Building a Live Speech to Text AI

Alastair McDermott

One thing you may not know about me is that when I run workshops or live demos for organisations, I always give the organiser explicit permission – twice – to interrupt me.

Here’s why: I have ADHD. I have it pretty strong. I know most people have heard of it, but if you’re not sure what it actually means day-to-day: my brain chemistry works in such a way that it doesn’t regulate attention the way the majority of people’s does.

It’s not that I can’t focus – it’s that I can’t always choose what I focus on, and when something genuinely interests me, I tend to lock in on it. The problem is that “something genuinely interesting” can arrive mid-session during a workshop when a participant asks me a question, and twenty minutes later we’re deep in a rabbit hole that’s often valuable – but not what the group came for.

So before the session, I try to always tell the organiser: “Please step in and call me back on track if I wander off. You have my full permission to drag me kicking and screaming back on topic.”

Then I say it again, right at the start, in front of the audience. That second time matters because I want the attendees to know I’ve given that permission – so the organiser doesn’t look like “the bad guy” when they use it.

And some (most?) of them have had to use it!

But I really like don’t like having to put that responsibility on someone else’s shoulders, and often I’m delivering workshops solo with no MC.

So what can I do about this?

This is the version of AI I care about. Not the version that replaces what humans do, but the version of technology that supports me – the human – in how I actually work – including the parts that don’t come easily.

I’ve spent most of my professional life helping organisations implement technology that makes their people’s working lives better. It would be fairly hypocritical not to do the same for myself.

So I built an AI tool to do that job for my workshops! I called it SessionPilot.

HumanSpark Labs – What this is

This is a Labs post.

Labs is where I document my own AI experiments – building real tools for my own use while learning what today’s AI is actually capable of (and where it quietly fails). I share the working notes, not just the polished outcome. SessionPilot is one of those projects.

SessionPilot is a real-time desktop app that listens to my live audio when I’m running a training workshop, checks it against my agenda, and gives me gentle visual nudges if I drift off-topic. It’s my external conscience – a quiet tap on the shoulder.This post is my working notes on choosing a speech-to-text provider for it – with one big constraint front and centre: GDPR and EU data residency.

The pipeline is straightforward. My live audio goes to a cloud speech-to-text service. The transcript feeds a sliding-window buffer. That buffer runs through a hybrid embedding and LLM layer that detects when I’ve drifted off-agenda. Results appear in a transparent overlay on my screen – gentle nudges, a question wait-timer, a momentum tracker. Discreet. Helpful. Not intrusive.

But here’s the thing about building AI tools for professional use. You can spend weeks evaluating accuracy benchmarks, latency curves, and pricing tiers. And then someone asks the question that actually matters: “Where does our audio actually go?”

Audio captured during a live workshop is almost always personal data – both mine and my participants’. And that means the speech-to-text layer, sitting right at the start of the pipeline, is where the compliance story either holds together or falls apart. I’d rather think it through now than explain myself to a data protection officer later.

What follows is the detailed vendor comparison. If you’re evaluating speech-to-text providers yourself, this should save you a solid week of research. If you’re not, the short version is this: most providers will tell you they’re GDPR compliant, but the gap between marketing language and contractual reality is wider than you’d expect. The decisions I made here – and the questions I asked – are the same kinds of decisions any organisation faces when bringing AI tools into their workflow.

Before the list: what “GDPR compliant” actually means in practice

Every reputable vendor in this space will tell you they’re GDPR compliant. That’s essentially table stakes now. But what you actually need to evaluate is something more specific: the data flow, the processing location, and what’s locked in a signed Data Processing Agreement.

At minimum, you want to be able to confirm all of the following before you commit to a provider:

A DPA is available and can be signed (processor terms, not just a self-certification)
A subprocessor list with locations is available and kept current
There is a clear statement of what “EU endpoint” actually means – processing and storage, not just the API entry point
You know what the default retention period is for audio, transcripts, and logs – and you can reduce it
The training policy is explicit: is your data used to improve the model by default, and if not, is that documented contractually?

I’ve ranked the options below from “maximum control” to “needs more contractual digging.” That’s a pragmatic engineering view, not a legal opinion. If you’re building something that will handle genuinely sensitive data, please also talk to a lawyer. (I’m a recovering software engineer, not a solicitor.)

The vendors ranked

I’ve grouped these into four tiers based on how much control you realistically have over where data goes and how long it stays there.

Tier 1 – Maximum control: keep it in your infrastructure

Tier 1 – Full Control

Self-hosting (e.g., Whisper open-source)

Audio never leaves your infrastructure. You own the processing location, the logs, the retention policy, and the access controls. There’s no vendor relationship to manage because there’s no vendor.

Trade-off: you also own security, scaling, and ops. Model performance tuning is your problem. If you have genuinely regulated content – sensitive training sessions, medical or legal contexts – this is worth the overhead. Otherwise, it might be more than you need.

Tier 1 – Full Control

NVIDIA Riva (on-prem / private cloud / edge)

Designed to run in your own environment. If you don’t persist data, you get zero data retention by design. Enterprise-grade performance without shipping audio to a third party.

Trade-off: licensing and deployment complexity. This is more involved than running open-source Whisper, but you get vendor support for it. Worth considering for enterprise deployments where performance matters and data must stay local.

Tier 1 – Full Control

Speechmatics (containers / on-prem option)

Containerised deployments give you strong control over where data is processed and how long it’s kept. This is a “privacy-first” posture with a realistic path to production.

Trade-off: the procurement and integration effort is higher than plug-and-play SaaS. But you get genuine control, not just a checkbox.

Tier 2 – Strong managed options: EU endpoints with real retention controls

Tier 2 – Strong SaaS

Soniox

Explicit EU endpoints with “no retention” as the default – audio and transcripts aren’t stored unless you specifically request it. Strong regional deployment story.

Trade-off: confirm what counts as “content” versus “system data.” Billing and usage metadata may still be retained separately. Worth asking about explicitly.

Tier 2 – Strong SaaS

AssemblyAI (EU endpoint on AWS Dublin)

EU endpoint available, with a documented zero data retention mode for streaming when you opt out of model training. Some metadata is retained for logging and billing, but that’s fairly standard.

Trade-off: make sure your specific plan and configuration actually enable the zero-retention posture. Don’t assume – check the settings and confirm it in writing.

Tier 2 – Strong SaaS

Google Cloud Speech-to-Text

Clear regional endpoint strategy and a documented choice between “with data logging” and “without data logging” modes. One of the more transparent big-cloud options on this specific question.

Trade-off: do the standard DPA and subprocessor due diligence, and make sure the correct endpoint is actually used everywhere in your stack – not just in your test environment.

Tier 2 – Strong SaaS

Microsoft Azure Speech

Strong region selection and mature enterprise compliance programmes. If your clients are already in the Microsoft ecosystem, this is often the path of least resistance for procurement.

Trade-off: zero data retention isn’t always a single toggle. You still need to design retention and logging carefully. Don’t assume Azure handles it all by default.

Tier 3 – Mid-range: workable, but verify before you commit

Tier 3 – Verify First

Rev.ai (EU deployment in Frankfurt)

EU deployment is explicitly positioned to support GDPR. Retention defaults exist but can be shortened, and manual deletion is supported. A straightforward API.

Trade-off: validate the default retention and deletion semantics for your specific workflow – real-time streaming versus batch processing can behave differently.

Tier 3 – Verify First

Deepgram

EU endpoint exists. Good product. But I’ve seen potential mismatches between marketing language and contractual commitments on residency.

Trade-off: treat this as a “confirm with sales in writing” situation before you build on it for strict residency requirements. It may be fine – but get it in writing.

Tier 3 – Verify First

Gladia (EU company)

EU company, which is a good starting point. But depending on your configuration and tier, processing may span Europe and the USA. Zero data retention appears to be an Enterprise feature.

Trade-off: if the product capabilities are compelling, this can still work – but only if you can contractually pin down EU-only processing. Don’t assume the company’s EU origin means EU-only processing.

Tier 3 – Verify First

Amazon Transcribe

EU region selection is available. But “you can choose an EU region” is not the same as “data stays in EU.” Retention and storage management is largely your responsibility through your AWS architecture.

Trade-off: there are many controls available, and many ways to configure them incorrectly. If you’re already AWS-native and have solid governance in place, this is manageable. If not, the complexity isn’t worth it for this use case.

Tier 4 – Lower certainty: useful, but with caveats

Tier 4 – Needs More Digging

OpenAI Whisper API

Technically attractive and often cost-effective. But data residency controls may only be available to eligible customers, and zero data retention availability can be gated by your commercial arrangement.

Trade-off: if your residency requirements are strict and you’re not on an enterprise agreement, this is a risk. If requirements are more flexible, it’s worth a closer look.

Tier 4 – Needs More Digging

ElevenLabs

Enterprise zero-retention exists. But some product areas – agent conversation history in particular – can have long default retention periods unless you configure them explicitly.

Trade-off: excellent capability, but you need to be deliberate about which features you use and how retention is configured across all of them. Not a “set and forget” compliance posture.

Tier 4 – Needs More Digging

Cartesia

GDPR compliance and optional zero data retention are marketed. But I haven’t been able to find a clear contractual commitment to processing staying in a specific country.

Trade-off: could be a good fit once you’ve confirmed residency, subprocessors, and retention semantics for your specific plan. Currently in the “needs more digging” pile for me.

Tier 4 – Different Layer

Gemini public API vs. Vertex AI

This one’s a bit different – it belongs more to the LLM layer (tangent detection) than the STT layer. For EU-only processing, the clearer path is a regional cloud platform like Vertex AI rather than a public API where regional guarantees are less explicit.

Trade-off: if data minimisation is strong before it reaches this layer – or if you’re using a region-pinned enterprise offering – this can be fine. Just don’t assume the public API gives you residency guarantees.

Questions to ask every vendor (copy and paste these)

Before you sign anything, get written answers to all of these. Verbal assurances don’t hold up.

Can you confirm that audio and transcripts are processed only in the EU when using your EU endpoint or region?
Where is metadata – usage logs, billing data, telemetry – processed and stored?
What is the default retention period for audio, transcripts, and logs? How do I reduce it to zero or as close as possible?
Do you offer Zero Data Retention or an equivalent mode? Is it self-serve, or enterprise-only?
Is customer data used for model training by default? Where is the opt-out documented in the contract – not just the privacy policy?
Can you provide a signed DPA, SCCs where applicable, and an up-to-date subprocessor list with locations?
What is your process and SLA for deletion requests and data subject access request (DSAR) support?

My current thinking

SessionPilot’s design philosophy is intentionally gentle. Nudges, not interruptions. Peripheral cues, not alerts. That matches how I want to be reminded – a quiet tap on the shoulder, not a foghorn. (Although, given my track record, some event organisers might argue I need the foghorn.) The same instinct applies to compliance: default to minimisation and control. If I can reduce what leaves the device, keep processing in-region, and push retention as close to zero as possible, everything downstream becomes simpler – legally and operationally.

For the initial build, I’m going with AssemblyAI for the managed SaaS path. EU endpoint, documented zero retention for streaming, real-time performance. Good fit. Self-hosted Whisper stays on the table as a fallback for sessions with stricter requirements – the abstract STT interface in the architecture means swapping providers is a config change, not a rewrite.

One thing I’ve noticed working through this: the compliance conversation forces you to think much more carefully about your data architecture up front. That’s actually useful. Knowing that nothing hits disk, that LLM calls go through ZDR routing, and that all cloud processing is pinned to EU endpoints – those decisions shaped the whole system design, not just the vendor selection. I’d argue that GDPR-first thinking makes for better software in general, not just more defensible software.

That’s the Labs value: you learn things by building that you wouldn’t learn by reading.

If you’re building something similar and have been through this vendor comparison yourself, I’d genuinely love to hear what you found. Hit reply and let me know.

– Alastair

P.S. SessionPilot is a HumanSpark Labs project – built for my own use and documented here as I go. If you’re a trainer or facilitator who’d like to follow along, or you want to try it when it’s ready for testing, get on the list.

P.P.S. The questions in the vendor checklist above aren’t just for speech-to-text. They’re the same questions you should ask about any AI tool you’re bringing into your organisation. If your team is navigating those decisions and you’d like a thinking partner, that’s exactly the kind of conversation I have regularly. Book a Priorities Chat and we’ll work through it together.

P.P.P.S. I’m a recovering software engineer, not a solicitor. None of this is legal advice – confirm everything in writing with your own legal team before you commit to a vendor for sensitive data.

💡

Your AI Transformation Starts Here

Get The Free AI Toolkit for Strategic Breakthrough Zero Guesswork, Maximum Impact

💡 Your AI Transformation Starts Here:

Get The Free AI Toolkit for Strategic Breakthrough
Zero Guesswork, Maximum Impact

Get Instant Access

Building a Live Speech to Text AI

Before the list: what “GDPR compliant” actually means in practice

The vendors ranked

Tier 1 – Maximum control: keep it in your infrastructure

Self-hosting (e.g., Whisper open-source)

NVIDIA Riva (on-prem / private cloud / edge)

Speechmatics (containers / on-prem option)

Tier 2 – Strong managed options: EU endpoints with real retention controls

Soniox

AssemblyAI (EU endpoint on AWS Dublin)

Google Cloud Speech-to-Text

Microsoft Azure Speech

Tier 3 – Mid-range: workable, but verify before you commit

Rev.ai (EU deployment in Frankfurt)

Deepgram

Gladia (EU company)

Amazon Transcribe

Tier 4 – Lower certainty: useful, but with caveats

OpenAI Whisper API

ElevenLabs

Cartesia

Gemini public API vs. Vertex AI

Questions to ask every vendor (copy and paste these)

My current thinking

More posts like this.

Get regular updates on AI strategies that work.

You're almost there!