voice ai robot

When Voice AI has to put its money where its mouth is

PCI compliance and the architecture gap most vendors are ignoring

Let me tell you about the most boring, most important problem in Voice AI right now.

It is not whose Voice AI handles interruptions better. It is not the Great Latency Wars. It is not whose demo sounds most like Scarlett Johansson. It is the fact that when a customer calls your AI-powered contact center and says “I’d like to pay my bill,” your fancy Voice AI has a full-blown identity crisis.

Bear with me. This is going to get weird, and then it is going to get interesting, and then you are going to look at every “AI-powered customer experience” company differently.

Chatbots had it easy….

Here is a thing most people do not think about: text-based AI payments are basically a solved problem.

Customer types card number into a secure form. Form is isolated from the chatbot. Chatbot never sees the digits. Transcript never records them. A tokenization layer sits between the customer and Stripe, and everyone goes home happy. Sierra launched Level 1 PCI-compliant chat payments in April 2026. Good for them. Genuinely.

But here’s the thing. Text-based payment compliance is to voice-based payment compliance what mini-golf is to Augusta National. They both involve putting a ball in a hole. The resemblance ends there.

The moment you add a human voice to the equation, everything breaks.

So why does everything break in voice?

Because sound is chaos. When a customer types a credit card number into a text field, the data is structured, contained, and controllable. When a customer speaks a credit card number into a phone, that audio waveform goes on a journey through your entire technology stack like a piece of luggage at O’Hare. It passes through the speech recognition engine. It gets transcribed into text. It enters the language model’s context window. It might get logged. It might get recorded for “quality assurance purposes.” It might end up in your analytics pipeline.

Every. Single. One. Of those touchpoints is now inside what PCI (the Payment Card Industry Data Security Standard) calls a “Cardholder Data Environment.” Your ASR engine? In scope. Your LLM? In scope. Your call recording system? In scope. Your logging infrastructure? Believe it or not, also in scope.

This is not “add a compliance checkbox to the roadmap.” This is “re-architect your entire platform.”

And it gets worse!

The 5 ways Voice AI falls apart when money shows up

1. The Audio Stream Problem

A phone call is a continuous signal. You cannot “mask” a spoken credit card number the way you mask digits in a form field. The sound waves carrying those 16 digits pass through your telephony infrastructure, your SIP trunks, your media servers. If any of those components are not encrypted end-to-end with TLS and SRTP, you have a compliance failure before the AI even starts listening.

2. The Transcription Problem

Your speech recognition engine converts audio to text in real time. Customer says their card number. ASR engine dutifully writes it down. Congratulations, you have now created a text artifact containing cardholder data that is sitting in memory, ready to be shipped downstream to any system that is listening.

And people do not speak card numbers cleanly! They say “four two one seven… wait, no… four two one eight… then eight nine three zero.” They mumble. They talk over the AI. They sneeze in the middle of digit twelve. Building a reliable redaction layer for spoken card data is like trying to catch confetti in a windstorm.

3. The LLM Context Window Problem

This one is my favorite.

Even if you scrub the transcript, modern voice AI feeds conversation context into a language model to generate responses. If the card number enters that context window, the LLM might repeat it back. It might summarize it. It might casually reference it three turns later because, hey, it is trying to be helpful.

You cannot unit test this away. LLMs are non-deterministic. You cannot write a test that proves a large language model will never, under any combination of prompts and context, echo back a card number. It is like trying to guarantee your toddler will never say something embarrassing at Thanksgiving dinner. You can prepare. You cannot guarantee.

4. The Call Recording Problem

Most contact centers record calls. If a customer speaks their card number during a recorded call, that recording is now cardholder data. You either need real-time audio redaction (extremely hard), or you need to pause recording during the payment portion (introduces complexity and failure modes), or you need to just accept that your call recordings are now a compliance liability sitting on a server somewhere.

Most companies pick the last option without even realizing it.

5. The Handoff Problem (a.k.a. The One That Kills the Experience)

Here is the “safe” approach most vendors use: the AI stops talking, transfers you to an IVR, the IVR collects your card info through a separate secure system, and then… maybe the AI comes back? Maybe you are on hold now? Maybe you just hang up because you have been transferred for the third time and you have things to do?

It works. Nobody would call it a good experience. And it destroys the entire value proposition of Voice AI, which is: this was supposed to feel like talking to a person.

Demo of a voice agent completing a PCI-compliant payment.

So how do you actually fix this?

You have to solve all five problems at once. And you have to do it without the customer noticing. That is the trick. The compliance has to be invisible.

Here is what a real solution looks like:

DTMF Suppression (a.k.a. “Just Use the Keypad”)

Instead of letting the customer speak their card number, the AI says: “Great, please enter your card number on your phone’s keypad.” The system captures the DTMF tones (the beeps) but suppresses them so they are not recorded or processed by the speech recognition engine. The digits never enter the AI’s brain.

This is the single most important design decision in PCI-compliant Voice AI, and it is the one most vendors skip because it requires deep telephony integration. You need to actually understand phone systems. Not just LLMs. Actual phone systems. The kind with SIP trunks and media gateways and all the infrastructure that most AI companies treat like someone else’s problem.

SIP-Level Encryption

The call itself is encrypted from carrier to AI using TLS for signaling and SRTP for media. This is table stakes for enterprise telephony but shockingly absent from voice AI platforms that were built as cloud-native apps by people who have never touched a PBX.

Just-in-Time Tokenization

The Voice AI acts as a passthrough. It collects the DTMF digits, immediately ships them to a payment processor (Stripe, Adyen, whoever), and only gets back a “success” or “failure” and the last four digits. The full card number never exists in the AI’s environment. It is like a relay race where the baton is on fire and the runner’s only job is to hand it off as fast as possible.

The Guardian AI (Transcript Scrubbing)

Even with DTMF, some customers will ignore the instructions and just say their card number out loud. Because humans. A secondary AI model, a “Guardian,” monitors the transcript in real time and redacts any patterns that look like card numbers before the transcript is stored. Think of it as a bouncer for your transcripts. “Sorry, those 16 digits are not on the list.”

Seamless Handoff Architecture

All of this has to be invisible. The customer experience should be:

AI: “I can help you with that payment. Please enter your card number on your keypad.”

Customer types digits.

AI: “Got it. Processing now.”

Three seconds pass.

AI: “Your payment of $247.50 has been processed. You will receive a confirmation shortly. Is there anything else I can help with?”

No hold music. No transfer. No IVR purgatory. One conversation. That is the whole game.

Why this matters right now

Three things are converging:

PCI DSS 4.0 is fully enforced. As of March 2025, the new standard is stricter about scoping. Any component that could impact cardholder data security is in scope. For voice AI, that means everything from the ASR engine to the analytics dashboard.

The market is exploding. The call center AI market is projected to grow from about $3 billion in 2026 to over $13 billion by 2034. Gartner says conversational AI will cut contact center labor costs by $80 billion this year. The incentive to deploy voice AI has never been higher.

Most vendors have zero PCI compliance. Not Level 1. Not Level 4. Not anything. They have built beautiful demos with natural-sounding voices and sophisticated conversational flows, and they have completely ignored what happens when someone says the words “pay my bill.”

This is one of those situations where the gap between “impressive demo” and “production-ready enterprise software” is not a gap. It is a canyon. And the bridge across that canyon is built entirely out of the boring, unglamorous, deeply technical compliance work that no one wants to put on a pitch deck.

The receipts…

At Avaamo, we built for the hardest use cases first. HIPAA. FINRA. PCI DSS. Not because compliance is glamorous, but because you cannot fake it, shortcut it, or announce it through a partnership press release. It takes years to build and months to audit.

Every voice AI company has a demo. Not every voice AI company has the receipts.

Sriram Chakravarthy, CTO & Co-founder
sriram@avaamo.com