I Tried Running NVIDIA's PersonaPlex on RunPod — Here's What Happened
A step-by-step walkthrough of setting up NVIDIA's PersonaPlex speech-to-speech model on RunPod, including the challenges, workarounds, and whether it's worth the effort.
I recently wanted to test NVIDIA’s PersonaPlex — a real-time speech-to-speech conversational AI model — but I don’t have an NVIDIA GPU. My M3 Max has 64GB of unified memory, which sounds impressive until you realize PersonaPlex is built for CUDA. So I turned to cloud GPUs.
This is a walkthrough of getting PersonaPlex running on RunPod, the gotchas I hit along the way, and my honest take on whether the open-source speech-to-speech space is ready for prime time.
What is PersonaPlex?
PersonaPlex is NVIDIA’s real-time, full-duplex speech-to-speech model built on the Moshi architecture from Kyutai. It enables persona control through text-based role prompts and audio-based voice conditioning. In plain English: you can talk to it, and it talks back with consistent character voices.
The appeal is obvious — instead of the traditional STT → LLM → TTS pipeline with its accumulated latency, speech-to-speech models process audio directly. They understand emotional context and verbal cues better than text intermediaries.
Cloud GPU Options and Pricing
Since PersonaPlex requires CUDA, my options were cloud GPU providers. Here’s what the landscape looks like:
| GPU | Provider | Price/hour |
|---|---|---|
| A100 80GB | RunPod | $1.49 - $1.79 |
| A100 80GB | Lambda Labs | $1.79 |
| H100 | Hyperstack | $1.90 - $2.40 |
| H100 | AWS/GCP/Azure | $4.00 - $8.00 |
I went with RunPod for the per-second billing and straightforward interface.
Setting Up the Pod
After creating a RunPod account and adding credits (~$10-25 is plenty for testing), the deployment process is:
- Click Pods → Deploy
- Select A100 SXM 80GB
- Choose the RunPod PyTorch 2.x template
- Set Volume Disk to 50-100 GB (model weights are large)
- Important: Add
8998to Expose HTTP Ports (more on this later) - Deploy
The port exposure step is easy to miss, and you’ll hit a wall without it.
Installation Steps
Once the pod is running, connect via the web terminal and run:
1
2
3
4
5
6
7
8
9
cd /workspace
git clone https://github.com/NVIDIA/personaplex.git
cd personaplex
# Install opus codec
apt-get update && apt-get install -y libopus-dev
# Install the package (no requirements.txt — it installs from the moshi directory)
pip install moshi/.
The lack of a requirements.txt tripped me up initially. The install happens via the local package directory.
The HuggingFace Token Dance
PersonaPlex downloads model weights from HuggingFace, and you’ll hit authentication errors without proper setup. Here’s the sequence of errors I encountered:
Error 1: 401 Unauthorized
1
requests.exceptions.HTTPError: 401 Client Error: Unauthorized
Fix: Export your HuggingFace token:
1
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxx
Error 2: 401 Still (Gated Repo)
1
2
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error
Cannot access gated repo for url...
Fix: Go to the model page on HuggingFace and click “Agree and access repository” to accept the license.
Error 3: 403 Forbidden
1
403 Forbidden: Please enable access to public gated repositories in your fine-grained token settings
Fix: Your token needs the right permissions. Go to HuggingFace → Settings → Tokens and create a new token with Read access (not fine-grained), or if using fine-grained tokens, enable “Access to public gated repos.”
Running the Server
The README says to run:
1
SSL_DIR=$(mktemp -d); python -m moshi.server --ssl "$SSL_DIR"
This works, and you’ll see the model download and load:
1
2
3
4
5
6
INFO - loading mimi
INFO - mimi loaded
INFO - loading moshi
INFO - moshi loaded
INFO - warming up the model
======= Running on https://0.0.0.0:8998 =======
But here’s the catch — the --ssl flag breaks RunPod’s proxy.
The Port Exposure Problem
RunPod proxies HTTP services through URLs like https://<pod-id>-<port>.proxy.runpod.net. When I tried accessing the server, I got a 502 Bad Gateway error.
The issue: RunPod’s proxy handles SSL termination itself. It expects backends to serve plain HTTP. The --ssl flag makes PersonaPlex serve HTTPS internally, which the proxy can’t handle.
The fix: Run without SSL:
1
python -m moshi.server --host 0.0.0.0 --port 8998
Now you can access the web interface at:
1
https://<your-pod-id>-8998.proxy.runpod.net
Was It Worth It?
After all that setup — creating accounts, configuring tokens, debugging SSL issues — I finally got PersonaPlex running.
And… it was underwhelming.
The latency was noticeable, the voice quality didn’t match my expectations, and the persona control felt limited. For roughly 30-40 minutes of A100 time, I spent about $1.50. Not a big loss, but it highlighted how far open-source speech-to-speech models have to go.
The Alternatives
If you’re exploring real-time voice AI, here’s what else is out there:
Hosted (Easiest Path)
| Service | Notes |
|---|---|
| GPT-4o Voice | Best quality, ~$20/mo via ChatGPT Plus |
| Gemini Live | Google’s equivalent, free tier available |
| ElevenLabs | Great voice quality, conversational AI API |
Open Source / Self-Hosted
| Model | Notes |
|---|---|
| Kyutai TTS 1.6B | From the Moshi team, newer and improved |
| Kyutai Pocket TTS | 100M params, runs on CPU |
| Ultravox | Speech-to-speech from Fixie.ai |
| Sesame CSM | Expressive character voices |
Honest Take
If you just want a good voice conversation experience right now, GPT-4o’s voice mode in the ChatGPT app is the most polished option. The open-source models are catching up, but they still lag in naturalness and latency.
For production voice AI, the traditional STT → LLM → TTS pipeline with providers like Deepgram, Claude, and Cartesia still offers more control, better quality, and predictable costs — even if it means slightly higher latency.
Key Takeaways
- Cloud GPU testing is cheap — You can test CUDA-only models for a few dollars
- RunPod’s HTTP proxy expects plain HTTP — Don’t use
--sslflags for services you want to expose - HuggingFace gated models require three things: A valid token, license acceptance, and correct token permissions
- Open-source speech-to-speech isn’t ready — For production use, hosted APIs or traditional pipelines are still the better choice
- Always clone to
/workspace— On RunPod, data outside this directory is lost when pods reset
Resources
Total cost of this experiment: ~$1.50 in cloud GPU time, plus the sunk cost of my expectations.