Contactless Vitals API Latency: How to Benchmark and Optimize
A technical guide to measuring and optimizing contactless vitals API latency, covering P50/P99 benchmarks, edge inference, and pipeline tuning for production health platforms.

Vitals API latency benchmark optimization is one of those problems that doesn't seem urgent until it is. You integrate a contactless vitals SDK, run some tests, get heart rate readings in a second or two, and ship it. Then your users start complaining that the scan "feels slow," or worse, your telehealth platform's video call stutters every time the vitals module kicks in. The difference between a vitals API that responds in 80ms and one that responds in 400ms isn't academic. It's the difference between a feature people use and one they turn off.
"Our ME-rPPG solution enables real-time inference with only 3.6 MB memory usage and 9.46 ms latency, surpassing existing methods by 19.5% to 49.7% in processing efficiency." — Wang et al., Tsinghua University, arXiv, April 2025
What "latency" actually means for a vitals API
The word gets thrown around loosely. For a standard REST API — say, fetching a user profile — latency is simple: time between request sent and response received. For a contactless vitals API, it's more complicated because the pipeline has multiple stages, each with its own timing characteristics.
There are at least three distinct latency numbers worth tracking:
Frame-to-signal latency is the time from when the camera captures a frame to when the rPPG algorithm produces an intermediate signal value. This is your on-device processing time. On modern phones, research from Tsinghua University's ME-rPPG project showed this can get as low as 9.46ms per frame using state-space models optimized for temporal-spatial duality. On older hardware, you might be looking at 40-80ms.
Signal-to-vitals latency is how long the algorithm needs to accumulate enough signal to produce a stable vital sign reading. Heart rate typically needs 10-15 seconds of clean signal. Respiratory rate needs 20-30 seconds. This isn't something you can optimize away with faster hardware — it's a physiological constraint.
API round-trip latency is what matters if your architecture sends raw or processed data to a cloud endpoint for analysis. This adds network time, serialization overhead, and server-side processing. For many production deployments, this is the number that determines user experience.
| Latency type | What it measures | Typical range | Primary optimization lever |
|---|---|---|---|
| Frame-to-signal | On-device rPPG processing per frame | 9-80ms | Model architecture, quantization |
| Signal-to-vitals | Accumulation window for stable reading | 10-30s | Algorithm design, signal quality |
| API round-trip | Cloud request/response cycle | 50-500ms | Edge deployment, payload size, CDN |
| End-to-end perceived | User taps "scan" to seeing results | 15-45s | Pipeline parallelization |
Most teams focus on the API round-trip because it's the easiest to measure. But frame-to-signal latency is what determines whether your app can process video in real time without dropping frames. If your rPPG inference takes 50ms per frame and you're capturing at 30fps (33ms per frame), you're already behind. Frames queue up, memory usage spikes, and eventually the whole thing falls over.
How to benchmark vitals API latency properly
Bad benchmarking leads to bad decisions. The most common mistake is measuring average latency and calling it a day. Average latency is almost useless for production systems because it hides the tail. Your P50 might be 100ms while your P99 is 2 seconds — meaning one in a hundred users gets a terrible experience.
The percentile approach
A 2025 guide from OneUptime on latency percentiles laid out the standard framework most infrastructure teams now use:
- P50 (median): Half your requests finish faster than this. Your "normal" experience.
- P95: 95% of requests finish faster. This is where you start seeing the impact of garbage collection pauses, cold starts, and connection reuse failures.
- P99: 99% of requests finish faster. The experience for your unluckiest 1% of users. For health applications where trust matters, this number can't be ignored.
For a contactless vitals API specifically, you want to benchmark each pipeline stage separately. Aggregate numbers tell you something is slow; per-stage numbers tell you what to fix.
What to measure and how
Set up instrumentation at each boundary in the pipeline. OpenTelemetry has become the standard for this — it gives you distributed tracing across on-device and cloud components without writing custom timing code everywhere.
Here's what a practical benchmark suite looks like:
- Camera capture consistency: Measure the interval between frames. You want steady 30fps or 60fps. Frame drops or irregular intervals degrade the rPPG signal, which means the algorithm needs more time to converge, which increases end-to-end latency even though no single component got slower.
- Face detection frequency: If you're running face detection every frame, you're probably wasting 25-35% of your compute budget. Benchmark the difference between every-frame and every-5th-frame detection. On most devices, interpolating face position between detections costs nearly nothing.
- Inference time per frame: Run your rPPG model on 1,000 frames from different devices and lighting conditions. Record P50 and P99. The variance matters as much as the median — high variance means unpredictable user experience.
- Network round-trip (if applicable): Test from multiple geographic regions, on both WiFi and cellular. Cellular P99 latency is routinely 3-5x higher than WiFi P99.
Benchmark environment pitfalls
Testing on a development machine with a wired connection and no other load tells you nothing about production. A 2026 analysis from Total Shift Left on monitoring API performance in production emphasized that synthetic benchmarks systematically underestimate real-world latency by 30-60% because they miss contention, thermal throttling, and variable network conditions.
Run benchmarks on actual target devices. An iPhone 15 Pro and a Samsung Galaxy A15 will give you latency numbers that differ by 3-5x for the same model. If you're targeting emerging markets or field deployments, benchmark on the worst hardware your users actually have.
Optimization strategies that move the needle
Once you know where the time goes, here's what actually helps.
Edge-first architecture
The single biggest latency reduction comes from moving inference off the cloud and onto the device. A cloud round-trip adds 50-500ms of latency that you can eliminate entirely if the rPPG model runs locally. The ME-rPPG work from Tsinghua University demonstrated that modern state-space architectures can run real-time inference using only 3.6 MB of memory, which fits comfortably on any smartphone made in the last five years.
The trade-off is model accuracy. Cloud models can be larger and more accurate. But for most vitals use cases, the accuracy difference between a well-optimized on-device model and a cloud model is smaller than the noise introduced by poor lighting or user movement. You get more accuracy improvement from reducing latency (which reduces motion artifacts from long scan times) than from running a bigger model in the cloud.
Pipeline parallelization
Most naive implementations process the vitals pipeline sequentially: capture frame → detect face → extract ROI → run inference → post-process. Each stage waits for the previous one to finish. But several of these stages can overlap.
While the rPPG model processes frame N, the camera can capture frame N+1 and face detection can run on frame N+2. This pipeline parallelism doesn't reduce the latency of any individual frame, but it increases throughput, which means you can run more sophisticated models without dropping frames.
Adaptive quality scaling
Not every frame needs the same processing. During the first few seconds of a scan, when the algorithm is still locking onto the signal, you can use lower-resolution frames and simpler preprocessing. Once the signal stabilizes, increase quality for the remaining seconds. This front-loads responsiveness when the user is most likely to give up and quit.
WebAssembly for browser deployments
If your vitals integration runs in a browser (telehealth platforms, web-based screening tools), WebAssembly offers near-native performance for the inference step. The rPPG edge implementation work published on GitHub by researchers studying real-time edge computing showed that WASM-compiled models ran within 15-20% of native speed on modern browsers, while being deployable without any app installation.
Reduce payload, not precision
If you must send data to a cloud API, send extracted signal values rather than raw video frames. A 30-second video clip at 30fps is roughly 900 frames of image data. The extracted rPPG signal from those same frames is a few kilobytes of time-series data. That's the difference between a 5MB upload and a 10KB upload, which on a cellular connection can mean 3-4 seconds versus 50 milliseconds.
Setting latency SLOs for health applications
Service Level Objectives for a vitals API need to account for the fact that health data carries higher stakes than typical API responses. Users don't retry a health scan the way they retry a failed page load. If the scan feels broken, they lose trust in the measurement itself.
| Metric | Recommended SLO | Why this threshold |
|---|---|---|
| Frame processing P50 | < 25ms | Keeps up with 30fps capture without buffering |
| Frame processing P99 | < 50ms | Prevents frame drops on lower-end devices |
| API round-trip P50 | < 150ms | Feels instantaneous to users |
| API round-trip P99 | < 500ms | Stays under frustration threshold |
| End-to-end scan time | < 30s | Beyond 30s, completion rates drop sharply |
| Cold start time | < 2s | Model loading before first frame processed |
The end-to-end scan time SLO is probably the most important one from a product perspective. Research on user engagement with health screening apps consistently shows that completion rates drop by roughly 10-15% for every additional 10 seconds of scan time beyond 20 seconds. An optimized pipeline that finishes in 15 seconds will have meaningfully higher engagement than one that takes 45 seconds, even if both produce identical accuracy.
Current research and evidence
The state of the art is moving fast. A few research threads worth watching:
The Tsinghua University ME-rPPG paper (Wang et al., April 2025, arXiv:2504.01774) introduced temporal-spatial state-space duality for memory-efficient rPPG. Their approach hit 9.46ms inference latency with 3.6 MB memory usage. Those numbers represent a 19.5-49.7% improvement over previous methods in processing efficiency. The practical implication: real-time vitals inference on hardware that would have struggled two years ago.
The RhythmEdge project, documented on GitHub and associated publications, demonstrated a working real-time edge computing system for rPPG that detects blood volume changes from video on constrained hardware. Their architecture separated the pipeline into parallelizable stages, allowing commodity ARM processors to handle the workload.
On the API infrastructure side, Aerospike's 2025 analysis of P99 latency in production systems found that database-backed health APIs commonly see P99 latencies 10-20x higher than P50 due to garbage collection pauses and connection pool exhaustion. Their recommendation: pre-warm connection pools and use consistent hashing to avoid the cold-path penalty.
Frequently asked questions
What is a good latency target for a contactless vitals API?
For on-device processing, target under 25ms per frame at P50 to maintain real-time 30fps processing. For cloud API round-trips, under 150ms P50 and under 500ms P99 keeps the experience responsive. End-to-end scan time should stay under 30 seconds.
Should vitals processing happen on-device or in the cloud?
On-device processing eliminates network latency entirely and is the right choice for most consumer-facing applications. Cloud processing makes sense when you need to run larger models for higher accuracy, or when your deployment targets thin clients (like browser-based telehealth) that can't run inference locally. A hybrid approach — on-device inference with cloud-based post-processing and storage — is increasingly common.
How does network quality affect vitals API performance?
If your architecture involves cloud processing, network quality is often the dominant factor in P99 latency. Cellular connections in particular add 100-300ms of variable latency compared to WiFi. The best mitigation is to minimize what you send over the network: transmit extracted signal data (kilobytes) rather than raw video (megabytes). For critical deployments, implement offline fallback so scans complete even when connectivity drops.
How do I benchmark on low-end devices without owning every phone model?
Cloud device farms like Firebase Test Lab and BrowserStack give you access to hundreds of real devices. Run your benchmark suite on the bottom-quartile phones in your target market. Pay particular attention to devices with 3-4GB of RAM and older chipsets (Snapdragon 4xx series, MediaTek Helio), as these represent the performance floor your users will actually experience.
Where vitals API optimization is heading
The gap between research-grade rPPG latency and production-grade latency is closing. Two years ago, getting real-time inference on a mid-range phone required serious compromise on model quality. Today, architectures like ME-rPPG from Tsinghua show that you can have both low latency and competitive accuracy on constrained hardware. The next frontier is consistent latency — not just fast averages, but tight P99 numbers across the full range of devices, lighting conditions, and skin tones that a production health platform encounters.
Companies like Circadify are building vitals SDKs and APIs designed for exactly this kind of production-grade performance, where the engineering challenge isn't just extracting a heart rate from a video feed but doing it reliably and quickly enough that developers can embed it into real products without worrying about latency budgets.
For platform teams evaluating contactless vitals integration, the benchmarking framework here gives you a starting point. Measure the right things, at the right percentiles, on the right hardware. The numbers will tell you where to spend your optimization time — and they'll probably surprise you.
