Skip to main content

Author

Ashar Mirza - VoicePing Inc.

The Problem

We run a translation microservice using FastAPI and vLLM. Under heavy load, we hit server latency issues that didn’t match what our GPU utilization metrics suggested. GPU utilization showed a stuttering pattern: spike to 93%, drop to 0%, spike again. Not the consistent high utilization we expected. The question: if the GPU has idle periods, where’s the bottleneck? This article covers how we identified the architectural issues in our FastAPI + multiprocessing setup that were preventing efficient GPU utilization.

System Context

Our translation service runs as multiple API servers behind a load balancer:
System Overview

Figure 1: Overall system architecture showing client applications, proxy/load balancer, and multiple API servers

  • Clients: Web, mobile, backend services
  • Proxy: Routes requests based on language pairs and server health
  • API Servers: Multiple FastAPI instances, each running vLLM
This article focuses on a single API server’s internal architecture and bottlenecks.

API Server Architecture

Here’s the internal structure of one API server:
API Architecture

Figure 2: Single API server architecture showing FastAPI, multiprocessing queues, worker processes, and vLLM instances

Components

1. FastAPI Main Process

# Single-threaded async event loop
@app.post("/translate")
async def translate_endpoint(request: TranslateRequest):
    result = await translation_service.translate(request)
    return result
  • Handles HTTP requests with async/await
  • Single Python process, one event loop
  • Non-blocking I/O for concurrent request handling

2. TranslationService

class TranslationService:
    def __init__(self, worker: TranslationWorker):
        self.worker = worker

    async def translate(self, request: TranslateRequest) -> TranslateResponse:
        # Create translation task
        event_task = self.worker.add_translation_task(
            text=request.text,
            source_lang=request.source_lang,
            target_lang=request.target_lang,
            timeout=30
        )

        # Wait asynchronously for result
        await event_task.event.wait()
        return TranslateResponse(translation=event_task.result.translation)
  • Creates translation tasks
  • Manages EventTask objects with asyncio.Event
  • Bridges async/await with multiprocessing

3. TranslationWorker (Main Process)

class TranslationWorker:
    def __init__(self):
        self.ctx = multiprocessing.get_context("spawn")
        self.translation_queue = None  # Created in run()
        self.event_queue = None
        self.event_tasks: Dict[str, EventTask] = {}

    def _initialize(self):
        # Create queues in main process
        self.translation_queue = self.ctx.JoinableQueue(maxsize=300)
        manager = self.ctx.Manager()
        self.translation_tasks = manager.dict()  # Shared state
        self.event_queue = self.ctx.Queue()

    def add_translation_task(...) -> EventTask:
        key = "t_" + generate_random_key(10)
        # Store in shared dict
        self.translation_tasks[key] = TranslationTask(...)

        # Send to workers via queue
        self.translation_queue.put(key)  # Serialization

        # Create event for async waiting
        event_task = EventTask(key)
        self.event_tasks[key] = event_task
        return event_task
  • Queues created in main process (shared with workers)
  • JoinableQueue for task distribution
  • manager().dict() for shared task state
  • Event queue for results

4. Worker Processes

def run(self):
    for worker_id in range(self.num_workers):
        worker = self.ctx.Process(
            target=self.process_queue,
            args=(worker_id, ready_event)
        )
        worker.start()

def process_queue(self, worker_id, ready_event):
    # Each worker loads its own vLLM instance
    translation_processor = TranslationProcessor(
        worker_id=worker_id,
        model_key=self.model_key,
        gpu_memory_utilization=self.gpu_memory_per_worker
    )

    # Process from shared queue
    while True:
        key = self.translation_queue.get()  # Deserialization
        task = self.translation_tasks[key]

        # Translate using vLLM
        result = translation_processor.translate(
            task.text,
            task.source_lang,
            task.target_lang
        )

        # Send result back
        self.event_queue.put((key, EventType.completed, result))  # Serialization
  • Spawned as separate processes (ctx.Process)
  • Each loads its own vLLM model instance
  • Pull from shared translation_queue
  • Return via shared event_queue

5. EventTask (Async Synchronization)

class EventTask:
    def __init__(self, key: str):
        self.key = key
        self.event = asyncio.Event()  # Async synchronization
        self.event_type = EventType.waiting
        self.result = None

    def update(self, event_type, result):
        self.event_type = event_type
        self.result = result
        self.event.set()  # Wake waiting coroutine
  • Bridges multiprocessing with async/await
  • Each request gets an EventTask
  • await event.wait() blocks coroutine until worker completes

Request Flow

Here’s what happens for a single translation request:
Request Flow

Figure 3: Step-by-step request flow showing serialization points and async waiting

Step by step:
  1. Client POST /translate → FastAPI creates async coroutine
  2. async translate() → TranslationService handles request
  3. create_task() → Generate ID, create TranslationTask in shared dict
  4. queue.put(key) → Serialize task key, send to workers (IPC overhead)
  5. Worker: vllm.translate() → Worker processes translation
  6. event_queue.put(result) → Serialize result, send back (IPC overhead)
  7. event.set() → Update EventTask, wake coroutine
  8. await event.wait() unblocked → Retrieve result
  9. Return response → Send to client
Overhead points:
  • Step 4: Serialization (pickle task key)
  • Step 6: Serialization (pickle result)
  • Step 8: Async waiting for multiprocessing result
  • IPC coordination throughout

Baseline Performance

Before optimization attempts:
Baseline Performance

Figure 4: Baseline performance showing throughput decrease and response time increase under load

Pattern:
  • Response time grows linearly (1.4s → 11.3s)
  • Throughput decreases under load (3.3 → 2.2 RPS)
  • Actual vLLM translation time per request: 300-450ms
GPU Utilization

Figure 5: GPU utilization pattern before (spiky) and after (consistent) optimization

Spiky pattern: GPU alternates between busy and idle. This indicated the GPU was waiting for work, not compute-bound.

Attempt 1: Multiple Workers

First hypothesis: more workers = better parallelization. We increased from 1 worker to 2 workers.

Configuration

num_workers = 2
gpu_memory_per_model = 0.3
  • Worker 1: Models A+B
  • Worker 2: Model C
  • Both share the same GPU

Results

Two Workers Comparison

Figure 6: Performance degradation when adding a second worker process

Median translation time also degraded: 452ms → 2,239ms. Performance dropped across all load levels.

Why Multiple Workers Failed

This result makes sense when you understand GPU behavior and our architecture.
GPU Contention

Figure 7: Multiple worker processes competing for GPU compute capacity

The Issue: Compute Contention

When one worker is processing a translation:
  • It uses ~90% of GPU compute capacity
  • Other workers can’t effectively utilize the remaining capacity in parallel
  • Workers end up waiting for GPU availability
Why no parallel benefit:
  • Worker 1 starts vLLM generation → uses ~90% GPU compute
  • Worker 2 tries to start → only ~10% GPU compute available
  • Worker 2 runs slowly or waits
  • Effectively sequential execution despite separate processes
Additional overhead:
  • Process spawning and management
  • GPU memory split between workers (each loads model weights)
  • IPC queue coordination
  • Context switching between processes
The GPU can technically run multiple CUDA kernels simultaneously, but when one worker is actively using ~90% of compute capacity, there’s insufficient remaining capacity for another worker to run efficiently in parallel.

Additional Architectural Issues

With multiple workers competing for the same resources:
  • Context switching overhead: OS switching between worker processes
  • Doubled memory usage: Each worker loads full model weights
  • No effective parallelism: Sequential GPU execution despite parallel architecture
The same queues handle all workers (translation_queue and event_queue shared), so the IPC overhead per request remains constant. However, the additional overhead from process management, context switching, and memory duplication, combined with no parallel GPU benefit, made performance worse.

Identified Bottlenecks

After this experiment, we identified the core issues:

1. IPC Serialization Overhead

  • Every request: serialize task → worker, serialize result → main
  • Python multiprocessing queue uses pickle
  • Overhead on every request

2. Compute Contention

  • One worker using ~90% GPU compute
  • Other workers can’t run effectively in parallel
  • Sequential execution despite multiprocessing

3. Async/Await + Multiprocessing Bridge

  • asyncio.Event waiting for multiprocessing result
  • Thread-based event queue consumer
  • Coordination overhead between async and multiprocess models

4. Wasted GPU Cycles

  • GPU idle while waiting for queue operations
  • Spiky utilization (93% → 0% → 93%)
  • Translation time ~400ms, total response time 11+ seconds
  • Most time spent in queues, not computing

5. Architecture Complexity

  • FastAPI (async/await)
  • TranslationService (bridge)
  • TranslationWorker (coordination)
  • JoinableQueue (IPC)
  • Worker processes (multiprocessing)
  • Event queue (IPC)
  • EventTask (async sync)
  • vLLM (actual work)
Each layer added latency.

Key Insights

1. Async/Await + Multiprocessing = Overhead

Bridging these two concurrency models requires coordination:
  • asyncio.Event for async waiting
  • Thread pool for consuming event queue
  • Serialization at process boundaries
This bridge has a cost.

2. Multiple Processes ≠ GPU Parallelism

Adding worker processes doesn’t automatically improve GPU utilization when:
  • One worker uses ~90% of GPU compute
  • Insufficient remaining capacity for parallel work
  • Sequential execution despite multiprocessing overhead

3. Queue Overhead Dominates

At 25 concurrent requests:
  • vLLM translation time: ~400ms
  • Total response time: 11,258ms
  • Queue overhead: ~97% of total time
The majority of time was spent in queues and coordination, not computing.

4. Spiky GPU = Architectural Issue

  • Consistent GPU utilization (e.g. 90-95%) indicates compute-bound workload
  • Spiky pattern (93% → 0% → 93%) indicates the GPU is waiting for work—bottleneck is elsewhere (in our case, queues and IPC)

Conclusion

The bottleneck wasn’t GPU capacity. It was our multiprocessing architecture: Issues identified:
  1. IPC overhead from queue serialization
  2. GPU compute contention without effective parallelism
  3. Async/await + multiprocessing coordination overhead
  4. Most latency from queues, not vLLM processing
Symptoms:
  • Spiky GPU utilization
  • Response time dominated by queue wait
  • Adding workers made performance worse
In Part 2, we’ll cover the solution: eliminating multiprocessing, using vLLM’s AsyncLLMEngine directly, and achieving an 82% throughput improvement in production.
Preview:
  • Remove multiprocessing architecture entirely
  • Use vLLM’s AsyncLLMEngine with FastAPI directly
  • Right-size continuous batching configuration
  • Production result: Improved throughput (+82%)