How a Go Rewrite Made Our WebSocket Proxy 100x More Efficient

Author

Akira Noda - VoicePing Inc.

TL;DR

We rewrote our WebSocket proxy server from Python to Go and reduced CPU usage to 1/10 and memory consumption to 1/100. The project not only improved resource efficiency but also taught us a crucial concurrency lesson:

Keep locks as small and as few as possible.

Context

Our system is a real-time STT (speech-to-text) and translation pipeline used in VoicePing, where each client device streams audio to our backend for speech-to-text and translation in multiple languages. The WebSocket proxy server sits in the middle of this pipeline:

System architecture overview

Each client maintains a persistent WebSocket session with the STT proxy
The proxy relays audio packets to one of several GPU-based inference servers
Waits for transcribed text and streams back partial transcripts and translations

This architecture must handle thousands of concurrent real-time audio sessions — with sub-second latency. However, our previous Python-based proxy became the bottleneck.

Before: Python Proxy (Inefficient)

Our first proxy server was implemented in Python (FastAPI + asyncio + websockets) and deployed using Gunicorn with multiple worker processes. It worked fine at small scale but quickly hit resource limits under production traffic:

Metric	Before (Python)	After (Go)
CPU usage	~12 cores, 40–50%	~12 cores, 4–5%
Memory usage	~25 GB	~10 MB

Why Python Struggled

Despite being asynchronous, Python’s architecture imposed several systemic bottlenecks: Single-Threaded Event Loop: The asyncio model multiplexes thousands of coroutines on a single thread. This means only one coroutine runs at a time — others wait until the loop yields control. Under heavy I/O, this single loop becomes a central choke point, especially for WebSocket workloads with constant read/write events. Gunicorn Multiprocessing: To use all CPU cores, we spawned multiple worker processes. Each process loaded a full Python runtime and app state — multiplying memory usage linearly. Heavy Task Contexts: Each WebSocket connection keeps its own stack frames, futures, and callbacks — consuming large memory per connection. Interpreter Overhead: Every coroutine runs inside the CPython interpreter, adding dynamic type checks and bytecode dispatch overhead. As a result, even though the system appeared concurrent, it was sequential at its core — every coroutine waited on the same event loop, amplifying latency and CPU load as connection counts rose. It was inevitable — Python’s model simply wasn’t suited for long-lived, high-throughput, low-latency WebSocket multiplexing at this scale. So we rewrote it in Go.

Birds-eye View of the Proxy Server

Proxy server architecture with connection pools

Our proxy server acts as a middle layer between clients and multiple inference servers. Clients send audio bytes via WebSocket, and the proxy routes each stream to one of several STT (speech-to-text) inference servers. Client → Proxy: Each client opens a WebSocket connection to the proxy and continuously sends audio chunks. Proxy → Inference Server: The proxy picks one active connection from its WebSocket connection pool — a pool of persistent backend connections (e.g., Pool A for Server A, Pool B for Server B). Streaming Processing: The proxy keeps the mapping between the client and the selected backend connection for the entire session, forwarding audio packets and returning STT results in real-time. Connection Reuse: When the session ends (client disconnects), the proxy returns the backend connection to the pool, making it available for another client. This reuse mechanism drastically reduces connection churn and resource overhead. The proxy server maintains two main parts:

Connection manager: handles routing and lifecycle of client connections
WebSocket connection pools: manage reusable backend connections for each inference server

Each pool corresponds to one inference target (e.g., A or B), and holds a limited number of pre-established WebSocket connections. This architecture allows the proxy to:

Balance load efficiently across inference servers
Avoid frequent connection establishment overhead

Functional Requirements for Connection Pool Management

Connection pool functional requirements

Designing the connection pool management was the most critical part of the new proxy architecture. The pool needed to efficiently handle thousands of concurrent WebSocket sessions while keeping the system stable and lightweight.

Requirement	Purpose
Pick an available connection	Each incoming request must quickly obtain a ready-to-use backend connection without blocking other clients. Ensures low latency and smooth load balancing.
Return connection to pool	Once a client session ends, the connection should be released and made available for reuse. Minimizes overhead of repeatedly opening/closing connections.
Keep only healthy connections	Periodic health checks remove or recreate failed connections. Prevents unhealthy connections from accumulating and causing silent failures.
Sync database config	Periodically synchronizes backend connection configuration from a central database, allowing dynamic scaling without restarts.

First Design (Naive)

Initial design with global mutex

At first, I implemented a straightforward design — simple but naïve:

A single array holding all connections
Boolean flags for “in-use” / “available” states
One global lock for all operations

Picking a connection involved:

Acquiring the lock
Scanning the array to find an available connection
Marking it as “in use”
Releasing the lock

Returning a connection worked similarly — acquire the lock, flip the flag, release the lock. We also had a separate goroutine to check database configuration periodically. This goroutine refreshed backend settings such as server lists or pool sizes from the database, ensuring the proxy always had the latest configuration without requiring a restart. A separate health-check goroutine periodically scanned all connections, removed unhealthy ones, and added new ones when needed.

// ────────────────────────────
// FIRST DESIGN (naive, global mutex)
// Single slice + flags, one coarse-grained mutex.
// ────────────────────────────

type Conn struct {
    id      string
    ws      *websocket.Conn
    inUse   bool
    healthy bool
}

type Pool struct {
    mu       sync.Mutex
    conns    []*Conn
    maxSize  int
    dialURL  string
}

// newConn dials a backend and returns a connected *Conn.
// NOTE: In this first design we (incorrectly) call this under the global lock.
func (p *Pool) newConn(ctx context.Context) (*Conn, error) {
    d := websocket.Dialer{}
    c, _, err := d.DialContext(ctx, p.dialURL, nil)
    if err != nil {
        return nil, err
    }
    return &Conn{
        id:      uuid.NewString(),
        ws:      c,
        inUse:   false,
        healthy: true,
    }, nil
}

// ────────────────────────────
// AdjustPool
// Check capacity vs current size; if not full, fill it.
// BAD PATTERN: holds the global mutex across slow I/O (dial).
// ────────────────────────────
func (p *Pool) AdjustPool(ctx context.Context) error {
    p.mu.Lock()
    defer p.mu.Unlock()

    cur := len(p.conns)
    if cur >= p.maxSize {
        return nil
    }

    needed := p.maxSize - cur
    for i := 0; i < needed; i++ {
        conn, err := p.newConn(ctx)
        if err != nil {
            return fmt.Errorf("adjust: dial failed: %w", err)
        }
        p.conns = append(p.conns, conn)
    }
    return nil
}

// ────────────────────────────
// GetPoolStats
// Return total / in-use / available counts.
// BAD PATTERN: O(n) scan under global lock every call.
// ────────────────────────────
type PoolStats struct {
    Total     int
    InUse     int
    Available int
    Healthy   int
    Unhealthy int
}

func (p *Pool) GetPoolStats() PoolStats {
    p.mu.Lock()
    defer p.mu.Unlock()

    var inUse, healthy int
    for _, c := range p.conns {
        if c.inUse {
            inUse++
        }
        if c.healthy {
            healthy++
        }
    }
    total := len(p.conns)
    return PoolStats{
        Total:     total,
        InUse:     inUse,
        Available: total - inUse,
        Healthy:   healthy,
        Unhealthy: total - healthy,
    }
}

// ────────────────────────────
// HealthCheck
// Ping all connections and mark healthy=false on failures.
// BAD PATTERN: holds global lock during network I/O & mutates in place.
// ────────────────────────────
func (p *Pool) HealthCheck(ctx context.Context, timeout time.Duration) {
    p.mu.Lock()
    defer p.mu.Unlock()

    deadline := time.Now().Add(timeout)
    for _, c := range p.conns {
        if c.ws == nil {
            c.healthy = false
            continue
        }
        if err := c.ws.WriteControl(websocket.PingMessage, []byte("ping"), deadline); err != nil {
            c.healthy = false
            _ = c.ws.Close()
            c.ws = nil
            continue
        }
        c.healthy = true
    }
}

Issues in the First Design

Despite working in small tests, the initial pooling model broke down under load. We observed: Incorrect total count (overshoot/undershoot): Concurrent pick/return operations mutated the same slice + flags under one coarse lock, so retries and timeouts occasionally double-returned or lost a connection, drifting past the max or starving the pool. Racey concurrent access → crashes & corrupt state: Health-check and metrics goroutines contended with request handlers; long health checks held the global lock, while readers sometimes observed half-updated flags, triggering panics or “no available connection” errors despite capacity. Goroutine leaks: Failed dials and timed-out health checks weren’t always canceled or reaped; retries spawned new goroutines while references to old ones lingered. Fragile observability: Counters derived from the slice + flags frequently disagreed with reality, making alerts noisy and masking real incidents. Lock contention & latency spikes: O(n) scans for an available slot under a single lock amplified tail latency as concurrency increased.

Root cause in one line: too much shared state guarded by one wide, long-held lock, plus operations (health/metrics) that should have been independent competing for that same lock.

Revised Design

Lock-free design with atomics and channels

The redesign focused on minimizing shared state and isolating responsibilities:

Component	Purpose
Queue for available connections	Enqueue/dequeue handles internal locks automatically
sync.Map for in-use connections	Lock-free concurrent map
Atomic variables	Health flags and counters
Dedicated goroutine per connection	Independent health checks

Each component now operates independently, with no pool-wide locks. The number & scope of locks dropped dramatically.

package pool

import (
	"context"
	"errors"
	"net/http"
	"sync"
	"sync/atomic"
	"time"

	"github.com/google/uuid"
	"github.com/gorilla/websocket"
)

// ────────────────────────────
// Conn (a single reusable backend WebSocket connection)
// ────────────────────────────
//
// ✅ GOOD PATTERN:
// - Keep each connection self-contained and concurrent-safe using atomics.
// - Encapsulate health logic inside the Conn itself (no shared state mutation).
// - Avoid external locks and let each conn manage its own goroutine lifecycle.

type Conn struct {
	id       string
	ws       *websocket.Conn
	healthy  atomic.Bool   // ✅ lock-free health status flag
	lastPing atomic.Int64  // ✅ atomic timestamp for last heartbeat
}

// ✅ GOOD PATTERN: Explicit small helper methods (no external mutation)
func (c *Conn) ID() string { return c.id }

func (c *Conn) IsAlive() bool {
	return c.healthy.Load()
}

// ✅ GOOD PATTERN: Safe close, idempotent and isolated
func (c *Conn) Close() error {
	if c.ws != nil {
		return c.ws.Close()
	}
	return nil
}

// ✅ GOOD PATTERN: Health loop runs independently per connection
// - No shared/global lock.
// - Non-blocking heartbeat.
// - Fails fast and marks itself dead without blocking pool operations.
func (c *Conn) StartHealthLoop(ctx context.Context, interval time.Duration) {
	t := time.NewTicker(interval)
	defer t.Stop()
	for {
		select {
		case <-ctx.Done():
			return
		case <-t.C:
			deadline := time.Now().Add(interval / 2)
			if err := c.ws.WriteControl(websocket.PingMessage, []byte("ping"), deadline); err != nil {
				c.healthy.Store(false)
				_ = c.Close()
				return
			}
			c.lastPing.Store(time.Now().UnixNano())
			c.healthy.Store(true)
		}
	}
}

// ────────────────────────────
// ConnPool (fast path only — no dialing, no blocking I/O)
// ────────────────────────────
//
// ✅ GOOD PATTERN:
// - Separate responsibilities: pool only manages available connections.
// - No dialing / blocking I/O under locks.
// - Use channel buffering and atomic counters for concurrency safety.
// - Eliminates coarse-grained global mutex.

type ConnPool struct {
	available chan *Conn   // ✅ buffered channel for ready conns (lock-free)
	inUse     sync.Map     // ✅ concurrent map for tracking active conns
	statsIn   atomic.Int64 // ✅ atomic counters (no need for locks)
	statsOut  atomic.Int64
}

// ✅ GOOD PATTERN: Explicit, fixed-capacity pool construction
func NewConnPool(capacity int) *ConnPool {
	return &ConnPool{
		available: make(chan *Conn, capacity),
	}
}

func (p *ConnPool) Capacity() int { return cap(p.available) }

// ✅ GOOD PATTERN: Non-blocking Acquire
// - Never holds locks while waiting for I/O.
// - Returns instantly if no conn available.
func (p *ConnPool) Acquire(ctx context.Context) (*Conn, error) {
	select {
	case c := <-p.available:
		p.inUse.Store(c.id, c)
		p.statsIn.Add(1)
		return c, nil
	case <-ctx.Done():
		return nil, ctx.Err()
	default:
		return nil, errors.New("no connection available")
	}
}

// ✅ GOOD PATTERN: Non-blocking Release
// - Never waits for space in the channel.
// - Drops unhealthy or excess conns immediately.
// - No global mutex.
func (p *ConnPool) Release(c *Conn) {
	p.inUse.Delete(c.id)
	p.statsOut.Add(1)

	if c.IsAlive() {
		select {
		case p.available <- c:
			// ✅ returned to pool safely
		default:
			// ✅ pool full → discard stale conn safely
			_ = c.Close()
		}
	} else {
		// ✅ unhealthy → close immediately
		_ = c.Close()
	}
}

// ✅ GOOD PATTERN: Offer is used by reconciler (external goroutine)
// - Keeps dialing/repair logic out of hot path.
// - Backpressure-safe with non-blocking insert.
func (p *ConnPool) Offer(c *Conn) bool {
	select {
	case p.available <- c:
		return true
	default:
		return false
	}
}

// ────────────────────────────
// Stats and Metrics
// ────────────────────────────
//
// ✅ GOOD PATTERN:
// - Lock-free snapshot using atomics.
// - Avoids holding locks during metrics collection.

type PoolStats struct {
	Capacity  int
	Available int
	InUse     int
	Acquired  int64
	Released  int64
}

// ✅ GOOD PATTERN: Snapshot safely aggregates pool state without blocking
func (p *ConnPool) Snapshot() PoolStats {
	inUseCount := 0
	p.inUse.Range(func(_, _ any) bool {
		inUseCount++
		return true
	})
	return PoolStats{
		Capacity:  cap(p.available),
		Available: len(p.available),
		InUse:     inUseCount,
		Acquired:  p.statsIn.Load(),
		Released:  p.statsOut.Load(),
	}
}

Each component operates independently with no pool-wide locks. The number and scope of locks dropped dramatically.

Event-Driven Reconciliation

Reconciliation worker pattern

Keeping the right pool size was another challenge. When connections failed or returned, reconciliation could easily run concurrently and overshoot the maximum pool size. The solution was an event-driven reconciliation loop:

Each operation sends a message into a channel (messageCh)
The reconciliation goroutine processes these messages sequentially
This ensures no race conditions

This model allowed us to handle high concurrency while keeping the system deterministic and safe.

type ServerConnectionPool struct {
	ctx         context.Context
	cancel      context.CancelFunc
	reconcileCh chan struct{}
	logger      *zap.Logger
	// ... other fields ...
}

// ✅ GOOD PATTERN: Constructor wires a buffered (size=1) signal channel to enable coalescing.
func NewServerConnectionPool(logger *zap.Logger /* ... */) *ServerConnectionPool {
	ctx, cancel := context.WithCancel(context.Background())
	return &ServerConnectionPool{
		ctx:         ctx,
		cancel:      cancel,
		reconcileCh: make(chan struct{}, 1), // ✅ coalescing buffer
		logger:      logger,
		// ... init other fields ...
	}
}

// ✅ GOOD PATTERN: Non-blocking signal helper; bursts coalesce into a single pending signal.
func (bp *ServerConnectionPool) trySend(ch chan struct{}) {
	select {
	case ch <- struct{}{}:
	default:
		// already queued; coalesced
	}
}

// triggerReconcile only *requests* reconciliation; never performs it inline.
// ✅ GOOD PATTERN: no work on the caller's goroutine, prevents stampedes.
func (bp *ServerConnectionPool) triggerReconcile() {
	bp.trySend(bp.reconcileCh)
}

// ✅ GOOD PATTERN: Public starter that owns the worker lifecycle.
func (bp *ServerConnectionPool) Start() {
	go bp.reconcileWorker()
}

// ✅ GOOD PATTERN: Graceful shutdown.
func (bp *ServerConnectionPool) Stop() {
	bp.cancel()
}

// reconcileWorker serializes reconciliation and coalesces bursts.
// ✅ GOOD PATTERN:
//   - Single-threaded worker → no concurrent ensureCapacity() runs
//   - Periodic safety net with light jitter to avoid thundering herds
//   - Drain queue before each run to collapse multiple signals into one
func (bp *ServerConnectionPool) reconcileWorker() {
	jitter := func(base time.Duration) time.Duration {
		// small ±10% jitter
		n := time.Duration(float64(base) * (0.9 + 0.2*rand.Float64()))
		return n
	}

	ticker := time.NewTicker(jitter(5 * time.Second))
	defer ticker.Stop()

	bp.logger.Debug("Reconciliation worker started", zap.String("pool", bp.GetName()))
	defer bp.logger.Debug("Reconciliation worker stopped", zap.String("pool", bp.GetName()))

	// Optional: run once immediately on startup
	bp.ensureCapacity()

	for {
		select {
		case <-bp.ctx.Done():
			return

		case <-ticker.C:
			// ✅ Periodic reconciliation as a safety net
			bp.ensureCapacity()
			// reset ticker with jitter to spread load
			ticker.Reset(jitter(5 * time.Second))

		case <-bp.reconcileCh:
			// ✅ Drain any queued signals: burst -> single reconciliation
			for {
				select {
				case <-bp.reconcileCh:
					// keep draining
				default:
					bp.ensureCapacity()
					goto CONTINUE
				}
			}
		}
	CONTINUE:
	}
}

// ensureCapacity is the single authoritative place that:
// 1) Cleans unhealthy queued conns
// 2) Computes current vs target
// 3) Grows or shrinks the pool
// 4) Emits metrics/logs
func (bp *ServerConnectionPool) ensureCapacity() {}

// ────────────────────────────
// Event sources that *request* reconciliation
// ────────────────────────────

// Health watcher: when a server flips health, request reconcile.
// ✅ GOOD PATTERN: do not call ensureCapacity() here; just signal.
func (bp *ServerConnectionPool) startHealthWatcher(healthCh <-chan HealthEvent) {
	go func() {
		for {
			select {
			case <-bp.ctx.Done():
				return
			case ev := <-healthCh:
				if ev.Changed {
					bp.logger.Debug("health change → reconcile",
						zap.String("server", ev.ServerID))
					bp.triggerReconcile()
				}
			}
		}
	}()
}

// Config watcher: when scale target changes, request reconcile.
// ✅ GOOD PATTERN: apply config change, then signal worker.
func (bp *ServerConnectionPool) startConfigWatcher(cfgCh <-chan ScaleTarget) {
	go func() {
		for {
			select {
			case <-bp.ctx.Done():
				return
			case target := <-cfgCh:
				// update internal target state...
				bp.logger.Debug("scale target changed → reconcile",
					zap.Int("target", target.Connections))
				bp.triggerReconcile()
			}
		}
	}()
}

Key patterns:

Single-threaded worker serializes reconciliation
Buffered channel coalesces bursts into single operations
Periodic safety net with jitter prevents thundering herds

Local Performance Check

Local performance test setup and results

We validated performance locally with the following setup:

Test Setup

Component	Configuration
Proxy	Go-based WebSocket proxy
Backends	3 × Echo WebSocket servers
Load	3,000 connections simultaneously (no ramp-up)
Traffic	1 KB text messages @ 100 msg/s per connection

Results

Metric	Value
Concurrent sessions	~3,000 stable
Throughput	~300K messages/sec
Peak memory	~150 MB
Average memory	~60 MB
CPU usage	~4–5% of 12 cores

This confirms that the proxy maintains a flat memory footprint, even under fully concurrent connection bursts — validating the effectiveness of the connection pool isolation and event-driven reconciliation model.

Performance comparison summary

Conclusion

After deploying the new Go-based proxy, we observed major improvements across performance, scalability, and stability:

Category	Python (FastAPI + asyncio + Gunicorn)	Go (Goroutines + Channels + Atomics)	Improvement
CPU usage	~12 cores × 40–50%	~12 cores × 4–5%	~90% reduction
Memory usage	~25 GB	~60–150 MB	~99% reduction
Scalability	Limited to hundreds	Sustains thousands	10x scale

The Go rewrite wasn’t just a language change — it was a concurrency model transformation.

Key Takeaway: Design concurrency as independent, communicating processes — not as shared mutable state under protection.

This architectural shift allowed the proxy to scale from hundreds to thousands of concurrent WebSocket sessions with near-constant resource usage and without losing clarity in code or operations.

References

Go Concurrency Patterns - golang.org/doc/effective_go
gorilla/websocket - github.com/gorilla/websocket
Python asyncio Event Loop - docs.python.org

Getting Started

Quick Start Guide

Pricing & Plans

Live Captions & Webinars

PC Voice Translation

Subtitles, Minutes & Dictionary

Mobile App

Admin Features

SSO Configuration

Virtual Office

Productivity Management

Support & FAQ

Research

Hiring

Legal & Security

How a Go Rewrite Made Our WebSocket Proxy 100x More Efficient

Author

TL;DR

Context

Before: Python Proxy (Inefficient)

Why Python Struggled

Birds-eye View of the Proxy Server

Functional Requirements for Connection Pool Management

First Design (Naive)

Issues in the First Design

Revised Design

Event-Driven Reconciliation

Local Performance Check

Test Setup

Results

Conclusion

References

Getting Started

Quick Start Guide

Pricing & Plans

Live Captions & Webinars

PC Voice Translation

Subtitles, Minutes & Dictionary

Mobile App

Admin Features

SSO Configuration

Virtual Office

Productivity Management

Support & FAQ

Research

Hiring

Legal & Security

​Author

​TL;DR

​Context

​Before: Python Proxy (Inefficient)

​Why Python Struggled

​Birds-eye View of the Proxy Server

​Functional Requirements for Connection Pool Management

​First Design (Naive)

​Issues in the First Design

​Revised Design

​Event-Driven Reconciliation

​Local Performance Check

​Test Setup

​Results

​Conclusion

​References

Author

TL;DR

Context

Before: Python Proxy (Inefficient)

Why Python Struggled

Birds-eye View of the Proxy Server

Functional Requirements for Connection Pool Management

First Design (Naive)

Issues in the First Design

Revised Design

Event-Driven Reconciliation

Local Performance Check

Test Setup

Results

Conclusion

References