Building Distributed LLM Inference with Go

Running large language models locally is increasingly practical, but scaling inference across multiple machines is still a challenge most developers haven't tackled. When I built GoLLama as part of a distributed systems course project, I wanted to solve exactly that — route requests across multiple llama.cpp workers using a clean, provider-agnostic API.

The Problem

Most local LLM tooling assumes a single machine. You spin up llama.cpp, point your app at localhost:8080, and you're done. But what happens when:

Your model is too large for a single GPU?
You want redundancy across machines?
You need to handle concurrent requests without queueing them on one worker?

Existing solutions either require cloud providers (OpenAI, Anthropic) or complex orchestration frameworks. I wanted something lightweight and self-contained.

The Hub-and-Spoke Architecture

GoLLama uses a simple hub-and-spoke pattern:

Client → Hub (GoLLama API) → Worker 1 (llama.cpp)
                           → Worker 2 (llama.cpp)
                           → Worker 3 (llama.cpp)

The hub exposes a single OpenAI-compatible /v1/chat/completions endpoint. Incoming requests are load-balanced across registered workers — each of which is a standard llama.cpp HTTP server.

Why Go?

Go was an obvious choice for the hub:

Goroutines make concurrent request handling trivial — each incoming request gets its own goroutine
Standard library has excellent HTTP primitives; no framework needed
Fast compilation speeds up iteration in a course setting
Static binary means deployment to workers is a single file copy

JWT Authentication

Since the hub is meant to be self-hosted and potentially exposed over a network, I added JWT-based authentication. Every client request must include a bearer token:

func authMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        token := extractBearerToken(r)
        if !validateJWT(token) {
            http.Error(w, "Unauthorized", http.StatusUnauthorized)
            return
        }
        next.ServeHTTP(w, r)
    })
}

Token validation uses golang-jwt/jwt with a configurable secret loaded from environment variables.

Worker Health Checks

The hub maintains a live registry of workers. A background goroutine pings each worker every 30 seconds:

func (h *Hub) healthCheck() {
    ticker := time.NewTicker(30 * time.Second)
    for range ticker.C {
        for _, worker := range h.workers {
            if err := worker.Ping(); err != nil {
                worker.SetHealthy(false)
            } else {
                worker.SetHealthy(true)
            }
        }
    }
}

Unhealthy workers are skipped during routing. This gives the system basic fault tolerance with minimal complexity.

Load Balancing

I implemented round-robin with health filtering — simple but effective for a course project:

func (h *Hub) nextWorker() *Worker {
    h.mu.Lock()
    defer h.mu.Unlock()
 
    for i := 0; i < len(h.workers); i++ {
        idx := (h.idx + i) % len(h.workers)
        if h.workers[idx].IsHealthy() {
            h.idx = (idx + 1) % len(h.workers)
            return h.workers[idx]
        }
    }
    return nil // all workers down
}

What I Learned

Go's concurrency model is genuinely excellent for this use case. The goroutine-per-request model handled 50 concurrent inference requests in testing without any deadlocks — something I'd have spent much more time on in Java with thread pools.

llama.cpp's HTTP API is surprisingly clean. It exposes OpenAI-compatible endpoints, so the hub just proxies requests with minimal transformation.

Distributed systems failures are subtle. A worker that responds slowly (not down, just slow) is harder to handle than one that refuses connections. I added a configurable request timeout but a proper circuit breaker pattern would be the next step.

What's Next

GoLLama is open source — the code is on GitHub. Areas I'd extend given more time:

Weighted routing based on worker GPU VRAM
Circuit breaker for slow workers
Streaming support for token-by-token responses
Model registry so different workers can serve different models

If you're building local LLM infrastructure, give it a look.