Running large language models locally is increasingly practical, but scaling inference across multiple machines is still a challenge most developers haven't tackled. When I built GoLLama as part of a distributed systems course project, I wanted to solve exactly that — route requests across multiple llama.cpp workers using a clean, provider-agnostic API.
The Problem
Most local LLM tooling assumes a single machine. You spin up llama.cpp, point your app at localhost:8080, and you're done. But what happens when:
- Your model is too large for a single GPU?
- You want redundancy across machines?
- You need to handle concurrent requests without queueing them on one worker?
Existing solutions either require cloud providers (OpenAI, Anthropic) or complex orchestration frameworks. I wanted something lightweight and self-contained.
The Hub-and-Spoke Architecture
GoLLama uses a simple hub-and-spoke pattern:
Client → Hub (GoLLama API) → Worker 1 (llama.cpp)
→ Worker 2 (llama.cpp)
→ Worker 3 (llama.cpp)
The hub exposes a single OpenAI-compatible /v1/chat/completions endpoint. Incoming requests are load-balanced across registered workers — each of which is a standard llama.cpp HTTP server.
Why Go?
Go was an obvious choice for the hub:
- Goroutines make concurrent request handling trivial — each incoming request gets its own goroutine
- Standard library has excellent HTTP primitives; no framework needed
- Fast compilation speeds up iteration in a course setting
- Static binary means deployment to workers is a single file copy
JWT Authentication
Since the hub is meant to be self-hosted and potentially exposed over a network, I added JWT-based authentication. Every client request must include a bearer token:
func authMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
token := extractBearerToken(r)
if !validateJWT(token) {
http.Error(w, "Unauthorized", http.StatusUnauthorized)
return
}
next.ServeHTTP(w, r)
})
}Token validation uses golang-jwt/jwt with a configurable secret loaded from environment variables.
Worker Health Checks
The hub maintains a live registry of workers. A background goroutine pings each worker every 30 seconds:
func (h *Hub) healthCheck() {
ticker := time.NewTicker(30 * time.Second)
for range ticker.C {
for _, worker := range h.workers {
if err := worker.Ping(); err != nil {
worker.SetHealthy(false)
} else {
worker.SetHealthy(true)
}
}
}
}Unhealthy workers are skipped during routing. This gives the system basic fault tolerance with minimal complexity.
Load Balancing
I implemented round-robin with health filtering — simple but effective for a course project:
func (h *Hub) nextWorker() *Worker {
h.mu.Lock()
defer h.mu.Unlock()
for i := 0; i < len(h.workers); i++ {
idx := (h.idx + i) % len(h.workers)
if h.workers[idx].IsHealthy() {
h.idx = (idx + 1) % len(h.workers)
return h.workers[idx]
}
}
return nil // all workers down
}What I Learned
Go's concurrency model is genuinely excellent for this use case. The goroutine-per-request model handled 50 concurrent inference requests in testing without any deadlocks — something I'd have spent much more time on in Java with thread pools.
llama.cpp's HTTP API is surprisingly clean. It exposes OpenAI-compatible endpoints, so the hub just proxies requests with minimal transformation.
Distributed systems failures are subtle. A worker that responds slowly (not down, just slow) is harder to handle than one that refuses connections. I added a configurable request timeout but a proper circuit breaker pattern would be the next step.
What's Next
GoLLama is open source — the code is on GitHub. Areas I'd extend given more time:
- Weighted routing based on worker GPU VRAM
- Circuit breaker for slow workers
- Streaming support for token-by-token responses
- Model registry so different workers can serve different models
If you're building local LLM infrastructure, give it a look.