Horizontal Scaling Guide
This document describes how to run and operate the API in a horizontally scaled configuration using Redis pub/sub.
Overview
The API supports horizontal scaling across multiple instances for high availability and increased capacity. When enabled, WebSocket connections can be distributed across multiple servers while maintaining seamless communication between all clients.
Architecture
┌─────────────┐
│Load Balancer│
└──────┬──────┘
│
┌───┴────┬─────────┬─────────┐
│ │ │ │
┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐
│API 1│ │API 2│ │API 3│ │API 4│
└──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘
│ │ │ │
└────────┴────┬────┴─────────┘
│
┌──────▼──────┐
│ Redis │
│ Pub/Sub + │
│ Presence │
└─────────────┘Quick Start
Enable Distributed Mode
Set the environment variable:
ENABLE_DISTRIBUTED_CHAT=trueOptional: Set Instance ID
By default, each instance generates a unique ID. For better observability, set a custom ID:
INSTANCE_ID=api-pod-1Configure Metrics Authentication
The /metrics endpoint is protected and requires authentication. You can access it via:
- Admin user session (logged in as admin)
- API key (recommended for monitoring tools)
To use an API key, generate and set it:
# Generate a secure API key (min 32 characters)
METRICS_API_KEY=$(openssl rand -base64 32)
# Or use a UUID
METRICS_API_KEY=$(uuidgen)
# Add to your environment
export METRICS_API_KEY=your-generated-keyAccess metrics with the API key:
curl -H "Authorization: Bearer YOUR_METRICS_API_KEY" https://api.yourdomain.com/metricsExample: Docker Compose
version: "3.8"
services:
api-1:
image: your-api:latest
environment:
- ENABLE_DISTRIBUTED_CHAT=true
- INSTANCE_ID=api-1
- REDIS_URL=redis://redis:6379
- METRICS_API_KEY=${METRICS_API_KEY}
ports:
- "3001:3001"
depends_on:
- redis
api-2:
image: your-api:latest
environment:
- ENABLE_DISTRIBUTED_CHAT=true
- INSTANCE_ID=api-2
- REDIS_URL=redis://redis:6379
- METRICS_API_KEY=${METRICS_API_KEY}
ports:
- "3002:3001"
depends_on:
- redis
redis:
image: redis:7-alpine
ports:
- "6379:6379"Features
1. Cross-Instance WebSocket Broadcasting
All WebSocket messages are broadcast across all instances via Redis pub/sub:
- Chat messages
- Message deletions
- Message updates
- User disconnections
- Presence updates
2. Shared Presence Tracking
User presence (guest/member/admin counts) is tracked in Redis and automatically synchronized across instances.
3. Graceful Shutdown
When an instance receives SIGTERM or SIGINT:
- Health checks return
503(removed from load balancer) - WebSocket clients are notified
- Connections are gracefully closed
- Presence data is cleaned up
- Instance is removed from cluster registry
4. Instance Heartbeat
Each instance sends a heartbeat to Redis every 10 seconds. Dead instances are automatically detected and cleaned up after 45 seconds of inactivity.
5. Message Deduplication
Messages are deduplicated using a 5-second sliding window to prevent duplicate broadcasts during retries.
Monitoring
Metrics Endpoint
The /metrics endpoint provides detailed observability into your distributed cluster.
Authentication Required:
- Admin session (logged in as admin user), OR
- API key via
Authorization: Bearer <key>header
Access metrics:
# With API key
curl -H "Authorization: Bearer YOUR_METRICS_API_KEY" https://api.yourdomain.com/metrics
# Or as logged-in admin user (with session cookie)
curl -b cookies.txt https://api.yourdomain.com/metricsResponse format:
{
"timestamp": "2024-01-10T12:00:00.000Z",
"instanceId": "api-1",
"distributedMode": true,
"pubsub": {
"messagesPublished": 1523,
"messagesReceived": 3046,
"publishFailures": 0,
"averageLatencyMs": 4.23,
"maxLatencyMs": 45,
"isDegraded": false,
"uptime": 3600000
},
"chat": {
"connectedClients": 123
},
"cluster": {
"activeInstances": 3,
"instances": [
{
"instanceId": "api-1",
"startedAt": 1704888000000,
"lastHeartbeat": 1704891600000,
"connectedClients": 123
},
{
"instanceId": "api-2",
"startedAt": 1704888000000,
"lastHeartbeat": 1704891600000,
"connectedClients": 145
},
{
"instanceId": "api-3",
"startedAt": 1704888000000,
"lastHeartbeat": 1704891600000,
"connectedClients": 98
}
]
}
}Key Metrics to Monitor
Pub/Sub Health:
publishFailures- Should be 0 or near-0averageLatencyMs- Should be <50msisDegraded- Should be false
Cluster Health:
activeInstances- Should match expected replica countlastHeartbeat- Should be <10 seconds old
Alerts to Configure:
- Pub/sub publish failures >10 in 1 minute
- Average latency >100ms for >5 minutes
- Degraded mode active
- Instance count mismatch
- Missing heartbeat >30 seconds
Redis Requirements
Connection Limits
Each instance requires 3 Redis connections:
- 1 for data operations
- 1 for pub/sub publishing
- 1 for pub/sub subscribing
Example: 10 instances = 30 Redis connections
Ensure your Redis maxclients setting accommodates this:
# In redis.conf
maxclients 10000Memory Usage
Per instance:
- Presence data: ~100 bytes × unique users
- Instance heartbeat: ~200 bytes
- Pub/sub overhead: Minimal (messages not stored)
Example: 1000 concurrent users across 10 instances:
- Presence: ~100 KB
- Heartbeats: ~2 KB
- Total: <200 KB
Recommended Settings
# redis.conf
maxmemory 256mb
maxmemory-policy allkeys-lru
maxclients 10000
timeout 300
tcp-keepalive 60Redis ACLs (Optional)
For security, create a dedicated user:
ACL SETUSER api-chat on >your-password \
~chat:* \
+@all \
-@dangerousThen use:
REDIS_URL=redis://api-chat:your-password@redis:6379Troubleshooting
Issue: Messages Not Broadcasting
Symptoms: Messages only visible to users on same instance
Checks:
- Verify
ENABLE_DISTRIBUTED_CHAT=trueon all instances - Check Redis connectivity:
redis-cli ping - Review pub/sub metrics:
GET /metrics(requires auth) - Check for pub/sub failures in logs
Solution:
# Check Redis pub/sub
redis-cli
> PUBSUB CHANNELS chat:*
# Should show: chat:pubsub:global:broadcast, etc.
> PUBSUB NUMSUB chat:pubsub:global:broadcast
# Should show number of subscribersIssue: Presence Counts Incorrect
Symptoms: User counts don't match reality
Checks:
- Check for dead instances:
GET /metrics(requires auth) →cluster.instances - Verify presence pruning is working
- Check Redis sorted sets:
redis-cli
> ZCARD chat:presence:global:members
> ZRANGE chat:presence:global:members 0 -1 WITHSCORESSolution:
# Manual cleanup if needed
redis-cli
> DEL chat:presence:global:guests
> DEL chat:presence:global:members
> DEL chat:presence:global:admins
# Presence will rebuild automaticallyIssue: High Pub/Sub Latency
Symptoms: averageLatencyMs >100ms
Checks:
- Redis network latency
- Redis CPU usage
- Redis memory usage
- Number of connected instances
Solution:
- Scale Redis (use cluster or increase resources)
- Reduce debounce interval (increase Redis load)
- Check network between instances and Redis
Issue: Instance Not Appearing in Cluster
Symptoms: Instance missing from GET /metrics cluster list
Checks:
- Verify instance can reach Redis
- Check instance logs for heartbeat errors
- Verify instance ID is unique
Solution:
# Check Redis for instance keys
redis-cli
> KEYS chat:instances:*
> TTL chat:instances:api-1
# Should show ~30 secondsIssue: Degraded Mode Active
Symptoms: pubsub.isDegraded: true
Cause: Redis connection lost
Solution:
- Check Redis availability
- Review Redis connection errors in logs
- Instance will auto-reconnect when Redis is available
- Monitor
isDegraded- should return tofalse
Performance Tuning
Presence Update Debouncing
Presence updates are debounced to 500ms to reduce Redis load. To adjust:
// In src/lib/chat-manager.ts
private readonly presenceDebounceMs = 500; // Increase to reduce Redis loadHeartbeat Interval
Heartbeats sent every 10 seconds. To adjust:
// In src/lib/instance-heartbeat.ts
private readonly heartbeatIntervalMs = 10_000; // Increase to reduce Redis loadMessage Deduplication Window
Messages deduplicated within 5 seconds. To adjust:
// In src/lib/chat-pubsub.ts (isDuplicateMessage method)
setTimeout(() => {
this.recentMessageIds.delete(messageId);
}, 5000); // Adjust window sizeDeployment Strategies
Blue-Green Deployment
- Deploy new version (blue) with
ENABLE_DISTRIBUTED_CHAT=false - Verify health checks pass
- Enable distributed mode on blue instances
- Shift traffic from green to blue
- Shutdown green instances gracefully
Canary Deployment
- Deploy 1 instance with new version
- Monitor metrics for issues
- Gradually increase replica count
- Rollback if issues detected
Rolling Update
# Kubernetes deployment strategy
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0Ensures zero downtime with graceful shutdown.
Rollback Procedure
Emergency Rollback
If distributed mode causes issues:
# Option 1: Disable distributed mode
kubectl set env deployment/api ENABLE_DISTRIBUTED_CHAT=false
kubectl rollout restart deployment/api
# Option 2: Scale to single instance
kubectl scale deployment/api --replicas=1Planned Rollback
- Set
ENABLE_DISTRIBUTED_CHAT=false - Perform rolling restart
- Verify single-instance mode working
- Scale down to 1 replica if needed
Redis Failover
Redis Master Failover
The API automatically reconnects to Redis when connection is lost:
- Pub/sub enters degraded mode (local broadcasts only)
- Presence tracking pauses
- Chat messages still saved (if Redis comes back)
- Auto-recovers when Redis is available
Redis Cluster Mode
To use Redis Cluster:
// In src/lib/redis.ts
export const createRedisClient = (name = "Redis"): Redis => {
return new Redis.Cluster([
{ host: 'redis-1', port: 6379 },
{ host: 'redis-2', port: 6379 },
{ host: 'redis-3', port: 6379 },
], {
// ... existing config
});
};Testing Horizontal Scaling
Local Testing
# Terminal 1
ENABLE_DISTRIBUTED_CHAT=true PORT=3001 INSTANCE_ID=api-1 bun run dev
# Terminal 2
ENABLE_DISTRIBUTED_CHAT=true PORT=3002 INSTANCE_ID=api-2 bun run dev
# Terminal 3
ENABLE_DISTRIBUTED_CHAT=true PORT=3003 INSTANCE_ID=api-3 bun run devConnect WebSocket clients to different ports and verify messages broadcast across all.
Load Testing
// artillery-config.yml
config:
target: 'http://localhost:3001'
phases:
- duration: 60
arrivalRate: 10
ws:
url: 'ws://localhost:3001/chat'
scenarios:
- engine: ws
flow:
- send:
message: '{"type":"ping"}'
- think: 5Run across multiple instances and verify message delivery.
Security Considerations
Metrics Endpoint Security
The /metrics endpoint is protected by authentication to prevent unauthorized access to sensitive infrastructure information.
Why metrics are protected:
- Exposes instance IDs and cluster topology
- Shows connected client counts and activity patterns
- Reveals pub/sub performance metrics and system health
- Could be used for reconnaissance by attackers
Two authentication methods:
Admin Session (for manual checks)
- User must be authenticated and have admin role
- Good for ad-hoc debugging and monitoring
API Key (for monitoring tools)
- Set
METRICS_API_KEYenvironment variable (min 32 chars) - Use
Authorization: Bearer <key>header - Recommended for Prometheus, Grafana, Datadog, etc.
- Rotate the key regularly
- Set
Generate secure API key:
# Option 1: OpenSSL (recommended)
METRICS_API_KEY=$(openssl rand -base64 32)
# Option 2: UUID
METRICS_API_KEY=$(uuidgen)Best practices:
- Always set
METRICS_API_KEYin production - Store the key in secrets management (not in git)
- Rotate the key every 90 days
- Use HTTPS to protect the key in transit
- Monitor failed authentication attempts
Redis Network Isolation
- Use private network for Redis
- Don't expose Redis port publicly
- Use TLS for Redis connections in production
Instance Authentication
Instances authenticate via Redis connection string. Ensure:
- Use strong Redis password
- Rotate credentials regularly
- Use Redis ACLs to limit permissions
Pub/Sub Message Validation
All pub/sub messages are validated:
- Structure validation
- Instance ID sanitization
- Type checking
Malformed messages are logged and dropped.
Summary
With distributed mode enabled, your API can:
- ✅ Scale horizontally across multiple instances
- ✅ Handle WebSocket connections on any instance
- ✅ Broadcast messages to all connected clients
- ✅ Track presence across the cluster
- ✅ Gracefully shutdown without dropping connections
- ✅ Auto-recover from Redis connection issues
- ✅ Detect and clean up dead instances
- ✅ Secure monitoring via authenticated
/metricsendpoint
Important: Set METRICS_API_KEY in production to secure the /metrics endpoint and enable monitoring tools (Prometheus, Grafana, etc.) to access cluster health data.