# Fix Memcached Failover Issues
Your Memcached cluster is experiencing failover issues. When a Memcached server goes down, your application either crashes, returns errors, or experiences degraded performance. You need to configure proper failover handling.
Memcached doesn't have built-in clustering or automatic failover. Failover behavior depends entirely on your client library configuration.
Introduction
Your Memcached cluster is experiencing failover issues. When a Memcached server goes down, your application either crashes, returns errors, or experiences degraded performance. You need to configure proper failover handling. Memcached does not have built-in clustering or automatic failover - failover behavior depends entirely on your client library configuration.
Understanding Memcached architecture is important: - No master/slave - all nodes are equal - No replication - data exists on one node only - No automatic failover - client handles node failures - Data loss on node failure - cache must be repopulated
Symptoms
Memcached failover issues present with: - "Connection refused" errors when servers fail - Application crashes on cache server failure - Stale connections to dead servers - Cache misses during failover - Performance degradation during server failures - Timeout errors reaching cache servers - Inconsistent cache behavior across nodes
Diagnosis commands to investigate:
Memcached is a simple, distributed cache: - No master/slave - all nodes are equal - No replication - data exists on one node only - No automatic failover - client handles node failures - Data loss on node failure - cache must be repopulated
Common Causes
- Configuration misconfiguration
- Missing or incorrect credentials
- Network connectivity issues
- Version compatibility problems
- Resource exhaustion or limits
- Permission or access denied
Step-by-Step Fix
Check Memcached server status:
```bash # Check if Memcached is running systemctl status memcached
# Check multiple servers for server in memcached1 memcached2 memcached3; do echo "=== $server ===" ssh $server "systemctl status memcached" done
# Check Memcached stats echo "stats" | nc localhost 11211 | head -20
# Check server connectivity nc -zv memcached1 11211 nc -zv memcached2 11211 nc -zv memcached3 11211 ```
Check from application:
```python import memcache
# Connect to cluster mc = memcache.Client(['memcached1:11211', 'memcached2:11211', 'memcached3:11211'])
# Test connectivity for server in mc.servers: print(f"Server {server.address}: {mc.get('test_key')}") ```
Common Issues and Solutions
Issue 1: Application Crashes on Server Failure
# Error
# Connection refused, server unavailableCause: Client not configured for failover.
Solution: Configure client with failover settings:
```python # Python - python-memcached import memcache
mc = memcache.Client( ['memcached1:11211', 'memcached2:11211', 'memcached3:11211'], dead_retry=30, # Retry dead servers after 30 seconds timeout=5, # Connection timeout failover=True, # Enable failover debug=False )
# Always handle exceptions try: value = mc.get('my_key') except Exception as e: value = None # Fallback to database ```
Issue 2: Stale Connections to Dead Servers
# Client keeps trying dead serverCause: Client doesn't mark server as dead.
Solution: Configure retry and timeout:
```python # Python - pymemcache from pymemcache.client.base import Client from pymemcache.client.hash import HashClient
# Single client with timeout client = Client( ('memcached1', 11211), timeout=5, connect_timeout=5, ignore_exc=True # Ignore exceptions, return None )
# Hash client for multiple servers hash_client = HashClient( [('memcached1', 11211), ('memcached2', 11211), ('memcached3', 11211)], timeout=5, connect_timeout=5, ignore_exc=True, retry_timeout=30 # Retry dead server after 30s ) ```
Issue 3: Data Loss on Failover
# Key exists on failed server, now missingCause: Memcached doesn't replicate data.
Solution: Implement fallback logic:
```python def get_with_fallback(key, db_query_func): """Get from cache, fallback to database on failure.""" try: value = mc.get(key) if value is None: # Cache miss or server failure value = db_query_func() mc.set(key, value, timeout=3600) return value except Exception as e: # Cache unavailable, get from database return db_query_func()
# Usage user = get_with_fallback( f'user:{user_id}', lambda: db.query_user(user_id) ) ```
Issue 4: Hash Distribution Changes
When a server fails, keys redistribute to remaining servers:
# Key 'user:1' was on memcached1
# Now memcached1 is down, key goes to memcached2
# But memcached2 doesn't have the dataSolution: Use consistent hashing:
```python # Python - pymemcache with consistent hashing from pymemcache.client.hash import HashClient
client = HashClient( [('memcached1', 11211), ('memcached2', 11211), ('memcached3', 11211)], use_consistent_hashing=True, retry_timeout=30 ) ```
Issue 5: Java Client Configuration
```java // Java - spymemcached import net.spy.memcached.MemcachedClient; import net.spy.memcached.ConnectionFactoryBuilder;
MemcachedClient client = new MemcachedClient( new ConnectionFactoryBuilder() .setFailureMode(FailureMode.Redistribute) // Redistribute on failure .setOpTimeout(5000) // Operation timeout 5s .setTimeoutExceptionThreshold(10) // Mark dead after 10 failures .build(), AddrUtil.getAddresses("memcached1:11211 memcached2:11211 memcached3:11211") ); ```
Issue 6: PHP Client Configuration
```php // PHP - Memcached extension $m = new Memcached(); $m->setOption(Memcached::OPT_DISTRIBUTION, Memcached::DISTRIBUTION_CONSISTENT); $m->setOption(Memcached::OPT_REMOVE_FAILED_SERVERS, true); $m->setOption(Memcached::OPT_RETRY_TIMEOUT, 30); $m->setOption(Memcached::OPT_CONNECT_TIMEOUT, 5000); $m->setOption(Memcached::OPT_SERVER_FAILURE_LIMIT, 2);
$m->addServers([ ['memcached1', 11211, 33], ['memcached2', 11211, 33], ['memcached3', 11211, 33] ]); ```
Issue 7: Node.js Client Configuration
```javascript // Node.js - memcached const Memcached = require('memcached');
const memcached = new Memcached({ 'memcached1:11211': { weight: 1 }, 'memcached2:11211': { weight: 1 }, 'memcached3:11211': { weight: 1 } }, { retries: 2, timeout: 5000, remove: true, // Remove failed servers failOverServers: ['memcached-backup:11211'], failOverOnException: true, retry: 30000 }); ```
Issue 8: Connection Pool Exhaustion
# Too many connections to remaining serversSolution: Configure connection limits:
```python # Use connection pooling from pymemcache.client.hash import HashClient from pymemcache.pool import PooledClient
# Pooled client pool = PooledClient( ('memcached1', 11211), max_pool_size=10, timeout=5 )
# Or HashClient with pool per server client = HashClient( [('memcached1', 11211), ('memcached2', 11211)], timeout=5, connect_timeout=5 ) ```
Monitoring and Health Checks
Server Health Check Script
```python import socket import time
def check_memcached(host, port=11211, timeout=5): """Check if Memcached server is healthy.""" try: sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.settimeout(timeout) sock.connect((host, port)) sock.send(b'stats\r\n') response = sock.recv(1024) sock.close() return True except: return False
def monitor_servers(servers): """Monitor Memcached servers.""" while True: for host, port in servers: status = check_memcached(host, port) print(f"{host}:{port} - {'OK' if status else 'DOWN'}") time.sleep(60)
servers = [('memcached1', 11211), ('memcached2', 11211), ('memcached3', 11211)] monitor_servers(servers) ```
Prometheus Metrics
```python from prometheus_client import Gauge, start_http_server import memcache
mc = memcache.Client(['memcached1:11211', 'memcached2:11211', 'memcached3:11211'])
# Metrics cache_hits = Gauge('memcached_hits', 'Cache hits') cache_misses = Gauge('memcached_misses', 'Cache misses') server_status = Gauge('memcached_server_status', 'Server status', ['server'])
def collect_metrics(): for server in mc.servers: alive = server.connect() is not None server_status.labels(server=server.address).set(1 if alive else 0)
start_http_server(8000) ```
High Availability Setup
Multiple Memcached Instances
# Run multiple Memcached instances per server
memcached -d -p 11211 -m 512 -c 1024
memcached -d -p 11212 -m 512 -c 1024
memcached -d -p 11213 -m 512 -c 1024Client configuration:
mc = memcache.Client([
'server1:11211', 'server1:11212', 'server1:11213',
'server2:11211', 'server2:11212', 'server2:11213',
])Twemproxy (Nutcracker)
Twemproxy provides proxy layer with automatic failover:
# nutcracker.yml
alpha:
listen: 127.0.0.1:11211
hash: crc32a
distribution: ketama
auto_eject_hosts: true
timeout: 500
server_retry_timeout: 30000
server_failure_limit: 2
servers:
- memcached1:11211:1
- memcached2:11211:1
- memcached3:11211:1Run Twemproxy:
nutcracker -c nutcracker.yml -dClient connects to Twemproxy:
mc = memcache.Client(['127.0.0.1:11211'])Mcrouter
Mcrouter provides more advanced failover:
{
"pools": {
"memcached": {
"servers": [
"memcached1:11211",
"memcached2:11211",
"memcached3:11211"
]
}
},
"routes": [
{
"route": "PoolRoute|memcached",
"failover": {
"failover_policy": "FailoverToNextPool",
"retry_policy": {
"tries": 3,
"retry_delay_ms": 100
}
}
}
]
}Verification
```bash # Test failover manually # Stop one server systemctl stop memcached
# Test from client python test_memcached.py
# Check logs tail -f /var/log/memcached.log
# Restart server systemctl start memcached
# Verify client reconnects python test_memcached.py ```
Test script:
```python import memcache import time
mc = memcache.Client( ['memcached1:11211', 'memcached2:11211', 'memcached3:11211'], dead_retry=30 )
# Set test key mc.set('test_key', 'test_value')
# Get test key repeatedly for i in range(100): try: value = mc.get('test_key') print(f"Attempt {i}: {value}") except Exception as e: print(f"Attempt {i}: Error - {e}") time.sleep(1) ```
Prevention
- 1.[ ] Multiple Memcached servers configured
- 2.[ ] Client configured with failover settings
- 3.[ ] Timeout and retry settings appropriate
- 4.[ ] Fallback logic implemented in application
- 5.[ ] Consistent hashing enabled
- 6.[ ] Connection pooling configured
- 7.[ ] Health monitoring in place
- 8.[ ] Twemproxy or Mcrouter for HA (optional)
- 9.[ ] Test failover manually
- 10.[ ] Document failover behavior
Related Articles
- [Fix Memcached Binary Protocol Sasl Authentication Failure in Memcached](memcached-binary-protocol-sasl-authentication-failure)
- [Fix Memcached Cas Mismatch Concurrent Update Operations Issue in Memcached](memcached-cas-mismatch-concurrent-update-operations)
- [Fix Memcached Cluster Node Failure Cache Miss Spike Issue in Memcached](memcached-cluster-node-failure-cache-miss-spike)
- [Fix Memcached Connection Limit Maxconns Reached High Load Issue in Memcached](memcached-connection-limit-maxconns-reached-high-load)
- [Fix Memcached Eviction Memory Pressure Hot Keys Issue in Memcached](memcached-eviction-memory-pressure-hot-keys)
<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "TechArticle", "headline": "Fix Memcached Failover Issues", "description": "Step-by-step guide to fix Memcached failover issues. Configure client failover, handle server failures, and maintain cache availability.", "url": "https://www.fixwikihub.com/fix-memcached-failover-issues", "publisher": { "@type": "Organization", "name": "FixWikiHub", "url": "https://www.fixwikihub.com" }, "author": { "@type": "Person", "name": "FixWikiHub Editorial Team" }, "datePublished": "2026-04-27T10:13:00.000Z", "dateModified": "2026-04-27T10:13:00.000Z" } </script>