# Fix Memcached Failover Issues

Your Memcached cluster is experiencing failover issues. When a Memcached server goes down, your application either crashes, returns errors, or experiences degraded performance. You need to configure proper failover handling.

Memcached doesn't have built-in clustering or automatic failover. Failover behavior depends entirely on your client library configuration.

Introduction

Your Memcached cluster is experiencing failover issues. When a Memcached server goes down, your application either crashes, returns errors, or experiences degraded performance. You need to configure proper failover handling. Memcached does not have built-in clustering or automatic failover - failover behavior depends entirely on your client library configuration.

Understanding Memcached architecture is important: - No master/slave - all nodes are equal - No replication - data exists on one node only - No automatic failover - client handles node failures - Data loss on node failure - cache must be repopulated

Symptoms

Memcached failover issues present with: - "Connection refused" errors when servers fail - Application crashes on cache server failure - Stale connections to dead servers - Cache misses during failover - Performance degradation during server failures - Timeout errors reaching cache servers - Inconsistent cache behavior across nodes

Diagnosis commands to investigate:

Memcached is a simple, distributed cache: - No master/slave - all nodes are equal - No replication - data exists on one node only - No automatic failover - client handles node failures - Data loss on node failure - cache must be repopulated

Common Causes

  • Configuration misconfiguration
  • Missing or incorrect credentials
  • Network connectivity issues
  • Version compatibility problems
  • Resource exhaustion or limits
  • Permission or access denied

Step-by-Step Fix

Check Memcached server status:

```bash # Check if Memcached is running systemctl status memcached

# Check multiple servers for server in memcached1 memcached2 memcached3; do echo "=== $server ===" ssh $server "systemctl status memcached" done

# Check Memcached stats echo "stats" | nc localhost 11211 | head -20

# Check server connectivity nc -zv memcached1 11211 nc -zv memcached2 11211 nc -zv memcached3 11211 ```

Check from application:

```python import memcache

# Connect to cluster mc = memcache.Client(['memcached1:11211', 'memcached2:11211', 'memcached3:11211'])

# Test connectivity for server in mc.servers: print(f"Server {server.address}: {mc.get('test_key')}") ```

Common Issues and Solutions

Issue 1: Application Crashes on Server Failure

python
# Error
# Connection refused, server unavailable

Cause: Client not configured for failover.

Solution: Configure client with failover settings:

```python # Python - python-memcached import memcache

mc = memcache.Client( ['memcached1:11211', 'memcached2:11211', 'memcached3:11211'], dead_retry=30, # Retry dead servers after 30 seconds timeout=5, # Connection timeout failover=True, # Enable failover debug=False )

# Always handle exceptions try: value = mc.get('my_key') except Exception as e: value = None # Fallback to database ```

Issue 2: Stale Connections to Dead Servers

python
# Client keeps trying dead server

Cause: Client doesn't mark server as dead.

Solution: Configure retry and timeout:

```python # Python - pymemcache from pymemcache.client.base import Client from pymemcache.client.hash import HashClient

# Single client with timeout client = Client( ('memcached1', 11211), timeout=5, connect_timeout=5, ignore_exc=True # Ignore exceptions, return None )

# Hash client for multiple servers hash_client = HashClient( [('memcached1', 11211), ('memcached2', 11211), ('memcached3', 11211)], timeout=5, connect_timeout=5, ignore_exc=True, retry_timeout=30 # Retry dead server after 30s ) ```

Issue 3: Data Loss on Failover

python
# Key exists on failed server, now missing

Cause: Memcached doesn't replicate data.

Solution: Implement fallback logic:

```python def get_with_fallback(key, db_query_func): """Get from cache, fallback to database on failure.""" try: value = mc.get(key) if value is None: # Cache miss or server failure value = db_query_func() mc.set(key, value, timeout=3600) return value except Exception as e: # Cache unavailable, get from database return db_query_func()

# Usage user = get_with_fallback( f'user:{user_id}', lambda: db.query_user(user_id) ) ```

Issue 4: Hash Distribution Changes

When a server fails, keys redistribute to remaining servers:

python
# Key 'user:1' was on memcached1
# Now memcached1 is down, key goes to memcached2
# But memcached2 doesn't have the data

Solution: Use consistent hashing:

```python # Python - pymemcache with consistent hashing from pymemcache.client.hash import HashClient

client = HashClient( [('memcached1', 11211), ('memcached2', 11211), ('memcached3', 11211)], use_consistent_hashing=True, retry_timeout=30 ) ```

Issue 5: Java Client Configuration

```java // Java - spymemcached import net.spy.memcached.MemcachedClient; import net.spy.memcached.ConnectionFactoryBuilder;

MemcachedClient client = new MemcachedClient( new ConnectionFactoryBuilder() .setFailureMode(FailureMode.Redistribute) // Redistribute on failure .setOpTimeout(5000) // Operation timeout 5s .setTimeoutExceptionThreshold(10) // Mark dead after 10 failures .build(), AddrUtil.getAddresses("memcached1:11211 memcached2:11211 memcached3:11211") ); ```

Issue 6: PHP Client Configuration

```php // PHP - Memcached extension $m = new Memcached(); $m->setOption(Memcached::OPT_DISTRIBUTION, Memcached::DISTRIBUTION_CONSISTENT); $m->setOption(Memcached::OPT_REMOVE_FAILED_SERVERS, true); $m->setOption(Memcached::OPT_RETRY_TIMEOUT, 30); $m->setOption(Memcached::OPT_CONNECT_TIMEOUT, 5000); $m->setOption(Memcached::OPT_SERVER_FAILURE_LIMIT, 2);

$m->addServers([ ['memcached1', 11211, 33], ['memcached2', 11211, 33], ['memcached3', 11211, 33] ]); ```

Issue 7: Node.js Client Configuration

```javascript // Node.js - memcached const Memcached = require('memcached');

const memcached = new Memcached({ 'memcached1:11211': { weight: 1 }, 'memcached2:11211': { weight: 1 }, 'memcached3:11211': { weight: 1 } }, { retries: 2, timeout: 5000, remove: true, // Remove failed servers failOverServers: ['memcached-backup:11211'], failOverOnException: true, retry: 30000 }); ```

Issue 8: Connection Pool Exhaustion

python
# Too many connections to remaining servers

Solution: Configure connection limits:

```python # Use connection pooling from pymemcache.client.hash import HashClient from pymemcache.pool import PooledClient

# Pooled client pool = PooledClient( ('memcached1', 11211), max_pool_size=10, timeout=5 )

# Or HashClient with pool per server client = HashClient( [('memcached1', 11211), ('memcached2', 11211)], timeout=5, connect_timeout=5 ) ```

Monitoring and Health Checks

Server Health Check Script

```python import socket import time

def check_memcached(host, port=11211, timeout=5): """Check if Memcached server is healthy.""" try: sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.settimeout(timeout) sock.connect((host, port)) sock.send(b'stats\r\n') response = sock.recv(1024) sock.close() return True except: return False

def monitor_servers(servers): """Monitor Memcached servers.""" while True: for host, port in servers: status = check_memcached(host, port) print(f"{host}:{port} - {'OK' if status else 'DOWN'}") time.sleep(60)

servers = [('memcached1', 11211), ('memcached2', 11211), ('memcached3', 11211)] monitor_servers(servers) ```

Prometheus Metrics

```python from prometheus_client import Gauge, start_http_server import memcache

mc = memcache.Client(['memcached1:11211', 'memcached2:11211', 'memcached3:11211'])

# Metrics cache_hits = Gauge('memcached_hits', 'Cache hits') cache_misses = Gauge('memcached_misses', 'Cache misses') server_status = Gauge('memcached_server_status', 'Server status', ['server'])

def collect_metrics(): for server in mc.servers: alive = server.connect() is not None server_status.labels(server=server.address).set(1 if alive else 0)

start_http_server(8000) ```

High Availability Setup

Multiple Memcached Instances

bash
# Run multiple Memcached instances per server
memcached -d -p 11211 -m 512 -c 1024
memcached -d -p 11212 -m 512 -c 1024
memcached -d -p 11213 -m 512 -c 1024

Client configuration:

python
mc = memcache.Client([
    'server1:11211', 'server1:11212', 'server1:11213',
    'server2:11211', 'server2:11212', 'server2:11213',
])

Twemproxy (Nutcracker)

Twemproxy provides proxy layer with automatic failover:

yaml
# nutcracker.yml
alpha:
  listen: 127.0.0.1:11211
  hash: crc32a
  distribution: ketama
  auto_eject_hosts: true
  timeout: 500
  server_retry_timeout: 30000
  server_failure_limit: 2
  servers:
    - memcached1:11211:1
    - memcached2:11211:1
    - memcached3:11211:1

Run Twemproxy:

bash
nutcracker -c nutcracker.yml -d

Client connects to Twemproxy:

python
mc = memcache.Client(['127.0.0.1:11211'])

Mcrouter

Mcrouter provides more advanced failover:

json
{
  "pools": {
    "memcached": {
      "servers": [
        "memcached1:11211",
        "memcached2:11211",
        "memcached3:11211"
      ]
    }
  },
  "routes": [
    {
      "route": "PoolRoute|memcached",
      "failover": {
        "failover_policy": "FailoverToNextPool",
        "retry_policy": {
          "tries": 3,
          "retry_delay_ms": 100
        }
      }
    }
  ]
}

Verification

```bash # Test failover manually # Stop one server systemctl stop memcached

# Test from client python test_memcached.py

# Check logs tail -f /var/log/memcached.log

# Restart server systemctl start memcached

# Verify client reconnects python test_memcached.py ```

Test script:

```python import memcache import time

mc = memcache.Client( ['memcached1:11211', 'memcached2:11211', 'memcached3:11211'], dead_retry=30 )

# Set test key mc.set('test_key', 'test_value')

# Get test key repeatedly for i in range(100): try: value = mc.get('test_key') print(f"Attempt {i}: {value}") except Exception as e: print(f"Attempt {i}: Error - {e}") time.sleep(1) ```

Prevention

  1. 1.[ ] Multiple Memcached servers configured
  2. 2.[ ] Client configured with failover settings
  3. 3.[ ] Timeout and retry settings appropriate
  4. 4.[ ] Fallback logic implemented in application
  5. 5.[ ] Consistent hashing enabled
  6. 6.[ ] Connection pooling configured
  7. 7.[ ] Health monitoring in place
  8. 8.[ ] Twemproxy or Mcrouter for HA (optional)
  9. 9.[ ] Test failover manually
  10. 10.[ ] Document failover behavior
  • [Fix Memcached Binary Protocol Sasl Authentication Failure in Memcached](memcached-binary-protocol-sasl-authentication-failure)
  • [Fix Memcached Cas Mismatch Concurrent Update Operations Issue in Memcached](memcached-cas-mismatch-concurrent-update-operations)
  • [Fix Memcached Cluster Node Failure Cache Miss Spike Issue in Memcached](memcached-cluster-node-failure-cache-miss-spike)
  • [Fix Memcached Connection Limit Maxconns Reached High Load Issue in Memcached](memcached-connection-limit-maxconns-reached-high-load)
  • [Fix Memcached Eviction Memory Pressure Hot Keys Issue in Memcached](memcached-eviction-memory-pressure-hot-keys)

<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "TechArticle", "headline": "Fix Memcached Failover Issues", "description": "Step-by-step guide to fix Memcached failover issues. Configure client failover, handle server failures, and maintain cache availability.", "url": "https://www.fixwikihub.com/fix-memcached-failover-issues", "publisher": { "@type": "Organization", "name": "FixWikiHub", "url": "https://www.fixwikihub.com" }, "author": { "@type": "Person", "name": "FixWikiHub Editorial Team" }, "datePublished": "2026-04-27T10:13:00.000Z", "dateModified": "2026-04-27T10:13:00.000Z" } </script>