Introduction
ZooKeeper client session expires due to inability to communicate with the server within the session timeout period. This causes ephemeral nodes to be deleted, watches to be lost, and distributed coordination to fail.
Symptoms
Session expired error:
```bash # In application logs: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
# Kafka using ZooKeeper: [ERROR] Session expired event received (kafka.server.KafkaServer) [ERROR] KeeperErrorCode = Session expired for /brokers/ids/1
# HBase using ZooKeeper: ERROR [main-SendThread] zookeeper.ClientCnxn: Session 0x123456789 for server null, unexpected error, closing socket connection and attempting reconnect ```
Ephemeral nodes deleted:
$ zkCli.sh -server localhost:2181 ls /brokers/ids
[ ] # Empty - all broker ephemeral nodes goneClient state lost:
Session state: EXPIRED
Session ID: 0x123456789abcdefCommon Causes
- 1.Network partition - Client cannot reach server
- 2.Session timeout too short - Not enough time for recovery
- 3.Server overload - Server too busy to process heartbeats
- 4.GC pauses - Long garbage collection pauses client
- 5.Clock skew - Time synchronization issues
- 6.Resource exhaustion - Client process overloaded
Step-by-Step Fix
- 1.Check logs for specific error messages
- 2.Verify configuration settings
- 3.Test network connectivity
- 4.Review recent changes
- 5.Apply corrective action
- 6.Verify the fix
Step 1: Check ZooKeeper Server Status
```bash # Check ZooKeeper process ps aux | grep zookeeper systemctl status zookeeper
# Check ZooKeeper is listening netstat -tlnp | grep 2181 ss -tlnp | grep 2181
# Check ZooKeeper health echo ruok | nc localhost 2181 # Should respond: imok
# Check ZooKeeper stats echo stat | nc localhost 2181
# Check connection status echo cons | nc localhost 2181
# Check server environment echo envi | nc localhost 2181
# Check watch count echo wchs | nc localhost 2181
# Check ephemeral nodes echo dump | nc localhost 2181 ```
Step 2: Check Session Configuration
```bash # Check tick time in zoo.cfg cat /etc/zookeeper/conf/zoo.cfg | grep tickTime
# Default: tickTime=2000 (2 seconds)
# Check session timeout # Session timeout should be 2x tickTime minimum # Recommended: 4x to 10x tickTime
# Default session timeout calculation: # minSessionTimeout = 2 * tickTime # maxSessionTimeout = 20 * tickTime
# Check configuration: tickTime=2000 initLimit=10 syncLimit=5 minSessionTimeout=4000 # Optional override maxSessionTimeout=40000 # Optional override
# Client-side session timeout: # Kafka: zookeeper.session.timeout.ms=6000 # HBase: zookeeper.session.timeout=60000 # Storm: storm.zookeeper.session.timeout=20000 ```
Step 3: Check Network Connectivity
```bash # Test connectivity to ZooKeeper ping zookeeper-server traceroute zookeeper-server
# Test port connectivity nc -zv zookeeper-server 2181 telnet zookeeper-server 2181
# Check for packet loss mtr zookeeper-server
# Check firewall rules iptables -L -n -v | grep 2181
# Allow ZooKeeper traffic iptables -I INPUT -p tcp --dport 2181 -j ACCEPT iptables -I INPUT -p tcp --dport 2888 -j ACCEPT # Leader election iptables -I INPUT -p tcp --dport 3888 -j ACCEPT # Leader election
# Check network latency ping -c 10 zookeeper-server | tail -1
# If latency > session timeout, sessions expire ```
Step 4: Check Server Resource Usage
```bash # Check CPU usage top -p $(pgrep java)
# Check memory usage free -h ps aux | grep zookeeper | awk '{print $6}'
# Check disk I/O iostat -x 1 10
# Check ZooKeeper transaction log disk df -h /var/lib/zookeeper
# Check file descriptors lsof -p $(pgrep -f zookeeper) | wc -l
# Check open file limit cat /proc/$(pgrep -f zookeeper)/limits | grep "open files"
# Increase limits if needed: # In /etc/security/limits.conf: zookeeper soft nofile 65536 zookeeper hard nofile 65536 ```
Step 5: Check GC and JVM Issues
```bash # Check JVM configuration ps aux | grep zookeeper | grep -o '-Xmx[^ ]*' ps aux | grep zookeeper | grep -o '-Xms[^ ]*'
# Enable GC logging (if not enabled) # In zookeeper-env.sh: SERVER_JVMFLAGS="-Xloggc:/var/log/zookeeper/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps"
# Check GC logs tail -100 /var/log/zookeeper/gc.log
# Look for long GC pauses (> 1 second) grep -E "Pause.*ms" /var/log/zookeeper/gc.log | sort -t'=' -k2 -n | tail
# If GC pauses exceed session timeout, sessions expire
# Optimize JVM settings: export JVMFLAGS="-Xms2g -Xmx2g -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
# Check heap usage jmap -heap $(pgrep -f org.apache.zookeeper.server.quorum.QuorumPeerMain) ```
Step 6: Increase Session Timeout
```bash # Increase server-side minimum session timeout # In zoo.cfg: tickTime=2000 minSessionTimeout=20000 # 20 seconds maxSessionTimeout=120000 # 120 seconds
# Or increase tickTime tickTime=5000 # 5 seconds per tick
# Restart ZooKeeper systemctl restart zookeeper
# Client-side timeout configuration:
# Kafka: zookeeper.session.timeout.ms=30000 zookeeper.connection.timeout.ms=15000
# HBase: zookeeper.session.timeout=60000
# Storm: storm.zookeeper.session.timeout=60000
# Solr: zkClientTimeout=30000 ```
Step 7: Check Client Configuration
```bash # Check client connection string grep -r "zookeeper" /etc/kafka/ grep -r "zookeeper" /etc/hbase/
# Ensure all servers in cluster listed: zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
# Check client session timeout # Kafka server.properties: zookeeper.session.timeout.ms=30000
# Check client retry configuration # Kafka: zookeeper.sync.time.ms=2000
# Storm: storm.zookeeper.retry.times=5 storm.zookeeper.retry.interval=1000 storm.zookeeper.retry.intervalceiling.millis=30000 ```
Step 8: Check Ensemble Health
```bash # Check ensemble status echo stat | nc zk1 2181 echo stat | nc zk2 2181 echo stat | nc zk3 2181
# Check leader election # Look for Mode: leader or Mode: follower
# Check zk server IDs cat /var/lib/zookeeper/myid
# Check for split-brain # All followers should have same leader echo mntr | nc zk1 2181 | grep zk_server_state
# Check transaction log sync echo stat | nc localhost 2181 | grep -E "Mode|Zxid"
# Check last zxid # If large gap between servers, sync issue
# Check follower latency echo srvr | nc localhost 2181 ```
Step 9: Monitor Session Events
```bash # Create monitoring script cat << 'EOF' > monitor_zk_sessions.sh #!/bin/bash ZK_HOST=localhost:2181
echo "=== ZooKeeper Stats ===" echo stat | nc $ZK_HOST
echo "" echo "=== Connections ===" echo cons | nc $ZK_HOST | head -20
echo "" echo "=== Watch Count ===" echo wchs | nc $ZK_HOST
echo "" echo "=== Ephemeral Nodes ===" echo dump | nc $ZK_HOST | grep -E "Session|ephemeral" | head -20
echo "" echo "=== Server Health ===" echo mntr | nc $ZK_HOST EOF
chmod +x monitor_zk_sessions.sh
# Monitor in loop watch -n 10 ./monitor_zk_sessions.sh
# Prometheus JMX exporter: # Add to JVM: -Djavaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=7070:/etc/jmx_exporter/zookeeper.yml
# Key metrics: # zk_avg_latency # zk_min_latency # zk_max_latency # zk_packets_received # zk_packets_sent # zk_num_alive_connections ```
Step 10: Implement Session Recovery
```java // Handle session expiration in client code
import org.apache.zookeeper.*;
public class ZooKeeperClient implements Watcher { private ZooKeeper zk; private String connectString = "zk1:2181,zk2:2181,zk3:2181"; private int sessionTimeout = 30000;
public void process(WatchedEvent event) { switch (event.getState()) { case Expired: // Session expired, need to reconnect System.out.println("Session expired, reconnecting..."); reconnect(); // Recreate ephemeral nodes recreateEphemeralNodes(); break; case Disconnected: // Temporarily disconnected, may recover System.out.println("Disconnected, waiting for reconnect..."); break; case SyncConnected: System.out.println("Connected"); break; case ConnectedReadOnly: System.out.println("Connected in read-only mode"); break; } }
private synchronized void reconnect() { try { if (zk != null) { zk.close(); } zk = new ZooKeeper(connectString, sessionTimeout, this); } catch (Exception e) { e.printStackTrace(); } }
private void recreateEphemeralNodes() { // Recreate ephemeral nodes after reconnection try { zk.create("/my-ephemeral", new byte[0], ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); } catch (Exception e) { e.printStackTrace(); } } } ```
ZooKeeper Session Checklist
| Check | Command | Expected |
|---|---|---|
| Server running | ruok | imok |
| Session timeout | > 4x tickTime | Sufficient buffer |
| Network latency | ping | < session timeout / 4 |
| GC pauses | gc.log | < session timeout |
| File descriptors | lsof | < limit |
| Memory | free | Available |
Verification
```bash # After increasing timeout or fixing connectivity
# 1. Check ZooKeeper health echo ruok | nc localhost 2181 // imok
# 2. Verify sessions stable echo cons | nc localhost 2181 | grep -c established // Stable connection count
# 3. Check ephemeral nodes persist echo dump | nc localhost 2181 | grep ephemeral // Nodes present
# 4. Monitor for session expiration tail -f /var/log/zookeeper/zookeeper.log | grep -i "expired|session" // No new expiration errors
# 5. Test client reconnection # Restart client, verify session re-establishes // Client reconnects successfully
# 6. Check latency echo stat | nc localhost 2181 | grep Latency // Avg latency low ```
Prevention
To prevent ZooKeeper session expiration from recurring, implement these proactive measures:
1. Configure Appropriate Session Timeout
```bash # In zoo.cfg: tickTime=2000 initLimit=10 syncLimit=5 # Session timeout should be at least 2 * tickTime minSessionTimeout=4000 maxSessionTimeout=40000
# Client-side configuration: zkClient.setSessionTimeout(30000) # 30 seconds ```
2. Monitor ZooKeeper Health
groups:
- name: zookeeper
rules:
- alert: ZooKeeperSessionExpirationsHigh
expr: rate(zookeeper_session_expirations[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High rate of ZooKeeper session expirations"3. Implement Connection Retry Logic
// Java client with retry handling
RetryPolicy retryPolicy = new ExponentialBackoffRetry(1000, 3);
CuratorFramework client = CuratorFrameworkFactory.builder()
.connectString("zk1:2181,zk2:2181,zk3:2181")
.sessionTimeoutMs(30000)
.connectionTimeoutMs(10000)
.retryPolicy(retryPolicy)
.build();Best Practices Checklist
- [ ] Configure appropriate session timeouts
- [ ] Monitor session expiration rate
- [ ] Implement client retry logic
- [ ] Use ZooKeeper ensemble for HA
- [ ] Monitor network latency
- [ ] Test session recovery procedures
Related Issues
- [Fix ZooKeeper Ensemble Not Forming](/articles/fix-zookeeper-ensemble-not-forming)
- [Fix ZooKeeper Leader Election Failed](/articles/fix-zookeeper-leader-election-failed)
- [Fix Kafka Controller Not Moving](/articles/fix-kafka-controller-not-moving)
Related Articles
- [Database troubleshooting: Fix Backup Exclusive Lock Table Production Writes ](backup-exclusive-lock-table-production-writes)
- [Fix Connection Pool Leak Application Not Closing Issue in Database](connection-pool-leak-application-not-closing)
- [Fix Connection Reset Idle Timeout Firewall Issue in Database](connection-reset-idle-timeout-firewall)
- [Fix Connection Reset Idle Timeout Serverless Database Issue in Database](connection-reset-idle-timeout-serverless-database)
- [Fix Connection String Encoding Special Characters Issue in Database](connection-string-encoding-special-characters)
<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "TechArticle", "headline": "Fix ZooKeeper Session Expired", "description": "Troubleshoot ZooKeeper session expired errors. Check network, tick time, session timeout.", "url": "https://www.fixwikihub.com/fix-zookeeper-session-expired", "publisher": { "@type": "Organization", "name": "FixWikiHub", "url": "https://www.fixwikihub.com" }, "author": { "@type": "Person", "name": "FixWikiHub Editorial Team" }, "datePublished": "2026-04-05T03:36:58.189Z", "dateModified": "2026-04-05T03:36:58.189Z" } </script>