Background
A financial system’s slave database exhibited severe replication lag during peak traffic, with Seconds_Behind_Master
spiking to 30,000+ seconds. Initial checks ruled out network issues and hardware failures, narrowing the focus to configuration discrepancies.
Key Symptoms
- Replication Lag Spike:
- Slave lag fluctuated wildly, peaking during batch write operations.
- Error logs showed repeated
Slave SQL thread retreating
warnings.
- Configuration Anomalies:
- Binary Log Format: Mixed format (
BINLOG_FORMAT=MIXED
) led to inconsistent row-based logging. - Sync Binlog:
sync_binlog=1
caused excessive disk I/O under heavy writes. - Hardware Mismatch: Slave’s storage subsystem (HDD) lagged behind the master’s SSD.
- Binary Log Format: Mixed format (
Root Cause Analysis
1. Binary Log Overhead
- Mixed Logging: Row-based replication (
ROW
) was sporadically used for DDL statements, generating bloated binary logs. - Sync Policy:
sync_binlog=1
flushed logs to disk after every commit, overwhelming the slave’s HDD.
2. Resource Contention
- Slave Hardware: Underpowered CPU and slow disk I/O exacerbated replication delays.
- Lock Waits: Long-running transactions on the master held locks longer, delaying relay log application.
3. Suboptimal Parameters
- **
innodb_flush_log_at_trx_commit=1
**: Ensured durability at the cost of latency. - **
max_binlog_size=1G
**: Infrequent rotation caused prolonged recovery times.
Solution Steps
1. Align Binary Log Settings:
SET GLOBAL binlog_format = 'ROW'; -- Enforce consistent row-based logging
SET GLOBAL sync_binlog = 100; -- Reduce disk I/O frequency
2. Hardware Upgrades:
- Replaced the slave’s HDD with NVMe SSDs.
- Increased slave memory to 32GB for larger InnoDB buffers.
3. Parameter Tuning:
SET GLOBAL innodb_flush_log_at_trx_commit = 2; -- Balance durability and speed
SET GLOBAL max_binlog_size = 128M; -- Shorten recovery cycles
4. Monitoring Enhancements:
- Added Prometheus alerts for
Seconds_Behind_Master > 300
. - Enabled GTID-based replication for easier failover.