Introduction
As database administrators, addressing replication lag is a common challenge. This article examines an unusual lag pattern observed in a production environment, where delays spiked and recovered abruptly.
Problem Phenomenon
Monitoring exposed an irregular replication lag pattern:
https://via.placeholder.com/600x400?text=Replication+Lag+Visualization
Key observations:
- Sudden Spikes: Lag surged to ~71 seconds before dropping to 0 within seconds.
- Periodicity: The pattern repeated every few minutes, resembling a "sawtooth" curve.
Using a shell script to log Seconds_Behind_Master
every second:
for i in {1..300}; do
echo `date` `mysql -S /data/mysql/3306/data/mysqld.sock -uxxx -pxxx -e 'show slave status\G' | grep Second` >> /tmp/second.log
sleep 1
done
The log revealed oscillations between 0 and 71:
2023-04-17 14:20:00 0
2023-04-17 14:20:01 71
2023-04-17 14:20:02 0
...
Root Cause Investigation
Initial theories pointed to resource contention or query overload, but the abrupt lag fluctuations dismissed these. Deeper analysis identified:
- Time Drift: The secondary server’s clock lagged behind the primary by 71 seconds.
- Seconds_Behind_Master Calculation:
- MySQL computes lag by comparing the secondary’s system time against the primary’s binary log timestamps.
- If the secondary’s clock adjusts after the IO thread initializes,
Seconds_Behind_Master
becomes unreliable.
Critical Insight:
"Time discrepancies post-IO thread startup will skew Seconds_Behind_Master
calculations." — MySQL Documentation
Solution
To resolve the issue:
- Synchronize System Time:
- Deploy NTP to align all cluster nodes’ clocks.
- Avoid abrupt time changes; use gradual adjustments (e.g.,
ntpd
instead of manualdate
commands).
- Restart the IO Thread:
- After time alignment, restart the secondary’s IO thread to recalculate the time offset.
- Operational Precautions:
- Halt application writes during time adjustments if dependent on
NOW()
orCURDATE()
. - Validate changes in a staging environment first.
- Halt application writes during time adjustments if dependent on
Example Workflow:
-- On Secondary Server
STOP SLAVE IO_THREAD;
-- Gradually adjust system time (if necessary)
START SLAVE IO_THREAD;
Key Takeaways
- Time Consistency: Critical for accurate replication monitoring.
- Metrics Validation: Cross-check
Seconds_Behind_Master
with binary log positions andSHOW SLAVE STATUS
. - Change Coordination: Align time adjustments with application teams to prevent data inconsistencies.
For further details, refer to MySQL’s documentation on replication timing mechanisms.