Background
MySQL 5.7 reached its End-of-Life (EOL) in October 2023, yet many production systems still rely on this version. Recently, a large-scale transaction execution caused significant replication lag in a MySQL 5.7-based system, leading to data inconsistency between the primary and secondary databases. This impacted application logic due to read/write splitting. To address this, the business team urgently needed an accurate method to measure replication lag.
Key Questions
- How can we obtain the actual replication lag value?
- How do we ensure the measured value is reliable?
Analyzing Seconds_Behind_Master’s Reliability
The Seconds_Behind_Master
metric from SHOW SLAVE STATUS
indicates the delay in applying binary log events on the secondary. However, its accuracy is questionable:
- Calculation Flaws: It relies on timestamps from binary log entries, which may not reflect actual execution times (e.g.,
sync_binlog=0
delays log flushing). - Single-Threaded SQL Thread Limitations: Long transactions (e.g., bulk updates) pause timestamp updates until completion, causing sudden spikes in reported lag.
- Parallel Replication Gaps: When using
slave_parallel_workers
, the metric only tracks the last applied event, masking delays in other threads. - Network/I/O Overlooked: It ignores delays in log retrieval or relay log writes, potentially showing zero lag despite network issues.
Solution: pt-heartbeat
Percona Toolkit’s pt-heartbeat
injects periodic timestamps into a monitoring table on the primary and measures lag by comparing them against the secondary’s clock:
1. Primary Setup:
-- Create heartbeat table
CREATE TABLE heartbeat.heartbeat (
id INT PRIMARY KEY,
ts TIMESTAMP(6) NOT NULL,
server_id INT NOT NULL
);
-- Insert/update heartbeat every second
pt-heartbeat --host=primary_host --user=user --password=pass \
--create-table --update --interval=1 -D heartbeat
2. Secondary Monitoring:
pt-heartbeat --host=secondary_host --user=user --password=pass \
--monitor -D heartbeat
Why pt-heartbeat is Reliable
- Real-Time Precision: Updates every second with microsecond granularity.
- Clock Independence: Avoids reliance on MySQL’s internal timestamps by using the system clock.
- Parallel Replication Handling: Aggregates delays across all replication threads.
- Network Awareness: Accounts for propagation delays between primary and secondary.
Recommendations
- Deploy
pt-heartbeat
with intervals ≤ 0.5 seconds for balanced accuracy/load. - Ensure NTP synchronization between primary/secondary to avoid clock drift.
- Set alerts for lag thresholds (e.g., >1s) and optimize large transactions.
By replacing Seconds_Behind_Master
with pt-heartbeat
, teams gain actionable insights into replication health, ensuring data consistency and system stability.