Background
MySQL primary-secondary replication is critical for ensuring high availability and scalability. However, replication lag can lead to data inconsistencies and degrade failover capabilities. A common culprit is large transactions, especially those involving multi-table writes. This article analyzes a production incident caused by sharding operations and shares optimization insights.
Problem Statement
Developers implemented a sharding process: reading data from a single large table, processing it, and writing to 100 partitioned tables. This generated large transactions with multi-table writes, causing significant replication lag. Initial observations revealed:
- Secondary server relay log size: 1.6 GB (exceeding the default 1.1 GB).
INFORMATION_SCHEMA.INNODB_TRX
showed active transactions with dynamically changing insert counts.- Replication lag: 37,325 seconds (10.4 hours), with the secondary trailing 4 binary logs behind.
Troubleshooting Steps
1. Analyze Secondary Server Transactions
Checked INNODB_TRX
for long-running transactions:
SELECT * FROM INFORMATION_SCHEMA.INNODB_TRX WHERE TIME > 120;
2. Inspect Relay Log Size
Identified oversized relay logs using:
ls -lh /path/to/relaylogs/mysql-relay-bin.*
The 1.6 GB log indicated excessive backlog.
3. Parse Binary Logs for Row Changes
Used mysqlbinlog
to analyze transaction impact:
mysqlbinlog --base64-output=decode-rows -vv mysql-bin.003731 | awk '{...}' > analysis.txt
Output showed most transactions affected 1–4 rows, ruling out massive bulk operations.
4. Evaluate Transaction Sizes
Sorted transactions by size to pinpoint outliers:
mysqlbinlog ... | grep "^BEGIN" | awk '{...}' | sort -nr | head -n 20
Two transactions exceeded 700 MB each, totaling ~1.5 GB—matching the relay log size.
5. Visualize Large Transactions with my2sql
Leveraged my2sql
to parse detailed transaction metadata:
./my2sql -user repuser -password ... -work-type stats -start-file mysql-bin.003731
Results highlighted 8 transactions spanning 7 hours, each inserting 4 million rows across 100 tables.
Solutions
1. Split Large Transactions
Break multi-table writes into smaller batches (e.g., 10,000 rows per commit) to reduce lock contention and binary log generation. Example:
BEGIN;
INSERT INTO db.t_sharding_01 (...) VALUES (...);
-- Repeat for 10,000 rows
COMMIT;
2. Optimize Sharding Logic
Streamline data distribution to minimize inter-table dependencies. Implement parallel writes or batch processing where feasible.
3. Monitor and Alert
Set up alerts for:
- Relay log size exceeding 500 MB.
- Transactions running longer than 5 minutes.
Use tools like Prometheus + Grafana or Percona Monitoring and Management (PMM).
Key Takeaways
- Large transactions are a leading cause of replication lag.
- Analyze binary logs with
mysqlbinlog
andmy2sql
to identify bottlenecks. - Prioritize transaction granularity and monitoring to maintain replication health.