May 26, 2025

Resolving MySQL Replication Lag from Large Transactions: Step-by-Step Fixes

Learn how to diagnose and fix MySQL replication lag with practical strategies. This guide covers analyzing large transactions, optimizing binlog size, and improving secondary server performance using real-world case studies.

Background

MySQL primary-secondary replication is critical for ensuring high availability and scalability. However, replication lag can lead to data inconsistencies and degrade failover capabilities. A common culprit is large transactions, especially those involving multi-table writes. This article analyzes a production incident caused by sharding operations and shares optimization insights.

Problem Statement

Developers implemented a sharding process: reading data from a single large table, processing it, and writing to 100 partitioned tables. This generated large transactions with multi-table writes, causing significant replication lag. Initial observations revealed:

Secondary server relay log size: 1.6 GB (exceeding the default 1.1 GB).
INFORMATION_SCHEMA.INNODB_TRX showed active transactions with dynamically changing insert counts.
Replication lag: 37,325 seconds (10.4 hours), with the secondary trailing 4 binary logs behind.

Troubleshooting Steps

1. Analyze Secondary Server Transactions
Checked INNODB_TRX for long-running transactions:

SELECT * FROM INFORMATION_SCHEMA.INNODB_TRX WHERE TIME > 120;

‍

2. Inspect Relay Log Size

Identified oversized relay logs using:

ls -lh /path/to/relaylogs/mysql-relay-bin.*

‍

The 1.6 GB log indicated excessive backlog.

3. Parse Binary Logs for Row Changes

Used mysqlbinlog to analyze transaction impact:

mysqlbinlog --base64-output=decode-rows -vv mysql-bin.003731 | awk '{...}' > analysis.txt

‍

Output showed most transactions affected 1–4 rows, ruling out massive bulk operations.

4. Evaluate Transaction Sizes

Sorted transactions by size to pinpoint outliers:

mysqlbinlog ... | grep "^BEGIN" | awk '{...}' | sort -nr | head -n 20

‍

Two transactions exceeded 700 MB each, totaling ~1.5 GB—matching the relay log size.

5. Visualize Large Transactions with my2sql

Leveraged my2sql to parse detailed transaction metadata:

./my2sql -user repuser -password ... -work-type stats -start-file mysql-bin.003731

‍

Results highlighted 8 transactions spanning 7 hours, each inserting 4 million rows across 100 tables.

Solutions

1. Split Large Transactions
Break multi-table writes into smaller batches (e.g., 10,000 rows per commit) to reduce lock contention and binary log generation. Example:

BEGIN;  
INSERT INTO db.t_sharding_01 (...) VALUES (...);  
-- Repeat for 10,000 rows  
COMMIT;

‍

2. Optimize Sharding Logic
Streamline data distribution to minimize inter-table dependencies. Implement parallel writes or batch processing where feasible.

3. Monitor and Alert
Set up alerts for:

Relay log size exceeding 500 MB.
Transactions running longer than 5 minutes.
Use tools like Prometheus + Grafana or Percona Monitoring and Management (PMM).

‍

Key Takeaways

Large transactions are a leading cause of replication lag.
Analyze binary logs with mysqlbinlog and my2sql to identify bottlenecks.
Prioritize transaction granularity and monitoring to maintain replication health.