July 17, 2025

MySQL Deadlock Analysis: PXC 8.0.33 wsrep_group_commit_queue Issue

A deep dive into how empty binlog events in Percona XtraDB Cluster 8.0.33 caused deadlocks via wsrep_group_commit_queue, with reproduction steps and solution analysis.

Percona XtraDB Cluster (PXC) removed the wsrep_group_commit_queue component in version 8.0.41 due to multiple deadlock-causing bugs. As a DBA handling numerous Crash/Hang tickets, I've identified this component as the root cause. Here's an analysis of one such case.

Symptom Description

The customer reported two PXC clusters across different data centers, synchronized via MySQL native replication, with applications accessing local nodes. Architecture:

During peak traffic, multiple PXC nodes frequently hung, stopped processing writes, and crashed after 600 seconds.

Error Log Analysis

MySQL's error.log showed:

2025-07-07T05:10:25.772284Z 0 [ERROR] [MY-012872] [InnoDB] [FATAL] Semaphore wait has lasted > 600 seconds...

InnoDB's SEMAPHORES section revealed two threads waiting for an RW-latch held by pthread ID 140395374548736.

Core File Analysis

GDB identified the hanging thread (thd1) waiting at wsrep_wait_for_turn_in_group_commit (line 468) for COND_wsrep_group_commit. The queue (wsrep_group_commit_queue) contained four THDs, with thd2 (ID 42) being the front element.

Deadlock Pattern

  1. thd1​ waits for ​thd2​ via COND_wsrep_group_commit
  2. thd2​ waits for ​thd3​ (the real leader) via m_stage_cond_binlog
  3. thd3​ (leader) should remove ​thd2​ from the queue but failed due to an empty binlog event

Root Cause

When SQL thread applies a no-op update (e.g., UPDATE t SET d=d WHERE id=1), it skips innobase_commit, leaving itself stuck in wsrep_group_commit_queue. This blocks subsequent transactions.

Reproduction Steps

  1. Deploy PXC + Replication clusters
  2. Create conflicting updates:
-- On Primary
UPDATE test.t SET d=1 WHERE id=1;
-- On Secondary 
UPDATE test.t SET d=1 WHERE id=1; -- Creates empty binlog event

  1. Stress with sysbench:
sysbench oltp_insert --threads=10 --time=600 --tables=1 --table-size=1000 \
--mysql-host=127.0.0.1 --mysql-port=10001 run

TPS drops to 0 as deadlock occurs.

Solution

Percona fixed this in 8.0.41 by completely removing the problematic wsrep_group_commit_queue component.

Key Takeaways

  • The deadlock requires:
    a) Replication applying no-op updates
    b) Concurrent group commits
    c) Busy workload
  • Understanding usage scenarios helps reproduce elusive bugs.

References

You will get best features of ChatDBA