Introduction
This article presents an optimization case study for a GROUP BY statement in a MySQL production environment. The execution time was reduced from 3 seconds to 30 milliseconds!
Case Study
Before optimization:
SELECT taskUniqueId, max(reportTime) AS reportTime
FROM task_log_info
WHERE reportTime > '2024-04-07'
GROUP BY taskUniqueId;
Execution time: 3 seconds.
After optimization:
SELECT a.taskUniqueId, reportTime
FROM task_log_info a
JOIN (
SELECT taskUniqueId, max(id) AS id
FROM task_log_info
GROUP BY taskUniqueId
) tmp ON a.id = tmp.id AND reportTime >= '2024-04-07';
Execution time: 30 milliseconds!
Note: This modification is only applicable when the id
and reportTime
fields have a correlated relationship.
Environment Preparation
For GROUP BY optimization using indexes, we discuss two scenarios:
- No Index on the Table: A temporary table is generated for grouping. Indexes can be used to optimize and avoid using temporary tables.
- Index on the Table: Several scanning algorithms are available for GROUP BY statements:
- Loose Index Scan
- Tight Index Scan
- Combination of both algorithms
Test Data Preparation
CREATE TABLE t2 (
id INT AUTO_INCREMENT,
c1 CHAR(64) NOT NULL,
c2 CHAR(64) NOT NULL,
c3 CHAR(64) NOT NULL,
c4 CHAR(64) NOT NULL,
PRIMARY KEY(id),
KEY c1_c2_c3_idx (c1, c2, c3)
) ENGINE=INNODB;
INSERT INTO t2 VALUES
(null,'a','b','a','a'),
(null,'a','b','a','a'),
(null,'a','c','a','a'),
(null,'a','c','a','a'),
(null,'a','d','b','b'),
(null,'a','b','b','b'),
(null,'d','b','c','c'),
(null,'e','b','c','c'),
(null,'f','c','d','d'),
(null,'k','c','d','d'),
(null,'y','d','y','y'),
(null,'f','b','f','y'),
(null,'a','b','a','a'),
(null,'a','b','a','a'),
(null,'a','c','a','a'),
(null,'a','c','a','a'),
(null,'a','d','b','b'),
(null,'a','b','b','b'),
(null,'d','b','c','c'),
(null,'e','b','c','c'),
(null,'f','c','d','d'),
(null,'k','c','d','d'),
(null,'y','d','y','y'),
(null,'f','b','f','y');
-- Collect statistics, otherwise it may affect the test
ANALYZE TABLE t2;
No Index Scenario
GROUP BY without using an index:
mysql> explain select c4, count(*) from t2 group by c4 order by null;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | t2 | NULL | ALL | NULL | NULL | NULL | NULL | 24 | 100.00 | Using temporary |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
mysql> explain select c4, count(*) from t2 group by c4;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------+---------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------+
| 1 | SIMPLE | t2 | NULL | ALL | NULL | NULL | NULL | NULL | 24 | 100.00 | Using temporary; Using filesort |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------+
1 row in set, 1 warning (0.00 sec)
Extra: Using temporary
indicates the use of a temporary table.
GROUP BY using an index:
mysql> explain select c1, count(*) from t2 group by c1;
+----+-------------+-------+------------+-------+---------------+---------------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+---------------+---------+------+------+----------+-------------+
| 1 | SIMPLE | t2 | NULL | index | c1_c2_c3_idx | c1_c2_c3_idx | 768 | NULL | 24 | 100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+---------------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.01 sec)
Extra: Using index indicates a full index scan. This can still be time-consuming for large datasets.
Indexed Scenario
Loose Index Scan (Loose Index Scan)
Does not require scanning all indexes, but skips parts of the index based on the grouping prefix (GROUP BY fields).
Extra: Using index for group-by
Tight Index Scan (Tight Index Scan)
Requires scanning a range or all indexes.
Extra: Using index
Combination of Both Algorithms
For some statistical AGG(DISTINCT) operations, such as SUM|COUNT|AVG(distinct), the cost of using Loose Index Scan may be higher than Tight Index Scan.
Example of Loose Index Scan:
mysql> explain SELECT c1, MIN(c2) FROM t2 GROUP BY c1;
+----+-------------+-------+------------+-------+---------------+---------------+---------+------+------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+---------------+---------+------+------+----------+----------------+
| 1 | SIMPLE | t2 | NULL | range | c1_c2_c3_idx | c1_c2_c3_idx | 256 | NULL | 7 | 100.00 | Using index for group-by |
+----+-------------+-------+------------+-------+---------------+---------------+---------+------+------+----------+----------------+
1 row in set, 1 warning (0.00 sec)
Conditions for Using Loose Index Scan:
- The query is based on a single table.
- The GROUP BY fields satisfy the leftmost prefix match principle of the index.
- The columns used by the aggregation functions must be included in the index; if multiple aggregation functions are used, they must use the same field, and the GROUP BY fields + aggregation function fields must also satisfy the leftmost prefix match principle.
- The fields in the index must be full-field indexes, not prefix indexes, such as INDEX(c1(10)).
Example of Tight Index Scan:
For some GROUP BY operations that cannot use Loose Index Scan, Tight Index Scan may be used if the leftmost prefix match principle is satisfied.
Combination of Both Algorithms:
For statistical AGG(DISTINCT) operations, the cost of using Loose Index Scan may be higher than Tight Index Scan.
Example:
SELECT count(distinct c1, c2) from t2;
mysql> explain select count(distinct c1, c2) from t2;
+----+-------------+-------+------------+-------+---------------+---------------+---------+------+------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+---------------+---------+------+------+----------+----------------+
| 1 | SIMPLE | t2 | NULL | range | c1_c2_c3_idx | c1_c2_c3_idx | 512 | NULL | 10 | 100.00 | Using index for group-by |
+----+-------------+-------+------------+-------+---------------+---------------+---------+------+------+----------+----------------+
1 row in set, 1 warning (0.01 sec)
New Test Data:
INSERT INTO t2 VALUES
(null,'a','b','a','a'),
(null,'a','b','a','a'),
(null,'k','c','a','a'),
(null,'k','g','a','a'),
(null,'a','d','b','b'),
(null,'a','b','b','b'),
(null,'d','b','c','c'),
(null,'e','b','c','c'),
(null,'f','c','d','d'),
(null,'k','c','d','d'),
(null,'y','d','y','y'),
(null,'f','b','f','y'),
(null,'j','b','a','a'),
(null,'m','b','a','a'),
(null,'z','c','a','a'),
(null,'t','c','a','a'),
(null,'x','d','b','b'),
(null,'x','b','b','b'),
(null,'d','b','c','c'),
(null,'e','b','c','c'),
(null,'f','c','d','d'),
(null,'k','c','d','d'),
(null,'y','d','y','y'),
(null,'f','b','f','y');
Execution Plan:
The core of the optimization is to transform the Tight Index Scan into a Loose Index Scan.