MySQL8.0 优化器介绍（三）

GreatSQL社区原创内容未经授权不得随意使用，转载请联系小编并注明来源。
GreatSQL是MySQL的国产分支版本，使用上与MySQL一致。
作者：奥特曼爱小怪兽
文章来源：GreatSQL社区原创

往期回顾

MySQL8.0 优化器介绍（一）

MySQL8.0 优化器介绍（二）

本篇将进一步深入介绍优化器相关的join优化

为更好的理解本篇内容需要提前看一下以下内容：

单表访问的方法，参考《MySQL 是怎样运行的：从根儿上理解 MySQL》第10章"单表访问方法"
更多select语句级别的优化细节见（https://dev.mysql.com/doc/refman/8.0/en/select-optimization.html）

为了让读者对join优化有更深的了解，章节里的sql例子，留了一些思考和动手的问题。可能大家得到的答案会不同，但探索未知的过程，方式应该是一样的。

join优化（Join Optimizations）

MySQL可以使用Join Optimizations来改进上次分享过的join algorithms，或者决定如何执行部分查询。本次主要介绍三种经常用到的join Optimizations，更多的 join type 见下面的链接：（https://dev.mysql.com/doc/refman/8.0/en/explain-output.html#explain-join-types）

index merge

通常MySQL只会对每个表使用一个索引。但是，如果对同一表中的多个列在where后有条件限制，并且没有覆盖所有列的单个索引，无论选哪个索引都不是最佳的。对于这些情况，MySQL支持索引合并 (index merge)。select a,b,c from t where a=x1 and b=x2 and c=x3,这种情况下，建立一个多列的复合索引 index_abc 比使用 index merge +index_a+index_b+index_c 性能更好。

Index merge 支持三种算法见下表

查询计划使用index merge 时，会在explain sql 的 access type 列有"index_merge"，key 列会包含所有参与merge的列， key_length 包含一个所用索引的最长关键部分的列表。举个Intersection例子：

Intersection

以下代码块注释中提到的知识点略多

##无论optimizer 是否选择  index merge 取决于index  statistics. 
## index  statistics 是从哪个试图获得呢？mysql.innodb_index_stats  还是 information_schema.statistics 
## 还是 information_schema.INNODB_SYS_TABLESTATS？ 
## 可以参考 https://www.cnblogs.com/ClassicMan/articles/15871403.html
## index_dive  eq_range_index_dive_limit 这两个参数有什么作用？
##意味着即使返回相同STATEMENT_DIGEST_TEXT的sql查询语句, WHERE语句后面跟不同的值，得到的查询计划可能是不一样的  ##比如select * from people where name='唯一值'；
##select * from people where name='超级多的重复值'

## 同理index statistics 的改变会让同一个查询走不同的执行计划，
## 体现在 select a,b from t where a=1 and b=1  有时走了 index merges，有时没走。

CREATE TABLE `payment` (
 `payment_id` smallint unsigned NOT NULL,
 `customer_id` smallint unsigned NOT NULL,
 `staff_id` tinyint unsigned NOT NULL,
 `rental_id` int(DEFAULT NULL,
 `amount` decimal(5,2) NOT NULL,
 `payment_date` datetime NOT NULL,
 `last_update` timestamp NULL,
 PRIMARY KEY (`payment_id`),
 KEY `idx_fk_staff_id` (`staff_id`),
 KEY `idx_fk_customer_id` (`customer_id`),
 KEY `fk_payment_rental` (`rental_id`)
) ENGINE=InnoDB;

## case1  等值查询
SELECT *
  FROM sakila.payment
 WHERE staff_id = 1
   AND customer_id = 75;
 
 mysql> EXPLAIN SELECT *
                  FROM sakila.payment
                 WHERE staff_id = 1
                   AND customer_id = 75\G
**************************** 1. row *****************************
 id: 1
 select_type: SIMPLE
 table: payment
 partitions: NULL
 type: index_merge
 possible_keys: idx_fk_staff_id,idx_fk_customer_id
 key: idx_fk_customer_id,idx_fk_staff_id
 key_len: 2,1
 ref: NULL
 rows: 20
 filtered: 100
 Extra: Using intersect(idx_fk_customer_id,idx_fk_staff_id); Using 
where 1 row in set, 1 warning (0.0007 sec)
 
mysql> EXPLAIN FORMAT=TREE
 SELECT *
 FROM sakila.payment
 WHERE staff_id = 1
 AND customer_id = 75\G
**************************** 1. row ****************************
EXPLAIN: -> Filter: ((sakila.payment.customer_id = 75) and (sakila.payment.staff_id = 1)) (cost=14.48 rows=20)
             -> Index range scan on payment using intersect(idx_fk_customer_id,idx_fk_staff_id) (cost=14.48 rows=20)
1 row in set (0.0004 sec)

##注意"Index range scan on payment",两个等值查询条件，为啥触发了rang scan？


## case2  下面的sql范围查询也能用到index  merge 吗？执行计划 自己下去测试验证
SELECT *
 FROM sakila.payment
 WHERE payment_id > 10
 AND customer_id = 318;

Union Algorithm

##case1 等值查询
SELECT *
 FROM sakila.payment
 WHERE staff_id = 1
 OR customer_id = 318;
 
 mysql> EXPLAIN
 SELECT *
 FROM sakila.payment
 WHERE staff_id = 1
 OR customer_id = 318\G
**************************** 1. row *****************************
 id: 1
 select_type: SIMPLE
 table: payment
 partitions: NULL
 type: index_merge
 possible_keys: idx_fk_staff_id,idx_fk_customer_id
 key: idx_fk_staff_id,idx_fk_customer_id
 key_len: 1,2
 ref: NULL
 rows: 8069
 filtered: 100
 Extra: Using union(idx_fk_staff_id,idx_fk_customer_id); Using where
1 row in set, 1 warning (0.0008 sec)
 
 mysql> EXPLAIN FORMAT=TREE
 SELECT *
 FROM sakila.payment
 WHERE staff_id = 1
 OR customer_id = 318\G
**************************** 1. row ****************************
EXPLAIN: -> Filter: ((sakila.payment.staff_id = 1) or (sakila.payment.customer_id = 318)) (cost=2236.18 rows=8069)
             -> Index range scan on payment using union(idx_fk_staff_id,idx_fk_customer_id) (cost=2236.18 rows=8069)
1 row in set (0.0010 sec)

 ## case2 范围查询也能用到index  merge 吗？执行计划 自己下去测试验证， 
 ## 有主键参与后，和Intersection 章节的case2 执行计划中用到的索引个数有啥不同？
SELECT *
 FROM sakila.payment
 WHERE payment_id > 15000
 OR customer_id = 318;

Sort-Union Algorithm

SELECT *
 FROM sakila.payment
 WHERE customer_id < 30
 OR rental_id < 10;
 
 mysql> EXPLAIN
 SELECT *
 FROM sakila.payment
 WHERE customer_id < 30
 OR rental_id < 10\G
**************************** 1. row *****************************
 id: 1
 select_type: SIMPLE
 table: payment
 partitions: NULL
 type: index_merge
possible_keys: idx_fk_customer_id,fk_payment_rental
 key: idx_fk_customer_id,fk_payment_rental
 key_len: 2,5
 ref: NULL
 rows: 826
 filtered: 100
 Extra: Using sort_union(idx_fk_customer_id,fk_payment_rental); 
Using where 1 row in set, 1 warning (0.0009 sec)

mysql> EXPLAIN FORMAT=TREE
 SELECT *
 FROM sakila.payment
 WHERE customer_id < 30
 OR rental_id < 10\G
**************************** 1. row *****************************
EXPLAIN: -> Filter: ((sakila.payment.customer_id < 30) or (sakila.payment.rental_id < 10)) (cost=1040.52 rows=826)
             -> Index range scan on payment using sort_union(idx_fk_customer_id,fk_payment_rental) (cost=1040.52 rows=826)
1 row in set (0.0005 sec)

Multi-Range Read (MRR)

多范围读取（MRR）优化旨在减少对辅助索引进行范围扫描所导致的随机I/O量。优化读取索引

首先，根据行id（InnoDB的聚集索引）对键进行排序，然后按行的存储顺序检索行。多量程读取优化

可以用于范围扫描和使用索引的等值连接。不支持虚拟生成列上的辅助索引。

使用InnoDB进行多范围读取优化的主要用例是用于没有覆盖索引的磁盘绑定查询（ disk-bound queries 另外一个层面对disk-bound 的优化，详细可见：https://dev.mysql.com/doc/refman/8.0/en/optimizing-innodb-diskio.html）。优化的效果取决于需要多少行以及存储器的查找时间。MySQL将会估算（estimate）是否有用。然而，成本估算在过于悲观而不是过于乐观的一面，因此可能有必要提供帮助优化器做出正确决策的信息。

有两个 optimizer switches 控制MRR优化

mrr: Whether the optimizer is allowed to use the Multi-Range Read optimization. The default is ON.
mrr_cost_based: Whether the decision to use the Multi-Range Read optimization is cost based. You can disable this option to always use the optimization when it is supported. The default is ON

可以用MRR() 和NO_MRR() 两个optimizer switches 来控制表级别or 索引级别的 Multi-Range Read ，举个例子：

mysql> EXPLAIN
 SELECT /*+ MRR(city) */
 *
 FROM world.city
 WHERE CountryCode BETWEEN 'AUS' AND 'CHN'\G
**************************** 1. row *****************************
 id: 1
 select_type: SIMPLE
 table: city
 partitions: NULL
 type: range
possible_keys: CountryCode
 key: CountryCode
 key_len: 3
 ref: NULL
 rows: 812
 filtered: 100
 Extra: Using index condition; Using MRR
1 row in set, 1 warning (0.0006 sec)


有必要使用MRR（）优化器提示或禁用基于MRR_cost_based的优化器开关。
示例中查询的估计行数太小，没有MRR的hint时，基于成本的优化无法使用MRR。
只能显示用hint来干预查询计划使用MRR。
当MRR的优化被使用时, MySQL需要用到random read buffer来存储indexes. 
有一个参数可以影响MRR的性能 read_rnd_buffer_size.

Batched Key Access (BKA)

可以简单认为 BKA=BNL+MRR .这使得可以以与非索引连接类似的方式将连接缓冲区用于索引连接，并使用多范围读取优化来减少随机I/O的数量。BKA 用于大量 disk-bound 查询的场景。但是，没有明确的说明来确定优化何时有帮助，何时会导致性能下降。

可以借鉴一下国外知名dba在MySQL 优化方面的blog（http://oysteing.blogspot.com/2012/04/improved-dbt-3-results-with-mysql-565.html）

MRR 在最优时，查询耗时减少20%,最糟糕时查询耗时增加2/3。

BKA 主要的获益在一个相对较窄的查询范围，而其他查询的性能可能会降低，因此默认情况下禁用该优化。

（可以简单理解为 MySQL5.6时，bka优化带来的收益小于bka带来的成本开销）除非确定开启bka能来提升时，再用hint BKA() 来启用。session级别开启：

SET SESSION optimizer_switch = 'mrr=on,mrr_cost_based=off,batched_key_access=on';

一个使用BKA的例子

mysql> EXPLAIN
 SELECT /*+ BKA(ci) */
 co.Code, co.Name AS Country,
 ci.Name AS City
 FROM world.country co
 INNER JOIN world.city ci
 ON ci.CountryCode = co.Code\G
**************************** 1. row *****************************
 id: 1
 select_type: SIMPLE
 table: co
 partitions: NULL
 type: ALL
 possible_keys: PRIMARY
 key: NULL
 key_len: NULL
 ref: NULL
 rows: 239
 filtered: 100
 Extra: NULL
**************************** 2. row *****************************
 id: 1
 select_type: SIMPLE
 table: ci
 partitions: NULL
 type: ref
 possible_keys: CountryCode
 key: CountryCode
 key_len: 3
 ref: world.co.Code
 rows: 18
 filtered: 100
 Extra: Using join buffer (Batched Key Access)
2 rows in set, 1 warning (0.0007 sec)

注意看执行计划中Extra 的关键字 "Using join buffer",说明 join_buffer_size 会影响BKA 特性的性能。从全局怎么调整join_buffer_size，并又能充分利用上BKA，是一个极大的挑战。调优最常见的问题，搞定了A sql,又引出了其他问题，比如内存使用率过高。

其他join优化

MySQL 还自动支持其他join 优化，一旦对查询有性能帮助，优化器会自动选择他们，一般不需要手动。

了解一下其他join的优化方式，有助于我们在遇到sql性能问题时，可以适当给与优化器，一些有用的hint。

具体有哪些join 优化方式，可以查看explain 输出中的Extra 的内容说明。本文可能列举的不全，精力有限只做了一些简单的介绍，具体细节需要查看官网，以及大量的实践。

Condition Filtering 条件过滤当一个表有两个或多个与之相关联的条件，并且一个索引可以用于部分条件时，使用条件过滤优化。启用条件过滤后，在估计表的总体过滤时，将考虑其余条件的过滤效果。
- Optimizer Switch: condition_fanout_filter – enabled by default
- Optimizer Hints: None
- EXPLAIN Output: None
Derived Merge 优化器可以将派生表(derived table)、视图引用和公共表表达式合并到它们所属的查询块中。优化的替代方法是物化表（materialize the table）、视图引用或公共表表达式。
- Optimizer Switch: derived_merge – enabled by default.
- Optimizer Hints: MERGE(), NO_MERGE().
- EXPLAIN Output: The query plan reflects that the derived table has been merged
Engine Condition Pushdown 此优化将条件向下推到存储引擎。目前仅NDBCluster存储引擎支持它。
Index Condition Pushdown

官方文档中给的例子和解释如下：people表中（zipcode，lastname，firstname）构成一个索引 SELECT * FROM people WHERE zipcode='95054' AND lastname LIKE '%etrunia%' AND address LIKE '%Main Street%';

如果没有使用索引下推技术，则MySQL会通过zipcode='95054'从存储引擎中查询对应的数据，返回到MySQL服务端，然后MySQL服务端基于lastname LIKE '%etrunia%'和address LIKE '%Main Street%'来判断数据是否符合条件。

如果使用了索引下推技术，则MySQL首先会返回符合zipcode='95054'的索引，然后根据lastname LIKE '%etrunia%'和address LIKE '%Main Street%'来判断索引是否符合条件。如果符合条件，则根据该索引来定位对应的数据，如果不符合，则直接reject掉。

有了索引下推优化，可以在有like条件查询的情况下，减少回表次数。

该优化也用于二级索引的范围条件。
- Optimizer Switch: index_condition_pushdown – enabled by default.
- Optimizer Hints: NO_ICP().
- EXPLAIN Output: The traditional format has Using index condition in the Extra column, and the JSON format sets the index_condition field with the index condition that is pushed
Index Extensions InnoDB中的所有二级非唯一索引都将主键列附加到索引中。当启用索引扩展优化时，MySQL会将主键列视为索引的一部分。
- Optimizer Switch: use_index_extensions – enabled by default
- Optimizer Hints: None
- EXPLAIN Output: None
Index Visibility 当表具有不可见的索引（ invisible index）时，默认情况下，优化器在创建查询计划时不会考虑它。如果启用了索引可见性优化器开关，则将考虑不可见的索引。例如，这可以用于测试已添加但尚未可见的索引的效果。
- Optimizer Switch: use_invisible_indexes – disabled by default
- Optimizer Hints: None
- EXPLAIN Output: None
Loose Index Scan 在某些情况下，MySQL可以使用部分索引来提高聚合数据或包含DISTINCT子句的查询的性能。这要求列用于通过形成多列索引的左前缀以及不用于分组的其他列来分组数据。当存在GROUP BY子句时，只有MIN（）和MAX（）聚合函数才能使用这个特性。

distinct效率更高还是group by效率更高？
- Optimizer Switch: None.
- Optimizer Hints: NO_RANGE_OPTIMIZATION() disables the loose index scan optimization as well as index merges and range scans.
- EXPLAIN Output: The traditional format has Using index for group-by in the Extra column. The JSON format sets the using_index_for_group_by field to true.
Range Access Method 范围优化与其他优化略有不同，因为它被认为是一种访问方法。MySQL将只扫描表或索引的一个或多个部分，而不是执行完整的表或索引扫描。范围访问方法通常用于涉及运算符>、>=、<、=<、BETWEEN、IN（），为NULL、LIKE等 range-Optimization 与 index merge Optimization 具有同样的重要性。（https://dev.mysql.com/doc/refman/8.0/en/range-optimization.html）
- Optimizer Switch: None.
- Optimizer Hints: NO_RANGE_OPTIMIZATION() – this also disables the loose index scan and index merge optimizations. It does however not disable the skip scan optimization even though that also uses range access.
- EXPLAIN Output: The access method is set to range. range_optimizer_max_mem_size 可以限制 range access使用的内存。默认是8M
Semijoin 半联接优化用于IN和EXIST条件。支持四个策略:

当subquery materialization 开启时， Semijoin 会尽可能的使用 materialization策略。EXISTS 在MySQL8.0.16 以后支持半联接。NOT EXISTS 在MySQL8.0.17 以后支持半联接。每种策略，都可以以参数的形式，用于SEMIJOIN() and NO_SEMIJOIN() hint

SEMIJOIN(DUPSWEEDOUT):The duplicate weedout strategy executes the semijoin as if it is a normal join and removes the duplicates using a temporary table. EXPLAIN Output: The traditional format has Start temporary and End temporary in the Extra column for the tables involved. The JSON-formatted output uses a block named duplicates_removal

SEMIJOIN(FIRSTMATCH)：The first match strategy returns the first match for each value rather than all values. EXPLAIN Output: The traditional format has FirstMatch(...) in the Extra column where the value between parentheses is the name of the reference table. The JSON format sets the value of the first_match field to the name of the reference table

SEMIJOIN(LOOSESCAN)：The loose scan strategy uses an index to choose a single value from each of the subquery’s value groups. EXPLAIN Output: The traditional format has LooseScan(m..n) in the Extra column where m and n indicate which parts of the index are used for the loose scan. The JSON format sets the loosescan field equal to true

半连接特别怕 null 值，Oracle 经常在以下方面出问题：
- where null in (a,b,c,null), null exists (null) .
- sum（null) 返回null,count(null) 返回 0
- materialization
- duplicate weedout
- first match
- loose scan (不要和 loose index scan optimization混淆)。
Skip Scan

Skip Scan MySQL 8.0.13 引入，工作方式类似loose index scan.当多列索引的第二列上存在范围条件，但第一列上没有条件时使用。Skip Scan将整个索引扫描转换为一系列范围扫描（对索引中第一列的每个值进行一次范围扫描）。

scan in the Extra column, and the JSON format sets the using_index_for_skip_scan field to true
- Optimizer Switch: skip_scan – enabled by default.
- Optimizer Hints: SKIP_SCAN(), NO_SKIP_SCAN().
- EXPLAIN Output: The traditional format has Using index for skip
Subquery Materialization

子查询物化策略将子查询的结果存储在内部临时表中。如果可能的话，优化器将在临时表上添加一个自动生成的哈希索引，将使其快速连接到查询的其余部分。

当启用了subquery_materialization_cost_based优化器开关（默认开）时，优化器将使用估计的成本来决定是使用Subquery Materialization 还是使用 IN-to-EXIST子查询转换（将IN条件重写为EXISTS）。

当开关关闭时，优化器总是选择Subquery Materialization。
- Optimizer Switch: materialization – enabled by default.
- Optimizer Hints: SUBQUERY(MATERIALIZATION).
- EXPLAIN Output: The traditional format has MATERIALIZED as the select type. The JSON format creates a block named materialized_from_subquery.