Oracle直方图理解与实验

一.Oracle中直方图的作用
直方图是一种对被管理对象某一方面质量进行管理的描述工具,那么在Oracle中自然它也是对Oracle中某个对象质量的描述工具,这个对象就是Oracle中最重要的东西——“数据”。
在Oracle中直方图是一种对数据分布质量情况进行描述的工具。它会按照某一列不同值出现数量多少,以及出现的频率高低来绘制数据的分布情况,以便能够指导优化器根据数据的分布做出正确的选择。在某些情况下,表的列中的数值分布将会影响优化器使用索引还是执行全表扫描的决策。当 where 子句的值具有不成比例数量的数值时,将出现这种情况,使得全表扫描比索引访问的成本更低。这种情况下如果where 子句的过滤谓词列之上上有一个合理的正确的直方图,将会对优化器做出正确的选择发挥巨大的作用,使得SQL语句执行成本最低从而提升性能。

二.Oracle中使用直方图的场合
在分析表或索引时,直方图用于记录数据的分布。通过获得该信息,基于成本的优 化器就可以决定使用将返回少量行的索引,而避免使用基于限制条件返回许多行的索引。直方图的使用不受索引的限制,可以在表的任何列上构建直方图。
构造直方图最主要的原因就是帮助优化器在表中数据严重偏斜时做出更好的规划:例如,如果一到两个值构成了表中的大部分数据(数据偏斜),相关的索引就可能无法帮助减少满足查询所需的I/O数量。创建直方图可以让基于成本的优化器知道何时使用索引才最合适,或何时应该根据WHERE子句中的值返回表中80%的记录。
通常情况下在以下场合中建议使用直方图:
(1)、当Where子句引用了列值分布存在明显偏差的列时:当这种偏差相当明显时,以至于 WHERE 子句中的值将会使优化器选择不同的执行计划。这时应该使用直方图来帮助优化器来修正执行路径。(注意:如果查询不引用该列,则创建直方图没有意义。这种错误很常见,许多 DBA 会在偏差列上创建柱状图,即使没有任何查询引用该列。)

(2)、当列值导致不正确的判断时:这种情况通常会发生在多表连接时,例如,假设我们有一个五项的表联接,其结果集只有 10 行。Oracle 将会以一种使第一个联接的结果集(集合基数)尽可能小的方式将表联接起来。通过在中间结果集中携带更少的负载,查询将会运行得更快。为了使中间结果最小化,优化器尝试在 SQL 执行的分析阶段评估每个结果集的集合基数。在偏差的列上拥有直方图将会极大地帮助优化器作出正确的决策。如优化器对中间结果集的大小作出不正确的判断,它可能会选择一种未达到最优化的表联接方法。因此向该列添加直方图经常会向优化器提供使用最佳联接方法所需的信息。

三.Oracle直方图的种类
Oracle利用直方图来提高非均匀数据分布的选择率和技术的计算精度。但是实际上Oracle会采用另种不同的策略来生成直方图:其中一种是针对包含很少不同值的数据集;另一种是针对包含很多不同的数据集。Oracle会针对第一种情况生成频率直方图,针对第二种情况生成高度均衡直方图。通常情况下当BUCTET < 表的NUM_DISTINCT值得到的是HEIGHT BALANCED(高度平衡)直方图,而当BUCTET > 表的NUM_DISTINCT值的时候得到的是FREQUENCY(频率)直方图

四、试验证明(有直方图执行计划更加准确)

SQL> create table t_xff
  2  as select * from dba_objects;
 
Table created
 
 
SQL> create index ind_t_xff on t_xff(object_id) online nologging;
 
Index created

SQL> SELECT MAX(object_id),MIN(object_id) FROM t_xff;
 
MAX(OBJECT_ID) MIN(OBJECT_ID)
-------------- --------------
         76800              2

SQL>  UPDATE t_xff SET object_id=1000 WHERE object_id>100 AND object_id<76000;
 
72965 rows updated
 
SQL> commit;
 
Commit complete
 
SQL> 
SQL>   BEGIN
  2      DBMS_stats.gather_table_stats(cascade => TRUE,degree => 2
  3      ,estimate_percent => 100,force => TRUE,ownname => USER,tabname => 'T_XFF');
  4   END;
  5  /
 
PL/SQL procedure successfully completed

SQL> SELECT * FROM user_histograms WHERE table_name='T_XFF' AND column_name='OBJECT_ID';
TABLE COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_A
----- ---------- --------------- -------------- ----------
T_XFF OBJECT_ID            1          2
T_XFF OBJECT_ID            2          3
……
T_XFF OBJECT_ID            73205          76789
T_XFF OBJECT_ID            73206          76800

SQL>   SELECT COLUMN_NAME,HISTOGRAM FROM USER_TAB_COLS WHERE TABLE_NAME='T_XFF' AND column_name='OBJECT_ID';
 
COLUMN_NAME                    HISTOGRAM
------------------------------ ---------------
OBJECT_ID                      FREQUENCY
--在gather_table_stats方法中,默认的method_opt值为:FOR ALL COLUMNS SIZE AUTO,所以也是会收集直方图的统计信息(和oracle版本相关)
--注意:ENDPOINT_NUMBER ,ENDPOINT_VALUE 的分布情况

SQL> set autot trace exp stat
SQL> select object_name from t_xff where object_id=100;


执行计划
----------------------------------------------------------
Plan hash value: 2950241517

--------------------------------------------------------------------------------

| Id  | Operation                   | Name      | Rows  | Bytes | Cost (%CPU)| Time     |

-----------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT            |           |     1 |    29 |     2   (0)| 00:00:01 |

|   1 |  TABLE ACCESS BY INDEX ROWID| T_XFF     |     1 |    29 |     2   (0)| 00:00:01 |

|*  2 |   INDEX RANGE SCAN          | IND_T_XFF |     1 |       |     1   (0)| 00:00:01 |

-----------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OBJECT_ID"=100)


统计信息
----------------------------------------------------------
          1  recursive calls
          0  db block gets
          4  consistent gets
          0  physical reads
          0  redo size
        432  bytes sent via SQL*Net to client
        416  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

SQL> select object_name from t_xff where object_id=1000;

已选择72965行。


执行计划
----------------------------------------------------------
Plan hash value: 667573674

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       | 72965 |  2066K|   292   (1)| 00:00:04 |
|*  1 |  TABLE ACCESS FULL| T_XFF | 72965 |  2066K|   292   (1)| 00:00:04 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("OBJECT_ID"=1000)


统计信息
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       5833  consistent gets
         16  physical reads
          0  redo size
    2487154  bytes sent via SQL*Net to client
      53920  bytes received via SQL*Net from client
       4866  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      72965  rows processed
--观察发现,因为有直方图的存在,oracle会只能的选择使用index或者全表扫描


SQL>   BEGIN
  2      DBMS_stats.gather_table_stats(cascade => TRUE,degree => 2
  3      ,estimate_percent => 100,force => TRUE,ownname => USER,tabname => 'T_XFF',method_opt => 'FOR ALL COLUMNS SIZE 1');
  4   END;
  5  /
 
PL/SQL procedure successfully completed
--删除直方图,设置method_opt:FOR ALL COLUMNS SIZE 1即可

SQL>  SELECT * FROM user_histograms WHERE table_name='T_XFF' AND column_name='OBJECT_ID';

TABLE COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_A
----- ---------- --------------- -------------- ----------
T_XFF OBJECT_ID                0              2
T_XFF OBJECT_ID                1          76800

SQL> SELECT COLUMN_NAME,HISTOGRAM FROM USER_TAB_COLS WHERE TABLE_NAME='T_XFF' AND column_name='OBJECT_ID';
 
COLUMN_NAME                    HISTOGRAM
------------------------------ ---------------
OBJECT_ID                      NONE

SQL> select object_name from t_xff where object_id=100;


执行计划
----------------------------------------------------------
Plan hash value: 2950241517

--------------------------------------------------------------------------------

| Id  | Operation                   | Name      | Rows  | Bytes | Cost (%CPU)| Time     |

--------------------------------------------------------------------------------

|   0 | SELECT STATEMENT            |           |   303 |  8787 |     7   (0)| 00:00:01 |

|   1 |  TABLE ACCESS BY INDEX ROWID| T_XFF     |   303 |  8787 |     7   (0)| 00:00:01 |

|*  2 |   INDEX RANGE SCAN          | IND_T_XFF |   303 |       |     2   (0)| 00:00:01 |

--------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OBJECT_ID"=100)


统计信息
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets
          0  physical reads
          0  redo size
        432  bytes sent via SQL*Net to client
        415  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

SQL> select object_name from t_xff where object_id=1000;

已选择72965行。


执行计划
----------------------------------------------------------
Plan hash value: 2950241517

--------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Rows  | Bytes | Cost (%CPU)| Time     |

--------------------------------------------------------------------------------

|   0 | SELECT STATEMENT            |           |   303 |  8787 |     7   (0)| 00:00:01 |

|   1 |  TABLE ACCESS BY INDEX ROWID| T_XFF     |   303 |  8787 |     7   (0)| 00:00:01 |

|*  2 |   INDEX RANGE SCAN          | IND_T_XFF |   303 |       |     2   (0)| 00:00:01 |

--------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OBJECT_ID"=1000)


统计信息
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       5833  consistent gets
          0  physical reads
          0  redo size
    2487154  bytes sent via SQL*Net to client
      53919  bytes received via SQL*Net from client
       4866  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      72965  rows processed
--没有了直方图,oracle傻瓜的选择也使用index
--虽然两次逻辑读一样,但是全表扫描涉及到一次可以读多块,但是index扫描一次只能读一个数据块

DBMS_STATS比较复杂参数

method_opt
Accepts:
FOR ALL [INDEXED | HIDDEN] COLUMNS [size_clause]
FOR COLUMNS [size clause] column|attribute [size_clause] [,column|attribute [size_clause]…]
size_clause is defined as size_clause := SIZE {integer | REPEAT | AUTO | SKEWONLY}

– integer : Number of histogram buckets. Must be in the range [1,254].
– REPEAT : Collects histograms only on the columns that already have histograms.
– AUTO : Oracle determines the columns to collect histograms based on data distribution and the workload of the columns.
– SKEWONLY : Oracle determines the columns to collect histograms based on the data distribution of the columns.

举例说明:
method_opt => ‘FOR COLUMNS size 254 object_id’ 收集objct_id列直方图
method_opt => ‘FOR COLUMNS size 1 object_id’ 删除object_id列直方图
method_opt => ‘for all columns size repeat’ 重新分析现有直方图
method_opt => ‘for all columns size auto’ oracle决定收集哪些列的直方图(需要设置table monitoring)
method_opt => ‘for all columns size skewonly’ oracle分析所有列的分布情况,生成直方图
method_opt => ‘FOR COLUMNS object_id size SKEWONLY’ 收集object_id列分布情况,生成直方图
method_opt => ‘FOR all INDEXED COLUMNS size SKEWONLY’ 收集index列分布情况,并生成直方图

granularity
Granularity of statistics to collect (only pertinent if the table is partitioned).
‘ALL’ – gathers all (subpartition, partition, and global) statistics
‘AUTO’- determines the granularity based on the partitioning type. This is the default value.
‘DEFAULT’ – gathers global and partition-level statistics. This option is obsolete, and while currently supported, it is included in the documentation for legacy reasons only. You should use the ‘GLOBAL AND PARTITION’ for this functionality. Note that the default value is now ‘AUTO’.
‘GLOBAL’ – gathers global statistics
‘GLOBAL AND PARTITION’ – gathers the global and partition level statistics. No subpartition level statistics are gathered even if it is a composite partitioned object.
‘PARTITION ‘- gathers partition-level statistics
‘SUBPARTITION’ – gathers subpartition-level statistics.

options
Further specification of which objects to gather statistics for:
GATHER: Gathers statistics on all objects in the schema.
GATHER AUTO: Gathers all necessary statistics automatically. Oracle implicitly determines which objects need new statistics, and determines how to gather those statistics. When GATHER AUTO is specified, the only additional valid parameters are ownname, stattab, statid, objlist and statown; all other parameter settings are ignored. Returns a list of processed objects.
GATHER STALE: Gathers statistics on stale objects as determined by looking at the *_tab_modifications views. Also, return a list of objects found to be stale.
GATHER EMPTY: Gathers statistics on objects which currently have no statistics. also, return a list of objects found to have no statistics.
LIST AUTO: Returns a list of objects to be processed with GATHER AUTO.
LIST STALE: Returns list of stale objects as determined by looking at the *_tab_modifications views.
LIST EMPTY: Returns list of objects which currently have no statistics.