通过awr指标评估会话建立是否频繁

有朋友问我,通过awr怎么来判断系统这个时间段的会话建立情况,也就是说如果中间件或者客户端程序发生异常,过多的连接数据库.
我这里有个例子,用户的数据库大概在每秒钟建立10个连接左右(相对而言比较频繁)

[oracle@xifenfei tmp]$ grep "18-JUL-2012 18:40:03" listener.log |wc -l
9
[oracle@xifenfei tmp]$ grep "18-JUL-2012 18:40:04" listener.log |wc -l
7
[oracle@xifenfei tmp]$ grep "18-JUL-2012 18:41:04" listener.log |wc -l
12
[oracle@xifenfei tmp]$ grep "18-JUL-2012 18:40" listener.log |wc -l
554
[oracle@xifenfei tmp]$ grep "18-JUL-2012 18:50" listener.log |wc -l
598
[oracle@xifenfei tmp]$ grep "18-JUL-2012 19:00" listener.log |wc -l
597

我们分析这个时间段的awr,看那些地方是可以表明用户会话建立频繁
awr汇总信息
通过这个信息我们可以发现awr报告时间段(120分钟),和数据库在起点和终点的会话数
说明:该数据库问题很多,出现负载高,不全是会话建立频繁导致,这里只分析建立会话相关情况

Load Profile信息
通过这里的Logons为36对于这样的系统来说,明显异常

Top 5 Timed Events
这里的latch: session allocation等待就是比较明显的建立会话时候出现的等待

Time Model Statistics
Time Model Statistics中的connection management call elapsed time大家都明白的,建立会话花费时间

Dictionary Cache Stats
dc_usernames和dc_users请求值偏大

总结说明
在实际工作中:遇到过因为session建立太频繁导致监听繁忙,tnsping延迟比较严重案例,也遇到因为会话建立频繁导致系统内存被消耗完的案例.
在遇到会话建立过于频繁的案例,最有力的说明证据是监听日志,因为awr中的相关数据没有绝对标准(而且awr本身也是一个相对性的东西),而且一般客户对awr中我刚刚列举的数据概念性不强,所以一般只能作为分析的辅助工具,或者为进一步分析监听日志提供理由依据.

CURSOR_SHARING=SIMILAR引起的悲剧

一个客户反馈说有一系统经常性负载比较高,让我帮忙分析原因
系统负载情况

[oracle@zwq-kfdialdb ~]$ top -c
top - 17:11:06 up 78 days,  1:12,  5 users,  load average: 124.83, 125.90, 112.13
Tasks: 836 total, 152 running, 684 sleeping,   0 stopped,   0 zombie
Cpu(s): 98.1%us,  0.1%sy,  0.0%ni,  1.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132070908k total, 90494280k used, 41576628k free,  1147384k buffers
Swap: 67108856k total,        0k used, 67108856k free, 79109904k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                
12179 oracle    15   0 32.5g 142m 139m R 39.0  0.1   6:09.20 oracleahdial (LOCAL=NO)
11530 oracle    16   0 32.5g 469m 466m R 36.1  0.4  11:14.23 oracleahdial (LOCAL=NO)
11816 oracle    15   0 32.5g 467m 463m R 36.1  0.4   6:33.86 oracleahdial (LOCAL=NO)
11577 oracle    15   0 32.5g 480m 477m R 34.7  0.4   7:15.98 oracleahdial (LOCAL=NO)
12136 oracle    16   0 32.5g 455m 452m R 31.9  0.4   9:07.88 oracleahdial (LOCAL=NO)
11237 oracle    16   0 32.5g 997m 992m R 31.2  0.8  20:53.50 oracleahdial (LOCAL=NO)
11427 oracle    16   0 32.5g 137m 135m R 31.2  0.1  11:50.16 oracleahdial (LOCAL=NO)
12051 oracle    16   0 32.5g 459m 456m R 31.2  0.4   6:12.67 oracleahdial (LOCAL=NO)

[oracle@zwq-kfdialdb ~]$ vmstat 3 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
137  0      0 41566600 1147384 79109888    0    0     0     3    0    0  0  0 99  0  0
146  0      0 41567548 1147384 79109888    0    0     0    47 1058 32190 98  0  2  0  0
150  0      0 41568056 1147384 79109888    0    0     0    39 1081 31543 98  0  2  0  0
12  0      0 41568464 1147384 79109888    0    0     0    23 1056 32111 98  0  2  0  0
62  0      0 41568776 1147384 79109888    0    0     0    11 1067 31474 98  0  2  0  0
108  0      0 41568304 1147384 79109888    0    0     0    39 1059 31193 98  0  2  0  0
140  0      0 41569280 1147384 79109888    0    0     0    48 1063 31171 98  0  2  0  0
140  0      0 41569444 1147384 79109888    0    0     0    40 1075 30508 98  0  2  0  0

通过top和vmstat看出系统现在负载很高,主要都是用户进程导致.

查询等待事件

SQL> select event from v$session where wait_class#<>6;

EVENT
----------------------------------------------------------------
cursor: mutex S

SQL> /

EVENT
----------------------------------------------------------------
cursor: mutex S

SQL> /

EVENT
----------------------------------------------------------------
cursor: mutex S

SQL> /

EVENT
----------------------------------------------------------------
cursor: mutex S

SQL> SELECT a.*, s.sql_text
  2    FROM v$sql s,
  3         (SELECT sid,
  4                 event,
  5                 wait_class,
  6                 p1 cursor_hash_value,
  7                 p2raw Mutex_value,
  8                 TO_NUMBER (SUBSTR (p2raw, 1, 8), 'xxxxxxxx') hold_mutex_x_sid
  9            FROM v$session_wait
 10           WHERE event LIKE 'cursor%') a
 11   WHERE s.HASH_VALUE = a.cursor_hash_value
 12  /

no rows selected

SQL>  select event from v$session where wait_class#<>6;

EVENT
----------------------------------------------------------------
SQL*Net message to client

数据库开始的等待事件只有cursor: mutex S,等该等待事件消失后系统负载也恢复正常

再次查看系统负载

[oracle@zwq-kfdialdb ~]$ top -c -i10
top - 17:13:51 up 78 days,  1:15,  6 users,  load average: 12.57, 78.21, 96.45
Tasks: 702 total,   2 running, 700 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.5%us,  0.2%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132070908k total, 86477808k used, 45593100k free,  1147500k buffers
Swap: 67108856k total,        0k used, 67108856k free, 79116036k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND        
15045 oracle    16   0 13136 1476  724 R 10.6  0.0   0:00.12 top -c -i10    
10010 root      10  -5     0    0    0 D  0.0  0.0   0:00.68 [kondemand/4]  
10019 root      10  -5     0    0    0 D  0.0  0.0   1:41.58 [kondemand/13] 
10020 root      10  -5     0    0    0 D  0.0  0.0   1:52.28 [kondemand/14] 
10021 root      10  -5     0    0    0 R  0.0  0.0   2:01.54 [kondemand/15] 
12166 root      24   0 10084  300  216 D  0.0  0.0   0:00.00 /opt/VRTSgab/gablogd

[oracle@zwq-kfdialdb ~]$  vmstat 3 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 45484264 1147528 79117360    0    0     0     3    0    0  0  0 99  0  0
 2  0      0 45468452 1147528 79117456    0    0     4   161 1372 5369  2  2 97  0  0
 3  0      0 45463712 1147528 79117584    0    0     4   187 1602 7253  6  0 93  0  0
 1  0      0 45458220 1147528 79117648    0    0     1    99 1358 5821  2  0 98  0  0
 0  0      0 45475168 1147528 79117712    0    0     0    41 1321 5321  2  0 98  0  0
 0  0      0 45473624 1147528 79117744    0    0     3   104 1378 5455  2  0 98  0  0
 2  0      0 45474656 1147528 79117776    0    0     0    55 1196 4872  1  0 99  0  0
 0  0      0 45474376 1147532 79117824    0    0     8   113 1170 4990  2  0 98  0  0
 1  0      0 45475440 1147532 79117872    0    0     1    56 1187 5301  3  0 97  0  0
 1  0      0 45475824 1147532 79117888    0    0     3    99 1083 4643  3  0 97  0  0

结合上面的等待事件查询,我们可以大概评估出来,当cursor: mutex S等待消失后,系统负载也恢复正常,现在已经不存在环境,如果要找出问题只能够是借助AWR和ASH

分析ASH
Top User Events

Top SQL with Top Events

Activity Over Time

通过对ASH分析,可以大概确定,在这段时间内,引起系统负载高主要是cursor: mutex S导致

分析AWR
awr整体信息(从这里看数据库是相当的繁忙)

Load Profile(从这里看数据库业务比较小)

Top 5 Timed Foreground Events(主要等待事件cursor: mutex S,和前面分析相符)

OS LOAD(虽然和系统看到有一定出入,但是整体还是展示系统负载较高)

SQL ordered by Version Count(出现cursor: mutex S,因为load profile中解析不多,所以想到高版本问题,这里确实非常高)

通过这里的一些列分析,我们已经基本上可以确定,该数据库因为高版本问题导致cursor: mutex S以及library cache 相关等待严重,从而出现系统负载过高.

找出高版本原因
高版本相关信息和查询请见:关于High Versions Count总结

SQL> select * from table(version_rpt('f8b9tba7sfsb5')); 

COLUMN_VALUE
--------------------------------------------------------------------------------
Version Count Report Version 3.2.1 -- Today's Date 19-jul-12 18:13
RDBMS Version :11.2.0.1.0 Host: zwq-kfdialdb Instance 1 : ahdial
==================================================================
Addr: 000000080FA4CEA0  Hash_Value: 2408014181  SQL_ID f8b9tba7sfsb5
Sharable_Mem: 206315729 bytes   Parses: 48689
Stmt:
0 select count(*) as col_0_0_ from TBL_SP_SALES_RECORDS tblspsales
1 0_ where tblspsales0_.SALES_RECORDS_STATUS=:"SYS_B_0" and tblsps
2 ales0_.MOBILE_TELE_NO=:"SYS_B_1"
3


COLUMN_VALUE
--------------------------------------------------------------------------------
Versions Summary
----------------
AUTH_CHECK_MISMATCH :4
TRANSLATION_MISMATCH :4
ROLL_INVALID_MISMATCH :10219
PURGED_CURSOR :9

Total Versions:10219

Plan Hash Value Summary
-----------------------

COLUMN_VALUE
--------------------------------------------------------------------------------
Plan Hash Value Count
=============== =====
      791727930 920
     2820478500 9300

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for AUTH_CHECK_MISMATCH :

  # of Ver PARSING_USER_ID PARSING_SCHEMA_ID PARSING_SCHEMA_NAME
========== =============== ================= ===================
     10218               75                75 HOLLYSP

COLUMN_VALUE
--------------------------------------------------------------------------------
         2              107               107 HOLLYSP_TEST
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for TRANSLATION_MISMATCH :

Summary of objects probably causing TRANSLATION_MISMATCH

  Object# Owner.Object_Name
========= =================
    76737 HOLLYSP.TBL_SP_SALES_RECORDS
 HOLLYSP.TBL_SP_SALES_RECORDS
 HOLLYSP_TEST.TBL_SP_SALES_RECORDS

COLUMN_VALUE
--------------------------------------------------------------------------------
   107043 HOLLYSP_TEST.TBL_SP_SALES_RECORDS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for ROLL_INVALID_MISMATCH :

No details available
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for PURGED_CURSOR :

No details available
####
To further debug Ask Oracle Support for the appropiate level LLL.

COLUMN_VALUE
--------------------------------------------------------------------------------
alter session set events
 'immediate trace name cursortrace address 2408014181, level LLL';
To turn it off do use address 1, level 2147483648
================================================================

59 rows selected.

SQL> select * from table(version_rpt('6zhjf3qh8gyp9'));   

COLUMN_VALUE
--------------------------------------------------------------------------------
Version Count Report Version 3.2.1 -- Today's Date 19-jul-12 18:15
RDBMS Version :11.2.0.1.0 Host: zwq-kfdialdb Instance 1 : ahdial
==================================================================
Addr: 00000007FD46D2A8  Hash_Value: 2693266089  SQL_ID 6zhjf3qh8gyp9
Sharable_Mem: 111904227 bytes   Parses: 4880
Stmt:
0 select count(*) as col_0_0_ from TBL_SP_SALES_RECORDS tblspsales
1 0_ where tblspsales0_.SALES_RECORDS_STATUS=:"SYS_B_0" and tblsps
2 ales0_.CAMPAIGN_ID=:"SYS_B_1" and tblspsales0_.IS_CONN=:"SYS_B_2
3 " and tblspsales0_.SALES_TIME>=:"SYS_B_3" and tblspsales0_.SALES
4 _PERSON=:"SYS_B_4"

COLUMN_VALUE
--------------------------------------------------------------------------------
5

Versions Summary
----------------
BIND_MISMATCH :4804
INCOMP_LTRL_MISMATCH :372
HASH_MATCH_FAILED :4936

Total Versions:4935

Plan Hash Value Summary

COLUMN_VALUE
--------------------------------------------------------------------------------
-----------------------
Plan Hash Value Count
=============== =====
     1645985080 3
     2040125427 4933

~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for BIND_MISMATCH :

Consolidated details for :
BIND_MISMATCH,USER_BIND_PEEK_MISMATCH,BIND_UACS_DIFF and

COLUMN_VALUE
--------------------------------------------------------------------------------
BIND_EQUIV_FAILURE (Mislabled as ROW_LEVEL_SEC_MISMATCH BY bug 6964441 in 11gR1)

from v$sql_bind_capture
COUNT(*) POSITION MIN(MAX_LENGTH) MAX(MAX_LENGTH) DATATYPE (PRECISION,SCALE)
======== ======== =============== =============== ======== ================
    4936        1              32              32        1 (,)
    4936        2              32              32        1 (,)
    4936        3              32              32        1 (,)
    4936        4              32              32        1 (,)
    4936        5              32              32        1 (,)


COLUMN_VALUE
--------------------------------------------------------------------------------
SUM(DECODE(column,Y, 1, 0) FROM V$SQL
IS_OBSOLETE IS_BIND_SENSITIVE IS_BIND_AWARE IS_SHAREABLE
=========== ================= ============= ============
          0              4309             0         4309
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for INCOMP_LTRL_MISMATCH :

No details available
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for HASH_MATCH_FAILED :


COLUMN_VALUE
--------------------------------------------------------------------------------
No details available
####
To further debug Ask Oracle Support for the appropiate level LLL.
alter session set events
 'immediate trace name cursortrace address 2693266089, level LLL';
To turn it off do use address 1, level 2147483648
================================================================

62 rows selected.

根据经验,出现这么多的高版本情况,很可能是cursor_sharing参数设置问题

--对应sql语句
select count(*) as col_0_0_ from TBL_SP_SALES_RECORDS tblspsales0_ 
where tblspsales0_.SALES_RECORDS_STATUS=:"SYS_B_0" and 
tblspsales0_.CAMPAIGN_ID=:"SYS_B_1" and tblspsales0_.IS_CONN=:"SYS_B_2" 
and tblspsales0_.SALES_TIME>=:"SYS_B_3" and tblspsales0_.SALES_PERSON=:"SYS_B_4"

--cursor_sharing参数
SQL> show parameter cursor_sharing;

NAME                                 TYPE                             VALUE
------------------------------------ -------------------------------- ----------------
cursor_sharing                       string                           SIMILAR

根据oracle官方建议在11g中不推荐使用cursor_sharing=SIMILAR,其实在所有版本中都不推荐,设置为该值很容易导致高版本问题.而且该值会出现莫名其妙的,无法解释的高版本问题.而且根据oracle相关文档,在即将发布的12c版本中,将除掉SIMILAR值.对于客户库的该问题,因为很多sql未绑定参数,为了减少硬解析,建议在业务低谷时设置cursor_sharing=FORCE,并刷新shared pool.

未收集统计信息对象—执行sql动态采样

在一次ORA-7445导致oracle数据库down掉故障分析中,发现一条类似的sql非常大(通过复制到文档确定该sql大小是5M左右)

SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE NO_PARALLEL(SAMPLESUB) 
opt_param('parallel_execution_enabled', 'false') NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE 
*/ NVL(SUM(C1),0), NVL(SUM(C2),0) FROM (SELECT /*+ IGNORE_WHERE_CLAUSE NO_PARALLEL("DCREDITMSG_00")
 FULL("DCREDITMSG_00") NO_PARALLEL_INDEX("DCREDITMSG_00") */ 1 AS C1, CASE WHEN
 "DCREDITMSG_00"."PHONE_NO"='具体电话号码' OR "DCREDITMSG_00"."PHONE_NO"='具体电话号码' OR 
……………………N多OR "DCREDITMSG_00"."PHONE_NO"='具体电话号码'
"DCREDITMSG_00"."PHONE_NO"='具体电话号码' THEN 1 ELSE 0 END AS C2 FROM 
"BSSADMIN"."DCREDITMSG_00" SAMPLE BLOCK (0.032410 , 1) SEED (1) "DCREDITMSG_00") SAMPLESUB

当时该sql因某种原因导致大量的sql area中很多内存泄露,最终导致数据库down掉.通过实验找出类此奇怪SQL.

创建模拟表

SQL> create table t_xifenfei 
  2  as
  3  select  * from dba_objects;

Table created.

SQL> select count(*) from t_xifenfei;

  COUNT(*)
----------
     74605

SQL> select NUM_ROWS,LAST_ANALYZED from dba_tables 
  2  where  table_name='T_XIFENFEI' and owner='CHF';

  NUM_ROWS LAST_ANALYZE
---------- ------------

得出信息:
1.该表一共有记录数74605条
2.该表未收集统计信息

查看执行计划

SQL>  set autotrace trace exp
SQL> select /*+ dynamic_sampling(t 0) */ * from t_xifenfei t;

Execution Plan
----------------------------------------------------------
Plan hash value: 548923532

--------------------------------------------------------------------------------
| Id  | Operation         | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            | 88868 |    17M|   299   (2)| 00:00:04 |
|   1 |  TABLE ACCESS FULL| T_XIFENFEI | 88868 |    17M|   299   (2)| 00:00:04 |
--------------------------------------------------------------------------------
--通过hint指定动态采样sql相关对象统计信息,
可以看到我们实际的表记录是74605而数据库采样出来的记录为88868,原则上还是可以接受

SQL> select * from t_xifenfei;

Execution Plan
----------------------------------------------------------
Plan hash value: 548923532

--------------------------------------------------------------------------------
| Id  | Operation         | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            | 73449 |    14M|   298   (1)| 00:00:04 |
|   1 |  TABLE ACCESS FULL| T_XIFENFEI | 73449 |    14M|   298   (1)| 00:00:04 |
--------------------------------------------------------------------------------

Note
-----
   - dynamic sampling used for this statement (level=2)
--因为没有统计信息,数据库动态采样sql相关对象统计信息
可以看到我们实际的表记录是74605而数据库采样出来的记录为73449,比手工指定采样准确

对自动采样进行10046跟踪

SQL> conn / as sysdba
Connected.
SQL> oradebug  setmypid
Statement processed.
SQL> oradebug EVENT 10046 TRACE NAME CONTEXT FOREVER, LEVEL 12
Statement processed.
SQL> select count(*) from CHF.t_xifenfei;

  COUNT(*)
----------
     74605

SQL>  oradebug TRACEFILE_NAME
/u01/oracle/diag/rdbms/ora11g/ora11g/trace/ora11g_ora_27967.trc

分析trace文件

*** 2012-07-12 15:42:34.991
WAIT #0:nam='SQL*Net message from client'ela= 56716427 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1342078954991525
=====================
PARSING IN CURSOR #3063864268 len=404 dep=1 uid=0 oct=3 lid=0 tim=1342078955037387 
hv=4184780033 ad='385d3708' sqlid='3gjvvxzwqxb81'
SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE NO_PARALLEL(SAMPLESUB) 
opt_param('parallel_execution_enabled', 'false') 
NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */ NVL(SUM(C1),0), NVL(SUM(C2),0) FROM 
(SELECT /*+ NO_PARALLEL("T_XIFENFEI") FULL("T_XIFENFEI") NO_PARALLEL_INDEX("T_XIFENFEI") */ 1 AS C1, 
1 AS C2 FROM "CHF"."T_XIFENFEI" SAMPLE BLOCK (5.790441 , 1) SEED (1) "T_XIFENFEI") SAMPLESUB
END OF STMT
PARSE #3063864268:c=6000,e=5404,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=1,plh=205916192,tim=1342078955037303
EXEC #3063864268:c=0,e=206,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=1,plh=205916192,tim=1342078955037901
FETCH #3063864268:c=4998,e=4759,p=0,cr=65,cu=0,mis=0,r=1,dep=1,og=1,plh=205916192,tim=1342078955042730
STAT #3063864268 id=1 cnt=1 pid=0 pos=1 obj=0 op='SORT AGGREGATE (cr=65 pr=0 pw=0 time=4795 us)'
STAT #3063864268 id=2 cnt=4253 pid=1 pos=1 obj=76370 op='TABLE ACCESS SAMPLE T_XIFENFEI 
(cr=65 pr=0 pw=0 time=8247 us cost=19 size=61752 card=5146)'
CLOSE #3063864268:c=0,e=7,dep=1,type=0,tim=1342078955043024
=====================
PARSING IN CURSOR #3063864784 len=35 dep=0 uid=0 oct=3 lid=0 tim=1342078955043465 hv=2174183953 ad='3ed2d700' 
sqlid='fadutqq0tfuhj'
select count(*) from CHF.t_xifenfei
END OF STMT
PARSE #3063864784:c=51991,e=51648,p=0,cr=66,cu=0,mis=1,r=0,dep=0,og=1,plh=2715729601,tim=1342078955043464
EXEC #3063864784:c=0,e=75,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2715729601,tim=1342078955043645
WAIT #3063864784: nam='SQL*Net message to client' ela= 3 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1342078955043705

通过这个trace的分析,果然发现在执行我们需要的sql前,执行了SELECT /* OPT_DYN_SAMP */组成的一个复杂的采样sql语句.

收集统计信息查看执行计划

SQL> EXEC DBMS_STATS.gather_table_stats('CHF','T_XIFENFEI');

PL/SQL procedure successfully completed.

SQL> select NUM_ROWS,LAST_ANALYZED from dba_tables 
  2  where  table_name='T_XIFENFEI' and owner='CHF';

  NUM_ROWS LAST_ANALYZE
---------- ------------
     74605 12-JUL-12

SQL>  set autotrace trace exp
SQL> select * from t_xifenfei;

Execution Plan
----------------------------------------------------------
Plan hash value: 548923532

--------------------------------------------------------------------------------
| Id  | Operation         | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            | 74605 |  7139K|   298   (1)| 00:00:04 |
|   1 |  TABLE ACCESS FULL| T_XIFENFEI | 74605 |  7139K|   298   (1)| 00:00:04 |
--------------------------------------------------------------------------------
--执行计划未提示dynamic sampling

继续做10046

SQL> conn / as sysdba
Connected.
SQL> oradebug  setmypid
Statement processed.
SQL> oradebug EVENT 10046 TRACE NAME CONTEXT FOREVER, LEVEL 12
Statement processed.
SQL> select count(*) from CHF.t_xifenfei;

  COUNT(*)
----------
     74605

SQL> oradebug TRACEFILE_NAME
/u01/oracle/diag/rdbms/ora11g/ora11g/trace/ora11g_ora_29780.trc

分析trace文件

*** 2012-07-12 16:14:53.914
Oradebug command 'EVENT 10046 TRACE NAME CONTEXT FOREVER, LEVEL 12' console output: <none>
WAIT #0: nam='SQL*Net message to client' ela= 8 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1342080893914307

*** 2012-07-12 16:14:59.376
WAIT #0: nam='SQL*Net message from client' ela= 5461608 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1342080899376008
=====================
PARSING IN CURSOR #3063709248 len=35 dep=0 uid=0 oct=3 lid=0 tim=1342080899379562 hv=2174183953 
ad='3ed2d700' sqlid='fadutqq0tfuhj'
select count(*) from CHF.t_xifenfei
END OF STMT

通过这里可以发现,当有了统计信息后,数据库不能再使用那条N多hint的sql去动态采样统计信息.

总结建议
动态采样(Dynamic Sampling)技术的最初提出是在Oracle 9i R2,在段(表,索引,分区)没有分析的情况下,为了使CBO 优化器得到足够的信息以保证做出正确的执行计划而发明的一种技术,可以把它看做分析手段的一种补充。当段对象没有统计信息时(即没有做分析),动态采样技术可以通过直接从需要分析的对象上收集数据块(采样)来获得CBO需要的统计信息。为了cbo,oracle引进了该功能,原则上说是一个很不错的东西,但是偶尔也是会出现一些意外,所以如果发现数据库中有表未做统计分析,建议手工处理下,ORACLE的自动收集统计信息程序也有不靠谱的时候(发现多次10g的库中有部分表未收集统计信息)

11g DirectPath Reads 噩梦案例

DirectPath Reads 说明
在oracle 11g以前的版本中,如果对大表进行全表扫描,wait event是:db file scattered read;在11g中,如果对大表进行全表扫描,wait event是:direct path read。在11g中,大表全表扫描时数据块不经过sga而直接进pga,这样会造成每次进行大表全表扫描,物理读都是很大,而在10g中,由于全表扫描的数据块在sga中已经存在,所以执行全表扫描时,它的物理读为0。但是这里主要是oracle在优化策略上的进步,即假定大表频繁全表扫描这种现象,在生产库上不会太多,通过把数据直接读入pga,进而减少了cache buffer的繁忙交换程度,提高了cache buffer的使用效率.

DirectPath Reads 优势
1. 减少了对栓的使用,避免可能的栓争用
2. 物理IO的大小不再取决于buffer_cache中所存在的块;试想某个8个块的extent中1,3,5,7号块在高速缓存中,而2,4,6,8块没有被缓存,传统的方式在读取该extent时将会是对2,4,6,8块进行4次db file sequential read,这是一种十分可怕的状况,其效率往往要比单次读取这个区间的所有8个块还要低得多,虽然Oracle为了避免这种情况总是尽可能的不缓存大表的块(读入后总是放在队列最冷的一端);而direct path read则可以完全避免这类问题,尽可能地单次读入更多的物理块。

DirectPath Reads 噩梦
这一切听起来都很美好,但是在大并发的OLTP系统中,这东西简直是一个噩梦.通过一个awr来说明该问题:这个是一个系统的awr报告,朋友反馈说系统有段运行缓慢,请求帮忙找出原因
分析总体信息

系统这段时间会话临时大幅度增加(从102增加到223),系统出现异常繁忙(60.62*16=969.92<2,454.52)
分析Load Profile信息

通过这个截图发现系统的业务不是很大,但是Physical reads参数异常
1.物理读大小:25071.1*8192/1024/1024=195.86796875M/S
2.物理读将近逻辑读一半,这个在一般系统中很难得到这个比例,进一步说明物理读过高

分析Top 5信息

这里可以发现direct path read等待很多

分析Host CPU

可以发现iowait很大占40.5%,io等待异常高(195M/S能不高吗?)
补充说明:在这里我们看到的%Idle=1-%System-%User不包括%WIO

处理建议
通过上面的评估,可以确定大部分是由于 导致了数据库的物理读过高,从而使得系统反应变慢,处理方法就是关闭掉11g该新特性
alter system set event= ‘10949 trace name context forever, level 1’ scope=spfile;
重启数据库

分析一例 TX Enqueue contention案例

应用反馈某个业务比较慢,需要紧急处理
查询等待事件

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production
PL/SQL Release 9.2.0.8.0 - Production
CORE    9.2.0.8.0       Production
TNS for IBM/AIX RISC System/6000: Version 9.2.0.8.0 - Production
NLSRTL Version 9.2.0.8.0 - Production

SQL> select A.INST_ID,count(*), event
  2    from Gv$session_wait a
  3  where event not in ('SQL*Net more data to client',
  4  'rdbms ipc message',
  5               'smon timer',
  6               'pmon timer',
  7               'SQL*Net message from client',
  8               'lock manager wait for remote message',
  9               'ges remote message',
 10               'gcs remote message',
 11               'gcs for action',
 12               'client message',
 13               'pipe get',
 14               'null event',
 15               'PX Idle Wait',
 16               'single-task message',
 17               'PX Deq: Execution Msg',
 18               'KXFQ: kxfqdeq - normal deqeue',
 19               'listen endpoint status',
 20               'slave wait',
 21               'wakeup time manager','jobq slave wait')
 22  group by INST_ID,event
 23  order by 1 desc,2 desc;

   INST_ID   COUNT(*) EVENT
---------- ---------- --------------------------------------
         2          8 enqueue
         2          1 async disk IO
         2          1 db file sequential read
         2          1 SQL*Net message to client
         2          1 PX Deq: reap credit
         1          2 global cache cr request
         1          1 async disk IO
         1          1 PX Deq: reap credit
         1          1 PX Deq: Execute Reply

9 rows selected.

发现enqueue等待有些多,怀疑是TX enquenue,查询阻塞者

SQL> set linesize 100
SQL> set pagesize 66
SQL> col c1 for a15
SQL> col c1 heading "Program Name "
SQL> select l.inst_id,l.SID,program c1,l.TYPE,l.ID1,l.ID2,l.LMODE,l.REQUEST
  2  from gv$lock l,gv$session s
  3  where l.type like 'TX' and l.REQUEST =6
  4  and l.inst_id=s.inst_id and l.sid=s.sid
  5  order by id1
  6  / 

   INST_ID        SID Program Name    TY        ID1        ID2      LMODE    REQUEST
---------- ---------- --------------- -- ---------- ---------- ---------- ----------
         2        295 rtStopMain@zwq_ TX    1441805    2391806          0          6
                      bill_2 (TNS V1-
                      V3)

         2        992 rtStopMain@zwq_ TX    1441805    2391806          0          6
                      bill_2 (TNS V1-
                      V3)

         2       1238 rtStopMain@zwq_ TX    6946827    2546365          0          6
                      bill_2 (TNS V1-
                      V3)

         2       1298 rtStopMain@zwq_ TX    6946827    2546365          0          6
                      bill_2 (TNS V1-
                      V3)

         2       1684 rtStopMain@zwq_ TX    6946827    2546365          0          6
                      bill_2 (TNS V1-
                      V3)

         2       1553 rtStopMain@zwq_ TX    6946827    2546365          0          6
                      bill_2 (TNS V1-
                      V3)

         2         75 rtStopMain@zwq_ TX   12451856     199146          0          6
                      bill_2 (TNS V1-
                      V3)

         2       1125 rtStopMain@zwq_ TX   14352404      63837          0          6
                      bill_2 (TNS V1-
                      V3)

查询持有者

SQL> set linesize 100
SQL> set pagesize 66
SQL> col c1 for a15
SQL> col c1 heading "Program Name "
SQL> select l.inst_id,l.SID,program c1,l.TYPE,l.ID1,l.ID2,l.LMODE,l.REQUEST
  2  from gv$lock l,gv$session s
  3  where l.type like 'TX' and l.LMODE =6 and (l.ID1,l.ID2) in
  4  (select id1,id2 from gv$lock where type like 'TX' and REQUEST =6)
  5  and l.inst_id=s.inst_id and l.sid=s.sid
  6  order by id1
  7  / 

   INST_ID        SID Program Name    TY        ID1        ID2      LMODE    REQUEST
---------- ---------- --------------- -- ---------- ---------- ---------- ----------
         2         75 rtStopMain@zwq_ TX    1441805    2391806          6          0
                      bill_2 (TNS V1-
                      V3)

         2        992 rtStopMain@zwq_ TX    6946827    2546365          6          0
                      bill_2 (TNS V1-
                      V3)

         2        295 rtStopMain@zwq_ TX   12451856     199146          6          0
                      bill_2 (TNS V1-
                      V3)

         2       1553 rtStopMain@zwq_ TX   14352404      63837          6          0
                      bill_2 (TNS V1-
                      V3)

通过持有者和阻塞者可以得出:
1.持有者和阻塞者都是在2号实例上
2.持有者75阻塞了295/992的会话
3.持有者992阻塞了1238/1298/1684/1553的会话
4.持有者295阻塞了75的会话
5.持有者1553阻塞了1125的会话
6.同时分析发现,所有的持有者sid也在阻塞者中,也就是持有者阻塞了某个sid,而自身又被其他sid给阻塞,形成了多级阻塞或者环.如:75阻塞了295,而295有阻塞了75;992阻塞了1553,而1553阻塞了1125

查询阻塞和持有者对象

SQL> set linesize 110
SQL> col c0 for 999
SQL> col c0 heading "INS"
SQL> col c1 for a15
SQL> col c1 heading "Program Name "
SQL> select inst_id c0,sid,program c1,ROW_WAIT_OBJ# object_no, ROW_WAIT_FILE# Rfile_no,
  2  ROW_WAIT_BLOCK# Block_no ,ROW_WAIT_ROW# Row_no
  3  from gv$session
  4  where (inst_id,sid) in (select inst_id,sid from gv$session_wait where p1='1415053318')
  5  / 

 INS        SID Program Name     OBJECT_NO   RFILE_NO   BLOCK_NO     ROW_NO
---- ---------- --------------- ---------- ---------- ---------- ----------
   2         75 rtStopMain@zwq_    1323132         13     122601        111
                bill_2 (TNS V1-
                V3)

   2        295 rtStopMain@zwq_    1323132         13     122601        100
                bill_2 (TNS V1-
                V3)

   2        992 rtStopMain@zwq_    1323132         13     122601        101
                bill_2 (TNS V1-
                V3)

   2       1125 rtStopMain@zwq_    1323132         84      38445         70
                bill_2 (TNS V1-
                V3)

   2       1238 rtStopMain@zwq_    1323132         15     255066         41
                bill_2 (TNS V1-
                V3)

   2       1298 rtStopMain@zwq_    1323132         14     118411          8
                bill_2 (TNS V1-
                V3)

   2       1553 rtStopMain@zwq_    1323132         15     255066         19
                bill_2 (TNS V1-
                V3)

   2       1684 rtStopMain@zwq_    1323132         14     118411         21
                bill_2 (TNS V1-
                V3)

8 rows selected.

SQL> set linesize 100
SQL> set pagesize 100
SQL> col owner for a10
SQL> col object_name for a20
SQL> col object_type for a10
SQL> select owner,object_name,object_id,object_type
  2  from dba_objects
  3  where
  4  object_id in (select ROW_WAIT_OBJ# from gv$session
  5  where (inst_id, sid) in (select inst_id,sid from gv$session_wait where p1='1415053318'))
  6  / 
 

OWNER      OBJECT_NAME           OBJECT_ID OBJECT_TYP
---------- -------------------- ---------- ----------
DBACCADM   DCUSTCREDITBALANCE      1323132 TABLE

通过查询的出来,所有操作的聚焦点都是在DBACCADM.DCUSTCREDITBALANCE表上面

查询相关sql语句

SQL> SQL> set linesize 120
SQL> set pagesize 66
SQL> col c0 for 999
SQL> col c0 heading "INS"
SQL> col c1 for a9
SQL> col c1 heading "OS User"
SQL> col c2 for a9
SQL> col c2 heading "Oracle User"
SQL> col c3 for a15
SQL> col c3 heading "Program Name"
SQL> col b1 for a9
SQL> col b1 heading "Unix PID"
SQL> col b2 for 9999 justify left
SQL> col b2 heading "ORA SID"
SQL> col b3 for 999999 justify left
SQL> col b3 heading "SERIAL#"
SQL> col sql_text for a45
SQL> set space 1
SQL> break on b1 nodup on c0 nodup on c3 nodup on c1 nodup on c2 nodup on b2 nodup on b3 skip 2
SQL> select a.inst_id c0,b.sid b2,c.spid b1, b.program c3, b.username c2,b.serial# b3, a.sql_text
  2    from gv$sql a, gv$session b, gv$process c
  3   where
  4     a.address = b.sql_address
  5     and b.paddr = c.addr
  6     and a.hash_value = b.sql_hash_value
  7     and a.inst_id=b.inst_id and a.inst_id=c.inst_id
  8     and a.inst_id like '&inst_id' and b.sid like '&sid'
  9   order by c.spid,a.hash_value
 10  / 
Enter value for inst_id: 2
Enter value for sid: 75
old   8:    and a.inst_id like '&inst_id' and b.sid like '&sid'
new   8:    and a.inst_id like '2' and b.sid like '75'

 INS ORA SID Unix PID  Program Name    Oracle Us SERIAL# SQL_TEXT
---- ------- --------- --------------- --------- ------- ---------------------------------------------
   2      75 1167392   rtStopMain@zwq_ DBCUSTOPR   42815 update dcustcreditbalance  set limit_owe=:b0,
                       bill_2 (TNS V1-                   unbill_fee=:b1,prepay_fee=:b2,owe_fee=:b3,op_
                       V3)                               time=sysdate where id_no=:b4




SQL> /
Enter value for inst_id: 2
Enter value for sid: 992
old   8:    and a.inst_id like '&inst_id' and b.sid like '&sid'
new   8:    and a.inst_id like '2' and b.sid like '992'

 INS ORA SID Unix PID  Program Name    Oracle Us SERIAL# SQL_TEXT
---- ------- --------- --------------- --------- ------- ---------------------------------------------
   2     992 2760870   rtStopMain@zwq_ DBCUSTOPR   56282 update dcustcreditbalance  set limit_owe=:b0,
                       bill_2 (TNS V1-                   unbill_fee=:b1,prepay_fee=:b2,owe_fee=:b3,op_
                       V3)                               time=sysdate where id_no=:b4




SQL> /
Enter value for inst_id: 2
Enter value for sid: 295
old   8:    and a.inst_id like '&inst_id' and b.sid like '&sid'
new   8:    and a.inst_id like '2' and b.sid like '295'

 INS ORA SID Unix PID  Program Name    Oracle Us SERIAL# SQL_TEXT
---- ------- --------- --------------- --------- ------- ---------------------------------------------
   2     295 1639008   rtStopMain@zwq_ DBCUSTOPR   35740 update dcustcreditbalance  set limit_owe=:b0,
                       bill_2 (TNS V1-                   unbill_fee=:b1,prepay_fee=:b2,owe_fee=:b3,op_
                       V3)                               time=sysdate where id_no=:b4

其他阻塞者和持有者执行sql语句均和该语句相同,省略其他查询.通过这些查询可以确定是因为对dcustcreditbalance表的更新操作导致了这样的现象发生.

处理方案
1.临时处理方案:kill掉持有者
2.永久处理方案:修改这部分程序业务逻辑