statspack报告中逻辑读为负值

最近遇到两次在sp报告中,显示逻辑读为负数。进行分析情况如下:
一台是运营商的crm库(aix 5.3+9.2.0.8+rac)

--系统版本
[zwq_crm2:/home/crm_oraeye]oslevel -s 
5300-08-07-0920

--数据库版本
SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production
PL/SQL Release 9.2.0.8.0 - Production
CORE    9.2.0.8.0       Production
TNS for IBM/AIX RISC System/6000: Version 9.2.0.8.0 - Production
NLSRTL Version 9.2.0.8.0 - Production

--数据库启动时间
SQL> SELECT A.INST_ID, TO_CHAR (startup_time, 'yyyy-mm-dd hh24:mi:ss')
  2   FROM gv$instance a;

   INST_ID TO_CHAR(STARTUP_TIM
---------- -------------------
         2 2011-02-19 03:33:49
         1 2011-02-19 03:56:17

--异常的逻辑读统计数据
SQL>   SELECT b.snap_id,
  2           TO_CHAR (B.SNAP_TIME, 'yyyy-mm-dd hh24:mi:ss'),
  3           TO_CHAR (A.VALUE, '999,999,999,999,999')
  4      FROM stats$sysstat a, stats$snapshot b
  5     WHERE     a.NAME = 'session logical reads'
  6           AND a.instance_number = 2
  7           AND a.snap_id = b.snap_id
  8           AND A.SNAP_ID >=47913
  9           AND A.SNAP_ID <=47920
 10  ORDER BY a.snap_id;

   SNAP_ID TO_CHAR(B.SNAP_TIME TO_CHAR(A.VALUE,'999
---------- ------------------- --------------------
   47913	2012-02-10 20:00	4,764,705,272,783
   47914	2012-02-10 21:00	4,761,539,910,574
   47915	2012-02-10 22:00	4,749,529,436,021
   47916	2012-02-10 23:00	4,745,952,040,146 
   47917	2012-02-11 00:00	4,738,052,256,634 
   47918	2012-02-11 01:00	4,738,894,245,521 
   47919	2012-02-11 02:00	4,739,587,095,184 
   47920	2012-02-11 03:00	4,740,409,262,259 

另一台是运营商的开停机库(aix 5.3+9.2.0.8+rac)

--系统版本
[zwq_offon2:/home/oraeye]oslevel -s 
5300-08-07-0920

--数据库版本
SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production
PL/SQL Release 9.2.0.8.0 - Production
CORE    9.2.0.8.0       Production
TNS for IBM/AIX RISC System/6000: Version 9.2.0.8.0 - Production
NLSRTL Version 9.2.0.8.0 - Production

--数据库启动时间
SQL> SELECT A.INST_ID, TO_CHAR (startup_time, 'yyyy-mm-dd hh24:mi:ss')
  2    FROM gv$instance a;

   INST_ID TO_CHAR(STARTUP_TIM
---------- -------------------
         2 2010-01-23 19:16:46
         1 2010-01-23 19:15:09

--异常的逻辑读统计数据
SQL>   SELECT b.snap_id,
  2           TO_CHAR (B.SNAP_TIME, 'yyyy-mm-dd hh24:mi:ss'),
  3           TO_CHAR (A.VALUE, '999,999,999,999,999')
  4      FROM stats$sysstat a, stats$snapshot b
  5     WHERE     a.NAME = 'session logical reads'
  6           AND a.instance_number = 2
  7           AND a.snap_id = b.snap_id
  8           AND A.SNAP_ID IN ('38271', '38272', '38339', '38340')
  9  ORDER BY a.snap_id;

   SNAP_ID TO_CHAR(B.SNAP_TIME TO_CHAR(A.VALUE,'999
---------- ------------------- --------------------
     38271 2012-01-31 19:00:05    4,269,858,122,434
     38272 2012-01-31 20:00:02    4,266,001,522,867
     38339 2012-02-02 09:00:02    4,275,651,080,526
     38340 2012-02-02 10:00:02    4,250,263,107,466

这两个数据库中都出现了在数据库没有重启的情况下stats$sysstat.value的值出现波动情况,而且都是在40万亿以上的时候。不知道是特点的版本巧合(特定的数据库版本,特定的操作系统版本),还是Oracle未公布bug。出现这样的情况,目前只能通过重启实例来使得statspack捕获到的逻辑读值变小,从而避免这样的波动,从而解决sp中出现逻辑读为负数的情况。

找出11g undo 回滚段名称

当数据库启动的时报undo相关异常,很多情况下我们不得不使用隐含参数来处理(_offline_rollback_segments和_corrupted_rollback_segments),而这个就需要明确异常的undo回滚段名称。在11g中,undo回滚段的名称发生了改变,在11g之前的版本中,回滚段名称是”_SYSTEMn$”之类,而到了11g回滚段的名称变为了”_SYSTEMn_时间戳$”,因为时间戳我们不知道,所以我们不能通过n的值,来确定回滚段的名称,从而也就不能很明确的使用_offline_rollback_segments和_corrupted_rollback_segments来标明异常回滚段。既然回滚段的名称我们可以通过dba_rollback_segs视图来查询,那么我们可以根据这个视图找到对应的回滚段是存储在哪张基表中,下面为dba_rollback_segs视图对应的sql语句

select un.name, decode(un.user#,1,'PUBLIC','SYS'),
       ts.name, un.us#, f.file#, un.block#,
       s.iniexts * ts.blocksize,
       decode(bitand(ts.flags, 3), 1, to_number(NULL),
                                      s.extsize * ts.blocksize),
       s.minexts, s.maxexts,
       decode(bitand(ts.flags, 3), 1, to_number(NULL),
                                      s.extpct),
       decode(un.status$, 2, 'OFFLINE', 3, 'ONLINE',
                          4, 'UNDEFINED', 5, 'NEEDS RECOVERY',
                          6, 'PARTLY AVAILABLE', 'UNDEFINED'),
       decode(un.inst#, 0, NULL, un.inst#), un.file#
from sys.undo$ un, sys.seg$ s, sys.ts$ ts, sys.file$ f
where un.status$ != 1
  and un.ts# = s.ts#
  and un.file# = s.file#
  and un.block# = s.block#
  and s.type# in (1, 10)
  and s.ts# = ts.ts#
  and un.ts# = f.ts#
  and un.file# = f.relfile#

通过观察,我们知道回滚段信息是存储在sys.undo$中(name字段表示回滚段名称,status$字段表示回滚段状态[1:DELETE,2:OFFLINE,3:ONLINE,4:UNDEFINED,5:NEEDS RECOVERY,6:PARTLY AVAILABLE,其他表示:UNDEFINED]),通过这个视图我们需要找的到status$为5所对应name。
找这些值较简易的方法就是通过odu抽取sys.undo$表中数据,然后在其他库上还原,然后通过sql语句查询
1.设置odu参数(config.txt)
output_format dmp,其他参数根据你的实际情况设置

2.填写system表空间对应的数据文件(control.txt)

[oracle@localhost odu]$ more control.txt 
#ts #fno   #rfno     filename                                          block_size
1     1      1       /u01/oradata/first/system01.dbf

3.odu导出数据

ODU> unload table sys.undo$

Unloading table: UNDO$,object ID: 15
Unloading segment,storage(Obj#=15 DataObj#=15 TS#=0 File#=1 Block#=224 Cluster=0)
21 rows unloaded

4.导入至新库

imp chf/xifenfei file=C:\Users\XIFENFEI\Downloads\SYS_UNDO$.dmp FROMUSER=SYS TOUSER=CHF

5.查询需要处理的回滚段

--数据库版本10g及其以上版本
set pagesize 0 feedback off verify off heading off echo off linesize 1000
select WMSYS.WM_CONCAT(name) from UNDO$ where status$=5;

--数据库版本9i及其以下版本(自己拼接)
set pagesize 0 feedback off verify off heading off echo off 
select name from UNDO$ where status$=5;

现在已经找出来了需要处理的回滚段,其他数据库恢复步骤与以前数据库相同。

ARCn: Media recovery disabled原因分析

1.数据库版本

SQL> select * from v$version;

BANNER
-------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.8.0 - Production
PL/SQL Release 9.2.0.8.0 - Production
CORE    9.2.0.8.0       Production
TNS for Linux: Version 9.2.0.8.0 - Production
NLSRTL Version 9.2.0.8.0 - Production

2.alert日志出现类此记录

Fri Feb 10 05:16:55 2012
ARC1: Media recovery disabled
Fri Feb 10 05:17:23 2012
ARC0: Media recovery disabled
Fri Feb 10 05:17:23 2012
ARC1: Media recovery disabled
Fri Feb 10 05:18:23 2012
ARC1: Media recovery disabled
Fri Feb 10 05:19:12 2012
Thread 1 advanced to log sequence 24
  Current log# 1 seq# 24 mem# 0: /tmp/recover/redo01.log
Fri Feb 10 05:19:12 2012
ARC0: Media recovery disabled
Fri Feb 10 05:19:23 2012
ARC1: Media recovery disabled
Fri Feb 10 05:20:23 2012
ARC1: Media recovery disabled
Fri Feb 10 05:21:23 2012
ARC1: Media recovery disabled

3.数据库归档信息

SQL> archive log list;
Database log mode              No Archive Mode
Automatic archival             Enabled
Archive destination            /tmp/recover
Oldest online log sequence     25
Current log sequence           27

SQL> show parameter log_archive_start

NAME                                 TYPE        VALUE
------------------------------------ ----------- ---------------
log_archive_start                    boolean     TRUE

从这里可以看出,数据库处于非归档模式,但是log_archive_start=true,这样会导致归档进程自动启动(ARCn),因为数据库设置为非归档模式,从而导致该进程启动后,又自己关闭

4.解决方法

SQL> alter system set log_archive_start=false scope=spfile;

System altered.

SQL> startup force      
ORACLE instance started.

Total System Global Area  202445176 bytes
Fixed Size                   450936 bytes
Variable Size             167772160 bytes
Database Buffers           33554432 bytes
Redo Buffers                 667648 bytes
Database mounted.
Database opened.
SQL> show parameter log_archive_start

NAME                                 TYPE        VALUE
------------------------------------ ----------- --------------
log_archive_start                    boolean     FALSE
SQL> ARCHIVE LOG LIST;
Database log mode              No Archive Mode
Automatic archival             Disabled
Archive destination            /tmp/recover
Oldest online log sequence     26
Current log sequence           28

ALERT中无类此记录出现

手动提交分布式事务一例

一.alert文件中出现了很多类此记录

Fri Feb 10 05:25:01 2012
Errors in file /tmp/recover/ahcx216_reco_7956.trc:
ORA-12154: TNS:could not resolve service name
Fri Feb 10 05:25:01 2012
Errors in file /tmp/recover/ahcx216_reco_7956.trc:
ORA-12154: TNS:could not resolve service name

这里可以看出来两个信息:
1.出错的进程是rec0进程,而该进程的作用是解决分布式事务失败后遗留问题(事务提交或者回滚)
2.错误信息是tns不能被解析

二.查看trace文件

/tmp/recover/ahcx216_reco_7956.trc
Oracle9i Enterprise Edition Release 9.2.0.8.0 - Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.8.0 - Production
ORACLE_HOME = /opt/oracle/product/9.2.0/db_1
System name:    Linux
Node name:      localhost.localdomain
Release:        2.6.9-89.0.0.0.1.ELhugemem
Version:        #1 SMP Tue May 19 04:38:38 EDT 2009
Machine:        i686
Instance name: ahcx216
Redo thread mounted by this instance: 1
Oracle process number: 7
Unix process pid: 7956, image: oracle@localhost.localdomain (RECO)


*** SESSION ID:(6.1) 2012-02-10 04:58:24.886
*** 2012-02-10 04:58:24.886
ERROR, tran=6.1.712757, session#=1, ose=0:
ORA-12154: TNS:could not resolve service name
………………
*** 2012-02-10 05:25:01.580
ERROR, tran=6.1.712757, session#=1, ose=0:
ORA-12154: TNS:could not resolve service name
*** 2012-02-10 05:25:01.627
ERROR, tran=12.19.99059, session#=1, ose=0:
ORA-12154: TNS:could not resolve service name

通过这里我们可以看出事务id分别为12.19.99059和6.1.712757有问题

三.查看dba_2pc_pending视图

SQL> select local_tran_id,state,fail_time,retry_time from dba_2pc_pending;

LOCAL_TRAN_ID     STATE            FAIL_TIME           RETRY_TIME
-------------- ---------------- --------------       --------------
6.1.712757	collecting	2010/12/1 13:39:03   2012/2/10 5:38:52
12.19.99059	collecting	2010/12/1 15:56:26   2012/2/10 5:38:52

可以看出,果然有两个分布式事务在2010年12月1日出了问题(本库是一个问题库,在把库拉起来后发现该问题)
因为是异机恢复,而且间隔时间较长,很多tns的信息都已经不存在,所以需要手工提交分布式事务

四.手动提交事务

SQL> BEGIN
  2  DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('6.1.712757');
  3  DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('12.19.99059');
  4  END;
  5  /
BEGIN
*
第 1 行出现错误:
ORA-30019: Illegal rollback Segment operation in Automatic Undo mode
ORA-06512: at "SYS.DBMS_TRANSACTION", line 65
ORA-06512: at "SYS.DBMS_TRANSACTION", line 85
ORA-06512: at line 2

SQL> alter session set "_smu_debug_mode"=4;

Session altered.

-------------------------------------------------------------
--设置UNDO_SUPPRESS_ERRORS=true也可以解决此问题
alter system set UNDO_SUPPRESS_ERRORS = TRUE;
EXECUTE DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('<事务ID>');
commit;
alter system set UNDO_SUPPRESS_ERRORS = false;
--------------------------------------------------------------

SQL> commit;

SQL> exec DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('6.1.712757');

PL/SQL procedure successfully completed.


SQL> exec DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('12.19.99059');
BEGIN DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('12.19.99059'); END;

*
ERROR at line 1:
ORA-01453: SET TRANSACTION must be first statement of transaction
ORA-06512: at "SYS.DBMS_TRANSACTION", line 65
ORA-06512: at "SYS.DBMS_TRANSACTION", line 85
ORA-06512: at line 1
--第一个分布式事务处理后,未提交导致

SQL> commit;

Commit complete.

SQL> exec DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('12.19.99059');

PL/SQL procedure successfully completed.

SQL> commit;

Commit complete.

五.补充说明
开始在另一个会话中,执行失败原因

SQL> alter session set "_smu_debug_mode"=4;

会话已更改。

SQL> commit;

提交完成。

SQL> BEGIN
  2  DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('6.1.712757');
  3  DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('12.19.99059');
  4  END;
  5  /
BEGIN
*
第 1 行出现错误:
ORA-01453: SET TRANSACTION must be first statement of transaction
ORA-06512: at "SYS.DBMS_TRANSACTION", line 65
ORA-06512: at "SYS.DBMS_TRANSACTION", line 85
ORA-06512: at line 3

因为这里的begin end中包含了两个事务的清理,在清理完第一个事务之后,需要提交才能够清理第二个,这里因为没有提交,导致ORA-01453错误。

通过sql语句获取awr/statspack逻辑读/物理读

在日常的性能监控中,我们经常需要需要通过一段时间内数据库的逻辑读/物理读来大致反映数据库的性能情况,这里列出通过awr和statspack来获取相关数据(查询最近30天数据,除掉第一条和最后一条数据)

awr逻辑读

WITH A AS
 (SELECT B.SNAP_ID, SUM(VALUE)/1024/1024/1024*8192 VALUE
    FROM DBA_HIST_SYSSTAT B
   WHERE B.DBID = (SELECT DBID FROM V$DATABASE)
          AND b.INSTANCE_NUMBER = &INST_NUM
     AND B.STAT_NAME IN ('session logical reads')
   GROUP BY B.SNAP_ID
   ORDER BY SNAP_ID)
SELECT A.SNAP_ID,
       LAG(VALUE, 1, '0') OVER(ORDER BY A.SNAP_ID) "START_VALUE(G)",
       VALUE "END_VALUE(G)",
       TO_CHAR(END_INTERVAL_TIME,'YYYY-MM-DD HH24:MI:SS')"END_TIME",
       VALUE - LAG(VALUE, 1, '0') OVER(ORDER BY A.SNAP_ID) "D-VALUE(G)"
  FROM A,(SELECT END_INTERVAL_TIME,SNAP_ID
    FROM DBA_HIST_SNAPSHOT B
   WHERE  B.DBID = (SELECT dbid FROM v$database)
     AND B.INSTANCE_NUMBER = &INST_NUM) B
     WHERE A.SNAP_ID=B.snap_id
     AND END_INTERVAL_TIME>=SYSDATE-30;

awr物理读

WITH A AS
 (SELECT B.SNAP_ID, SUM(VALUE)/1024/1024/1024*8192 VALUE
    FROM DBA_HIST_SYSSTAT B
   WHERE B.DBID = (SELECT DBID FROM V$DATABASE)
          AND b.INSTANCE_NUMBER = &INST_NUM
     AND B.STAT_NAME IN ('physical reads')
   GROUP BY B.SNAP_ID
   ORDER BY SNAP_ID)
SELECT A.SNAP_ID,
       LAG(VALUE, 1, '0') OVER(ORDER BY A.SNAP_ID) "START_VALUE(G)",
       VALUE "END_VALUE(G)",
       TO_CHAR(END_INTERVAL_TIME,'YYYY-MM-DD HH24:MI:SS')"END_TIME",
       VALUE - LAG(VALUE, 1, '0') OVER(ORDER BY A.SNAP_ID) "D-VALUE(G)"
  FROM A,(SELECT END_INTERVAL_TIME,SNAP_ID
    FROM DBA_HIST_SNAPSHOT B
   WHERE  B.DBID = (SELECT dbid FROM v$database)
     AND B.INSTANCE_NUMBER = &INST_NUM) B
     WHERE A.SNAP_ID=B.snap_id
     AND END_INTERVAL_TIME>=SYSDATE-30;

statspack逻辑读

WITH A AS
 (SELECT B.SNAP_ID, SUM(VALUE)/1024/1024/1024*8192 VALUE
    FROM STATS$SYSSTAT B
   WHERE B.DBID = (SELECT DBID FROM V$DATABASE)
     AND B.INSTANCE_NUMBER = &INST_NUM
     AND B.NAME IN ('session logical reads')
   GROUP BY B.SNAP_ID
   ORDER BY SNAP_ID)
SELECT A.SNAP_ID,
       LAG(VALUE, 1, '0') OVER(ORDER BY A.SNAP_ID) "START_VALUE(G)",
       VALUE "END_VALUE(G)",
       TO_CHAR(SNAP_TIME, 'YYYY-MM-DD HH24:MI:SS') "END_TIME",
       VALUE - LAG(VALUE, 1, '0') OVER(ORDER BY A.SNAP_ID) "D-VALUE(G)"
  FROM A,
       (SELECT SNAP_TIME, SNAP_ID
          FROM STATS$SNAPSHOT B
         WHERE B.DBID = (SELECT DBID FROM V$DATABASE)
           AND B.INSTANCE_NUMBER = &INST_NUM) B
 WHERE A.SNAP_ID = B.SNAP_ID
  AND snap_time>=SYSDATE-30;

statspack物理读

WITH A AS
 (SELECT B.SNAP_ID, SUM(VALUE)/1024/1024/1024*8192 VALUE
    FROM STATS$SYSSTAT B
   WHERE B.DBID = (SELECT DBID FROM V$DATABASE)
     AND B.INSTANCE_NUMBER = &INST_NUM
     AND B.NAME IN ('physical reads')
   GROUP BY B.SNAP_ID
   ORDER BY SNAP_ID)
SELECT A.SNAP_ID,
       LAG(VALUE, 1, '0') OVER(ORDER BY A.SNAP_ID) "START_VALUE(G)",
       VALUE "END_VALUE(G)",
       TO_CHAR(SNAP_TIME, 'YYYY-MM-DD HH24:MI:SS') "END_TIME",
       VALUE - LAG(VALUE, 1, '0') OVER(ORDER BY A.SNAP_ID) "D-VALUE(G)"
  FROM A,
       (SELECT SNAP_TIME, SNAP_ID
          FROM STATS$SNAPSHOT B
         WHERE B.DBID = (SELECT DBID FROM V$DATABASE)
           AND B.INSTANCE_NUMBER = &INST_NUM) B
 WHERE A.SNAP_ID = B.SNAP_ID
 AND snap_time>=SYSDATE-30;