ORA-00600 [ktbdchk1: bad dscn] 解决

启动数据库报错
SQL> startup
ORACLE instance started.

Total System Global Area  167772160 bytes
Fixed Size                  1260720 bytes
Variable Size             150995792 bytes
Database Buffers            8388608 bytes
Redo Buffers                7127040 bytes
Database mounted.
ORA-01092: ORACLE instance terminated. Disconnection forced

alert.log导错
Wed Aug 10 12:31:11 2011
Errors in file /u01/admin/xienfei/udump/xff_ora_8568.trc:
ORA-00600: internal error code, arguments: [ktbdchk1: bad dscn], [], [], [], [], [], [], []

xff_ora_8568.trc内容
[ktbdchk] -- readers_dsz -- bad dscn
scn: 0x0000.b1e60c00scn: 0x0000.0011fca1
*** 2011-08-10 12:31:11.998
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [ktbdchk1: bad dscn], [], [], [], [], [], [], []
Current SQL statement for this session:
select ctime, mtime, stime from obj$ where obj# = :1
根据上面错误判断,错误的scn为b1e60c00,不是整个数据文件的scn错误
而应该是一个对象的scn错误,所以继续在xff_ora_8568.trc文件中查找b1e60c00
找到结果如下:
Block header dump:  0x0040007a
 Object id on Block? Y
 seg/obj: 0x12  csc: 0x00.b1e60c00  itc: 1  flg: -  typ: 1 - DATA
     fsl: 0  fnx: 0x0 ver: 0x01

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0008.02a.000001d9  0x00802341.01bb.04  ----    1  fsc 0x0000.0011ae7c

data_block_dump,data header at 0x20fd6044
===============
tsiz: 0x1fb8
hsiz: 0xea
pbl: 0x20fd6044
bdba: 0x0040007a
     76543210
flag=--------
ntab=1
nrow=108
frre=-1
fsbo=0xea
fseo=0x453
avsp=0x369
tosp=0x369
0xe:pti[0]      nrow=108        offs=0

根据这个提示,发现dba为:0040007a的对象异常,查找对应的file_id,block
SQL> SELECT DBMS_UTILITY.data_block_address_file (TO_NUMBER ('40007a', 'XXXXXXXX')) file_id,
  2          DBMS_UTILITY.data_block_address_block (TO_NUMBER ('40007a', 'XXXXXXXX')) block_id
  3    FROM DUAL;

   FILE_ID   BLOCK_ID
---------- ----------
         1        122

使用bbed查看file=1,block=122的scn情况
BBED> p ktbbh
struct ktbbh, 48 bytes                      @20      
   ub1 ktbbhtyp                             @20       0x01 (KDDBTDATA)
   union ktbbhsid, 4 bytes                  @24      
      ub4 ktbbhsg1                          @24       0x00000012
      ub4 ktbbhod1                          @24       0x00000012
   struct ktbbhcsc, 8 bytes                 @28      
      ub4 kscnbas                           @28       0xb1e60c00
      ub2 kscnwrp                           @32       0x0000
   b2 ktbbhict                              @36       1
   ub1 ktbbhflg                             @38       0x02 (NONE)
   ub1 ktbbhfsl                             @39       0x00
   ub4 ktbbhfnx                             @40       0x00000000
   struct ktbbhitl[0], 24 bytes             @44      
      struct ktbitxid, 8 bytes              @44      
         ub2 kxidusn                        @44       0x0008
         ub2 kxidslt                        @46       0x002a
         ub4 kxidsqn                        @48       0x000001d9
      struct ktbituba, 8 bytes              @52      
         ub4 kubadba                        @52       0x00802341
         ub2 kubaseq                        @56       0x01bb
         ub1 kubarec                        @58       0x04
      ub2 ktbitflg                          @60       0x0001 (NONE)
      union _ktbitun, 2 bytes               @62      
         b2 _ktbitfsc                       @62       0
         ub2 _ktbitwrp                      @62       0x0000
      ub4 ktbitbas                          @64       0x0011ae7c

果然发现scn为0xb1e60c00,现在把其修改为:0x00124ac6(注意规则,一般linux下都是倒序)
BBED> set offset 28
        OFFSET          28

BBED> m /x c64a1200
BBED-00209: invalid number (c64a1200)
小技巧,一次性修改报错,尝试一次修改一点

BBED> m /x c64a
 File: /u01/oradata/xienfei/system01.dbf (0)
 Block: 122              Offsets:   28 to   43           Dba:0x00000000
------------------------------------------------------------------------
 c64ae6b1 00000000 01000200 00000000 

 <32 bytes per line>

BBED> set offset +2
        OFFSET          30

BBED> m /x 1200
 File: /u01/oradata/xienfei/system01.dbf (0)
 Block: 122              Offsets:   30 to   45           Dba:0x00000000
------------------------------------------------------------------------
 12000000 00000100 02000000 00000800 

 <32 bytes per line>

BBED> set offset -2
        OFFSET          28

BBED> dump
 File: /u01/oradata/xienfei/system01.dbf (0)
 Block: 122              Offsets:   28 to   43           Dba:0x00000000
------------------------------------------------------------------------
 c64a1200 00000000 01000200 00000000 

 <32 bytes per line>

BBED> sum apply
Check value for File 0, Block 122:
current = 0x3a4e, required = 0x3a4e

SQL> startup
ORACLE instance started.

Total System Global Area  167772160 bytes
Fixed Size                  1260720 bytes
Variable Size             150995792 bytes
Database Buffers            8388608 bytes
Redo Buffers                7127040 bytes
Database mounted.
Database opened.

cursor: pin S事件

A session waits for “cursor: pin S” when it wants a specific mutex in S (share) mode on a specific cursor and there is no concurrent X holder but it could not acquire that mutex immediately. This may seem a little strange as one might question why there should be any form of wait to get a mutex which has no session holding it in an incompatible mode. The reason for the wait is that in order to acquire the mutex in S mode (or release it) the session has to increment (or decrement) the mutex reference count and this requires an exclusive atomic update to the mutex structure itself. If there are concurrent sessions trying to make such an update to the mutex then only one session can actually increment (or decrement) the reference count at a time. A wait on “cursor: pin S” thus occurs if a session cannot make that atomic change immediately due to other concurrent requests.
Mutexes are local to the current instance in RAC environments.

Oracle10g中引用的mutexes机制一定程度的替代了library cache pin,其结构更简单,get&set的原子操作更快捷。
它相当于,每个child cursor下面都有一个mutexes这样的简单内存结构,当有session要执行该SQL而需要pin cursor操作的时候,session只需要以shared模式set这个内存位+1,表示session获得该mutex的shared mode lock.可以有很多session同时具有这个mutex的shared mode lock;但在同一时间,只能有一个session在操作这个mutext +1或者-1。+1 -1的操作是排它性的原子操作。如果因为session并行太多,而导致某个session在等待其他session的mutext +1/-1操作,则该session要等待cursor: pin S等待事件。
当看到系统有很多session等待cursor: pin S事件的时候,要么是CPU不够快,要么是某个SQL的并行执行次数太多了而导致在child cursor上的mutex操作争用。如果是Capacity的问题,则可以升级硬件。如果是因为SQL的并行太多,则要么想办法降低该SQL执行次数,要么将该SQL复制成N个其它的SQL。
select /*SQL 1*/object_name from t where object_id=?
select /*SQL 2*/object_name from t where object_id=?
select /*SQL …*/object_name from t where object_id=?
select /*SQL N*/object_name from t where object_id=?
这样就有了N个SQL Cursor,N个Mutex内存结构,就将争用分散开来,类似partition的作用了