记录一次ORA-600 4000数据库故障恢复

ORA-600[4000]错误
一朋友数据库因为当前redo丢失,在恢复的过程中启动报ORA-600[4000]错误

SMON: enabling cache recovery
Thu May 30 16:24:17 2013
Errors in file /u02/oracle/app/oracle/admin/xifenfei/udump/xifenfei1_ora_1458370.trc:
ORA-00600: internal error code, arguments: [4000], [83], [], [], [], [], [], []
Thu May 30 16:24:19 2013
Errors in file /u02/oracle/app/oracle/admin/xifenfei/udump/xifenfei1_ora_1458370.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [4000], [83], [], [], [], [], [], []
Thu May 30 16:24:19 2013
Error 704 happened during db open, shutting down database
USER: terminating instance due to error 704
Instance terminated by USER, pid = 1458370
ORA-1092 signalled during: alter database open resetlogs...

分析trace文件

*** 2013-05-30 16:24:17.979
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [4000], [83], [], [], [], [], [], []
Current SQL statement for this session:
select ctime, mtime, stime from obj$ where obj# = :1
--确定是obj$对象异常,通过某种手段找到obj$的objid和dataobjid均为16,对应16进制为12

Block header dump:  0x0040007a
 Object id on Block? Y
 seg/obj: 0x12  csc: 0xc1e.a329e76f  itc: 1  flg: -  typ: 1 - DATA
     fsl: 0  fnx: 0x0 ver: 0x01
 
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0053.02a.000598bd  0x0d407e46.4f52.2f  --U-    1  fsc 0x0000.a329e772

这里比较明显obj$对象在rdba为0040007a的block上,scn为0c1e.a329e76f(13325725984623)且未提交的事务,这样的现象就决定了处理的特殊性(不是因为块延迟清理导致访问undo现象,该现象直接推进scn解决,而该情况不行)

数据文件头scn

SQL> SELECT DISTINCT CHECKPOINT# FROM V$DATAFILE_HEADER;

    CHECKPOINT_CHANGE#
-------------------------
           13324676536960

bbed查看文件头scn

   struct kcvfhckp, 160 bytes               @484     
      struct kcvcpscn, 8 bytes              @484     
         ub4 kscnbas                        @484      0x649c9a80
         ub2 kscnwrp                        @488      0x0c1e

这里看到的文件头scn也是为13324676536960(0c1e.649c9a80)和sql查询结果一致,也就是说数据库中的obj$的某个对象含有事务,且scn大于文件头scn(因为当前redo丢失,无法前滚,所以出现该情况),当数据库访问obj$的时候,为了事务的一致性,就需要访问undo(这里提示为83 回滚段),而undo异常,所以smon进程回滚失败,数据库无法正常启动

使用bbed提交事务

BBED> map
 File: /oradata/sys/xifenfei/system01.dbf (1)
 Block: 122                                   Dba:0x0040007a
------------------------------------------------------------
 KTB Data Block (Table/Cluster)

 struct kcbh, 20 bytes                      @0       

 struct ktbbh, 48 bytes                     @20      

 struct kdbh, 14 bytes                      @68      

 struct kdbt[1], 4 bytes                    @82      

 sb2 kdbr[108]                              @86      

 ub1 freespace[802]                         @302     

 ub1 rowdata[7084]                          @1104    

 ub4 tailchk                                @8188    


BBED> p ktbbh
struct ktbbh, 48 bytes                      @20      
   ub1 ktbbhtyp                             @20       0x01 (KDDBTDATA)
   union ktbbhsid, 4 bytes                  @24      
      ub4 ktbbhsg1                          @24       0x00000012
      ub4 ktbbhod1                          @24       0x00000012
   struct ktbbhcsc, 8 bytes                 @28      
      ub4 kscnbas                           @28       0xa329e76f
      ub2 kscnwrp                           @32       0x0c1e
   b2 ktbbhict                              @36       1
   ub1 ktbbhflg                             @38       0x02 (NONE)
   ub1 ktbbhfsl                             @39       0x00
   ub4 ktbbhfnx                             @40       0x00000000
   struct ktbbhitl[0], 24 bytes             @44      
      struct ktbitxid, 8 bytes              @44      
         ub2 kxidusn                        @44       0x0053
         ub2 kxidslt                        @46       0x002a
         ub4 kxidsqn                        @48       0x000598bd
      struct ktbituba, 8 bytes              @52      
         ub4 kubadba                        @52       0x0d407e46
         ub2 kubaseq                        @56       0x4f52
         ub1 kubarec                        @58       0x2f
      ub2 ktbitflg                          @60       0x2001 (KTBFUPB)<--需要提交
      union _ktbitun, 2 bytes               @62      
         b2 _ktbitfsc                       @62       0
         ub2 _ktbitwrp                      @62       0x0000
      ub4 ktbitbas                          @64       0xa329e772

BBED> set count 32
        COUNT           32

BBED> set offset 60
        OFFSET          60

BBED> d
 File: /oradata/sys/xifenfei/system01.dbf (1)
 Block: 122              Offsets:   60 to   91           Dba:0x0040007a
------------------------------------------------------------------------
 20010000 a329e772 0001006c ffff00ea 040c0368 03680000 006c1f7c 1f3c1efb 

 <32 bytes per line>

BBED> m /x 8001
 File: /oradata/sys/xifenfei/system01.dbf (1)
 Block: 122              Offsets:   60 to   91           Dba:0x0040007a
------------------------------------------------------------------------
 80010000 a329e772 0001006c ffff00ea 040c0368 03680000 006c1f7c 1f3c1efb 

 <32 bytes per line>

BBED> sum apply
Check value for File 1, Block 122:
current = 0xafd6, required = 0xafd6

尝试open数据库ORA-600[2662]解决

Thu May 30 21:16:00 2013
Errors in file /u02/oracle/app/oracle/admin/xifenfei/bdump/xifenfei1_smon_819664.trc:
ORA-00600: internal error code, arguments:[2662],[3102],[2737532996],[3102],[2745973074],[4194397],[],[]
Non-fatal internal error happenned while SMON was doing non-existent object cleanup.
SMON encountered 1 out of maximum 100 non-fatal internal errors.
Thu May 30 21:16:01 2013
Trace dumping is performing id=[cdmp_20130530211601]
Thu May 30 21:16:02 2013
Errors in file /u02/oracle/app/oracle/admin/xifenfei/bdump/xifenfei1_smon_819664.trc:
ORA-00600: internal error code, arguments:[2662],[3102],[2737532997],[3102],[2745973074],[4194397],[],[]
Thu May 30 21:16:03 2013
Non-fatal internal error happenned while SMON was doing logging scn->time mapping.
SMON encountered 2 out of maximum 100 non-fatal internal errors.
Thu May 30 21:16:05 2013
Errors in file /u02/oracle/app/oracle/admin/xifenfei/bdump/xifenfei1_smon_819664.trc:
ORA-00600: internal error code, arguments:[2662],[3102],[2737532997],[3102],[2745973074],[4194397],[],[]
Thu May 30 21:16:08 2013
Errors in file /u02/oracle/app/oracle/admin/xifenfei/bdump/xifenfei1_pmon_958764.trc:
ORA-00474: SMON process terminated with error
Thu May 30 21:16:08 2013
PMON: terminating instance due to error 474

数据库在open过程中遇到大量ORA-00600[2662],这个是因为数据库中文件头的scn小于访问的数据块scn导致该问题,解决方法推荐scn,如果数据库的scn本身就很大(和时间理论scn较接近),推进过程中可能遇到如下错误,这个时候就需要选择合适的方法/合适的值来推进scn

SQL> startup pfile=/home/oracle/pfile force
ORACLE instance started.

Total System Global Area 5.5835E+10 bytes
Fixed Size                  2177056 bytes
Variable Size            3.2867E+10 bytes
Database Buffers         2.2951E+10 bytes
Redo Buffers               14598144 bytes
Database mounted.
ORA-01052: required destination LOG_ARCHIVE_DUPLEX_DEST is not specified

后面的工作因为没有redo前滚,而且该库故障时有大量事务在跑,现在无法前滚,导致大量的undo回滚段异常,index和data不一致等故障,需要做的就是屏蔽undo seg,重建undo,重建库

乱用_allow_resetlogs_corruption参数导致悲剧

一个朋友11.2.0.1的数据库因为断电,出现不能正常open问题,自己尝试恢复,折腾了几天,最后让我帮忙的时候错误如下

SQL> startup
ORACLE 例程已经启动。

Total System Global Area  778387456 bytes
Fixed Size                  1374808 bytes
Variable Size             545260968 bytes
Database Buffers          226492416 bytes
Redo Buffers                5259264 bytes
数据库装载完毕。
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00604: error occurred at recursive SQL level 1
ORA-01578: ORACLE data block corrupted (file # 1, block # 225)
ORA-01110: data file 1: 'F:\APP\ADMINISTRATOR\ORADATA\YFCLOUD\SYSTEM01.DBF'
进程 ID: 5964
会话 ID: 1144 序列号: 5

从启动的日志提示看初步判断就是悲剧了,因为根据经验值在11gr2版本中,该错误就是undo$(分析trace文件进步一确定是undo$),该block出现异常,数据库在启动的时候要扫描该表,把相关的回滚段给online起来,现在他异常了,数据库肯定无法正常启动
dbv检查数据库文件

F:\>dbv file='F:\APP\ADMINISTRATOR\ORADATA\YFCLOUD\SYSTEM01.DBF'

DBVERIFY: Release 11.2.0.1.0 - Production on 星期三 5月 22 11:06:00 2013

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

DBVERIFY - 开始验证: FILE = F:\APP\ADMINISTRATOR\ORADATA\YFCLOUD\SYSTEM01.DBF
页 225 流入 - 很可能是介质损坏
Corrupt block relative dba: 0x004000e1 (file 1, block 225)
Fractured block found during dbv:
Data in bad block:
 type: 6 format: 2 rdba: 0x004000e1
 last change scn: 0x0000.00d65120 seq: 0x1 flg: 0x06
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0xb98e0601
 check value in block header: 0xb307
 computed block checksum: 0xe8ae



DBVERIFY - 验证完成

检查的页总数: 134400
处理的页总数 (数据): 98226
失败的页总数 (数据): 0
处理的页总数 (索引): 14189
失败的页总数 (索引): 0
处理的页总数 (其他): 4178
处理的总页数 (段)  : 1
失败的总页数 (段)  : 0
空的页总数: 17806
标记为损坏的总页数: 1
流入的页总数: 1
加密的总页数        : 0
最高块 SCN            : 14045769 (0.14045769)

看到这里,可以确定坏块的存在,根据上面的提示,我们发现tailchk值不正确,应该是5120+06+01,而不该是b98e0601,通过bbed查看

BBED> p kcbh
struct kcbh, 20 bytes                       @0
   ub1 type_kcbh                            @0        0x06
   ub1 frmt_kcbh                            @1        0xa2
   ub1 spare1_kcbh                          @2        0x00
   ub1 spare2_kcbh                          @3        0x00
   ub4 rdba_kcbh                            @4        0x004000e1
   ub4 bas_kcbh                             @8        0x00d65120
   ub2 wrp_kcbh                             @12       0x0000
   ub1 seq_kcbh                             @14       0x01
   ub1 flg_kcbh                             @15       0x06 (KCBHFDLC, KCBHFCKV)
   ub2 chkval_kcbh                          @16       0x5ba9
   ub2 spare3_kcbh                          @18       0x0000

BBED> p tailchk
ub4 tailchk                                 @8188     0xb98e0601

进一步证明是tailchk异常导致,分析alert日志,数据库异常断电,然后启动的时候发现如下错误

Recovery of Online Redo Log: Thread 1 Group 2 Seq 431 Reading mem 0
  Mem# 0: F:\APP\ADMINISTRATOR\ORADATA\YFCLOUD\REDO02.LOG
RECOVERY OF THREAD 1 STUCK AT BLOCK 451 OF FILE 3
Aborting crash recovery due to error 1172
Errors in file f:\app\administrator\diag\rdbms\yfcloud\yfcloud\trace\yfcloud_ora_772.trc:
ORA-01172: 线程 1 的恢复停止在块 451 (在文件 3 中)
ORA-01151: 如果需要, 请使用介质恢复以恢复块和还原备份
Errors in file f:\app\administrator\diag\rdbms\yfcloud\yfcloud\trace\yfcloud_ora_772.trc:
ORA-01172: 线程 1 的恢复停止在块 451 (在文件 3 中)
ORA-01151: 如果需要, 请使用介质恢复以恢复块和还原备份
ORA-1172 signalled during: alter database open...
Tue May 21 14:27:29 2013
ALTER DATABASE RECOVER  datafile 3  
Media Recovery Start
Serial Media Recovery started
Recovery of Online Redo Log: Thread 1 Group 2 Seq 431 Reading mem 0
  Mem# 0: F:\APP\ADMINISTRATOR\ORADATA\YFCLOUD\REDO02.LOG
Errors in file f:\app\administrator\diag\rdbms\yfcloud\yfcloud\trace\yfcloud_ora_772.trc  (incident=112164):
ORA-00600: 内部错误代码, 参数: [3020], [3], [451], [12583363], [], [], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 3, block# 451, file offset is 3694592 bytes)
ORA-10564: tablespace UNDOTBS1
ORA-01110: 数据文件 3: 'F:\APP\ADMINISTRATOR\ORADATA\YFCLOUD\UNDOTBS01.DBF'
ORA-10560: block type 'KTU UNDO BLOCK'
Media Recovery failed with error 600
ORA-283 signalled during: ALTER DATABASE RECOVER  datafile 3  ...

因为file# 3, block# 451和redo信息不一致,出现ora-600[3020]错误,而file# 3为undo文件,朋友从而设置undo_management=’manual’并设置了_allow_resetlogs_corruption=true,然后进行不完全恢复,从而出现了如下错误提示

Tue May 21 14:41:23 2013
SMON: enabling cache recovery
Corrupt block relative dba: 0x004000e1 (file 1, block 225)
Fractured block found during buffer read
Data in bad block:
 type: 6 format: 2 rdba: 0x004000e1
 last change scn: 0x0000.00d65120 seq: 0x1 flg: 0x06
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0xb98e0601
 check value in block header: 0xb307
 computed block checksum: 0xe8ae
Reading datafile 'F:\APP\ADMINISTRATOR\ORADATA\YFCLOUD\SYSTEM01.DBF'
for corruption at rdba: 0x004000e1 (file 1, block 225)
Reread (file 1, block 225) found same corrupt data
Errors in file f:\app\administrator\diag\rdbms\yfcloud\yfcloud\trace\yfcloud_ora_4892.trc  (incident=120165):
ORA-01578: ORACLE 数据块损坏 (文件号 1, 块号 225)
ORA-01110: 数据文件 1: 'F:\APP\ADMINISTRATOR\ORADATA\YFCLOUD\SYSTEM01.DBF'
Errors in file f:\app\administrator\diag\rdbms\yfcloud\yfcloud\trace\yfcloud_ora_4892.trc:
ORA-00604: 递归 SQL 级别 1 出现错误
ORA-01578: ORACLE 数据块损坏 (文件号 1, 块号 225)
ORA-01110: 数据文件 1: 'F:\APP\ADMINISTRATOR\ORADATA\YFCLOUD\SYSTEM01.DBF'
Errors in file f:\app\administrator\diag\rdbms\yfcloud\yfcloud\trace\yfcloud_ora_4892.trc:
ORA-00604: 递归 SQL 级别 1 出现错误
ORA-01578: ORACLE 数据块损坏 (文件号 1, 块号 225)
ORA-01110: 数据文件 1: 'F:\APP\ADMINISTRATOR\ORADATA\YFCLOUD\SYSTEM01.DBF'
Error 604 happened during db open, shutting down database
USER (ospid: 4892): terminating the instance due to error 604

从而的原因基本上可以从操作过程中了解到:数据库是因为file# 3 block# 451和redo不一致导致问题,而恢复的操作人员冲动的使用了_allow_resetlogs_corruption参数,从而使得数据库出现了不一致性,也就是导致file# 1 block# 225坏块的根本原因,针对这样的情况,完全没有到使用_allow_resetlogs_corruption隐含参数地步

使用bbed修改tailchk

BBED> p tailchk
ub4 tailchk                                 @8188     0xb98e0601

BBED> verify
DBVERIFY - Verification starting
FILE = system01.dbf
BLOCK = 225

Block 225 is corrupt
***
Corrupt block relative dba: 0x004000e1 (file 0, block 225)
Fractured block found during verification
Data in bad block -
 type: 6 format: 2 rdba: 0x004000e1
 last change scn: 0x0000.00d65120 seq: 0x1 flg: 0x06
 consistency value in tail: 0xb98e0601
 check value in block header: 0x5ba9, computed block checksum: 0x0
 spare1: 0x0, spare2: 0x0, spare3: 0x0
***


DBVERIFY - Verification complete

Total Blocks Examined         : 1
Total Blocks Processed (Data) : 0
Total Blocks Failing   (Data) : 0
Total Blocks Processed (Index): 0
Total Blocks Failing   (Index): 0
Total Blocks Empty            : 0
Total Blocks Marked Corrupt   : 1
Total Blocks Influx           : 2

BBED> m /x 01062051
 File: system01.dbf (0)
 Block: 226              Offsets: 8188 to 8191           Dba:0x00000000
------------------------------------------------------------------------
 01062051

 <32 bytes per line>

BBED> p tailchk
ub4 tailchk                                 @8188     0x51200601

BBED> sum apply
Check value for File 0, Block 226:
current = 0xb307, required = 0xb307

BBED> verify
DBVERIFY - Verification starting
FILE = system01.dbf
BLOCK = 225


DBVERIFY - Verification complete

Total Blocks Examined         : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing   (Data) : 0
Total Blocks Processed (Index): 0
Total Blocks Failing   (Index): 0
Total Blocks Empty            : 0
Total Blocks Marked Corrupt   : 0
Total Blocks Influx           : 0

bbed修改block之后,数据库直接正常打开,完成数据库恢复任务,在这里很明显是因为错误的使用了_allow_resetlogs_corruption参数,屏蔽了redo前滚导致了相关的坏块,所以大家在数据库异常恢复的时候,需要知道各个参数的意义,而不要乱使用,很可能导致不可控结果

一次侥幸的OSD-04016 O/S-Error异常恢复

一台数据库因为异常断电导致硬盘IO出现O/S-Error: (OS 23) 数据错误(循环冗余检查)错误,使得datafile 6无法完成实例恢复.使用dbv检查该数据文件也出现类似错误,尝试copy该文件,也出现了类似的错误.尝试dd拷贝完整,发现dd也只能拷贝81951个block.

Tue May 14 15:32:10 2013
Completed redo scan
 16941 redo blocks read, 1106 data blocks need recovery
Tue May 14 15:32:17 2013
Errors in file d:\oracle\product\10.2.0\admin\water\bdump\water_p002_1472.trc:
ORA-01115: IO error reading block from file 6 (block # 81951)
ORA-01110: data file 6: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\WATER\YD_DATA01.DBF'
ORA-27070: async read/write failed
OSD-04016: 异步 I/O 请求排队时出错。
O/S-Error: (OS 23) 数据错误(循环冗余检查)。

因为该数据库有一天前的备份,而且他们只要求恢复其中三张核心表的数据,通过分析数据字典,确定出来相关表的block均不在block 81951之上,也就是说,如果数据库只是该block异常了,可以通过跳过该block,从而copy相关block,来实现数据库恢复,因为是一个文件的中间部分异常了,所以决定使用dd来copy文件正常部分

dd if=D:\ORACLE\PRODUCT\10.2.0\ORADATA\WATER\YD_DATA01.DBF bs=8192 count=81951 of=h:\dd\yd_data01_1.dbf
dd if=D:\ORACLE\PRODUCT\10.2.0\ORADATA\WATER\YD_DATA01.DBF bs=8192  skip=81952   of=h:\dd\yd_data01_2.dbf

dd出来文件之后,因为我们跳过了block 81952(block 0 数据库为记录),所以我们需要通过dd来构造block 81952,并且把他们合并到一起

dd if=/dev/zero of=h:\dd\yd_data01_1.dbf seek=81951 bs=8192 count=1
dd if=h:\dd\yd_data01_2.dbf seek=81952 bs=8192 of=h:\dd\yd_data01_1.dbf 

然后使用dul工具抽出来客户需要的三张核心表的数据,恢复工作算完成。
针对本次恢复,如果需求是open数据库,通过设置隐含参数,bbed之类原则上也可以实现.
这次的恢复算是比较侥幸:1.客户有一天前的exp,只需要恢复三张核心表数据;2.三张表的数据恰好都不在损坏的block中;3.数据库就损坏了一个block.
如果出现不幸情况,那可能需要先硬盘恢复,然后数据库恢复,最后折腾数据.
总之再次提醒各位:数据库备份很重要,很重要.对于需求是不能丢失数据的系统备份,一定要rman的方式备份,千万别选择exp/expdp

bbed处理ORA-01200故障

一个朋友的测试库出现ORA-01200错误,正好周末比较空闲,随手帮他使用bbed进行了恢复,给广大朋友提供一种解决该问题的方法
数据库启动报错

C:\Users\Administrator>sqlplus /nolog

SQL*Plus: Release 11.1.0.6.0 - Production on 星期日 5月 12 22:09:11 2013

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

SQL> connect/as sysdba
已连接到空闲例程。
SQL> startup force
ORACLE 例程已经启动。

Total System Global Area 1071333376 bytes
Fixed Size                  1334380 bytes
Variable Size             318768020 bytes
Database Buffers          746586112 bytes
Redo Buffers                4644864 bytes
数据库装载完毕。
ORA-01122: 数据库文件 1 验证失败
ORA-01110: 数据文件 1: 'D:\APP\ADMINISTRATOR\ORADATA\ORCL\SYSTEM01.DBF'
ORA-01200: 87946 的实际文件大小小于 88320 块的正确大小

这里的错误很明显是因为file 1的数据文件头记录block大小为88320个block,而该数据文件的实际大小只有87946个block,所以出现该问题.

dbv检测文件

D:\app\Administrator\oradata\orcl>dbv file=SYSTEM01.DBF

DBVERIFY: Release 11.1.0.6.0 - Production on 星期日 5月 12 22:30:29 2013

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

DBVERIFY - 开始验证: FILE = SYSTEM01.DBF


DBVERIFY - 验证完成

检查的页总数: 87040
处理的页总数 (数据): 62870
失败的页总数 (数据): 0
处理的页总数 (索引): 11055
失败的页总数 (索引): 0
处理的页总数 (其它): 2437
处理的总页数 (段)  : 0
失败的总页数 (段)  : 0
空的页总数: 10678
标记为损坏的总页数: 0
流入的页总数: 0
加密的总页数        : 0
最高块 SCN            : 980055 (0.980055)

检查发现该数据文件未发现坏块,减小了该数据文件通过bbed恢复异常的风险,数据库最怕就是system中出现很多坏块

使用bbed修改kccfhfsz
因为win的bbed问题,所以拷贝到我的电脑上进行修改

C:\Users\XIFENFEI\Desktop\temp>bbed filename=system01.dbf blocksize=8192
Password:

BBED: Release 2.0.0.0.0 - Limited Production on Sun May 12 23:27:26 2013

Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.

************* !!! For Oracle Internal Use only !!! ***************

BBED> set block 2
        BLOCK#          2
--从一台机器中拷贝到另外的机器,实际中的block可能发生改变,因为含block 0

BBED> map
 File: system01.dbf (0)
 Block: 2                                     Dba:0x00000000
------------------------------------------------------------
 Data File Header

 struct kcvfh, 360 bytes                    @0

 ub4 tailchk                                @8188


BBED> p kcvfhhdr.kccfhfsz
ub4 kccfhfsz                                @44       0x0001578a

--通过ORA-01200错误报出来的文件头记录大小88320实际就是0x0001578a


BBED> set mode edit
        MODE            Edit

BBED> set count 32
        COUNT           32

BBED> d
 File: system01.dbf (0)
 Block: 2                Offsets:   44 to   75           Dba:0x00000000
------------------------------------------------------------------------
00590100 00200000 01000300 00000000 00000000 00000000 00000000 00000000

 <32 bytes per line>

BBED> m /x 8A570100
 File: system01.dbf (0)
 Block: 2                Offsets:   44 to   75           Dba:0x00000000
------------------------------------------------------------------------
 8a570100 00200000 01000300 00000000 00000000 00000000 00000000 00000000

 <32 bytes per line>
--通过ORA-01200错误报出来的数据文件实际大小,来修改该文件头的kcvfhhdr.kccfhfsz值,也可以通过文件实际大小计算出来


BBED> p kcvfhhdr.kccfhfsz
ub4 kccfhfsz                                @44       0x0001578a

BBED> sum apply
Check value for File 0, Block 2:
current = 0x0f79, required = 0x0f79

BBED> verify
DBVERIFY - Verification starting
FILE = system01.dbf
BLOCK = 1


DBVERIFY - Verification complete

Total Blocks Examined         : 1
Total Blocks Processed (Data) : 0
Total Blocks Failing   (Data) : 0
Total Blocks Processed (Index): 0
Total Blocks Failing   (Index): 0
Total Blocks Empty            : 0
Total Blocks Marked Corrupt   : 0
Total Blocks Influx           : 0

打开数据库

SQL> select open_mode from v$database;

OPEN_MODE
----------
MOUNTED

SQL> alter database open;

数据库已更改。

ORA-600[2037]与ORA-07445[kcbs_dump_adv_state]错误

一台win oracle 数据库,重启后发现数据库无法访问,检查发现是Bug 4899479,但是oracle未提供完整的解决方法,这里根据自己对于数据库启动过程的理解,通过屏蔽前滚和回滚,拉起来数据库
数据库版本平台信息

ORACLE:11.1.0.7
OS:WIN 2008 R2 X64

数据库启动报错

Tue Apr 16 12:36:31 2013
alter database open
Beginning crash recovery of 1 threads
 parallel recovery started with 7 processes
Started redo scan
Completed redo scan
 28878 redo blocks read, 7353 data blocks need recovery
Started redo application at
 Thread 1: logseq 7960, block 14132
Recovery of Online Redo Log: Thread 1 Group 1 Seq 7960 Reading mem 0
  Mem# 0: D:\APP\SDWLJG-DB101\ORADATA\WLJG\REDO01.LOG
Tue Apr 16 12:36:32 2013
RECOVERY OF THREAD 1 STUCK AT BLOCK 915068 OF FILE 9
Hex dump of (file 9, block 1698691) in trace file c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p001_1500.trc
Corrupt block relative dba: 0x0259eb83 (file 9, block 1698691)
Bad header found during crash/instance recovery
Data in bad block:
 type: 0 format: 0 rdba: 0x0000a206
 last change scn: 0x2359.0259eb83 seq: 0xf7 flg: 0x0b
 spare1: 0x0 spare2: 0x0 spare3: 0x601
 consistency value in tail: 0x02c10243
 check value in block header: 0x0
 block checksum disabled
Reread of rdba: 0x0259eb83 (file 9, block 1698691) found valid data
Slave exiting with ORA-1172 exception
Errors in file c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p001_1500.trc:
ORA-01172: recovery of thread 1 stuck at block 915068 of file 9
ORA-01151: use media recovery to recover block, restore backup if needed
Tue Apr 16 12:36:32 2013
Errors in file c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p003_4088.trc  (incident=187558):
ORA-00600: internal error code, arguments: [2037], [12619645], [41474], [6], [1], [247], [12619645], [0], [], [], [], []
Incident details in: c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\incident\incdir_187558\wljg_p003_4088_i187558.trc
ORA-07445: exception encountered: core dump [kcbs_dump_adv_state()+1352] [ACCESS_VIOLATION]
 [ADDR:0xFFFFFFFFFFFFFFFF] [PC:0x16BFD20] [UNABLE_TO_READ] []
ORA-00600: internal error code, arguments: [2037], [12619645], [41474], [6], [1], [247], [12619645], [0], [], [], [], []
Incident details in: c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\incident\incdir_187559\wljg_p003_4088_i187559.trc
Errors in file c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p006_1216.trc  (incident=187567):

这里提示file 9 block 915068异常,但是通过dbv检查发现file 9无任何坏块.

trace文件内容

Dump continued from file: c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p003_4088.trc
ORA-00600: internal error code, arguments: [2037], [12620930], [41474], [2], [1], [247], [12619645], [0], [], [], [], []

** DBGRL Error: ARB Alert Log
** DBGRL Error: <msg time='2013-04-16T11:05:58.522+08:00' org_id='oracle' comp_id='rdbms'
 msg_id='dbgexProcessError:1097:3370026720' type='TRACE' level='16'
 host_id='SDWLSCJG-DB' host_addr='172.18.1.15'>
 <txt>Incident details in: c:\app\sdwljg-db101\diag\rdbms\wljg\wlj
========= Dump for incident 129879 (ORA 600 [2037]) ========

*** 2013-04-16 11:05:58.522
----- SQL Statement (None) -----
Current SQL information unavailable - no cursor.

----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
ksedst1()+111        CALL???  skdstdst()+0         000000000 000000000 01CFC9B80
                                                   000000200
ksedst()+63          CALL???  ksedst1()+0          000000005 021B00600 005D30C80
                                                   000002004
dbkedDefDump()+1012  CALL???  ksedst()+0           000000000 000000000 000000000
                                                   000000000
ksedmp()+51          CALL???  dbkedDefDump()+0     000000003 000000002 021AF92C0
                                                   000405038
__PGOSF184_ksfdmp()  CALL???  ksedmp()+0           000000000 000000000 000000000
+27                                                27F00000000
dbgexPhaseII()+266   CALL???  __PGOSF184_ksfdmp()  00000000D 0082FAE50 000000000
                              +0                   000000004
dbgexProcessError()  CALL???  dbgexPhaseII()+0     021B00600 021AFCA50 000000201
+1313                                              000000000
dbgeExecuteForError  CALL???  dbgexProcessError()  021B00600 021B07590 000000001
()+55                         +0                   000000000
dbgePostErrorKGE()+  CALL???  dbgeExecuteForError  021AFCA30 021AFCA80 00000002E
1608                          ()+0                 000000005
dbkePostKGE_kgsf()+  CALL???  dbgePostErrorKGE()+  01CFC99D0 021B0E080 000000258
65                            0                    021B0E080
kgeade()+556         CALL???  dbkePostKGE_kgsf()+  000002000 000000000 000000009
                              0                    000000004
kgeriv_int()+105     CALL???  kgeade()+0           3A4F00000003 000C09482
                                                   0FFFFFFFF 000000000
kgeriv()+27          CALL???  kgeriv_int()+0       3A9A024E0 000000000 01CFC9410
                                                   000000000
kgesiv()+102         CALL???  kgeriv()+0           0000008D5 0000008C3 021AFD9A0
                                                   000AFDC73
ksesic7()+125        CALL???  kgesiv()+0           006371F20 000000007 27F912000
                                                   200000004
kcoexam()+248        CALL???  ksesic7()+0          2000007F5 000000000 000C09482
                                                   000000000
kcbtema()+2154       CALL???  kcoexam()+0          27FFC22C8 39E113470 3A940BBB8
                                                   000000000
kcrpap()+355         CALL???  kcbtema()+0          27FFC22C8 28BFC2628 000000000
                                                   021B10200
kcrpdv()+1655        CALL???  kcrpap()+0           021B101A0 000000002 000000004
                                                   000000512
kxfprdp()+1384       CALL???  kcrpdv()+0           3A7AD3098 000000000 00000000C
                                                   00757CF00
opirip()+1396        CALL???  kxfprdp()+0          00000001E 005CDB518 021AFF9E0
                                                   000000000
opidrv()+855         CALL???  opirip()+0           000000032 000000004 021AFFD30
                                                   000000000
sou2o()+52           CALL???  opidrv()+213         000000032 000000004 021AFFD30
                                                   021AFFDB0
opimai_real()+295    CALL???  sou2o()+0            000000000 7FEFD9819B5
                                                   000000000 000000000
opimai()+96          CALL???  opimai_real()+0      000000000 000000000 000000000
                                                   000000000
BackgroundThreadSta  CALL???  opimai()+0           021AFFE98 000000001 000000000
rt()+695                                           000000000
00000000775AF56D     CALL???  BackgroundThreadSta  00A26B7A0 000000000 000000000
                              rt()+0               000000000
0000000077923281     CALL???  00000000775AF560     000000000 000000000 000000000
                                                   000000000
 
--------------------- Binary Stack Dump ---------------------

查询mos发现During Startup (Open Database) Alert Log Shows ORA-600[2037] and ORA-7445[kcbs_dump_adv_state] [ID 551993.1]和我们这里展示的错误相符,引起该问题的原因主要是因为:The database may crash and fail to open due to undo/redo corruption if you are using distributed transactions.因为使用分布式事务的时候,数据库crash导致undo/redo corruption,从而使得数据库无法正常启动.

故障处理思路
因为通过数据库alert日志可以知道,数据库是在做前滚的时候并发进程失败,设置fast_start_parallel_rollback=false,禁止数据库实例恢复并发,可以恢复依然失败.因为前滚过不去,那就通过设置隐含参数禁止数据库前滚,在open数据库的过程中发现ora-600[2662]错误,推进scn,继续open数据库发现ora-600[4194],通过设置undo管理模式,屏蔽事务,屏蔽回滚段等方法,终于重新open库并重建undo,然后重建库算是完成恢复任务