ORA-600[2037]与ORA-07445[kcbs_dump_adv_state]错误

一台win oracle 数据库,重启后发现数据库无法访问,检查发现是Bug 4899479,但是oracle未提供完整的解决方法,这里根据自己对于数据库启动过程的理解,通过屏蔽前滚和回滚,拉起来数据库
数据库版本平台信息

ORACLE:11.1.0.7
OS:WIN 2008 R2 X64

数据库启动报错

Tue Apr 16 12:36:31 2013
alter database open
Beginning crash recovery of 1 threads
 parallel recovery started with 7 processes
Started redo scan
Completed redo scan
 28878 redo blocks read, 7353 data blocks need recovery
Started redo application at
 Thread 1: logseq 7960, block 14132
Recovery of Online Redo Log: Thread 1 Group 1 Seq 7960 Reading mem 0
  Mem# 0: D:\APP\SDWLJG-DB101\ORADATA\WLJG\REDO01.LOG
Tue Apr 16 12:36:32 2013
RECOVERY OF THREAD 1 STUCK AT BLOCK 915068 OF FILE 9
Hex dump of (file 9, block 1698691) in trace file c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p001_1500.trc
Corrupt block relative dba: 0x0259eb83 (file 9, block 1698691)
Bad header found during crash/instance recovery
Data in bad block:
 type: 0 format: 0 rdba: 0x0000a206
 last change scn: 0x2359.0259eb83 seq: 0xf7 flg: 0x0b
 spare1: 0x0 spare2: 0x0 spare3: 0x601
 consistency value in tail: 0x02c10243
 check value in block header: 0x0
 block checksum disabled
Reread of rdba: 0x0259eb83 (file 9, block 1698691) found valid data
Slave exiting with ORA-1172 exception
Errors in file c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p001_1500.trc:
ORA-01172: recovery of thread 1 stuck at block 915068 of file 9
ORA-01151: use media recovery to recover block, restore backup if needed
Tue Apr 16 12:36:32 2013
Errors in file c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p003_4088.trc  (incident=187558):
ORA-00600: internal error code, arguments: [2037], [12619645], [41474], [6], [1], [247], [12619645], [0], [], [], [], []
Incident details in: c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\incident\incdir_187558\wljg_p003_4088_i187558.trc
ORA-07445: exception encountered: core dump [kcbs_dump_adv_state()+1352] [ACCESS_VIOLATION]
 [ADDR:0xFFFFFFFFFFFFFFFF] [PC:0x16BFD20] [UNABLE_TO_READ] []
ORA-00600: internal error code, arguments: [2037], [12619645], [41474], [6], [1], [247], [12619645], [0], [], [], [], []
Incident details in: c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\incident\incdir_187559\wljg_p003_4088_i187559.trc
Errors in file c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p006_1216.trc  (incident=187567):

这里提示file 9 block 915068异常,但是通过dbv检查发现file 9无任何坏块.

trace文件内容

Dump continued from file: c:\app\sdwljg-db101\diag\rdbms\wljg\wljg\trace\wljg_p003_4088.trc
ORA-00600: internal error code, arguments: [2037], [12620930], [41474], [2], [1], [247], [12619645], [0], [], [], [], []

** DBGRL Error: ARB Alert Log
** DBGRL Error: <msg time='2013-04-16T11:05:58.522+08:00' org_id='oracle' comp_id='rdbms'
 msg_id='dbgexProcessError:1097:3370026720' type='TRACE' level='16'
 host_id='SDWLSCJG-DB' host_addr='172.18.1.15'>
 <txt>Incident details in: c:\app\sdwljg-db101\diag\rdbms\wljg\wlj
========= Dump for incident 129879 (ORA 600 [2037]) ========

*** 2013-04-16 11:05:58.522
----- SQL Statement (None) -----
Current SQL information unavailable - no cursor.

----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
ksedst1()+111        CALL???  skdstdst()+0         000000000 000000000 01CFC9B80
                                                   000000200
ksedst()+63          CALL???  ksedst1()+0          000000005 021B00600 005D30C80
                                                   000002004
dbkedDefDump()+1012  CALL???  ksedst()+0           000000000 000000000 000000000
                                                   000000000
ksedmp()+51          CALL???  dbkedDefDump()+0     000000003 000000002 021AF92C0
                                                   000405038
__PGOSF184_ksfdmp()  CALL???  ksedmp()+0           000000000 000000000 000000000
+27                                                27F00000000
dbgexPhaseII()+266   CALL???  __PGOSF184_ksfdmp()  00000000D 0082FAE50 000000000
                              +0                   000000004
dbgexProcessError()  CALL???  dbgexPhaseII()+0     021B00600 021AFCA50 000000201
+1313                                              000000000
dbgeExecuteForError  CALL???  dbgexProcessError()  021B00600 021B07590 000000001
()+55                         +0                   000000000
dbgePostErrorKGE()+  CALL???  dbgeExecuteForError  021AFCA30 021AFCA80 00000002E
1608                          ()+0                 000000005
dbkePostKGE_kgsf()+  CALL???  dbgePostErrorKGE()+  01CFC99D0 021B0E080 000000258
65                            0                    021B0E080
kgeade()+556         CALL???  dbkePostKGE_kgsf()+  000002000 000000000 000000009
                              0                    000000004
kgeriv_int()+105     CALL???  kgeade()+0           3A4F00000003 000C09482
                                                   0FFFFFFFF 000000000
kgeriv()+27          CALL???  kgeriv_int()+0       3A9A024E0 000000000 01CFC9410
                                                   000000000
kgesiv()+102         CALL???  kgeriv()+0           0000008D5 0000008C3 021AFD9A0
                                                   000AFDC73
ksesic7()+125        CALL???  kgesiv()+0           006371F20 000000007 27F912000
                                                   200000004
kcoexam()+248        CALL???  ksesic7()+0          2000007F5 000000000 000C09482
                                                   000000000
kcbtema()+2154       CALL???  kcoexam()+0          27FFC22C8 39E113470 3A940BBB8
                                                   000000000
kcrpap()+355         CALL???  kcbtema()+0          27FFC22C8 28BFC2628 000000000
                                                   021B10200
kcrpdv()+1655        CALL???  kcrpap()+0           021B101A0 000000002 000000004
                                                   000000512
kxfprdp()+1384       CALL???  kcrpdv()+0           3A7AD3098 000000000 00000000C
                                                   00757CF00
opirip()+1396        CALL???  kxfprdp()+0          00000001E 005CDB518 021AFF9E0
                                                   000000000
opidrv()+855         CALL???  opirip()+0           000000032 000000004 021AFFD30
                                                   000000000
sou2o()+52           CALL???  opidrv()+213         000000032 000000004 021AFFD30
                                                   021AFFDB0
opimai_real()+295    CALL???  sou2o()+0            000000000 7FEFD9819B5
                                                   000000000 000000000
opimai()+96          CALL???  opimai_real()+0      000000000 000000000 000000000
                                                   000000000
BackgroundThreadSta  CALL???  opimai()+0           021AFFE98 000000001 000000000
rt()+695                                           000000000
00000000775AF56D     CALL???  BackgroundThreadSta  00A26B7A0 000000000 000000000
                              rt()+0               000000000
0000000077923281     CALL???  00000000775AF560     000000000 000000000 000000000
                                                   000000000
 
--------------------- Binary Stack Dump ---------------------

查询mos发现During Startup (Open Database) Alert Log Shows ORA-600[2037] and ORA-7445[kcbs_dump_adv_state] [ID 551993.1]和我们这里展示的错误相符,引起该问题的原因主要是因为:The database may crash and fail to open due to undo/redo corruption if you are using distributed transactions.因为使用分布式事务的时候,数据库crash导致undo/redo corruption,从而使得数据库无法正常启动.

故障处理思路
因为通过数据库alert日志可以知道,数据库是在做前滚的时候并发进程失败,设置fast_start_parallel_rollback=false,禁止数据库实例恢复并发,可以恢复依然失败.因为前滚过不去,那就通过设置隐含参数禁止数据库前滚,在open数据库的过程中发现ora-600[2662]错误,推进scn,继续open数据库发现ora-600[4194],通过设置undo管理模式,屏蔽事务,屏蔽回滚段等方法,终于重新open库并重建undo,然后重建库算是完成恢复任务

非归档异常数据库rman备份

最近在数据库恢复中遇到一个案例:xx单位1.5T oracle 10.2.0.4(redhat 4.5),因为异常关闭操作系统,导致数据库不能启动,需要帮忙恢复。该数据库为非归档模式,使用裸设备,一个裸设备文件大小(35G),数据库文件大小4-30G都有,现在客户要求我们不能对现有环境进行任何操作,需要克隆一份数据库出来,然后在克隆的库上进行数据库恢复操作.数据库环境的克隆最好的方法就是使用rman来完成,但是该数据库为非归档模式,无法直接使用rman进行备份操作.最后采取dd的方式处理(需要注意dd文件大小为block_size*(v$datafile.blocks+1)+v$datafile.offset).因为不能使用rman的一条命令处理,心里一直不舒服,在家里实验,终于还是确定可以通过重建控制文件的方法来欺骗rman是归档模式,来实现rman完成类似工作.
数据库非非归档模式

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production

SQL> archive log list;
Database log mode              No Archive Mode
Automatic archival             Disabled
Archive destination            /u01/oracle/oradata/ora11g/archivelog
Oldest online log sequence     7
Current log sequence           9

非归档模式尝试rman 备份

RMAN> backup database format '/u01/oracle/oradata/orall1g_%U';

Starting backup at 22-JAN-13
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=10 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
RMAN-03009: failure of backup command on ORA_DISK_1 channel at 01/22/2013 16:10:49
ORA-19602: cannot backup or copy active file in NOARCHIVELOG mode
continuing other job steps, job failed will not be re-run
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
including current control file in backup set
including current SPFILE in backup set
channel ORA_DISK_1: starting piece 1 at 22-JAN-13
channel ORA_DISK_1: finished piece 1 at 22-JAN-13
piece handle=/u01/oracle/oradata/orall1g_13o02k8a_1_1 tag=TAG20130122T161048 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:03
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================

rman的backup or copy命令不能在非归档模式下执行

尝试修改数据库为归档模式

SQL> alter database archivelog;
alter database archivelog
*
ERROR at line 1:
ORA-00265: instance recovery required, cannot set ARCHIVELOG mode

数据库非正常关闭,不能修改归档模式

重建控制文件

SQL> alter database backup controlfile to trace as '/tmp/ctl';

Database altered.

SQL> shutdown immediate
ORA-01109: database not open


Database dismounted.
ORACLE instance shut down.

--备份当前控制文件(保留控制文件现场)

SQL> STARTUP NOMOUNT
ORACLE instance started.

Total System Global Area  313860096 bytes
Fixed Size                  1344652 bytes
Variable Size             285215604 bytes
Database Buffers           20971520 bytes
Redo Buffers                6328320 bytes
SQL> CREATE CONTROLFILE REUSE DATABASE "ORA11G" NORESETLOGS  ARCHIVELOG
  2      MAXLOGFILES 16
  3      MAXLOGMEMBERS 3
  4      MAXDATAFILES 100
  5      MAXINSTANCES 8
  6      MAXLOGHISTORY 292
  7  LOGFILE
  8    GROUP 1 '/u01/oracle/oradata/ora11g/redo01.log'  SIZE 50M BLOCKSIZE 512,
  9    GROUP 2 '/u01/oracle/oradata/ora11g/redo02.log'  SIZE 50M BLOCKSIZE 512,
 10    GROUP 3 '/u01/oracle/oradata/ora11g/redo03.log'  SIZE 50M BLOCKSIZE 512
 11  DATAFILE
 12    '/u01/oracle/oradata/ora11g/system01.dbf',
 13    '/u01/oracle/oradata/ora11g/sysaux01.dbf',
 14    '/u01/oracle/oradata/ora11g/users01.dbf',
 15    '/u01/oracle/oradata/ora11g/dbfs01.dbf',
 16    '/u01/oracle/oradata/ora11g/tts_xifenfei02.dbf',
 17    '/u01/oracle/oradata/ora11g/tts_xifenfei01.dbf',
 18    '/u01/oracle/oradata/ora11g/system02.dbf',
 19    '/u01/oracle/oradata/ora11g/czum01.dbf',
 20    '/u01/oracle/oradata/ora11g/undotbs02.dbf'
 21  CHARACTER SET ZHS16GBK
 22  ;

Control file created.

数据库已经变为归档模式

SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Disabled
Archive destination            /u01/oracle/oradata/ora11g/archivelog
Oldest online log sequence     7
Next log sequence to archive   7
Current log sequence           9

归档模式尝试rman备份

RMAN> backup datafile 1 format '/u01/oracle/oradata/system01_%U';

Starting backup at 22-JAN-13
using channel ORA_DISK_1
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00001 name=/u01/oracle/oradata/ora11g/system01.dbf
channel ORA_DISK_1: starting piece 1 at 22-JAN-13
channel ORA_DISK_1: finished piece 1 at 22-JAN-13
piece handle=/u01/oracle/oradata/system01_02o02kl7_1_1 tag=TAG20130122T161742 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:01:15
Finished backup at 22-JAN-13

RMAN-06497: WARNING: control file is not current, control file AUTOBACKUP skipped

因为现在的控制文件是新创建的,不能算是数据库的当前控制文件,所以未被rman自动备份(很好理解,重建控制文件后,我们做恢复都要使用using backup controlfile命令)

总结说明
1.数据库为非归档模式,不能使用rman的backup和copy命令来备份
2.因为数据库为非正常关闭不能直接修改为归档模式
3.通过重建控制文件修改数据库(注意备份)为归档模式实现rman正常备份
4.当rman备份好之后,使用原先控制文件替换现在控制文件

关于blockrecover 解决坏块相关测试与总结

悲剧的客户因为IBM p系列小机更换电源导致主机直接掉电,起来后发现数据库出现不少坏块,而且还有部分坏块中含有回滚事务,导致alert日志一直报smon回滚遇到坏块错误,该数据库版本是9.2.0.8 RAC,根据客户的备份情况,为了减少对业务的影响,决定使用blockrecover对其处理.这里通过10g数据库大概模拟出现含事务坏块的情况以及处理过程,重现了我们在处理的时候不确定的一些知识.
创建测试表

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Prod
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> conn /as sysdba
Connected.
SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence     231
Next log sequence to archive   233
Current log sequence           233
SQL> conn chf/xifenfei
Connected.
SQL> create table t_xifenfei
  2  as
  3  select * from dba_objects where rownum<10;

Table created.

SQL> select rowid,
  2  dbms_rowid.rowid_relative_fno(rowid) rel_fno,
  3  dbms_rowid.rowid_block_number(rowid) block
  4  from chf.t_xifenfei;

ROWID                 REL_FNO      BLOCK
------------------ ---------- ----------
AAANIqAAEAAAAAcAAA          4         28
AAANIqAAEAAAAAcAAB          4         28
AAANIqAAEAAAAAcAAC          4         28
AAANIqAAEAAAAAcAAD          4         28
AAANIqAAEAAAAAcAAE          4         28
AAANIqAAEAAAAAcAAF          4         28
AAANIqAAEAAAAAcAAG          4         28
AAANIqAAEAAAAAcAAH          4         28
AAANIqAAEAAAAAcAAI          4         28

9 rows selected.

当前的seq是233(也就是说我在233归档上创建了t_xienfei表)

dbv检查block

[oracle@xifenfei ~]$ dbv file='/u01/oracle/oradata/XFF/users01.dbf'

DBVERIFY: Release 10.2.0.4.0 - Production on Sun Jan 20 22:16:16 2013

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

DBVERIFY - Verification starting : FILE = /u01/oracle/oradata/XFF/users01.dbf


DBVERIFY - Verification complete

Total Pages Examined         : 1280
Total Pages Processed (Data) : 904
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 38
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 44
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 294
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Highest block SCN            : 3224018224 (2860.3224018224)

证明无任何坏块

切换归档

SQL> alter system switch logfile;

System altered.

SQL> /

System altered.

SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence     233
Next log sequence to archive   235
Current log sequence           235

现在已经切换seq到235

rman备份我们需要测试block(file 4 block 28)对应的数据文件

RMAN>  backup datafile 4 format '/u01/oracle/oradata/xff_4.rman';

具体见:rman制造坏块,bbed修复坏块

模拟数据库进行其他操作

SQL> alter system switch logfile;

System altered.

SQL> /

System altered.

SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence     235
Next log sequence to archive   237
Current log sequence           237
SQL> conn chf/xifenfei
Connected.
SQL> create table t_xff 
  2  as
  3  select * from dba_objects;

Table created.

SQL> alter system switch logfile;

System altered.

主要是为了模拟对其他block操作,对于block 28的恢复影响

对block 28进行操作

SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence     237
Next log sequence to archive   239
Current log sequence           239
SQL> update chf.t_xifenfei set object_name='www.orasos.com';

9 rows updated.

SQL> commit;

Commit complete.

SQL> alter system switch logfile;

System altered.

SQL> /

System altered.

SQL> /

System altered.

在seq为239的时候对block 28进行了一次update操作

模拟其他业务操作

SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence     240
Next log sequence to archive   242
Current log sequence           242
SQL> delete from chf.t_xff ;

50491 rows deleted.

SQL> commit;

Commit complete.

SQL> alter system switch logfile;

System altered.

SQL> /

System altered.

SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence     242
Next log sequence to archive   244
Current log sequence           244
SQL>  alter system switch logfile;  

System altered.

SQL> /

System altered.

这里可以知道在seq为246的时候做了备份归档操作

备份归档操作

RMAN> backup archivelog all format '/u01/oracle/oradata/xff_arch_%U' delete input;

模拟继续操作

SQL> alter system switch logfile;

System altered.

SQL> /

System altered.

SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence     248
Next log sequence to archive   250
Current log sequence           250

SQL> update chf.t_xifenfei set object_name='www.orasos.com' where rownum<5;

4 rows updated.

这里可以发现,在seq为250的时候我们再次对block 28进行了操作

使用rman制造坏块

RMAN> BLOCKRECOVER DATAFILE 4 block 28 clear;

dbv检查坏块

[oracle@xifenfei ~]$ dbv file='/u01/oracle/oradata/XFF/users01.dbf'

DBVERIFY: Release 10.2.0.4.0 - Production on Sun Jan 20 23:01:24 2013

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

DBVERIFY - Verification starting : FILE = /u01/oracle/oradata/XFF/users01.dbf
Page 28 is influx - most likely media corrupt
Corrupt block relative dba: 0x0100001c (file 4, block 28)
Fractured block found during dbv: 
Data in bad block:
 type: 6 format: 2 rdba: 0x0100001c
 last change scn: 0x0b2c.c02ab081 seq: 0x1 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0xb0c4a6ea
 check value in block header: 0x393f
 computed block checksum: 0xc917



DBVERIFY - Verification complete

Total Pages Examined         : 1280
Total Pages Processed (Data) : 903
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 38
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 47
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 291
Total Pages Marked Corrupt   : 1
Total Pages Influx           : 1
Highest block SCN            : 3224022228 (2860.3224022228)

强制kill数据库

[oracle@xifenfei ~]$ ps -ef|grep pmon
oracle    9744  9638  0 23:03 pts/1    00:00:00 grep pmon
oracle   32156     1  0 14:17 ?        00:00:10 ora_pmon_XFF
[oracle@xifenfei ~]$ kill -9 32156
[oracle@xifenfei ~]$ ps -ef|grep pmon
oracle    9751  9638  0 23:03 pts/1    00:00:00 grep pmon

为了模拟含事务的block出现坏块

启动数据库

SQL> startup mount
ORACLE instance started.

Total System Global Area  306184192 bytes
Fixed Size                  1267164 bytes
Variable Size             109054500 bytes
Database Buffers          188743680 bytes
Redo Buffers                7118848 bytes
Database mounted.
SQL> ALTER DATABASE OPEN;

Database altered.

数据库启动正常

查询坏块

SQL> select count(*) from chf.t_xifenfei;
select count(*) from chf.t_xifenfei
                         *
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 4, block # 28)
ORA-01110: data file 4: '/u01/oracle/oradata/XFF/users01.dbf'

alert日志

Sun Jan 20 23:04:37 2013
SMON: enabling tx recovery
Sun Jan 20 23:04:37 2013
Database Characterset is ZHS16GBK
Sun Jan 20 23:04:37 2013
Hex dump of (file 4, block 28) in trace file /u01/oracle/admin/XFF/bdump/xff_smon_9775.trc
Corrupt block relative dba: 0x0100001c (file 4, block 28)
Fractured block found during buffer read
Data in bad block:
 type: 6 format: 2 rdba: 0x0100001c
 last change scn: 0x0b2c.c02ab081 seq: 0x1 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0xb0c4a6ea
 check value in block header: 0x393f
 computed block checksum: 0xc917
Reread of rdba: 0x0100001c (file 4, block 28) found same corrupted data
ORACLE Instance XFF (pid = 8) - Error 1578 encountered while recovering transaction (9, 37) on object 53802.
Sun Jan 20 23:04:38 2013
Errors in file /u01/oracle/admin/XFF/bdump/xff_smon_9775.trc:
ORA-01578: ORACLE data block corrupted (file # 4, block # 28)
ORA-01110: data file 4: '/u01/oracle/oradata/XFF/users01.dbf'

通过试验步骤和alert日志可以发现因为block有事务,但是被标记为了坏块,所以smon无法回滚该事务,从而出现alert中类似提示

继续切换归档

SQL> alter system switch logfile;

System altered.

SQL> /

System altered.

SQL> /

System altered.

SQL> /

System altered.

SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            /u01/oracle/oradata/XFF/archivelog
Oldest online log sequence     252
Next log sequence to archive   254
Current log sequence           254

移走归档

[oracle@xifenfei archivelog]$ ls -l
total 2224
-rw-r----- 1 oracle oinstall  360960 Jan 20 22:59 1_247_792679299.dbf
-rw-r----- 1 oracle oinstall    1024 Jan 20 22:59 1_248_792679299.dbf
-rw-r----- 1 oracle oinstall 1630208 Jan 20 23:04 1_249_792679299.dbf
-rw-r----- 1 oracle oinstall  249344 Jan 20 23:09 1_250_792679299.dbf
-rw-r----- 1 oracle oinstall    1024 Jan 20 23:09 1_251_792679299.dbf
-rw-r----- 1 oracle oinstall    4608 Jan 20 23:09 1_252_792679299.dbf
-rw-r----- 1 oracle oinstall    1024 Jan 20 23:09 1_253_792679299.dbf
[oracle@xifenfei archivelog]$ mkdir bak
[oracle@xifenfei archivelog]$ mv *.dbf bak
[oracle@xifenfei archivelog]$ ll
total 4
drwxr-xr-x 2 oracle oinstall 4096 Jan 20 23:11 bak

为了重现,当我们使用blockrecover恢复的时候,如果缺少归档会怎么样

尝试blockrecover

RMAN> BLOCKRECOVER DATAFILE 4 block 28;

Starting blockrecover at 20-JAN-13
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=135 devtype=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: sid=126 devtype=DISK
allocated channel: ORA_DISK_3
channel ORA_DISK_3: sid=125 devtype=DISK
allocated channel: ORA_DISK_4
channel ORA_DISK_4: sid=124 devtype=DISK

channel ORA_DISK_1: restoring block(s)
channel ORA_DISK_1: specifying block(s) to restore from backup set
restoring blocks of datafile 00004
channel ORA_DISK_1: reading from backup piece /u01/oracle/oradata/xff_4.rman
channel ORA_DISK_1: restored block(s) from backup piece 1
piece handle=/u01/oracle/oradata/xff_4.rman tag=TAG20130120T222333
channel ORA_DISK_1: block restore complete, elapsed time: 00:00:01

starting media recovery

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of blockrecover command at 01/20/2013 23:11:41
RMAN-06053: unable to perform media recovery because of missing log
RMAN-06025: no backup of log thread 1 seq 253 lowscn 12286830510311 found to restore
RMAN-06025: no backup of log thread 1 seq 252 lowscn 12286830510307 found to restore
RMAN-06025: no backup of log thread 1 seq 251 lowscn 12286830510305 found to restore
RMAN-06025: no backup of log thread 1 seq 250 lowscn 12286830509979 found to restore
RMAN-06025: no backup of log thread 1 seq 249 lowscn 12286830489543 found to restore
RMAN-06025: no backup of log thread 1 seq 248 lowscn 12286830489541 found to restore
RMAN-06025: no backup of log thread 1 seq 247 lowscn 12286830489279 found to restore

alert日志记录

Sun Jan 20 23:11:38 2013
alter database recover datafile list clear
Sun Jan 20 23:11:38 2013
Completed: alter database recover datafile list clear
Sun Jan 20 23:11:38 2013
Starting block media recovery
Sun Jan 20 23:11:39 2013
Media Recovery Log /u01/oracle/oradata/XFF/archivelog/1_235_792679299.dbf
Sun Jan 20 23:11:41 2013
alter database recover cancel
Sun Jan 20 23:11:41 2013
Media Recovery Canceled
Completed: alter database recover cancel

blockrecover恢复途中或者异常终止,dbv检测

[oracle@xifenfei ~]$ dbv file=/u01/oracle/oradata/XFF/users01.dbf

DBVERIFY: Release 10.2.0.4.0 - Production on Sun Jan 20 23:18:29 2013

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

DBVERIFY - Verification starting : FILE = /u01/oracle/oradata/XFF/users01.dbf
Page 28 is marked corrupt
Corrupt block relative dba: 0x0100001c (file 4, block 28)
Bad check value found during dbv: 
Data in bad block:
 type: 6 format: 2 rdba: 0x0100001c
 last change scn: 0x0b2c.c02b0248 seq: 0x1 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x02480601
 check value in block header: 0x13fc
 computed block checksum: 0x663b



DBVERIFY - Verification complete

Total Pages Examined         : 1280
Total Pages Processed (Data) : 903
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 38
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 47
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 291
Total Pages Marked Corrupt   : 1
Total Pages Influx           : 0
Highest block SCN            : 3224022228 (2860.3224022228)

在blockrecover处理过程,或者处理失败的后,block依然是坏块,如果遇到这类情况,没有事务可以直接指定seq/scn/time的方法来恢复,如果有事务,需要指定恢复时间点过该事务的时间点,让smon能够正常回滚,从而使得smon进程正常工作

还原归档后继续测试

[oracle@xifenfei archivelog]$ mv bak/* ./

RMAN> BLOCKRECOVER DATAFILE 4 block 28;

starting media recovery
archive log thread 1 sequence 247 is already on disk as file /u01/oracle/oradata/XFF/archivelog/1_247_792679299.dbf
…………
media recovery complete, elapsed time: 00:00:01
Finished blockrecover at 20-JAN-13

[oracle@xifenfei ~]$ dbv file=/u01/oracle/oradata/XFF/users01.dbf

DBVERIFY: Release 10.2.0.4.0 - Production on Sun Jan 20 23:15:43 2013

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

DBVERIFY - Verification starting : FILE = /u01/oracle/oradata/XFF/users01.dbf


DBVERIFY - Verification complete

Total Pages Examined         : 1280
Total Pages Processed (Data) : 904
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 38
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 47
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 291
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Highest block SCN            : 3224023169 (2860.3224023169)

>
blockrecover处理alert日志

Sun Jan 20 23:15:01 2013
Media Recovery Log /u01/oracle/oradata/XFF/archivelog/1_250_792679299.dbf
Sun Jan 20 23:15:01 2013
Media Recovery Log /u01/oracle/oradata/XFF/archivelog/1_251_792679299.dbf
Sun Jan 20 23:15:01 2013
Recovery of Online Redo Log: Thread 1 Group 3 Seq 252 Reading mem 0
  Mem# 0: /u01/oracle/oradata/XFF/redo03.log
Sun Jan 20 23:15:01 2013
Recovery of Online Redo Log: Thread 1 Group 1 Seq 253 Reading mem 0
  Mem# 0: /u01/oracle/oradata/XFF/redo01.log
Sun Jan 20 23:15:01 2013
Recovery of Online Redo Log: Thread 1 Group 2 Seq 254 Reading mem 0
  Mem# 0: /u01/oracle/oradata/XFF/redo02.log
Sun Jan 20 23:15:02 2013
Completed block media recovery

补充说明
1.在9i中使用blockrecover会在Archive destination中生产block的备份文件,类似469_519791_3063_2442393528.bkd(file 469 block 519791),需要注意Archive destination目录结尾需要”/”,不然可能出现直接写入和该目录并列的Archive destination+blockrecover产生文件;10g中不生成该文件;如果该block在9i中未备份,也不会在生产相关文件,而是利用归档恢复.
2.blockrecover需要还原从该block从备份之后的所有归档(如果被备份起来需要还原出来),如果缺少归档可能导致恢复失败,包括基于scn/seq/time的等
3.blockrecover在执行过程中或者执行异常终止,该block依然是坏块,不会对其他block产生影响.在使用blockrecover出现异常终止后的block修复,在后续blog中提供解决方法
4.对于不含事务的坏块,如果数据允许丢失可以通过设置event跳过坏块(特殊block除外),然后重建对象;对于含event的block 坏块,建议使用blockrecover处理或者直接恢复数据文件,如果没有备份,考试使用event跳过事务回滚,然后屏蔽坏块处理

rman制造坏块,bbed修复坏块

在rman中有隐藏的命令clear,可以标记数据块为corrupt,从而实现数据库坏块试验,本篇blog通过dbv,bbed,table select来验证坏块的出现和修复
创建测试表

SQL>  create table t_xifenfei as  
  2   select object_id,object_name from dba_objects where rownum<10;

Table created.

SQL>  select rowid,
  2   dbms_rowid.rowid_relative_fno(rowid) rel_fno,
  3   dbms_rowid.rowid_block_number(rowid) block
  4   from chf.t_xifenfei;

ROWID                 REL_FNO      BLOCK
------------------ ---------- ----------
AAAStAAAEAAAACrAAA          4        171
AAAStAAAEAAAACrAAB          4        171
AAAStAAAEAAAACrAAC          4        171
AAAStAAAEAAAACrAAD          4        171
AAAStAAAEAAAACrAAE          4        171
AAAStAAAEAAAACrAAF          4        171
AAAStAAAEAAAACrAAG          4        171
AAAStAAAEAAAACrAAH          4        171
AAAStAAAEAAAACrAAI          4        171

9 rows selected.

SQL> alter system checkpoint;

System altered.

dbv验证坏块

[oracle@xifenfei ~]$ dbv file=/u01/oracle/oradata/ora11g/users01.dbf

DBVERIFY: Release 11.2.0.3.0 - Production on Sun Jan 20 09:12:16 2013

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

DBVERIFY - Verification starting : FILE = /u01/oracle/oradata/ora11g/users01.dbf


DBVERIFY - Verification complete

Total Pages Examined         : 7680
Total Pages Processed (Data) : 3776
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 167
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 744
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 2993
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Total Pages Encrypted        : 0
Highest block SCN            : 1006717955 (2956.1006717955)

这里创建了t_xifenfei表,数据存储在file 4 block 171中,现在该block一切正常,本试验就是要通过rman来使得该block corrupt,然后通过bbed来修复

bbed查看kcbh

BBED> set block 171
        BLOCK#          171

BBED> p kcbh
struct kcbh, 20 bytes                       @0       
   ub1 type_kcbh                            @0        0x06
   ub1 frmt_kcbh                            @1        0xa2
   ub1 spare1_kcbh                          @2        0x00
   ub1 spare2_kcbh                          @3        0x00
   ub4 rdba_kcbh                            @4        0x010000ab
   ub4 bas_kcbh                             @8        0x3c014bfe
   ub2 wrp_kcbh                             @12       0x0b8c
   ub1 seq_kcbh                             @14       0x02
   ub1 flg_kcbh                             @15       0x04 (KCBHFCKV)
   ub2 chkval_kcbh                          @16       0x6e0c  <--重点关注
   ub2 spare3_kcbh                          @18       0x0000

rman标记坏块

RMAN> BLOCKRECOVER DATAFILE 4 block 171 clear;

Starting recover at 20-JAN-13
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=18 device type=DISK
Finished recover at 20-JAN-13

dbv再次检查坏块

[oracle@xifenfei ~]$ dbv file=/u01/oracle/oradata/ora11g/users01.dbf

DBVERIFY: Release 11.2.0.3.0 - Production on Sun Jan 20 09:53:16 2013

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

DBVERIFY - Verification starting : FILE = /u01/oracle/oradata/ora11g/users01.dbf
Page 171 is marked corrupt
Corrupt block relative dba: 0x010000ab (file 4, block 171)
Bad check value found during dbv: 
Data in bad block:
 type: 6 format: 2 rdba: 0x010000ab
 last change scn: 0x0b8c.3c014bfe seq: 0x2 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x4bfe0602
 check value in block header: 0x6e0c
 computed block checksum: 0xb5bc


DBVERIFY - Verification complete

Total Pages Examined         : 7680
Total Pages Processed (Data) : 3775
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 167
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 744
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 2993
Total Pages Marked Corrupt   : 1
Total Pages Influx           : 1
Total Pages Encrypted        : 0
Highest block SCN            : 1006717955 (2956.1006717955)

尝试查询数据

SQL> select count(*) from t_xifenfei;
select count(*) from t_xifenfei
                     *
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 4, block # 171)
ORA-01110: data file 4: '/u01/oracle/oradata/ora11g/users01.dbf'

bbed验证坏块

BBED> set block 171
        BLOCK#          171

BBED> p kcbh
struct kcbh, 20 bytes                       @0       
   ub1 type_kcbh                            @0        0x06
   ub1 frmt_kcbh                            @1        0xa2
   ub1 spare1_kcbh                          @2        0x00
   ub1 spare2_kcbh                          @3        0x00
   ub4 rdba_kcbh                            @4        0x010000ab
   ub4 bas_kcbh                             @8        0x3c014bfe
   ub2 wrp_kcbh                             @12       0x0b8c
   ub1 seq_kcbh                             @14       0x02
   ub1 flg_kcbh                             @15       0x04 (KCBHFCKV)
   ub2 chkval_kcbh                          @16       0x6e0c  <--注意该值未变化
   ub2 spare3_kcbh                          @18       0x0000

BBED> verify
DBVERIFY - Verification starting
FILE = /u01/oracle/oradata/ora11g/users01.dbf
BLOCK = 171

Block 171 is corrupt
Corrupt block relative dba: 0x010000ab (file 0, block 171)
Bad check value found during verification
Data in bad block:
 type: 6 format: 2 rdba: 0x010000ab
 last change scn: 0x0b8c.3c014bfe seq: 0x2 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x4bfe0602
 check value in block header: 0x6e0c
 computed block checksum: 0xb5bc


DBVERIFY - Verification complete

Total Blocks Examined         : 1
Total Blocks Processed (Data) : 0
Total Blocks Failing   (Data) : 0
Total Blocks Processed (Index): 0
Total Blocks Failing   (Index): 0
Total Blocks Empty            : 0
Total Blocks Marked Corrupt   : 1
Total Blocks Influx           : 0
Message 531 not found;  product=RDBMS; facility=BBED

这里通过dbv,bbed,select table都证明rman能够标记block为Corrupt.
这里需要分析:block已经被标记,那证明该块肯定有修改,也就是说chkval_kcbh一定要变化,但是这里没有变化,证明该处异常

bbed修复rman生成坏块

BBED> sum apply
Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) y
Check value for File 0, Block 171:
current = 0xdbb0, required = 0xdbb0

BBED> verify
DBVERIFY - Verification starting
FILE = /u01/oracle/oradata/ora11g/users01.dbf
BLOCK = 171


DBVERIFY - Verification complete

Total Blocks Examined         : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing   (Data) : 0
Total Blocks Processed (Index): 0
Total Blocks Failing   (Index): 0
Total Blocks Empty            : 0
Total Blocks Marked Corrupt   : 0
Total Blocks Influx           : 0
Message 531 not found;  product=RDBMS; facility=BBED

bbed测试坏块已经修复

dbv验证坏块

[oracle@xifenfei ~]$ dbv file=/u01/oracle/oradata/ora11g/users01.dbf

DBVERIFY: Release 11.2.0.3.0 - Production on Sun Jan 20 10:01:46 2013

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

DBVERIFY - Verification starting : FILE = /u01/oracle/oradata/ora11g/users01.dbf

DBVERIFY - Verification complete

Total Pages Examined         : 7680
Total Pages Processed (Data) : 3776
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 167
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 744
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 2993
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Total Pages Encrypted        : 0
Highest block SCN            : 1006717955 (2956.1006717955)

dbv测试坏块也被修复

查询表验证

SQL> select count(*) from t_xifenfei;

  COUNT(*)
----------
         9

数据也可以正常查询,证明rman的坏块问题通过bbed解决

总结说明
1.我们可以通过rman的clear命令来标记坏块(BLOCKRECOVER DATAFILE file# BLOCK block1#, block2#, block3#… CLEAR ;)
2.我们可以通过bbed的sum apply命令来修复该类型坏块

ORA-600 kghstack_free2异常恢复

今天接触一案例ORA-00600[kghstack_free2],恢复起来很简单,顺便记录下,供其他人遇到类似情况参考.

Wed Feb 20 15:05:04 2013
SMON: enabling cache recovery
Wed Feb 20 15:05:05 2013
Successfully onlined Undo Tablespace 1.
Wed Feb 20 15:05:05 2013
SMON: enabling tx recovery
Wed Feb 20 15:05:06 2013
Errors in file f:\oracle\product\10.2.0\admin\cksoltp\bdump\cksoltp_smon_3556.trc:
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []

Wed Feb 20 15:05:06 2013
Database Characterset is AL32UTF8
replication_dependency_tracking turned off (no async multimaster replication found)
Wed Feb 20 15:05:08 2013
Errors in file f:\oracle\product\10.2.0\admin\cksoltp\bdump\cksoltp_smon_3556.trc:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [_kghstack_err+101] 
[PC:0x603CC77F] [ADDR:0x59004E04] [UNABLE_TO_READ] []
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []

Starting background process QMNC
QMNC started with pid=18, OS id=3504
Wed Feb 20 15:05:13 2013
Errors in file f:\oracle\product\10.2.0\admin\cksoltp\bdump\cksoltp_pmon_4064.trc:
ORA-00474: SMON process terminated with error

这个错误很明显是因为smon在回滚事务的时候,遇到异常从而出现ORA-00600[kghstack_free2]错误,使得数据库不能被正常open,通过分析trace文件发现回滚段36中有事务未提交,需要回滚,针对这样的数据库恢复方法很简单:1.屏蔽事务回滚,2.屏蔽回滚段强制offline等
1) EVENT=”10513 trace name context forever”
2)_offline_rollback_segments= _SYSSMU36$
数据库启动后,通过分析trace文件,找出来异常对象,然后重建该对象,除掉event/_offline_rollback_segments即可