存储异常导致数据库不能启动恢复

xx医院存储突然掉线,导致数据库异常,现场工程师折腾了一天,问题依旧没有解决,无奈之下找到我们,希望我们能够帮忙恢复数据库.
启动报ORA-00600[2131]错误

Fri Nov 06 14:50:59 2015
ALTER DATABASE   MOUNT
This instance was first to mount
Fri Nov 06 14:50:59 2015
ALTER SYSTEM SET local_listener=' (ADDRESS=(PROTOCOL=TCP)(HOST=192.168.4.4)(PORT=1521))' SCOPE=MEMORY SID='xifenfei1';
NOTE: Loaded library: System 
SUCCESS: diskgroup DATA was mounted
NOTE: dependency between database xifenfei and diskgroup resource ora.DATA.dg is established
Errors in file /home/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_ora_13221.trc  (incident=191085):
ORA-00600: internal error code, arguments: [2131], [33], [32], [], [], [], [], [], [], [], [], []
Incident details in: /home/app/oracle/diag/rdbms/xifenfei/xifenfei1/incident/incdir_191085/xifenfei1_ora_13221_i191085.trc
Fri Nov 06 14:51:10 2015
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
ORA-600 signalled during: ALTER DATABASE   MOUNT...

出现该错误的原因是由于:We are attempting to write a controlfile checkpoint progress record, but find we do not have the progress record generating this exception.由于控制文件异常导致,出现此类情况,我们一般使用单个控制文件一次尝试,如果都不可以考虑重建控制文件

由于坏块(逻辑/物理)导致数据库实例恢复无法进行

Beginning crash recovery of 2 threads
Started redo scan
kcrfr_rnenq: use log nab 393216
kcrfr_rnenq: use log nab 2
Completed redo scan
 read 4427 KB redo, 500 data blocks need recovery
Started redo application at
 Thread 1: logseq 5731, block 391398
 Thread 2: logseq 4252, block 520815
Recovery of Online Redo Log: Thread 1 Group 2 Seq 5731 Reading mem 0
  Mem# 0: +DATA/xifenfei/onlinelog/group_2.266.835331047
Recovery of Online Redo Log: Thread 2 Group 8 Seq 4252 Reading mem 0
  Mem# 0: +DATA/xifenfei/onlinelog/group_8.331.835330421
Errors in file /home/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_ora_13770.trc  (incident=197486):
ORA-00600: internal error code, arguments: [kdxlin:psno out of range], [], [], [], [], [], [], [], [], [], [], []
Incident details in:/home/app/oracle/diag/rdbms/xifenfei/xifenfei1/incident/incdir_197486/xifenfei1_ora_13770_i197486.trc
Fri Nov 06 15:03:09 2015
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /home/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_ora_13770.trc  (incident=197487):
ORA-01578: ORACLE data block corrupted (file # 2, block # 65207)
ORA-01110: data file 2: '+DATA/xifenfei/datafile/sysaux.257.835324753'
ORA-10564: tablespace SYSAUX
ORA-01110: data file 2: '+DATA/xifenfei/datafile/sysaux.257.835324753'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 81045
ORA-00600: internal error code, arguments: [kdxlin:psno out of range], [], [], [], [], [], [], [], [], [], [], []
Incident details in:/home/app/oracle/diag/rdbms/xifenfei/xifenfei1/incident/incdir_197487/xifenfei1_ora_13770_i197487.trc
Errors in file /home/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_ora_13770.trc:
ORA-01578: ORACLE data block corrupted (file # 2, block # 65207)
ORA-01110: data file 2: '+DATA/xifenfei/datafile/sysaux.257.835324753'
ORA-10564: tablespace SYSAUX
ORA-01110: data file 2: '+DATA/xifenfei/datafile/sysaux.257.835324753'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 81045
ORA-00600: internal error code, arguments: [kdxlin:psno out of range], [], [], [], [], [], [], [], [], [], [], []
Recovery of Online Redo Log: Thread 2 Group 3 Seq 4253 Reading mem 0
  Mem# 0: +DATA/xifenfei/onlinelog/group_3.332.835330505
Hex dump of (file 14, block 62536) in trace file /home/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_ora_13770.trc
Reading datafile '+DATA/xifenfei/datafile/ht01.dbf' for corruption at rdba: 0x0380f448 (file 14, block 62536)
Reread (file 14, block 62536) found same corrupt data (logically corrupt)
RECOVERY OF THREAD 1 STUCK AT BLOCK 62536 OF FILE 14
Fri Nov 06 15:03:13 2015
Abort recovery for domain 0
Aborting crash recovery due to error 1172
Errors in file /home/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_ora_13770.trc:
ORA-01172: recovery of thread 1 stuck at block 62536 of file 14
ORA-01151: use media recovery to recover block, restore backup if needed
Abort recovery for domain 0
Errors in file /home/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_ora_13770.trc:
ORA-01172: recovery of thread 1 stuck at block 62536 of file 14
ORA-01151: use media recovery to recover block, restore backup if needed
ORA-1172 signalled during: ALTER DATABASE OPEN...

查看资料发现和Bug 14301592 – Several errors by corrupt blocks shifted by 2 bytes in buffer cache during recovery caused by INDEX redo apply,可以通过ALLOW 1 CORRUPTION临时解决

使用ALLOW 1 CORRUPTION进行恢复,出现ORA-07445[kdxlin]错误

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
+DATA/xifenfei/onlinelog/group_3.332.835330505     
ORA-00279: change 700860458 generated at 11/05/2015 21:20:15 needed for thread
1
ORA-00289: suggestion : +ARCHIVE/xifenfei/xifenfei_1_5731_835324843.arc
ORA-00280: change 700860458 for thread 1 is in sequence #5731


Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
+DATA/xifenfei/onlinelog/group_2.266.835331047
ORA-00283: recovery session canceled due to errors
ORA-10562: Error occurred while applying redo to data block (file# 2, block#
70104)
ORA-10564: tablespace SYSAUX
ORA-01110: data file 2: '+DATA/xifenfei/datafile/sysaux.257.835324753'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 82289
ORA-00607: Internal error occurred while making a change to a data block
ORA-00602: internal programming exception
ORA-07445: exception encountered: core dump [kdxlin()+4088] [SIGSEGV]
[ADDR:0xC] [PC:0x95FB572] [Address not mapped to object] []


ORA-01112: media recovery not started

ORA-07445[kdxlin()+4088]未找到类似说明,到了这一步,无法简单的恢复成功,只能通过设置隐含参数跳过实例恢复,尝试resetlog库

通过设置_allow_resetlogs_corruption参数继续恢复

SQL> startup pfile='/tmp/pfile.ora' mount;
ORACLE instance started.

Total System Global Area 7315603456 bytes
Fixed Size                  2267384 bytes
Variable Size            2566915848 bytes
Database Buffers         4731174912 bytes
Redo Buffers               15245312 bytes
Database mounted.
SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00600: internal error code, arguments: [kclchkblk_4], [0], [700869927],
[0], [700860464], [], [], [], [], [], [], []
Process ID: 13563
Session ID: 157 Serial number: 3

alert日志报错

Fri Nov 06 19:26:39 2015
SMON: enabling cache recovery
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Errors in file /home/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_ora_13563.trc  (incident=319140):
ORA-00600: internal error code, arguments: [kclchkblk_4], [0], [700869927], [0], [700860464], [], [], [], [], [], [], []
Incident details in:/home/app/oracle/diag/rdbms/xifenfei/xifenfei1/incident/incdir_319140/xifenfei1_ora_13563_i319140.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Redo thread 2 internally disabled at seq 1 (CKPT)
ARC1: Archiving disabled thread 2 sequence 1
Archived Log entry 9956 added for thread 2 sequence 1 ID 0x0 dest 1:
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
Errors in file /home/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_ora_13563.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [0], [700869927], [0], [700860464], [], [], [], [], [], [], []
Errors in file /home/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_ora_13563.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [0], [700869927], [0], [700860464], [], [], [], [], [], [], []
Error 600 happened during db open, shutting down database
USER (ospid: 13563): terminating the instance due to error 600
Fri Nov 06 19:26:42 2015
Instance terminated by USER, pid = 13563
ORA-1092 signalled during: alter database open resetlogs...
opiodr aborting process unknown ospid (13563) as a result of ORA-1092
Fri Nov 06 19:26:42 2015
ORA-1092 : opitsk aborting process

这里是比较熟悉的ora-600[kclchkblk_4]错误,和ora-600[2662]错误类似,需要调整scn,由于数据库版本为11.2.0.4,无法使用常规方法调整scn,在修改控制文件,oradebug,bbed方法可供选择

使用oradebug方法处理
因为是asm环境,其他方法处理起来都相对麻烦

[oracle@wisetop1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Fri Nov 6 19:30:59 2015

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup pfile='/tmp/pfile.ora' mount;
ORACLE instance started.

Total System Global Area 7315603456 bytes
Fixed Size                  2267384 bytes
Variable Size            2566915848 bytes
Database Buffers         4731174912 bytes
Redo Buffers               15245312 bytes
Database mounted.
SQL> oradebug setmypid
Statement processed.
SQL> oradebug poke 0x06001AE70 4 0x2FAF0800
BEFORE: [06001AE70, 06001AE74) = 00000000
AFTER:  [06001AE70, 06001AE74) = 2FAF0800
SQL> alter database open;

Database altered.

至此数据库open成功,后续就是处理一些坏块的工作,并建议客户逻辑重建库.

ORA-600 4155

某客户使用win 2003,Oracle 11.2.0.1+ASM架构方式,由于存储异常并且做了存储恢复之后,ASM可以正常mount起来,但是数据库无法打开
使用dbv检查system发现有少量坏块

DBVERIFY - 开始验证: FILE = +DATA/xifenfei/datafile/system.256.764288125
页 3117 标记为损坏
Corrupt block relative dba: 0x00400c2d (file 1, block 3117)
Bad header found during dbv: 
Data in bad block:
 type: 11 format: 2 rdba: 0x00400001
 last change scn: 0x0000.00000000 seq: 0x1 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x00000b01
 check value in block header: 0xfeec
 computed block checksum: 0x0

Corrupt block relative dba: 0x0042002d (file 1, block 131117)
Bad header found during dbv: 
Data in bad block:
 type: 11 format: 2 rdba: 0x00400001
 last change scn: 0x0000.00000000 seq: 0x1 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x00000b01
 check value in block header: 0xfeec
 computed block checksum: 0x0

Corrupt block relative dba: 0x0042003d (file 1, block 131133)
Bad header found during dbv: 
Data in bad block:
 type: 11 format: 2 rdba: 0x00400001
 last change scn: 0x0000.00000000 seq: 0x1 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x00000b01
 check value in block header: 0xfeec
 computed block checksum: 0x0


DBVERIFY - 验证完成

检查的页总数: 222208
处理的页总数 (数据): 188939
失败的页总数 (数据): 19
处理的页总数 (索引): 17375
失败的页总数 (索引): 0
处理的页总数 (其他): 3190
处理的总页数 (段)  : 1
失败的总页数 (段)  : 0
空的页总数: 12701
标记为损坏的总页数: 3
流入的页总数: 0
加密的总页数        : 0
最高块 SCN            : 0 (0.0)

很多”页 131125 失败, 校验代码为 6125″类似错误忽略.
我们对于这些坏块进行分析,这些坏块未涉及oracle 最核心的基表数据,从理论上可以open数据库

尝试打开数据库

ALTER DATABASE OPEN
Beginning crash recovery of 1 threads
 parallel recovery started with 32 processes
Started redo scan
Mon Nov 16 14:12:45 2015
NOTE: dependency between database xifenfei and diskgroup resource ora.DATA.dg is established
Errors in file d:\oracle\diag\rdbms\xifenfei\xifenfei\trace\xifenfei_ora_5672.trc  (incident=937262):
ORA-00353: 日志损坏接近块 12509 更改 14199034494312 时间 02/05/2015 03:09:12
ORA-00312: 联机日志 2 线程 1: '+DATA/xifenfei/onlinelog/group_2.265.764288315'
ORA-00312: 联机日志 2 线程 1: '+DATA/xifenfei/onlinelog/group_2.264.764288315'
Incident details in: d:\oracle\diag\rdbms\xifenfei\xifenfei\incident\incdir_937262\xifenfei_ora_5672_i937262.trc
Media Recovery failed with error 399
ORA-355 signalled during: ALTER DATABASE RECOVER  DATABASE  ...

可以确定存储恢复的redo有问题(ORA-00353,ORA-00312),数据库无法直接打开

使用参数_allow_resetlogs_corruption屏蔽redo异常resetlogs库

Mon Nov 16 15:26:41 2015
alter database open resetlogs
Mon Nov 16 15:26:41 2015
Starting background process ASMB
Mon Nov 16 15:26:41 2015
ASMB started with pid=25, OS id=6612 
Starting background process RBAL
Mon Nov 16 15:26:41 2015
RBAL started with pid=26, OS id=6940 
NOTE: initiating MARK startup 
Starting background process MARK
Mon Nov 16 15:26:41 2015
MARK started with pid=27, OS id=1720 
NOTE: MARK has subscribed 
NOTE: Loaded library: System 
SUCCESS: diskgroup DATA was mounted
RESETLOGS is being done without consistancy checks. This may result
in a corrupted database. The database should be recreated.
Mon Nov 16 15:26:44 2015
NOTE: dependency between database xifenfei and diskgroup resource ora.DATA.dg is established
Archived Log entry 1 added for thread 1 sequence 1390429 ID 0xf6320db5 dest 1:
Archived Log entry 2 added for thread 1 sequence 1390427 ID 0xf6320db5 dest 1:
ARCH: Log corruption near block 16350 change 14199035261082 time ?
CORRUPTION DETECTED: thread 1 sequence 1390428 log 3 at block 16350. Arch found corrupt blocks
Errors in file d:\oracle\diag\rdbms\xifenfei\xifenfei\trace\xifenfei_ora_8132.trc  (incident=951271):
ORA-00353: 日志损坏接近块 16350 更改 14199035261082 时间 02/05/2015 03:12:49
ORA-00312: 联机日志 3 线程 1: '+DATA/xifenfei/onlinelog/group_3.267.764288315'
ORA-00312: 联机日志 3 线程 1: '+DATA/xifenfei/onlinelog/group_3.266.764288315'
Incident details in: d:\oracle\diag\rdbms\xifenfei\xifenfei\incident\incdir_951271\xifenfei_ora_8132_i951271.trc
Errors in file d:\oracle\diag\rdbms\xifenfei\xifenfei\trace\xifenfei_ora_8132.trc:
ORA-00354: 损坏重做日志块标头
ORA-00353: 日志损坏接近块 16350 更改 14199035261082 时间 02/05/2015 03:12:49
ORA-00312: 联机日志 3 线程 1: '+DATA/xifenfei/onlinelog/group_3.267.764288315'
ORA-00312: 联机日志 3 线程 1: '+DATA/xifenfei/onlinelog/group_3.266.764288315'
ARCH: All Archive destinations made inactive due to error 354
Committing creation of archivelog '+DATA/xifenfei/archivelog/2015_11_16/thread_1_seq_1390428.276.895937207' (error 354)
Deleted Oracle managed file +DATA/xifenfei/archivelog/2015_11_16/thread_1_seq_1390428.276.895937207
******************************************************
Detected premature EOF of log 3 at block 16350; re-trying archival
******************************************************
Mon Nov 16 15:26:49 2015
Sweep [inc][951271]: completed
Mon Nov 16 15:26:49 2015
Trace dumping is performing id=[cdmp_20151116152649]
Archived Log entry 3 added for thread 1 sequence 1390428 ID 0xf6320db5 dest 1:
RESETLOGS after incomplete recovery UNTIL CHANGE 14199033899179
Resetting resetlogs activation ID 4130475445 (0xf6320db5)
Errors in file d:\oracle\diag\rdbms\xifenfei\xifenfei\trace\xifenfei_m001_800.trc  (incident=951319):
ORA-00353: log corruption near block 2270 change 14199035131016 time 02/05/2015 03:12:58
ORA-00334: archived log: '+DATA/xifenfei/archivelog/2015_11_16/thread_1_seq_1390428.276.895937209'
Incident details in: d:\oracle\diag\rdbms\xifenfei\xifenfei\incident\incdir_951319\xifenfei_m001_800_i951319.trc
Errors in file d:\oracle\diag\rdbms\xifenfei\xifenfei\trace\xifenfei_m001_800.trc  (incident=951320):
ORA-00355: change numbers out of order
ORA-00353: log corruption near block 2270 change 14199035131016 time 02/05/2015 03:12:58
ORA-00334: archived log: '+DATA/xifenfei/archivelog/2015_11_16/thread_1_seq_1390428.276.895937209'
Incident details in: d:\oracle\diag\rdbms\xifenfei\xifenfei\incident\incdir_951320\xifenfei_m001_800_i951320.trc
Trace dumping is performing id=[cdmp_20151116152651]
Mon Nov 16 15:26:51 2015
Sweep [inc][951320]: completed
Sweep [inc][951319]: completed
Sweep [inc2][951320]: completed
Sweep [inc2][951319]: completed
Sweep [inc2][951271]: completed
Trace dumping is performing id=[cdmp_20151116152653]
Checker run found 1 new persistent data failures
Mon Nov 16 15:26:53 2015
Setting recovery target incarnation to 2
Mon Nov 16 15:26:54 2015
Assigning activation ID 4262085362 (0xfe0a42f2)
LGWR: STARTING ARCH PROCESSES
Mon Nov 16 15:26:54 2015
ARC0 started with pid=29, OS id=2896 
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
Mon Nov 16 15:26:55 2015
ARC1 started with pid=30, OS id=1748 
Mon Nov 16 15:26:55 2015
ARC2 started with pid=31, OS id=1920 
ARC1: Archival started
ARC1: Becoming the 'no FAL' ARCH
ARC1: Becoming the 'no SRL' ARCH
Thread 1 opened at log sequence 1
  Current log# 1 seq# 1 mem# 0: +DATA/xifenfei/onlinelog/group_1.262.764288315
  Current log# 1 seq# 1 mem# 1: +DATA/xifenfei/onlinelog/group_1.263.764288315
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Mon Nov 16 15:26:55 2015
SMON: enabling cache recovery
Mon Nov 16 15:26:55 2015
ARC3 started with pid=32, OS id=7236 
ARC2: Archival started
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
ARC0: Becoming the heartbeat ARCH
Errors in file d:\oracle\diag\rdbms\xifenfei\xifenfei\trace\xifenfei_ora_8132.trc  (incident=951272):
ORA-00600: 内部错误代码, 参数: [4155], [], [], [], [], [], [], [], [], [], [], []
Incident details in: d:\oracle\diag\rdbms\xifenfei\xifenfei\incident\incdir_951272\xifenfei_ora_8132_i951272.trc
Errors in file d:\oracle\diag\rdbms\xifenfei\xifenfei\trace\xifenfei_ora_8132.trc:
ORA-00600: 内部错误代码, 参数: [4155], [], [], [], [], [], [], [], [], [], [], []
Errors in file d:\oracle\diag\rdbms\xifenfei\xifenfei\trace\xifenfei_ora_8132.trc:
ORA-00600: 内部错误代码, 参数: [4155], [], [], [], [], [], [], [], [], [], [], []
Error 600 happened during db open, shutting down database
Trace dumping is performing id=[cdmp_20151116152658]
USER (ospid: 8132): terminating the instance due to error 600
Mon Nov 16 15:27:07 2015
Instance terminated by USER, pid = 8132
ORA-1092 signalled during: alter database open resetlogs...

在resetlogs过程中由于遭遇了ORA-600[4155]导致数据库无法正常打开.

分析相关trace文件

*** 2015-11-16 15:26:56.921
*** SESSION ID:(145.3) 2015-11-16 15:26:56.921
*** CLIENT ID:() 2015-11-16 15:26:56.921
*** SERVICE NAME:(SYS$USERS) 2015-11-16 15:26:56.921
*** MODULE NAME:(sqlplus.exe) 2015-11-16 15:26:56.921
*** ACTION NAME:() 2015-11-16 15:26:56.921
 
Dump continued from file: d:\oracle\diag\rdbms\xifenfei\xifenfei\trace\xifenfei_ora_8132.trc
ORA-00600: 内部错误代码, 参数: [4155], [], [], [], [], [], [], [], [], [], [], []

========= Dump for incident 951272 (ORA 600 [4155]) ========
----- Beginning of Customized Incident Dump(s) -----
XID passed in =xid: 0x000b.001.00fcfb45
XID from Undo block = xid: 0x000b.009.00fcc561
----- End of Customized Incident Dump(s) -----

*** 2015-11-16 15:26:57.203
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=7j16t46cacjt9) -----
alter database open resetlogs

----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
ksedst1()+129        CALL???  skdstdst()           009233DA2 000000000 000000000
                                                   000000000
ksedst()+69          CALL???  ksedst1()            000000002 000000000 006F605E0
                                                   000000000
dbkedDefDump()+4536  CALL???  ksedst()             000000287 000000000 000000000
                                                   000000000
ksedmp()+43          CALL???  dbkedDefDump()       000000003 000000002 000000000
                                                   000468E71
ksfdmp()+87          CALL???  ksedmp()             000000000 000000000 000000000
                                                   000000000
dbgexPhaseII()+1819  CALL???  ksfdmp()             000000000 000000000 000000000
                                                   000000000
dbgexExplicitEndInc  CALL???  dbgexPhaseII()       000000000 000000000 000000000
()+755                                             000000000
dbgeEndDDEInvocatio  CALL???  dbgexExplicitEndInc  00CFB0570 00CFB7540 01F9012D8
nImpl()+748                   ()                   01F901300
dbgeEndDDEInvocatio  CALL???  dbgeEndDDEInvocatio  00CFB0570 00CFB7540 01F903130
n()+47                        nImpl()              00000000A
ktugce()+610         CALL???  dbgeEndDDEInvocatio  006E24498 01F8FF91E 00000002E
                              n()                  035636366
ktdgti()+609         CALL???  ktugce()             000000000 000001F68 000001F68
                                                   000000001
k2vGetCollectingInf  CALL???  ktdgti()             000000000 008A34E71 01F902470
o()+324                                            000000018
k2vcbk()+182         CALL???  k2vGetCollectingInf  000000000 000000000 000000000
                              o()                  000000008
kturRecoverTxn()+82  CALL???  k2vcbk()             000000000 FCFB450000000B
67                                                 000610001 7FFDF055860
kturRecoverUndoSegm  CALL???  kturRecoverTxn()     01F903448 009460001 000000000
ent()+1371                                         00147AE14
ktuiup()+1520        CALL???  kturRecoverUndoSegm  7FF0000000B 01F903628
                              ent()                000000000 00147AE14
ktuini()+80          CALL???  ktuiup()             000000001 00000000E 01F9038A0
                                                   003CB3D06
adbdrv()+44263       CALL???  ktuini()             000000000 01F9090C8 000000000
                                                   000000000
opiexe()+20842       CALL???  adbdrv()             000000023 000000003
                                                   7FF00000102 000000000
opiosq0()+5129       CALL???  opiexe()+16981       000000004 000000000 01F90A8E0
                                                   009361AB3
kpooprx()+357        CALL???  opiosq0()            000000003 00000000E 01F90ABB0
                                                   0000000A4
kpoal8()+940         CALL???  kpooprx()            000020C80 01E65CCD0 00CE91AD8
                                                   000000001
opiodr()+1662        CALL???  kpoal8()             00000005E 00000001C 01F90E120
                                                   00A4EF224
ttcpip()+1325        CALL???  opiodr()             480000000000005E
                                                   49004D000000001C 01F90E120
                                                   4100200000000000
opitsk()+2040        CALL???  ttcpip()             01E735200 000000000 000000000
                                                   000000000
opiino()+1258        CALL???  opitsk()             00000001E 000000000 000000000
                                                   01F90FA18
opiodr()+1662        CALL???  opiino()             00000003C 000000004 01F90FAD0
                                                   000000000
opidrv()+864         CALL???  opiodr()             00000003C 000000004 01F90FAD0
                                                   6F5C3A6400000000
sou2o()+98           CALL???  opidrv()+150         00000003C 000000004 01F90FAD0
                                                   000000000
opimai_real()+158    CALL???  sou2o()              01F90FB00 01F90FBC4
                                                   F0010000B07DF 1009C002B0019
opimai()+191         CALL???  opimai_real()        00000001B 01F90FC88 000000036
                                                   000000000
OracleThreadStart()  CALL???  opimai()             01F90FE90 01F60FF38 000000002
+724                                               01F90FC88
0000000078D3B6DA     CALL???  OracleThreadStart()  01F60FF38 000000000 000000000
                                                   01F90FFA8
 

--------------------- Binary Stack Dump ---------------------

UNDO BLK:  
xid: 0x000b.009.00fcc561  seq: 0x513d cnt: 0x2f  irb: 0x2f  icl: 0x0   flg: 0x0000
 
 Rec Offset      Rec Offset      Rec Offset      Rec Offset      Rec Offset
---------------------------------------------------------------------------
0x01 0x1f64     0x02 0x1ee4     0x03 0x1e30     0x04 0x1d7c     0x05 0x1cc8     
0x06 0x1c14     0x07 0x1b60     0x08 0x1aac     0x09 0x19f8     0x0a 0x1944     
0x0b 0x1890     0x0c 0x17dc     0x0d 0x1728     0x0e 0x1674     0x0f 0x15c0     
0x10 0x150c     0x11 0x1458     0x12 0x13a4     0x13 0x12f0     0x14 0x123c     
0x15 0x1188     0x16 0x10d4     0x17 0x1050     0x18 0x0fd0     0x19 0x0f1c     
0x1a 0x0e98     0x1b 0x0de4     0x1c 0x0d30     0x1d 0x0c7c     0x1e 0x0bc8     
0x1f 0x0b44     0x20 0x0ac4     0x21 0x0a10     0x22 0x095c     0x23 0x08a8     
0x24 0x07f4     0x25 0x0770     0x26 0x06f0     0x27 0x063c     0x28 0x0588     
0x29 0x04d4     0x2a 0x0420     0x2b 0x039c     0x2c 0x031c     0x2d 0x0298     
0x2e 0x0218     0x2f 0x0164     

ORA-600 4155是由于在恢复过程中发现事务的id和undo segment中的事务表中id不匹配从而出现此类问题.针对此问题,可以通过bbed修改事务表记录,或者直接丢弃该事务,从而绕过该错误.

处理掉异常事务id后,继续open库

Mon Nov 16 16:16:07 2015
ALTER DATABASE OPEN
Beginning crash recovery of 1 threads
 parallel recovery started with 32 processes
Started redo scan
Completed redo scan
 read 8 KB redo, 0 data blocks need recovery
Started redo application at
 Thread 1: logseq 1, block 2, scn 14199161880578
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1 Reading mem 0
  Mem# 0: +DATA/xifenfei/onlinelog/group_1.262.764288315
  Mem# 1: +DATA/xifenfei/onlinelog/group_1.263.764288315
Completed redo application of 0.00MB
Completed crash recovery at
 Thread 1: logseq 1, block 19, scn 14199161900601
 0 data blocks read, 0 data blocks written, 8 redo k-bytes read
Current SCN is not changed: _minimum_giga_scn (scn 14199161880576) is too small
Mon Nov 16 16:16:08 2015
LGWR: STARTING ARCH PROCESSES
Mon Nov 16 16:16:08 2015
ARC0 started with pid=62, OS id=7612 
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
Mon Nov 16 16:16:09 2015
ARC1 started with pid=63, OS id=5620 
Mon Nov 16 16:16:09 2015
ARC2 started with pid=64, OS id=7308 
ARC1: Archival started
ARC1: Becoming the 'no FAL' ARCH
ARC1: Becoming the 'no SRL' ARCH
Thread 1 advanced to log sequence 2 (thread open)
Thread 1 opened at log sequence 2
  Current log# 2 seq# 2 mem# 0: +DATA/xifenfei/onlinelog/group_2.264.764288315
  Current log# 2 seq# 2 mem# 1: +DATA/xifenfei/onlinelog/group_2.265.764288315
Successful open of redo thread 1
Archived Log entry 1 added for thread 1 sequence 1 ID 0xfe09f6df dest 1:
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Mon Nov 16 16:16:10 2015
SMON: enabling cache recovery
Dictionary check beginning
Tablespace 'TEMP' #3 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Dictionary check complete
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
*********************************************************************
WARNING: The following temporary tablespaces contain no files.
         This condition can occur when a backup controlfile has
Mon Nov 16 16:16:09 2015
ARC3 started with pid=65, OS id=7064 
         been restored.  It may be necessary to add files to these
ARC2: Archival started
         tablespaces.  That can be done using the SQL statement:
ARC3: Archival started
 
ARC0: STARTING ARCH PROCESSES COMPLETE
         ALTER TABLESPACE <tablespace_name> ADD TEMPFILE
ARC0: Becoming the heartbeat ARCH
 
         Alternatively, if these temporary tablespaces are no longer
         needed, then they can be dropped.
           Empty temporary tablespace: TEMP
*********************************************************************
Database Characterset is ZHS16GBK
No Resource Manager plan active
**********************************************************
WARNING: Files may exists in db_recovery_file_dest
that are not known to the database. Use the RMAN command
CATALOG RECOVERY AREA to re-catalog any such files.
If files cannot be cataloged, then manually delete them
using OS command.
One of the following events caused this:
1. A backup controlfile was restored.
2. A standby controlfile was restored.
3. The controlfile was re-created.
4. db_recovery_file_dest had previously been enabled and
   then disabled.
**********************************************************
replication_dependency_tracking turned off (no async multimaster replication found)
WARNING: AQ_TM_PROCESSES is set to 0. System operation                     might be adversely affected.
LOGSTDBY: Validating controlfile with logical metadata
LOGSTDBY: Validation complete
Completed: ALTER DATABASE OPEN

因为该数据库是经过了存储恢复,除system之外,其他文件也有大量坏块,因为恢复过程相对比较麻烦,除了上面列出来的ORA-00353,ORA-00312,ORA-600 4155等各种错误之外,还有大量的ORA-01578,ORA-01110.由于11G比较常见的ORA-00283,ORA-16433问题。

存储异常导致ORA-10562故障恢复

朋友数据库由于存储变动,导致数据库瞬间hang住,然后直接crash,之后无法正常启动,请求技术支持.
数据库报ORA-00600[2131]错误
不能mount,可以通过重建控制文件解决

Mon Nov 30 20:35:38 2015
alter database mount
Mon Nov 30 20:35:38 2015
NOTE: Loaded library: System 
Mon Nov 30 20:35:38 2015
SUCCESS: diskgroup DATADG was mounted
Mon Nov 30 20:35:38 2015
NOTE: dependency between database xifenfei and diskgroup resource ora.DATADG.dg is established
Errors in file /u01/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_ora_26450.trc  (incident=3032256):
ORA-00600: internal error code, arguments: [2131], [33], [32], [], [], [], [], [], [], [], [], []
ORA-600 signalled during: alter database mount...

尝试recover数据库

Mon Nov 30 20:45:53 2015
ALTER DATABASE RECOVER  database  
Media Recovery Start
 started logmerger process
Parallel Media Recovery started with 80 slaves
Mon Nov 30 20:45:56 2015
Recovery of Online Redo Log: Thread 2 Group 11 Seq 617 Reading mem 0
  Mem# 0: +DATADG/xifenfei/redo011.log
Recovery of Online Redo Log: Thread 1 Group 4 Seq 5410 Reading mem 0
  Mem# 0: +DATADG/xifenfei/redo04.log
Recovery of Online Redo Log: Thread 1 Group 5 Seq 5411 Reading mem 0
  Mem# 0: +DATADG/xifenfei/redo05.log
Mon Nov 30 20:46:07 2015
Recovery of Online Redo Log: Thread 1 Group 6 Seq 5412 Reading mem 0
  Mem# 0: +DATADG/xifenfei/redo06.log
Mon Nov 30 20:46:13 2015
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xC] [PC:0x95FB502, kdxlin()+4088] [flags: 0x0, count: 1]
Errors in file /u01/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_pr13_30480.trc  (incident=3032568):
ORA-07445: 出现异常错误: 核心转储 [kdxlin()+4088] [SIGSEGV] [ADDR:0xC] [PC:0x95FB502] [Address not mapped to object] []
Mon Nov 30 20:46:17 2015
Sweep [inc][3032568]: completed
Sweep [inc2][3032568]: completed
Mon Nov 30 20:46:31 2015
Slave exiting with ORA-10562 exception
Errors in file /u01/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_pr13_30480.trc:
ORA-10562: Error occurred while applying redo to data block (file# 2, block# 165054)
ORA-10564: tablespace SYSAUX
ORA-01110: 数据文件 2: '+DATADG/xifenfei/datafile/sysaux.265.861925867'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 271
ORA-00607: 当更改数据块时出现内部错误
ORA-00602: 内部编程异常错误
ORA-07445: 出现异常错误: 核心转储 [kdxlin()+4088] [SIGSEGV] [ADDR:0xC] [PC:0x95FB502] [Address not mapped to object] []
Mon Nov 30 20:46:31 2015
Recovery Slave PR13 previously exited with exception 10562
Mon Nov 30 20:46:33 2015
Checker run found 28 new persistent data failures
Mon Nov 30 20:46:35 2015
Media Recovery failed with error 448
Errors in file /u01/app/oracle/diag/rdbms/xifenfei/xifenfei1/trace/xifenfei1_pr00_30400.trc:
ORA-00283: 恢复会话因错误而取消
ORA-00448: 后台进程正常结束
ORA-10562 signalled during: ALTER DATABASE RECOVER  database  ...

通过这里可以看到,由于在recover 操作之时,由于某种原因redo的数据无法apply到file 2 block 165054中,导致数据库recover database失败.

按照数据文件recover操作

SQL> recover datafile 1;
Media recovery complete.
SQL> recover datafile 3,4,5,6,7;
Media recovery complete.
SQL> recover datafile 8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28;              
Media recovery complete.
SQL> recover datafile 2;
ORA-00283: recovery session canceled due to errors
ORA-10562: Error occurred while applying redo to data block (file# 2, block#
165054)
ORA-10564: tablespace SYSAUX
ORA-01110: data file 2: '+DATADG/xifenfei/datafile/sysaux.265.861925867'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 271
ORA-00607: Internal error occurred while making a change to a data block
ORA-00602: internal programming exception
ORA-07445: exception encountered: core dump [kdxlin()+4088] [SIGSEGV]
[ADDR:0xC] [PC:0x95FB502] [Address not mapped to object] []

错误提示和recover database一样,那我们只能让恢复跳过该block继续恢复,因为根据经验判断data object# 271不是系统核心对象,不会影响数据库的启动

跳过坏块继续恢复

SQL> recover  datafile 2 allow 1 corruption;
ORA-00283: recovery session canceled due to errors
ORA-00600: internal error code, arguments: [3020], [2], [69793], [8458401], [],
[], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 2, block# 69793, file
offset is 571744256 bytes)
ORA-10564: tablespace SYSAUX
ORA-01110: data file 2: '+DATADG/xifenfei/datafile/sysaux.265.861925867'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 272


SQL> recover  datafile 2 allow 1 corruption;
Media recovery complete.
SQL> alter database open;

Database altered.

出现了ORA-600[3020] 继续跳过坏块,然后数据库顺利open,别忘记加tempfile

处理异常对象

SQL> select object_name,object_type from dba_objects where data_object_id in(272,271);

OBJECT_NAME
--------------------------------------------------------------------------------
OBJECT_TYPE
-------------------
SMON_SCN_TIME_TIM_IDX
INDEX

SMON_SCN_TIME_SCN_IDX
INDEX

SQL> select index_name from dba_indexes where table_name='SMON_SCN_TIME';

INDEX_NAME
------------------------------
SMON_SCN_TIME_TIM_IDX
SMON_SCN_TIME_SCN_IDX

SQL> set pages 1000
SQL> set long 1000
SQL> Select dbms_metadata.get_ddl('TABLE','SMON_SCN_TIME','SYS') FROM DUAL ;

DBMS_METADATA.GET_DDL('TABLE','SMON_SCN_TIME','SYS')
--------------------------------------------------------------------------------

  CREATE TABLE "SYS"."SMON_SCN_TIME"
   (    "THREAD" NUMBER,
        "TIME_MP" NUMBER,
        "TIME_DP" DATE,
        "SCN_WRP" NUMBER,
        "SCN_BAS" NUMBER,
        "NUM_MAPPINGS" NUMBER,
        "TIM_SCN_MAP" RAW(1200),
        "SCN" NUMBER DEFAULT 0,
        "ORIG_THREAD" NUMBER DEFAULT 0           /* for downgrade */
   ) CLUSTER "SYS"."SMON_SCN_TO_TIME_AUX" ("THREAD")

SQL> analyze table smon_scn_time validate structure cascade online;
analyze table smon_scn_time validate structure cascade online
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 2, block # 165054)
ORA-01110: data file 2: '+DATADG/xifenfei/datafile/sysaux.265.861925867'

SQL> truncate CLUSTER "SYS"."SMON_SCN_TO_TIME_AUX";

Cluster truncated.

关于SMON_SCN_TIME部分处理,可以参考:关于SMON_SCN_TIME若干问题说明.至此数据库基本上恢复完成,而且运气非常好,恢复的非常完美,数据实现0丢失.

asm disk格式化恢复

接到网友请求,由于操作人员粗心把asm disk的磁盘映射到另外的机器上,并且格式化为了win ntfs文件系统,导致asm 磁盘组异常,数据库无法使用
asm 日志报ORA-27072错

Mon Nov 30 12:00:13 2015
Errors in file c:\app\administrator\diag\asm\+asm\+asm\trace\+asm_gmon_868.trc:
ORA-27070: async read/write failed
OSD-04008: WriteFile() 失败, 无法写入文件
O/S-Error: (OS 21) 设备未就绪。
WARNING: IO Failed. group:1 disk(number.incarnation):0.0xf0f0bbfb disk_path:\\.\ORCLDISKDATA0
	 AU:1 disk_offset(bytes):2093056 io_size:4096 operation:Write type:synchronous
	 result:I/O error process_id:868
WARNING: disk 0.4042308603 (DATA_0000) not responding to heart beat
ERROR: too many offline disks in PST (grp 1)
WARNING: Disk DATA_0000 in mode 0x7f will be taken offline
Mon Nov 30 12:00:13 2015
NOTE: process 576:37952 initiating offline of disk 0.4042308603 (DATA_0000) with mask 0x7e in group 1
WARNING: Disk DATA_0000 in mode 0x7f is now being taken offline
NOTE: initiating PST update: grp = 1, dsk = 0/0xf0f0bbfb, mode = 0x15
kfdp_updateDsk(): 5 
kfdp_updateDskBg(): 5 
Errors in file c:\app\administrator\diag\asm\+asm\+asm\trace\+asm_gmon_868.trc:
ORA-27072: File I/O error
WARNING: IO Failed. group:1 disk(number.incarnation):1.0xf0f0bbfc disk_path:\\.\ORCLDISKDATA1
	 AU:1 disk_offset(bytes):1048576 io_size:4096 operation:Read type:synchronous
	 result:I/O error process_id:868
Errors in file c:\app\administrator\diag\asm\+asm\+asm\trace\+asm_gmon_868.trc:
ORA-27072: File I/O error
WARNING: IO Failed. group:1 disk(number.incarnation):1.0xf0f0bbfc disk_path:\\.\ORCLDISKDATA1
	 AU:1 disk_offset(bytes):1052672 io_size:4096 operation:Read type:synchronous
	 result:I/O error process_id:868
Errors in file c:\app\administrator\diag\asm\+asm\+asm\trace\+asm_gmon_868.trc:
ORA-27072: File I/O error
WARNING: IO Failed. group:1 disk(number.incarnation):2.0xf0f0bbfd disk_path:\\.\ORCLDISKDATA2
	 AU:1 disk_offset(bytes):1048576 io_size:4096 operation:Read type:synchronous
	 result:I/O error process_id:868
Errors in file c:\app\administrator\diag\asm\+asm\+asm\trace\+asm_gmon_868.trc:
ORA-27072: File I/O error
WARNING: IO Failed. group:1 disk(number.incarnation):2.0xf0f0bbfd disk_path:\\.\ORCLDISKDATA2
	 AU:1 disk_offset(bytes):1052672 io_size:4096 operation:Read type:synchronous
	 result:I/O error process_id:868
Errors in file c:\app\administrator\diag\asm\+asm\+asm\trace\+asm_gmon_868.trc:
ORA-27072: File I/O error
WARNING: IO Failed. group:1 disk(number.incarnation):3.0xf0f0bbfe disk_path:\\.\ORCLDISKDATA3
	 AU:1 disk_offset(bytes):1048576 io_size:4096 operation:Read type:synchronous
	 result:I/O error process_id:868
Errors in file c:\app\administrator\diag\asm\+asm\+asm\trace\+asm_gmon_868.trc:
ORA-27072: File I/O error
WARNING: IO Failed. group:1 disk(number.incarnation):3.0xf0f0bbfe disk_path:\\.\ORCLDISKDATA3
	 AU:1 disk_offset(bytes):1052672 io_size:4096 operation:Read type:synchronous
	 result:I/O error process_id:868
Errors in file c:\app\administrator\diag\asm\+asm\+asm\trace\+asm_gmon_868.trc:
ORA-27072: File I/O error
WARNING: IO Failed. group:1 disk(number.incarnation):4.0xf0f0bbff disk_path:\\.\ORCLDISKDATA4
	 AU:1 disk_offset(bytes):1048576 io_size:4096 operation:Read type:synchronous
	 result:I/O error process_id:868
Errors in file c:\app\administrator\diag\asm\+asm\+asm\trace\+asm_gmon_868.trc:
ORA-27072: File I/O error
WARNING: IO Failed. group:1 disk(number.incarnation):4.0xf0f0bbff disk_path:\\.\ORCLDISKDATA4
	 AU:1 disk_offset(bytes):1052672 io_size:4096 operation:Read type:synchronous
	 result:I/O error process_id:868
Errors in file c:\app\administrator\diag\asm\+asm\+asm\trace\+asm_gmon_868.trc:
ORA-27072: File I/O error
WARNING: IO Failed. group:1 disk(number.incarnation):6.0xf0f0bc01 disk_path:\\.\ORCLDISKDATA6
	 AU:1 disk_offset(bytes):1048576 io_size:4096 operation:Read type:synchronous
	 result:I/O error process_id:868
Errors in file c:\app\administrator\diag\asm\+asm\+asm\trace\+asm_gmon_868.trc:
ORA-27072: File I/O error
WARNING: IO Failed. group:1 disk(number.incarnation):6.0xf0f0bc01 disk_path:\\.\ORCLDISKDATA6
	 AU:1 disk_offset(bytes):1052672 io_size:4096 operation:Read type:synchronous
	 result:I/O error process_id:868
Errors in file c:\app\administrator\diag\asm\+asm\+asm\trace\+asm_gmon_868.trc:
ORA-27072: File I/O error
WARNING: IO Failed. group:1 disk(number.incarnation):7.0xf0f0bc02 disk_path:\\.\ORCLDISKDATA7
	 AU:1 disk_offset(bytes):1048576 io_size:4096 operation:Read type:synchronous
	 result:I/O error process_id:868
Errors in file c:\app\administrator\diag\asm\+asm\+asm\trace\+asm_gmon_868.trc:
ORA-27072: File I/O error
WARNING: IO Failed. group:1 disk(number.incarnation):7.0xf0f0bc02 disk_path:\\.\ORCLDISKDATA7
	 AU:1 disk_offset(bytes):1052672 io_size:4096 operation:Read type:synchronous
	 result:I/O error process_id:868
ERROR: no PST quorum in group: required 1, found 0
WARNING: Disk DATA_0000 in mode 0x7f offline aborted
Mon Nov 30 12:00:14 2015
SQL> alter diskgroup DATA dismount force /* ASM SERVER */ 
NOTE: cache dismounting (not clean) group 1/0xBB404B03 (DATA) 
Mon Nov 30 12:00:14 2015
NOTE: halting all I/Os to diskgroup DATA
Mon Nov 30 12:00:14 2015
NOTE: LGWR doing non-clean dismount of group 1 (DATA)
NOTE: LGWR sync ABA=367.7265 last written ABA 367.7265
NOTE: cache dismounted group 1/0xBB404B03 (DATA) 
kfdp_dismount(): 6 
kfdp_dismountBg(): 6 
NOTE: De-assigning number (1,0) from disk (\\.\ORCLDISKDATA0)
NOTE: De-assigning number (1,1) from disk (\\.\ORCLDISKDATA1)
NOTE: De-assigning number (1,2) from disk (\\.\ORCLDISKDATA2)
NOTE: De-assigning number (1,3) from disk (\\.\ORCLDISKDATA3)
NOTE: De-assigning number (1,4) from disk (\\.\ORCLDISKDATA4)
NOTE: De-assigning number (1,5) from disk (\\.\ORCLDISKDATA5)
NOTE: De-assigning number (1,6) from disk (\\.\ORCLDISKDATA6)
NOTE: De-assigning number (1,7) from disk (\\.\ORCLDISKDATA7)
SUCCESS: diskgroup DATA was dismounted
NOTE: cache deleting context for group DATA 1/-1153414397
SUCCESS: alter diskgroup DATA dismount force /* ASM SERVER */
ERROR: PST-initiated MANDATORY DISMOUNT of group DATA

这里的asm日志很明显由于asm disk无法正常访问,报ORA-27072错误,磁盘组强制dismount.

分析磁盘情况
asm-disk1
asm-disk2


通过与客户沟通,确定从I到O本为asm disk 被格式化为了NTFS文件系统的磁盘,结合asmtool分析可以发现还有一个asm disk没有格式化掉,该磁盘组中一个共有8个磁盘格式化掉了7个.

通过kfed分析磁盘信息

C:\Users\Administrator>kfed read '\\.\J:'
kfbh.endian:                        235 ; 0x000: 0xeb
kfbh.hard:                           82 ; 0x001: 0x52
kfbh.type:                          144 ; 0x002: *** Unknown Enum ***
kfbh.datfmt:                         78 ; 0x003: 0x4e
kfbh.block.blk:               542328404 ; 0x004: T=0 NUMB=0x20534654
kfbh.block.obj:                 2105376 ; 0x008: TYPE=0x0 NUMB=0x2020
kfbh.check:                        2050 ; 0x00c: 0x00000802
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                    63488 ; 0x014: 0x0000f800
kfbh.spare1:                   16711743 ; 0x018: 0x00ff003f
kfbh.spare2:                       2048 ; 0x01c: 0x00000800
ERROR!!!, failed to get the oracore error message

C:\Users\Administrator>kfed read '\\.\J:' blkn=2
kfbh.endian:                         70 ; 0x000: 0x46
kfbh.hard:                           73 ; 0x001: 0x49
kfbh.type:                           76 ; 0x002: *** Unknown Enum ***
kfbh.datfmt:                         69 ; 0x003: 0x45
kfbh.block.blk:                  196656 ; 0x004: T=0 NUMB=0x30030
kfbh.block.obj:                33563364 ; 0x008: TYPE=0x0 NUMB=0x22e4
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                    65537 ; 0x010: 0x00010001
kfbh.fcn.wrap:                    65592 ; 0x014: 0x00010038
kfbh.spare1:                        416 ; 0x018: 0x000001a0
kfbh.spare2:                       1024 ; 0x01c: 0x00000400
ERROR!!!, failed to get the oracore error message

C:\Users\Administrator>kfed read '\\.\J:' blkn=256
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                           13 ; 0x002: KFBTYP_PST_NONE
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:              2147483648 ; 0x004: T=1 NUMB=0x0
kfbh.block.obj:              2147483654 ; 0x008: TYPE=0x8 NUMB=0x6
kfbh.check:                    17662471 ; 0x00c: 0x010d8207
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
ERROR!!!, failed to get the oracore error message

C:\Users\Administrator>kfed read '\\.\J:' blkn=510
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                     254 ; 0x004: T=0 NUMB=0xfe
kfbh.block.obj:              2147483654 ; 0x008: TYPE=0x8 NUMB=0x6
kfbh.check:                   717599272 ; 0x00c: 0x2ac5b228
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdhdb.driver.provstr:    ORCLDISKDATA6 ; 0x000: length=13
kfdhdb.driver.reserved[0]:   1096040772 ; 0x008: 0x41544144
kfdhdb.driver.reserved[1]:           54 ; 0x00c: 0x00000036
kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000
kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000
kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000
kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000
…………

通过分析,可以确定asm disk的备份block没有被覆盖,原则上可以通过备份block实现磁盘组恢复,从而减小了恢复难度

kfed恢复磁盘头

C:\Users\Administrator> kfed repair '\\.\J:'
C:\Users\Administrator>kfed read '\\.\J:' 
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                     254 ; 0x004: T=0 NUMB=0xfe
kfbh.block.obj:              2147483654 ; 0x008: TYPE=0x8 NUMB=0x6
kfbh.check:                   717599272 ; 0x00c: 0x2ac5b228
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdhdb.driver.provstr:    ORCLDISKDATA6 ; 0x000: length=13
kfdhdb.driver.reserved[0]:   1096040772 ; 0x008: 0x41544144
kfdhdb.driver.reserved[1]:           54 ; 0x00c: 0x00000036
kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000
kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000
kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000
kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000
…………

确定asm disk相关信息
对于7个被格式化的磁盘都进行类似处理之后,通过工具看到相关磁盘信息如下
asm-disk3


恢复处理
根据ntfs的文件系统分布,我们可以知道,虽然asm disk header备份block正常,但是asm disk中间部分依旧有不少au会被破坏
ntfs


这样的情况,不合适直接使用工具拷贝出来datafile(由于可能记录block的字典正好被覆盖,导致拷贝出来的文件异常,在恢复过程中我们也做了试验小文件拷贝ok,大文件拷贝然后使用dbv检测有很多坏块),我们采用工具(asm disk header 彻底损坏恢复)从底层扫描直接重组出来asm disk中的数据文件,然后结合拷贝出来的控制文件,redo文件,参数文件,然后通过重命名相关路径,然后直接open数据库

Q:\>sqlplus / as sysdba

SQL*Plus: Release 11.2.0.1.0 Production on 星期三 1月 22 16:08:18 2014

Copyright (c) 1982, 2010, Oracle.  All rights reserved.


连接到:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options


SQL> set pages 1000
SQL> col name for a100
SQL> set lines 150
SQL> select file#,name from v$datafile;

     FILE# NAME
---------- --------------------------------------------------------------------
         1 +DATA/vspdb/datafile/system.256.778520603
         2 +DATA/vspdb/datafile/sysaux.257.778520603
         3 +DATA/vspdb/datafile/undotbs1.258.778520603
         4 +DATA/vspdb/datafile/users.259.778520603
         5 +DATA/vspdb/datafile/vsp_tbs.293.779926097
        …………
       147 +DATA/vspdb/datafile/index_dg.418.864665747
       148 +DATA/vspdb/datafile/data_dg.419.864667053
       149 +DATA/vspdb/datafile/vsp_mm_tbs.420.890410367
       150 +DATA/vspdb/datafile/vsp_mm_tbs.421.890410457

SQL> select member from v$logfile;

MEMBER
-------------------------------------------------------------------------------------
+DATA/vspdb/onlinelog/group_7.263.862676593
+DATA/vspdb/onlinelog/group_7.262.862676601
+DATA/vspdb/onlinelog/group_4.410.862652291
+DATA/vspdb/onlinelog/group_4.411.862652307
+DATA/vspdb/onlinelog/group_5.412.862653715
+DATA/vspdb/onlinelog/group_5.413.862653727
+DATA/vspdb/onlinelog/group_6.414.862676425
+DATA/vspdb/onlinelog/group_6.415.862676433

重命名数据文件和redo文件,open数据库

SQL> recover database;
完成介质恢复。
SQL> alter database open;

数据库已更改。

已用时间:  00: 00: 04.51

由于部分block被覆盖,使用空块代替,导致数据访问到该block就会出现ora-8103(模拟普通ORA-08103并解决,模拟极端ORA-08103并解决)错误,对于该种对象,最简单处理方法就是直接通过dul抽出来数据然后truncate table重新导入数据,当然如果你想彻底安全逻辑方式重建库最靠谱