0kb数据文件恢复或者文件丢失恢复

接到一个朋友恢复请求,由于rose频繁切换导致文件系统部分数据文件变化为0kb和文件丢失.
故障现象
部分数据文件变化为0kb和文件丢失.
file_lost
file_size_0


这里比较明显,数据库的users03变为了0kb和users04丢失.数据库alert日志报错信息如下:

Completed: alter database mount exclusive
alter database open
Errors in file E:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_dbw0_12008.trc:
ORA-01157: ????/?????? 7 - ??? DBWR ????
ORA-01110: ???? 7: 'E:\APP\ADMINISTRATOR\ORADATA\ORCL\USERS03.DBF'
ORA-27047: ??????????
OSD-04006: ReadFile() 失败, 无法读取文件
O/S-Error: (OS 38) 已到文件结尾。
Errors in file E:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_dbw0_12008.trc:
ORA-01157: ????/?????? 8 - ??? DBWR ????
ORA-01110: ???? 8: 'E:\APP\ADMINISTRATOR\ORADATA\ORCL\USERS04.DBF'
ORA-27041: ??????
OSD-04002: 无法打开文件
O/S-Error: (OS 2) 系统找不到指定的文件。
Errors in file E:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_ora_12040.trc:
ORA-01157: ????/?????? 7 - ??? DBWR ????
ORA-01110: ???? 7: 'E:\APP\ADMINISTRATOR\ORADATA\ORCL\USERS03.DBF'
ORA-1157 signalled during: alter database open...
Fri May 04 09:35:10 2018
Checker run found 2 new persistent data failures

alert日志的报错也比较明显,users03是文件超过了大小(大小为0kb,读取之后肯定超过大小),users04提示无法打开文件(文件在文件系统层面已经丢失).现在问题比较明显由于文件系统故障导致文件大小为0和丢失

碎片扫描恢复
常规的方法肯定无法恢复,比较好的方法只能是底层碎片扫描重组,结合多种扫描工具,最后发现一个做底层恢复的朋友的工具效果不错,扫描结果如下
file_scan


通过工具分析坏块情况

C:\Users\Administrator>dbv FiLe=D:\0504\ORCL_TS.4_FILE.7_10.ora

DBVERIFY: Release 11.2.0.4.0 - Production on 星期六 5月 5 08:52:53 2018

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

DBVERIFY - 开始验证: FILE = D:\0504\ORCL_TS.4_FILE.7_10.ora

………………

页 382565 标记为损坏
Corrupt block relative dba: 0x01c5d665 (file 7, block 382565)
Completely zero block found during dbv:

页 382566 标记为损坏
Corrupt block relative dba: 0x01c5d666 (file 7, block 382566)
Completely zero block found during dbv:

页 382567 标记为损坏
Corrupt block relative dba: 0x01c5d667 (file 7, block 382567)
Completely zero block found during dbv:



DBVERIFY - 验证完成

检查的页总数: 1374720
处理的页总数 (数据): 27582
失败的页总数 (数据): 0
处理的页总数 (索引): 20114
失败的页总数 (索引): 0
处理的页总数 (其他): 1319752
处理的总页数 (段)  : 0
失败的总页数 (段)  : 0
空的页总数: 1
标记为损坏的总页数: 7271
流入的页总数: 0
加密的总页数        : 0
最高块 SCN            : 228271996 (0.228271996)


C:\Users\Administrator>dbv FiLe=D:\0504\ORCL_TS.4_FILE.8_8.ora

DBVERIFY: Release 11.2.0.4.0 - Production on 星期六 5月 5 08:52:53 2018

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

DBVERIFY - 开始验证: FILE = D:\0504\ORCL_TS.4_FILE.8_8.ora


DBVERIFY - 验证完成

检查的页总数: 1136896
处理的页总数 (数据): 36639
失败的页总数 (数据): 0
处理的页总数 (索引): 57038
失败的页总数 (索引): 0
处理的页总数 (其他): 1043218
处理的总页数 (段)  : 0
失败的总页数 (段)  : 0
空的页总数: 1
标记为损坏的总页数: 0
流入的页总数: 0
加密的总页数        : 0
最高块 SCN            : 228271997 (0.228271997)

C:\Users\Administrator>

scan_resulte


这里通过分析恢复的两个文件总的block数量2511618,其中连续损坏7271个block损坏,由于出现问题之后,数据库被offline这两个文件继续启动运行了几个小时,导致少量block被覆盖,恢复软件直接置空.后续的恢复比较顺利,正常open数据库,然后处理坏块对象(正好不是业务核心表的lob字段,所有部分丢失影响不是非常大).

温馨提醒:
1. 数据文件和备份不要放在同一个阵列上,更不能是同一个分区(卷)上
2. 出现此类问题之后,应当理解停止对该分区的任何写操作,方式丢失或者大小为0的文件被覆盖.

ORA-00704 ORA-00702 恢复

数据库启动报ORA-01092 ORA-00704 ORA-00702错误

SQL> startup
ORACLE instance started.

Total System Global Area 3056513024 bytes
Fixed Size                  2257152 bytes
Variable Size             704646912 bytes
Database Buffers         2332033024 bytes
Redo Buffers               17575936 bytes
Database mounted.
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00702: bootstrap verison '' inconsistent with version '8.0.0.0.0'
Process ID: 27344
Session ID: 191 Serial number: 3

bootstrap-ORA-00702


查看alert日志

Mon Apr 09 16:22:34 2018
ALTER DATABASE   MOUNT
Successful mount of redo thread 1, with mount id 1383493834
Database mounted in Exclusive Mode
Lost write protection disabled
Completed: ALTER DATABASE   MOUNT
Mon Apr 09 16:22:39 2018
ALTER DATABASE OPEN
Thread 1 opened at log sequence 3
  Current log# 3 seq# 3 mem# 0: /u01/app/oracle/oradata/orcl/redo03.log
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
SMON: enabling cache recovery
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_27344.trc:
ORA-00704: bootstrap process failure
ORA-00702: bootstrap verison '' inconsistent with version '8.0.0.0.0'
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_27344.trc:
ORA-00704: bootstrap process failure
ORA-00702: bootstrap verison '' inconsistent with version '8.0.0.0.0'
Error 704 happened during db open, shutting down database
USER (ospid: 27344): terminating the instance due to error 704
Instance terminated by USER, pid = 27344
ORA-1092 signalled during: ALTER DATABASE OPEN...
opiodr aborting process unknown ospid (27344) as a result of ORA-1092
Mon Apr 09 16:22:40 2018
ORA-1092 : opitsk aborting process

错误比较明显是由于数据库open过程中bootstrap异常导致,出现此类错误一般是由于软件介质和db不匹配或者bootstrap表的block故障导致.

官方说明

Versions 9.2, 10.1, 10.2, 11.1, 11.2, 12.1

Error:  ORA-00702 bootstrap verison '%s' inconsistent with version '%s' 
---------------------------------------------------------------------------
Cause:  The reading version of the boostrap is incompatible with the current 
	bootstrap version. 
Action: Restore a version of the software that is compatible with this 
	bootstrap version

由于bootstrap$以及其中相关表处理比较特殊,如果您遭遇此类bootstrap$相关异常无法解决,需要恢复支持,请联系我们
Phone:13429648788    Q Q:107644445QQ咨询惜分飞    E-Mail:dba@xifenfei.com

ORA-00600: internal error code, arguments: [16703], [1403], [20]

继续上篇的tab$被清空(ORA-600 16703故障解析—tab$表被清空),导致数据库启动异常的case
数据库日志分析
数据库open成功同时报ORA-7445 kokeicbegin和ORA-600 kzrini:!uprofile错误
ora-600-kzrini-uprofile


再次启动数据库直接报ORA-600 16703错误
ora-600-16703

这里有一个其他现象,就是数据库在open成功的同时(同一秒内),开始报异常.重启之后直接报
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [], [], [], [], []
根据10046分析结果

=====================
select rowcnt,blkcnt,empcnt,avgspc,chncnt,avgrln,nvl(degree,1), nvl(instances,1) from tab$ where obj# = :1
END OF STMT
PARSE #140048443935120:c=0,e=390,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1499185905161433
=====================
select blevel, leafcnt, distkey, lblkkey, dblkkey, clufac,        nvl(degree,1), nvl(instances,1) from ind$ where bo# = :1 and obj# = :2
END OF STMT
PARSE #140048443934176:c=1000,e=601,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1499185905162088
=====================
PARSING IN CURSOR #140048443933232 len=70 dep=1 uid=0 oct=3 lid=0 tim=1499185905162444 hv=3377894161 ad='84f13d70' sqlid='32d4jrb4pd4sj'
select charsetid, charsetform from col$  where obj# = :1 and col# = :2
END OF STMT
PARSE #140048443933232:c=0,e=294,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1499185905162443
=====================
PARSING IN CURSOR #140048443932288 len=52 dep=1 uid=0 oct=3 lid=0 tim=1499185905247020 hv=429618617 ad='84f0f2d0' sqlid='4krwuz0ctqxdt'
select ctime, mtime, stime from obj$ where obj# = :1
END OF STMT
PARSE #140048443932288:c=0,e=549,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1499185905247019
BINDS #140048443932288:
select blevel, leafcnt, distkey, lblkkey, dblkkey, clufac,        nvl(degree,1), nvl(instances,1) from ind$ where bo# = :1 and obj# = :2
END OF STMT
PARSE #140048443934176:c=1000,e=601,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1499185905162088
=====================
PARSING IN CURSOR #140048443933232 len=70 dep=1 uid=0 oct=3 lid=0 tim=1499185905162444 hv=3377894161 ad='84f13d70' sqlid='32d4jrb4pd4sj'
select charsetid, charsetform from col$  where obj# = :1 and col# = :2
END OF STMT
PARSE #140048443933232:c=0,e=294,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1499185905162443
=====================
PARSING IN CURSOR #140048443932288 len=52 dep=1 uid=0 oct=3 lid=0 tim=1499185905247020 hv=429618617 ad='84f0f2d0' sqlid='4krwuz0ctqxdt'
select ctime, mtime, stime from obj$ where obj# = :1
END OF STMT
PARSE #140048443932288:c=0,e=549,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1499185905247019
BINDS #140048443932288:
 Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=00 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=7f5f91b87bd0  bln=22  avl=02  flg=05
  value=20
EXEC #140048443932288:c=2000,e=2686,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=1218588913,tim=1499185905249810
WAIT #140048443932288: nam='db file sequential read' ela= 6205 file#=1 block#=337 blocks=1 obj#=36 tim=1499185905256132
WAIT #140048443932288: nam='db file sequential read' ela= 3739 file#=1 block#=338 blocks=1 obj#=36 tim=1499185905266704
WAIT #140048443932288: nam='db file sequential read' ela= 4966 file#=1 block#=241 blocks=1 obj#=18 tim=1499185905271761
FETCH #140048443932288:c=0,e=21976,p=3,cr=3,cu=0,mis=0,r=1,dep=1,og=4,plh=1218588913,tim=1499185905271820
STAT #140048443932288 id=1 cnt=1 pid=0 pos=1 obj=18 op='TABLE ACCESS BY INDEX ROWID OBJ$ (cr=3 pr=3 pw=0 time=21993 us)'
STAT #140048443932288 id=2 cnt=1 pid=1 pos=1 obj=36 op='INDEX RANGE SCAN I_OBJ1 (cr=2 pr=2 pw=0 time=16923 us)'
CLOSE #140048443932288:c=0,e=63,dep=1,type=0,tim=1499185905271941
BINDS #140048443935120:
 Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=7f5f91c07de8  bln=22  avl=02  flg=05
  value=20
EXEC #140048443935120:c=1000,e=795,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=2970138452,tim=1499185905272802
WAIT #140048443935120: nam='db file sequential read' ela= 3197 file#=1 block#=169 blocks=1 obj#=3 tim=1499185905276069
WAIT #140048443935120: nam='db file sequential read' ela= 3459 file#=1 block#=170 blocks=1 obj#=3 tim=1499185905404084
WAIT #140048443935120: nam='db file sequential read' ela= 6358 file#=1 block#=145 blocks=1 obj#=4 tim=1499185905410548
FETCH #140048443935120:c=999,e=137805,p=3,cr=3,cu=0,mis=0,r=0,dep=1,og=4,plh=2970138452,tim=1499185905410635
STAT #140048443935120 id=1 cnt=0 pid=0 pos=1 obj=4 op='TABLE ACCESS CLUSTER TAB$ (cr=3 pr=3 pw=0 time=137810 us)'
STAT #140048443935120 id=2 cnt=1 pid=1 pos=1 obj=3 op='INDEX UNIQUE SCAN I_OBJ# (cr=2 pr=2 pw=0 time=131330 us)'
 
*** 2017-07-05 00:31:46.094
Incident 176395 created, dump file: /oracle/diag/rdbms/orcl/orcl2/incident/incdir_176395/orcl_ora_51261_i176395.trc
ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [], [], [], [], []

以及以往恢复经验和mos,基本上可以确定是由于tab$和obj$记录不匹配导致该问题.而且是obj#=20的记录在tab$和obj$中不匹配.

分析tab$和obj$记录

Data UnLoader: 11.2.0.1.5 - Internal Only - on Wed Jul 05 01:28:53 2017
with 64-bit io functions and the decompression option
 
Copyright (c) 1994 2017 Bernard van Duijnen All rights reserved.
 
 Strictly Oracle Internal Use Only
 
 
Found db_id = 1334610369
Found db_name = orcl
DUL> unload table TAB$( OBJ# number, DATAOBJ# number,
  2      TS# number, FILE# number, BLOCK# number,
  3      BOBJ# number, TAB# number, COLS number, CLUCOLS number,
  4      PCTFREE$ ignore, PCTUSED$ ignore, INITRANS ignore, MAXTRANS ignore,
  5      FLAGS ignore, AUDIT$ ignore, ROWCNT ignore, BLKCNT ignore,
  6      EMPCNT ignore, AVGSPC ignore, CHNCNT ignore, AVGRLN ignore,
  7      AVGSPC_FLB ignore, FLBCNT ignore,
  8      ANALYZETIME ignore, SAMPLESIZE ignore,
  9      DEGREE ignore, INSTANCES ignore,
 10      INTCOLS ignore, KERNELCOLS number, PROPERTY number)
 11      cluster  C_OBJ#(OBJ#)
 12      storage ( tablespace 0 segobjno 2 tabno 1 file 1 block 144);
. unloading table                      TAB$       0 rows unloaded
DUL> unload table OBJ$( OBJ# number, DATAOBJ# number, OWNER# number,
  2      NAME clean varchar2(30), NAMESPACE ignore, SUBNAME clean varchar2(30),
  3      TYPE# number, CTIME ignore, MTIME ignore, STIME ignore,
  4      STATUS ignore, REMOTEOWNER ignore, LINKNAME ignore,
  5      FLAGS ignore, OID$ hexraw)
  6      storage ( tablespace 0 segobjno 18 file 1 block 240);
. unloading table                      OBJ$   89804 rows unloaded
DUL>

这里可以明确表示tab$无记录,obj$有记录,从而启动的过程出现ORA-600 16703错误可以很好解释.由于数据库启动成功和报错几乎同时进行,以及上面看到的tab$记录不存在的情况,初步怀疑是有startup触发器清空tab$表记录
工具分析触发器表trigger$
startup-trigger
这里果然看到一个after startup on database的触发器,名字为DBMS_SUPPORT_DBMONITOR,而它调用的是DBMS_SUPPORT_DBMONITORP存储.

工具分析pl/sql表source$
DBMS_SUPPORT_DBMONITOR


对wraped加密的程序进行解密
DBMS_SUPPORT_DBMONITOR-unwraped

这里可以明确的看到DBMS_SUPPORT_DBMONITORP存储过程备份tab$表到orachk中然后delete tab$表,现在已经明确是由于DBMS_SUPPORT_DBMONITOR触发器在数据库重启之后开始执行调用DBMS_SUPPORT_DBMONITORP程序,如果判断数据库创建时间大于等于300天,便干掉tab$表,实现数据库破坏.

查找DBMS_SUPPORT_DBMONITOR等程序源头
分析相关程序创建时间,通过obj$表的ctime和name来判断
DBMS_SUPPORT_DBMONITOR-ctime
bootstrap-ctime

这里可以看出来DBMS_SUPPORT_DBMONITOR和DBMS_SUPPORT_DBMONITORP的创建时间基本上和数据库核心对象的创建时间相差无几,我们可以大概排除掉,是由于pl sql等工具连接数据库导致该发问题(类似:plsql dev引起的数据库被黑勒索比特币实现原理分析和解决方案),很可能是在dbca创建库的过程中就已经带有了DBMS_SUPPORT_DBMONITOR等程序,如果这样那很可能是由于数据库的安装介质被破坏导致该问题.

分析DBMS_SUPPORT_DBMONITOR来源
prvtsupp
20170711001626

这里已经很清晰,由于prvtsupp.plb被人注入了恶意脚本从而使得数据库被创建了DBMS_SUPPORT_DBMONITOR的触发器和DBMS_SUPPORT_DBMONITORP的存储过程,从而实现了对数据库的而易破坏.
md5

通过md5码对比,可以确定是有人对软件的安装介质进行了破坏,从而实现了恶意代码的注入,从而实现了数据库300天之后重启之后无法正常启动而是出现类似ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [], [], [], [], []的错误.提醒各位一定要从官方途径下载oracle安装介质,如果是从其他互联网途径下载一定要验证md5,确保文件没有被人恶意篡改,造成无可挽回的损坏.如果真的不幸遇到这类问题,请保护现场联系我们
Tel:13429648788    Q Q:107644445QQ咨询惜分飞    E-Mail:dba@xifenfei.com
可以实现业务数据0丢失恢复,最大限度抢救您的数据

ora-600 2037 ORA-7445 kcbs_dump_adv_state

有客户系统断电,导致数据库无法启动,让我们帮忙解决,通过分析主要是ORA-600 2037和ORA-7445 _kcbs_dump_adv_state等错误,通过人工recover解决.
数据库报ORA-03113,无法启动成功

C:\Documents and Settings\Administrator>sqlplus / as sysdba

SQL*Plus: Release 10.2.0.1.0 - Production on 星期五 5月 12 09:50:36 2017

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

已连接到空闲例程。

SQL> startup
ORACLE 例程已经启动。

Total System Global Area 1258291200 bytes
Fixed Size                  1250548 bytes
Variable Size             218106636 bytes
Database Buffers         1031798784 bytes
Redo Buffers                7135232 bytes
数据库装载完毕。
ORA-03113: 通信通道的文件结束

分析alert日志

Fri May 12 09:50:43 2017
ALTER DATABASE OPEN
Fri May 12 09:50:43 2017
Beginning crash recovery of 1 threads
 parallel recovery started with 15 processes
Fri May 12 09:50:43 2017
Started redo scan
Fri May 12 09:50:43 2017
Completed redo scan
 1240 redo blocks read, 277 data blocks need recovery
Fri May 12 09:50:44 2017
Started redo application at
 Thread 1: logseq 5881, block 41179
Fri May 12 09:50:44 2017
Recovery of Online Redo Log: Thread 1 Group 1 Seq 5881 Reading mem 0
  Mem# 0 errs 0: E:\ORACLE\PRODUCT\10.2.0\ORADATA\xff\REDO01.LOG
Fri May 12 09:50:44 2017
Completed redo application
Fri May 12 09:50:44 2017
Errors in file e:\oracle\product\10.2.0\admin\xff\bdump\xff_p006_6072.trc:
ORA-00600: internal error code, arguments: [6110], [193], [3], [], [], [], [], []

Fri May 12 09:50:44 2017
Hex dump of (file 3, block 14004) in trace file e:\oracle\product\10.2.0\admin\xff\bdump\xff_p000_6024.trc
Corrupt block relative dba: 0x00c036b4 (file 3, block 14004)
Bad header found during crash/instance recovery
Data in bad block:
 type: 255 format: 7 rdba: 0x06010601
 last change scn: 0xa206.a2060601 seq: 0xb4 flg: 0x36
 spare1: 0x1 spare2: 0x6 spare3: 0x673
 consistency value in tail: 0x1b0a0708
 check value in block header: 0x36b4
 computed block checksum: 0xe4f5
Fri May 12 09:50:44 2017
Hex dump of (file 9, block 65507) in trace file e:\oracle\product\10.2.0\admin\xff\bdump\xff_p003_6056.trc
Corrupt block relative dba: 0x0240ffe3 (file 9, block 65507)
Bad header found during crash/instance recovery
Data in bad block:
 type: 3 format: 6 rdba: 0x06020601
 last change scn: 0xa206.a2060602 seq: 0xe3 flg: 0xff
 spare1: 0x1 spare2: 0x6 spare3: 0x6dc
 consistency value in tail: 0xc1028001
 check value in block header: 0xffe3
 computed block checksum: 0xff01
Fri May 12 09:50:44 2017
Reread of rdba: 0x00c036b4 (file 3, block 14004) found different data
Fri May 12 09:50:44 2017
Reread of rdba: 0x0240ffe3 (file 9, block 65507) found different data
Fri May 12 09:50:44 2017
Errors in file e:\oracle\product\10.2.0\admin\xff\bdump\xff_p005_6060.trc:
ORA-00600: internal error code,arguments:[2037],[17442602],[2718302723],[255],[9],[203],[657105414],[2147549568]
Fri May 12 09:50:44 2017
Errors in file e:\oracle\product\10.2.0\admin\xff\bdump\xff_p000_6024.trc:
ORA-07445:exception encountered:core dump[ACCESS_VIOLATION][_kclcomplete+79][PC:0x72B0C7][ADDR:0x220][UNABLE_TO_READ][]
Fri May 12 09:50:44 2017
Errors in file e:\oracle\product\10.2.0\admin\xff\bdump\xff_p006_6072.trc:
ORA-07445: exception encountered:core dump[ACCESS_VIOLATION][_kcbzdh+2496][PC:0x4A4928][ADDR:0xB][UNABLE_TO_READ][]
ORA-00600: internal error code, arguments: [6110], [193], [3], [], [], [], [], []
Errors in file e:\oracle\product\10.2.0\admin\xff\bdump\xff_p012_6128.trc:
ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [_kcbs_dump_adv_state+723] 
                                 [PC:0x5975A3] [ADDR:0xCBC0CBB2] [UNABLE_TO_READ] []
ORA-00600:internal error code,arguments:[2037],[17430318],[2718303745],[128],[1],[203],[4147028486],[2147549568]

错误比较明显由于坏块导致应用日志恢复异常,主要错误集中在ORA-600 2037,ORA-7445 _kcbs_dump_adv_state,ORA-7445_kcbzdh,ORA-7445 _kclcomplete等

dbv检查数据文件

E:\>dbv file=E:\ORACLE\PRODUCT\10.2.0\ORADATA\xff\SYSAUX01.DBF

DBVERIFY: Release 10.2.0.1.0 - Production on 星期五 5月 12 09:57:39 2017

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

DBVERIFY - 开始验证: FILE = E:\ORACLE\PRODUCT\10.2.0\ORADATA\xff\SYSAUX01.DBF

页 13353 标记为损坏
Corrupt block relative dba: 0x00c03429 (file 3, block 13353)
Bad header found during dbv:
Data in bad block:
 type: 1 format: 6 rdba: 0x3429a206
 last change scn: 0x066f.066f3429 seq: 0x0 flg: 0x00
 spare1: 0x6 spare2: 0xa2 spare3: 0x8c96
 consistency value in tail: 0x06018001
 check value in block header: 0x0
 block checksum disabled

页 14004 标记为损坏
Corrupt block relative dba: 0x00c036b4 (file 3, block 14004)
Bad header found during dbv:
Data in bad block:
 type: 1 format: 6 rdba: 0x36b4a206
 last change scn: 0x0673.067336b4 seq: 0x0 flg: 0x00
 spare1: 0x6 spare2: 0xa2 spare3: 0xfb97
 consistency value in tail: 0x06010210
 check value in block header: 0x0
 block checksum disabled

页 15261 标记为损坏
Corrupt block relative dba: 0x00c03b9d (file 3, block 15261)
Bad header found during dbv:
Data in bad block:
 type: 2 format: 6 rdba: 0x3b9da206
 last change scn: 0x0673.06733b9d seq: 0x0 flg: 0x00
 spare1: 0x6 spare2: 0xa2 spare3: 0x0
 consistency value in tail: 0x06018001
 check value in block header: 0x5549
 block checksum disabled



DBVERIFY - 验证完成

检查的页总数: 58880
处理的页总数 (数据): 19318
失败的页总数 (数据): 0
处理的页总数 (索引): 18610
失败的页总数 (索引): 0
处理的页总数 (其它): 13747
处理的总页数 (段)  : 0
失败的总页数 (段)  : 0
空的页总数: 7202
标记为损坏的总页数: 3
流入的页总数: 0
最高块 SCN            : 178325323 (0.178325323)


E:\>dbv file=E:\ORACLE\PRODUCT\10.2.0\ORADATA\xff\xff_BSE02

DBVERIFY: Release 10.2.0.1.0 - Production on 星期五 5月 12 10:10:24 2017

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

DBVERIFY - 开始验证: FILE = E:\ORACLE\PRODUCT\10.2.0\ORADATA\xff\xff_BSE02

页 65507 标记为损坏
Corrupt block relative dba: 0x0240ffe3 (file 9, block 65507)
Bad header found during dbv:
Data in bad block:
 type: 2 format: 6 rdba: 0xffe3a206
 last change scn: 0x06dc.06dcffe3 seq: 0x0 flg: 0x00
 spare1: 0x6 spare2: 0xa2 spare3: 0xb32
 consistency value in tail: 0x060102ff
 check value in block header: 0x0
 block checksum disabled



DBVERIFY - 验证完成

检查的页总数: 1310720
处理的页总数 (数据): 34102
失败的页总数 (数据): 0
处理的页总数 (索引): 30270
失败的页总数 (索引): 0
处理的页总数 (其它): 10850
处理的总页数 (段)  : 0
失败的总页数 (段)  : 0
空的页总数: 1235497
标记为损坏的总页数: 1
流入的页总数: 0
最高块 SCN            : 178325221 (0.178325221)

确实如alert日志报错,file 3和9 都出现坏块导致实例恢复无法进行。根据错误ORA-600 2037和ORA-7445 _kcbs_dump_adv_state,初步判断和During Startup (Open Database) Alert Log Shows ORA-600[2037] and ORA-7445[kcbs_dump_adv_state] (Doc ID 551993.1)文章描述相符(而且版本也相符)

尝试recover datafile部分file

E:\>sqlplus / as sysdba

SQL*Plus: Release 10.2.0.1.0 - Production on 星期五 5月 12 10:16:00 2017

Copyright (c) 1982, 2005, Oracle.  All rights reserved.


连接到:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options

SQL> recover datafile 1;
完成介质恢复。
SQL> recover datafile 2;
完成介质恢复。
SQL> recover datafile 3;
完成介质恢复。
SQL> recover datafile 4;
完成介质恢复。
SQL> recover datafile 9;
完成介质恢复。
SQL> alter database open;
alter database open
*
第 1 行出现错误:
ORA-00600: 内部错误代码, 参数: [kcratr1_lastbwr], [], [], [], [], [], [], []

ORA-00600 kcratr1_lastbwr错误比较明显,见ORA-00600:[Kcratr1_lastbwr] During Database Startup after a Crash (Doc ID 393984.1)

通过recover database处理

SQL> recover database;
完成介质恢复。
SQL> alter database open;

数据库已更改。

然后通过查询dba_extents 处理坏块对象

补充ORA-600 2037错误

Format: ORA-600 [2037] [a] [b] 1 [d] [e] [f] [g]


VERSIONS:
  versions 8.0 and above

DESCRIPTION:

  During recovery we are examining a block to ensure that it is not
  corrupt prior to applying any change vectors.

  The block has failed this check and this exception is raised.

ARGUMENTS:
  Arg [a] Relative Data Block Address (RDBA) that the redo vector is for
  Arg [b] The Block format  
  Arg 1 RDBA in the block itself
  Arg [d] The block type
  Arg [e] The sequence number
  Arg [f] Flags, if set  
  Arg [g] The return value from the block head/tail checker.

asm disk 分区表损坏恢复

有朋友反馈,他们做了xx存储的双活之后,重启主机发现gi无法正常启动,分析发现所有该存储的磁盘分区信息丢失,导致asmlib无法发现磁盘(使用分区做asm disk)
类似如下错误(磁盘分区丢失)

--fdisk -l 显示部分结果
Disk /dev/mapper/datahds1: 1099.5 GB, 1099511627776 bytes
255 heads, 63 sectors/track, 133674 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

--ls -l /dev/mapper/   显示结果无分区信息
lrwxrwxrwx 1 root root      7 May  6 03:44 datahds1 -> ../dm-1
lrwxrwxrwx 1 root root      7 May  6 03:26 datahds2 -> ../dm-3
lrwxrwxrwx 1 root root      7 May  6 03:26 datahds3 -> ../dm-8
lrwxrwxrwx 1 root root      7 May  6 03:26 ocrhds1 -> ../dm-0
lrwxrwxrwx 1 root root      7 May  6 03:26 ocrhds2 -> ../dm-2
lrwxrwxrwx 1 root root      7 May  6 03:26 ocrhds3 -> ../dm-4

asm日志显示

SUCCESS: diskgroup DATADG was mounted
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 3
SUCCESS: diskgroup OCRHDS was mounted
ORA-15032: not all alterations performed
ORA-15017: diskgroup "DATA" cannot be mounted
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATA"

分析系统日志

May  6 02:23:27 db2 kernel: sdb: unknown partition table
May  6 02:23:27 db2 kernel: sde: unknown partition table
May  6 02:23:27 db2 kernel: sdc: unknown partition table
May  6 02:23:27 db2 kernel: sdf: unknown partition table
May  6 02:23:27 db2 kernel: sdd: unknown partition table
May  6 02:23:27 db2 kernel: sdj:Dev sdj: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sdi: sdi1
May  6 02:23:27 db2 kernel: sdk: sdk1
May  6 02:23:27 db2 kernel: sdg: unknown partition table
May  6 02:23:27 db2 kernel: sdl: sdl1
May  6 02:23:27 db2 kernel: sdm:Dev sdm: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sdo:Dev sdo: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sdn:Dev sdn: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sdp:Dev sdp: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sds:Dev sds: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sdh:
May  6 02:23:27 db2 kernel: sdt: sdt1
May  6 02:23:27 db2 kernel: sdv:Dev sdv: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sdq:Dev sdq: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sd 1:0:1:9: [sdr] Very big device. Trying to use READ CAPACITY(16).
May  6 02:23:27 db2 kernel: sdr:Dev sdr: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sd 2:0:0:9: [sdab] Very big device. Trying to use READ CAPACITY(16).
May  6 02:23:27 db2 kernel: sdab: unknown partition table
May  6 02:23:27 db2 kernel: sdac: unknown partition table
May  6 02:23:27 db2 kernel: sdw: sdw1
May  6 02:23:27 db2 kernel: sdu:Dev sdu: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sdx: sdx1
May  6 02:23:27 db2 kernel: sdy: sdy1
May  6 02:23:27 db2 kernel: sdaa: sdaa1
May  6 02:23:27 db2 kernel: sdz: sdz1
May  6 02:23:27 db2 kernel: sdae: unknown partition table
May  6 02:23:27 db2 kernel: sdaf: unknown partition table
May  6 02:23:27 db2 kernel: sdag: unknown partition table
May  6 02:23:27 db2 kernel: sdai:
May  6 02:23:27 db2 kernel: sdah: unknown partition table
May  6 02:23:27 db2 kernel: sdad: unknown partition table
May  6 02:23:28 db2 mcelog: failed to prefill DIMM database from DMI data

这里错误比较明显unknown partition table,磁盘的分区信息损坏.使用fdisk无法发现分区

partprobe也无效

[root@db2 oracle]# partprobe /dev/mapper/ocrhds3
[root@db2 oracle]# 
[root@db2 oracle]# ls -l /dev/mapper/ocrhds3*
lrwxrwxrwx 1 root root 7 May  6 07:30 /dev/mapper/ocrhds3 -> ../dm-4

从尚需信息看,磁盘的分区表信息应该已经损坏,现在能够做的,就是希望运气好,磁盘的分区的实际数据没有损坏

分析磁盘实际分区数据

[root@db2 ~]$ dd if=/dev/mapper/datahds1 of=/tmp/datahds1.dd bs=1024k count=50
[root@db2 ~]$ dd if=/tmp/datahds1.dd of=/tmp/xff01.dd  bs=3225 skip=1
[grid@db2 ~]$ kfed read /tmp/xff01.dd |more
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:              2147483648 ; 0x008: disk=0
kfbh.check:                  3110278718 ; 0x00c: 0xb963163e
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdhdb.driver.provstr: ORCLDISKHDSDATA1 ; 0x000: length=16
kfdhdb.driver.reserved[0]:   1146307656 ; 0x008: 0x44534448
kfdhdb.driver.reserved[1]:    826364993 ; 0x00c: 0x31415441
kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000
kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000
kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000
kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000
kfdhdb.compat:                186646528 ; 0x020: 0x0b200000
kfdhdb.dsknum:                        0 ; 0x024: 0x0000
kfdhdb.grptyp:                        1 ; 0x026: KFDGTP_EXTERNAL
kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER
kfdhdb.dskname:             DATADG_0000 ; 0x028: length=11
kfdhdb.grpname:                  DATADG ; 0x048: length=6
kfdhdb.fgname:              DATADG_0000 ; 0x068: length=11
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.crestmp.hi:             33050696 ; 0x0a8: HOUR=0x8 DAYS=0x2 MNTH=0x4 YEAR=0x7e1
kfdhdb.crestmp.lo:           3813740544 ; 0x0ac: USEC=0x0 MSEC=0x44 SECS=0x35 MINS=0x38
kfdhdb.mntstmp.hi:             33050701 ; 0x0b0: HOUR=0xd DAYS=0x2 MNTH=0x4 YEAR=0x7e1
kfdhdb.mntstmp.lo:            411385856 ; 0x0b4: USEC=0x0 MSEC=0x150 SECS=0x8 MINS=0x6

通过上述分析,我们可以初步判断,分区磁盘的信息很可能是好的(因为asm disk header是好的,根据一般的规则从前往后覆盖,既然header是好的,后面的block被覆盖的概率非常小)

通过准备新磁盘直接把磁盘分区dd到新设备上

dd if=/dev/mapper/ocrhds1 of=/dev/mapper/ocrhdsnew1 skip=1 bs=3225
dd if=/dev/mapper/ocrhds2 of=/dev/mapper/ocrhdsnew2 skip=1 bs=3225
dd if=/dev/mapper/ocrhds3 of=/dev/mapper/ocrhdsnew3 skip=1 bs=3225
dd if=/dev/mapper/datahds1 of=/dev/mapper/datahdsnew1 skip=1 bs=3225
dd if=/dev/mapper/datahds2 of=/dev/mapper/datahdsnew2 skip=1 bs=3225
dd if=/dev/mapper/datahds3 of=/dev/mapper/datahdsnew3 skip=1 bs=3225

asmlib重新扫描磁盘

[root@db1 disks]# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks...
Scanning system for ASM disks...
Instantiating disk "HDSOCR3"
Instantiating disk "HDSDATA2"
Instantiating disk "HDSDATA1"
Instantiating disk "HDSDATA3"
Instantiating disk "HDSOCR1"
Instantiating disk "HDSOCR2"
[root@db1 disks]# ls -ltr
total 0
brw-rw---- 1 grid asmadmin  8, 160 May  6 13:49 HDSOCR3
brw-rw---- 1 grid asmadmin  8, 192 May  6 13:49 HDSDATA2
brw-rw---- 1 grid asmadmin  8, 176 May  6 13:49 HDSDATA1
brw-rw---- 1 grid asmadmin  8, 208 May  6 13:49 HDSDATA3
brw-rw---- 1 grid asmadmin  8, 128 May  6 13:49 HDSOCR1
brw-rw---- 1 grid asmadmin  8, 144 May  6 13:49 HDSOCR2

kfed验证拷贝的分区

[root@db2 tmp]# /oracle/app/11.2.0/grid_1/bin/kfed read /dev/oracleasm/disks/HDSDATA1
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:              2147483648 ; 0x008: disk=0
kfbh.check:                  3110278718 ; 0x00c: 0xb963163e
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdhdb.driver.provstr: ORCLDISKHDSDATA1 ; 0x000: length=16
kfdhdb.driver.reserved[0]:   1146307656 ; 0x008: 0x44534448
kfdhdb.driver.reserved[1]:    826364993 ; 0x00c: 0x31415441
kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000
kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000
kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000
kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000
kfdhdb.compat:                186646528 ; 0x020: 0x0b200000
kfdhdb.dsknum:                        0 ; 0x024: 0x0000
kfdhdb.grptyp:                        1 ; 0x026: KFDGTP_EXTERNAL
kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER
kfdhdb.dskname:             DATADG_0000 ; 0x028: length=11
kfdhdb.grpname:                  DATADG ; 0x048: length=6
kfdhdb.fgname:              DATADG_0000 ; 0x068: length=11
kfdhdb.capname:                         ; 0x088: length=0

asm和数据库启动正常

[grid@db2 ~]$ asmcmd
ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512   4096  1048576   3145710  2378034                0         2378034              0             N  DATADG/
MOUNTED  NORMAL  N         512   4096  1048576     15342    14416             5114            4651              0             Y  OCRHDS/
ASMCMD> 

[oracle@db2 ~]$ sqlplus  / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Sat May 6 13:54:21 2017

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup
ORACLE instance started.

Total System Global Area 3.6077E+10 bytes
Fixed Size                  2260648 bytes
Variable Size            7247757656 bytes
Database Buffers         2.8723E+10 bytes
Redo Buffers              104382464 bytes
Database mounted.
Database opened.
SQL> 

asm-disk-partition-lost-recovery


通过上述恢复,实现asm磁盘分区丢失数据0丢失