ORA-704 ORA-604 ORA-942 恢复

接到客户请求,oracle数据库停机重启维护之后,无法正常启动,请求我们给予协助
数据库启动报ORA-00704 ORA-00604 ORA-00942错误

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
PL/SQL Release 11.2.0.1.0 - Production
CORE    11.2.0.1.0      Production
TNS for 32-bit Windows: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 - Production

SQL> startup open
ORACLE 例程已经启动。

Total System Global Area 1288949760 bytes
Fixed Size                  1376520 bytes
Variable Size             377491192 bytes
Database Buffers          897581056 bytes
Redo Buffers               12500992 bytes
数据库装载完毕。
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-00942: table or view does not exist
进程 ID: 2756
会话 ID: 5 序列号: 9

alert日志报错

SMON: enabling cache recovery
Errors in file d:\app\administrator\diag\rdbms\xff\xff\trace\xff_ora_2756.trc:
ORA-00704: 引导程序进程失败
ORA-00604: 递归 SQL 级别 1 出现错误
ORA-00942: 表或视图不存在
Errors in file d:\app\administrator\diag\rdbms\xff\xff\trace\xff_ora_2756.trc:
ORA-00704: 引导程序进程失败
ORA-00604: 递归 SQL 级别 1 出现错误
ORA-00942: 表或视图不存在
Error 704 happened during db open, shutting down database
USER (ospid: 2756): terminating the instance due to error 704
Instance terminated by USER, pid = 2756
ORA-1092 signalled during: ALTER DATABASE OPEN...
opiodr aborting process unknown ospid (2756) as a result of ORA-1092
Fri Nov 30 12:51:26 2018
ORA-1092 : opitsk aborting process

根据这些年的恢复经验,恢复过相关错误的主要有:
ORA-01092 ORA-00704 ORA-00942
Oracle 11g丢失access$恢复方法
11.1.0.7版本也会出现access$表丢失导致数据库无法启动
总结主要两类:1. 由于某种bug导致access$表丢失的故障,另外一种是由于坏块导致数据库核心基表损坏引起,对于这个库进行分析

dbv 检查坏块

D:\APP\ADMINISTRATOR\ORADATA\ORCL> dbv file=system01.dbf

DBVERIFY: Release 10.2.0.3.0 - Production on 星期五 11月 30 15:11:24 2018

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

DBVERIFY - 开始验证: FILE = system01.dbf


DBVERIFY - 验证完成

检查的页总数: 93440
处理的页总数 (数据): 61979
失败的页总数 (数据): 0
处理的页总数 (索引): 13560
失败的页总数 (索引): 0
处理的页总数 (其它): 3067
处理的总页数 (段)  : 0
失败的总页数 (段)  : 0
空的页总数: 14834
标记为损坏的总页数: 0
流入的页总数: 0
最高块 SCN            : 659587683 (0.659587683)

通过dbv检查数据库,确定system没有坏块
10046跟踪数据库启动

SQL> startup mount;
ORACLE 例程已经启动。

Total System Global Area 1288949760 bytes
Fixed Size                  1376520 bytes
Variable Size             377491192 bytes
Database Buffers          897581056 bytes
Redo Buffers               12500992 bytes
数据库装载完毕。

SQL> oradebug setmypid
已处理的语句
SQL> alter session set db_file_multiblocK_read_count=1;

会话已更改。

SQL> ALTER SESSION SET EVENTS '704 trace name errorstack level 3';

会话已更改。

SQL> oradebug EVENT 10046 TRACE NAME CONTEXT FOREVER, LEVEL 12
已处理的语句
SQL> oradebug TRACEFILE_NAME
d:\app\administrator\diag\rdbms\xff\xff\trace\xff_ora_1132.trc

--trace信息
PARSE ERROR #3:len=208 dep=1 uid=0 oct=9 lid=0 tim=5838844893 err=942
CREATE UNIQUE INDEX I_OBJ1 ON OBJ$(OBJ#,OWNER#,TYPE#) PCTFREE 10 INITRANS 2 MAXTRANS 255 
STORAGE (  INITIAL 64K NEXT 1024K MINEXTENTS 1 MAXEXTENTS 2147483645
 PCTINCREASE 0 OBJNO 36 EXTENTS (FILE 1 BLOCK 336))

*** 2018-11-30 13:18:07.593
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=3, mask=0x0)
----- Error Stack Dump -----
ORA-00704: 引导程序进程失败
ORA-00604: 递归 SQL 级别 1 出现错误
ORA-00942: 表或视图不存在

通过这一步基本上可以判断是由于obj$表丢失导致数据库创建I_OBJ1 index不成功,从而使得数据库无法正常启动。通过一些技巧修复出来obj$表,尝试启动数据库

SQL> alter database open;
alter database open
*
第 1 行出现错误:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [4000], [6], [], [], [], [], [], [],
[], [], [], []
进程 ID: 836
会话 ID: 355 序列号: 6027

alert日志报错

Fri Nov 30 15:33:47 2018
SMON: enabling cache recovery
Errors in file d:\app\administrator\diag\rdbms\xff\xff\trace\xff_ora_836.trc  (incident=648694):
ORA-00600: 内部错误代码, 参数: [4000], [6], [], [], [], [], [], [], [], [], [], []
Incident details in: d:\app\administrator\diag\rdbms\xff\xff\incident\incdir_648694\xff_ora_836_i648694.trc
Errors in file d:\app\administrator\diag\rdbms\xff\xff\trace\xff_ora_836.trc:
ORA-00704: 引导程序进程失败
ORA-00600: 内部错误代码, 参数: [4000], [6], [], [], [], [], [], [], [], [], [], []
Errors in file d:\app\administrator\diag\rdbms\xff\xff\trace\xff_ora_836.trc:
ORA-00704: 引导程序进程失败
ORA-00600: 内部错误代码, 参数: [4000], [6], [], [], [], [], [], [], [], [], [], []
Error 704 happened during db open, shutting down database
USER (ospid: 836): terminating the instance due to error 704
Fri Nov 30 15:33:57 2018
Instance terminated by USER, pid = 836
ORA-1092 signalled during: alter database open...
opiodr aborting process unknown ospid (836) as a result of ORA-1092

发现数据库启动报ORA-00704和ORA-00600: internal error code, arguments: [4000], [6], [], [], [], [], [], [],错误,根据以往经验,该问题是由于回滚段异常导致
分析trace信息

Dump continued from file: d:\app\administrator\diag\rdbms\xff\xff\trace\xff_ora_2124.trc
ORA-00600: 内部错误代码, 参数: [4000], [6], [], [], [], [], [], [], [], [], [], []

========= Dump for incident 651102 (ORA 600 [4000]) ========

*** 2018-11-30 15:46:32.125
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for txff session (sql_id=6apq2rjyxmxpj) -----
select line#, sql_text from bootstrap$ where obj# != :1

----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
_skdstdst()+121      CALLrel  _kgdsdst()           E6B1614 2
_ksedst1()+93        CALLrel  _skdstdst()          E6B1614 0 1 436646 435BE2
                                                   436646
_ksedst()+49         CALLrel  _ksedst1()           0 1
_dbkedDefDump()+367  CALLrel  _ksedst()            0
2                                                  
_ksedmp()+44         CALLrel  _dbkedDefDump()      3 2
_ksfdmp()+56         CALLrel  _ksedmp()            3EB
_dbgexPhaseII()+164  CALLreg  00000000             C378A28 3EB
0                                                  
_dbgexProcessError(  CALLrel  _dbgexPhaseII()      E2B0454 E2B6E50 E6B6078
)+2061                                             
_dbgeExecuteForErro  CALLrel  _dbgexProcessError(  E2B0454 E2B6E50 1 0 E2B0454
r()+43                        )                    E2B6E50
__VInfreq__dbgePost  CALLrel  _dbgeExecuteForErro  E2B0454 E2B6E50 0 1 0
ErrorKGE()+260                r()                  
_dbkePostKGE_kgsf()  CALLrel  _dbgePostErrorKGE()  C378A28 E2BD274 258
+56                                                
_kgeade()+299        CALLreg  00000000             C378A28 E2BD274 258
_kgeriv_int()+79     CALLrel  _kgeade()            C378A28 C378B50 E2BD274 258 0
                                                   FA0 0 0 0 1 E6B68A8
_kgeriv()+22         CALLrel  _kgeriv_int()        C378A28 E2BD274 FA0 0 1
                                                   E6B68A8
_kgeasi()+107        CALLrel  _kgeriv()            C378A28 E2BD274 FA0 1 E6B68A8
__VInfreq__ktuGetUs  CALLrel  _kgeasi()            C378A28 E2BD274 FA0 2 1 0 6 0
egDba()+123                                        
_ktrgcm()+5147       CALLrel  _ktuGetUsegDba()     6 E6B6E78 0 0 E6B6F48 0
_ktrget2()+596       CALLrel  _ktrgcm()            F9FCEAC
_kdst_fetch()+816    CALLrel  _ktrget2()           F9FCEAC F9FCE24 303 0
_kdstf11001010000km  CALLrel  _kdst_fetch()        1 F9FCEA8 E6B71C8
()+2806                                            
_kdsttgr()+5944      CALLrel  _kdstf11001010000km  F9FCEA8 0 557B044C F9FCDF8
                              ()                   2F0C3A6 E6B7894
_qertbFetch()+767    CALLrel  _kdsttgr()           F9FCEA8 0 557B044C F9FCDF8
                                                   557B0498 2F0C3A6 E6B7894 1
_opifch2()+2729      CALLptr  00000000             E6E7228 0 0 2 26180001
_opifch()+53         CALLrel  _opifch2()           89 5 E6B7A04
_opiodr()+1248       CALLreg  00000000             5 2 E6B81FC
_rpidrus()+186       CALLrel  _opiodr()            5 2 E6B81FC 2
_rpidru()+90         CALLrel  _rpidrus()           E6B7D58
_rpiswu2()+557       CALLrel  _rpidru()            E6B813C
_rpidrv()+1242       CALLrel  _rpiswu2()           6D9A295C 0 6D9A29A8 2 E6B8184
                                                   0 6D9A2A28 0 0 544632 5448D6
                                                   E6B813C 8
_rpifch()+43         CALLrel  _rpidrv()            2 5 E6B81FC 8
_kqlbebs()+1213      CALLrel  _rpifch()            2 2 2 F013232 FA0 1 0 E6B8634
                                                   0 0 0 0 0
_kqlblfc()+175       CALLrel  _kqlbebs()           0 E6BBBD4
_adbdrv()+16992      CALLrel  _kqlblfc()           0 E6BBBD4
_opiexe()+13594      CALLrel  _adbdrv()            4A 6E6CBC48 6DDEDB6C E6BBD68
                                                   6D60697 6E6CBC48
_opiosq0()+6248      CALLrel  _opiexe()            4 0 E6BC734
_kpooprx()+277       CALLrel  _opiosq0()           3 E E6BC9A0 A4 0
_kpoal8()+632        CALLrel  _kpooprx()           E6BF0A4 E6BD420 1B 1 0 A4
_opiodr()+1248       CALLreg  00000000             5E 1C E6BF0A0
_ttcpip()+1051       CALLreg  00000000             5E 1C E6BF0A0 0
_opitsk()+1404       CALL???  00000000             C3832A8 5E E6BF0A0 0 E6BED30
                                                   E6BF1CC 53E52E 0 E6BF1F8
_opiino()+980        CALLrel  _opitsk()            0 0
_opiodr()+1248       CALLreg  00000000             3C 4 E6BFBF4
_opidrv()+1201       CALLrel  _opiodr()            3C 4 E6BFBF4 0
_sou2o()+55          CALLrel  _opidrv()            3C 4 E6BFBF4
_opimai_real()+124   CALLrel  _sou2o()             E6BFC04 3C 4 E6BFBF4
_opimai()+125        CALLrel  _opimai_real()       2 E6BFC2C
_OracleThreadStart@  CALLrel  _opimai()            2 E6BFF6C 7C9BA7F4 E6BFC34 0
4()+830                                            E6BFD04
7C82484C             CALLreg  00000000             E5BFF9C 0 0 E5BFF9C 0 E6BFFC4
00000000             CALL???  00000000             
 

--------------------- Binary Stack Dump ---------------------


Block header dump:  0x0040020b
 Object id on Block? Y
 seg/obj: 0x3b  csc: 0x00.27508136  itc: 1  flg: O  typ: 1 - DATA
     fsl: 0  fnx: 0x0 ver: 0x01
 
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0006.015.0005e9e9  0x00c0052c.916f.10  --U-    1  fsc 0x0000.27508263

报错比较明显该block比较异常,通过bbed修改相关block信息,相关处理参考:
重现ORA-600 4000异常
通过bbed解决ORA-00600[4000]案例
记录一次ORA-600 4000数据库故障恢复
ORACLE 8.1.7 数据库ORA-600 4000故障恢复

处理之后数据库正常open

Fri Nov 30 15:57:34 2018
SMON: enabling cache recovery
Successfully onlined Undo Tablespace 2.
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
Database Characterset is ZHS16GBK
No Resource Manager plan active
replication_dependency_tracking turned off (no async multimaster replication found)
Fri Nov 30 15:57:38 2018
Starting background process QMNC
Fri Nov 30 15:57:38 2018
QMNC started with pid=23, OS id=1152 
Completed: alter database open

0kb数据文件恢复或者文件丢失恢复

接到一个朋友恢复请求,由于rose频繁切换导致文件系统部分数据文件变化为0kb和文件丢失.
故障现象
部分数据文件变化为0kb和文件丢失.
file_lost
file_size_0


这里比较明显,数据库的users03变为了0kb和users04丢失.数据库alert日志报错信息如下:

Completed: alter database mount exclusive
alter database open
Errors in file E:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_dbw0_12008.trc:
ORA-01157: ????/?????? 7 - ??? DBWR ????
ORA-01110: ???? 7: 'E:\APP\ADMINISTRATOR\ORADATA\ORCL\USERS03.DBF'
ORA-27047: ??????????
OSD-04006: ReadFile() 失败, 无法读取文件
O/S-Error: (OS 38) 已到文件结尾。
Errors in file E:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_dbw0_12008.trc:
ORA-01157: ????/?????? 8 - ??? DBWR ????
ORA-01110: ???? 8: 'E:\APP\ADMINISTRATOR\ORADATA\ORCL\USERS04.DBF'
ORA-27041: ??????
OSD-04002: 无法打开文件
O/S-Error: (OS 2) 系统找不到指定的文件。
Errors in file E:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl\trace\orcl_ora_12040.trc:
ORA-01157: ????/?????? 7 - ??? DBWR ????
ORA-01110: ???? 7: 'E:\APP\ADMINISTRATOR\ORADATA\ORCL\USERS03.DBF'
ORA-1157 signalled during: alter database open...
Fri May 04 09:35:10 2018
Checker run found 2 new persistent data failures

alert日志的报错也比较明显,users03是文件超过了大小(大小为0kb,读取之后肯定超过大小),users04提示无法打开文件(文件在文件系统层面已经丢失).现在问题比较明显由于文件系统故障导致文件大小为0和丢失

碎片扫描恢复
常规的方法肯定无法恢复,比较好的方法只能是底层碎片扫描重组,结合多种扫描工具,最后发现一个做底层恢复的朋友的工具效果不错,扫描结果如下
file_scan


通过工具分析坏块情况

C:\Users\Administrator>dbv FiLe=D:\0504\ORCL_TS.4_FILE.7_10.ora

DBVERIFY: Release 11.2.0.4.0 - Production on 星期六 5月 5 08:52:53 2018

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

DBVERIFY - 开始验证: FILE = D:\0504\ORCL_TS.4_FILE.7_10.ora

………………

页 382565 标记为损坏
Corrupt block relative dba: 0x01c5d665 (file 7, block 382565)
Completely zero block found during dbv:

页 382566 标记为损坏
Corrupt block relative dba: 0x01c5d666 (file 7, block 382566)
Completely zero block found during dbv:

页 382567 标记为损坏
Corrupt block relative dba: 0x01c5d667 (file 7, block 382567)
Completely zero block found during dbv:



DBVERIFY - 验证完成

检查的页总数: 1374720
处理的页总数 (数据): 27582
失败的页总数 (数据): 0
处理的页总数 (索引): 20114
失败的页总数 (索引): 0
处理的页总数 (其他): 1319752
处理的总页数 (段)  : 0
失败的总页数 (段)  : 0
空的页总数: 1
标记为损坏的总页数: 7271
流入的页总数: 0
加密的总页数        : 0
最高块 SCN            : 228271996 (0.228271996)


C:\Users\Administrator>dbv FiLe=D:\0504\ORCL_TS.4_FILE.8_8.ora

DBVERIFY: Release 11.2.0.4.0 - Production on 星期六 5月 5 08:52:53 2018

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

DBVERIFY - 开始验证: FILE = D:\0504\ORCL_TS.4_FILE.8_8.ora


DBVERIFY - 验证完成

检查的页总数: 1136896
处理的页总数 (数据): 36639
失败的页总数 (数据): 0
处理的页总数 (索引): 57038
失败的页总数 (索引): 0
处理的页总数 (其他): 1043218
处理的总页数 (段)  : 0
失败的总页数 (段)  : 0
空的页总数: 1
标记为损坏的总页数: 0
流入的页总数: 0
加密的总页数        : 0
最高块 SCN            : 228271997 (0.228271997)

C:\Users\Administrator>

scan_resulte


这里通过分析恢复的两个文件总的block数量2511618,其中连续损坏7271个block损坏,由于出现问题之后,数据库被offline这两个文件继续启动运行了几个小时,导致少量block被覆盖,恢复软件直接置空.后续的恢复比较顺利,正常open数据库,然后处理坏块对象(正好不是业务核心表的lob字段,所有部分丢失影响不是非常大).

温馨提醒:
1. 数据文件和备份不要放在同一个阵列上,更不能是同一个分区(卷)上
2. 出现此类问题之后,应当理解停止对该分区的任何写操作,方式丢失或者大小为0的文件被覆盖.

ORA-00704 ORA-00702 恢复

数据库启动报ORA-01092 ORA-00704 ORA-00702错误

SQL> startup
ORACLE instance started.

Total System Global Area 3056513024 bytes
Fixed Size                  2257152 bytes
Variable Size             704646912 bytes
Database Buffers         2332033024 bytes
Redo Buffers               17575936 bytes
Database mounted.
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00702: bootstrap verison '' inconsistent with version '8.0.0.0.0'
Process ID: 27344
Session ID: 191 Serial number: 3

bootstrap-ORA-00702


查看alert日志

Mon Apr 09 16:22:34 2018
ALTER DATABASE   MOUNT
Successful mount of redo thread 1, with mount id 1383493834
Database mounted in Exclusive Mode
Lost write protection disabled
Completed: ALTER DATABASE   MOUNT
Mon Apr 09 16:22:39 2018
ALTER DATABASE OPEN
Thread 1 opened at log sequence 3
  Current log# 3 seq# 3 mem# 0: /u01/app/oracle/oradata/orcl/redo03.log
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
SMON: enabling cache recovery
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_27344.trc:
ORA-00704: bootstrap process failure
ORA-00702: bootstrap verison '' inconsistent with version '8.0.0.0.0'
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_27344.trc:
ORA-00704: bootstrap process failure
ORA-00702: bootstrap verison '' inconsistent with version '8.0.0.0.0'
Error 704 happened during db open, shutting down database
USER (ospid: 27344): terminating the instance due to error 704
Instance terminated by USER, pid = 27344
ORA-1092 signalled during: ALTER DATABASE OPEN...
opiodr aborting process unknown ospid (27344) as a result of ORA-1092
Mon Apr 09 16:22:40 2018
ORA-1092 : opitsk aborting process

错误比较明显是由于数据库open过程中bootstrap异常导致,出现此类错误一般是由于软件介质和db不匹配或者bootstrap表的block故障导致.

官方说明

Versions 9.2, 10.1, 10.2, 11.1, 11.2, 12.1

Error:  ORA-00702 bootstrap verison '%s' inconsistent with version '%s' 
---------------------------------------------------------------------------
Cause:  The reading version of the boostrap is incompatible with the current 
	bootstrap version. 
Action: Restore a version of the software that is compatible with this 
	bootstrap version

由于bootstrap$以及其中相关表处理比较特殊,如果您遭遇此类bootstrap$相关异常无法解决,需要恢复支持,请联系我们
Phone:13429648788    Q Q:107644445QQ咨询惜分飞    E-Mail:dba@xifenfei.com

ORA-00600: internal error code, arguments: [16703], [1403], [20]

继续上篇的tab$被清空(ORA-600 16703故障解析—tab$表被清空),导致数据库启动异常的case
数据库日志分析
数据库open成功同时报ORA-7445 kokeicbegin和ORA-600 kzrini:!uprofile错误
ora-600-kzrini-uprofile


再次启动数据库直接报ORA-600 16703错误
ora-600-16703

这里有一个其他现象,就是数据库在open成功的同时(同一秒内),开始报异常.重启之后直接报
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [], [], [], [], []
根据10046分析结果

=====================
select rowcnt,blkcnt,empcnt,avgspc,chncnt,avgrln,nvl(degree,1), nvl(instances,1) from tab$ where obj# = :1
END OF STMT
PARSE #140048443935120:c=0,e=390,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1499185905161433
=====================
select blevel, leafcnt, distkey, lblkkey, dblkkey, clufac,        nvl(degree,1), nvl(instances,1) from ind$ where bo# = :1 and obj# = :2
END OF STMT
PARSE #140048443934176:c=1000,e=601,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1499185905162088
=====================
PARSING IN CURSOR #140048443933232 len=70 dep=1 uid=0 oct=3 lid=0 tim=1499185905162444 hv=3377894161 ad='84f13d70' sqlid='32d4jrb4pd4sj'
select charsetid, charsetform from col$  where obj# = :1 and col# = :2
END OF STMT
PARSE #140048443933232:c=0,e=294,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1499185905162443
=====================
PARSING IN CURSOR #140048443932288 len=52 dep=1 uid=0 oct=3 lid=0 tim=1499185905247020 hv=429618617 ad='84f0f2d0' sqlid='4krwuz0ctqxdt'
select ctime, mtime, stime from obj$ where obj# = :1
END OF STMT
PARSE #140048443932288:c=0,e=549,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1499185905247019
BINDS #140048443932288:
select blevel, leafcnt, distkey, lblkkey, dblkkey, clufac,        nvl(degree,1), nvl(instances,1) from ind$ where bo# = :1 and obj# = :2
END OF STMT
PARSE #140048443934176:c=1000,e=601,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1499185905162088
=====================
PARSING IN CURSOR #140048443933232 len=70 dep=1 uid=0 oct=3 lid=0 tim=1499185905162444 hv=3377894161 ad='84f13d70' sqlid='32d4jrb4pd4sj'
select charsetid, charsetform from col$  where obj# = :1 and col# = :2
END OF STMT
PARSE #140048443933232:c=0,e=294,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1499185905162443
=====================
PARSING IN CURSOR #140048443932288 len=52 dep=1 uid=0 oct=3 lid=0 tim=1499185905247020 hv=429618617 ad='84f0f2d0' sqlid='4krwuz0ctqxdt'
select ctime, mtime, stime from obj$ where obj# = :1
END OF STMT
PARSE #140048443932288:c=0,e=549,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1499185905247019
BINDS #140048443932288:
 Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=00 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=7f5f91b87bd0  bln=22  avl=02  flg=05
  value=20
EXEC #140048443932288:c=2000,e=2686,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=1218588913,tim=1499185905249810
WAIT #140048443932288: nam='db file sequential read' ela= 6205 file#=1 block#=337 blocks=1 obj#=36 tim=1499185905256132
WAIT #140048443932288: nam='db file sequential read' ela= 3739 file#=1 block#=338 blocks=1 obj#=36 tim=1499185905266704
WAIT #140048443932288: nam='db file sequential read' ela= 4966 file#=1 block#=241 blocks=1 obj#=18 tim=1499185905271761
FETCH #140048443932288:c=0,e=21976,p=3,cr=3,cu=0,mis=0,r=1,dep=1,og=4,plh=1218588913,tim=1499185905271820
STAT #140048443932288 id=1 cnt=1 pid=0 pos=1 obj=18 op='TABLE ACCESS BY INDEX ROWID OBJ$ (cr=3 pr=3 pw=0 time=21993 us)'
STAT #140048443932288 id=2 cnt=1 pid=1 pos=1 obj=36 op='INDEX RANGE SCAN I_OBJ1 (cr=2 pr=2 pw=0 time=16923 us)'
CLOSE #140048443932288:c=0,e=63,dep=1,type=0,tim=1499185905271941
BINDS #140048443935120:
 Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=7f5f91c07de8  bln=22  avl=02  flg=05
  value=20
EXEC #140048443935120:c=1000,e=795,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=2970138452,tim=1499185905272802
WAIT #140048443935120: nam='db file sequential read' ela= 3197 file#=1 block#=169 blocks=1 obj#=3 tim=1499185905276069
WAIT #140048443935120: nam='db file sequential read' ela= 3459 file#=1 block#=170 blocks=1 obj#=3 tim=1499185905404084
WAIT #140048443935120: nam='db file sequential read' ela= 6358 file#=1 block#=145 blocks=1 obj#=4 tim=1499185905410548
FETCH #140048443935120:c=999,e=137805,p=3,cr=3,cu=0,mis=0,r=0,dep=1,og=4,plh=2970138452,tim=1499185905410635
STAT #140048443935120 id=1 cnt=0 pid=0 pos=1 obj=4 op='TABLE ACCESS CLUSTER TAB$ (cr=3 pr=3 pw=0 time=137810 us)'
STAT #140048443935120 id=2 cnt=1 pid=1 pos=1 obj=3 op='INDEX UNIQUE SCAN I_OBJ# (cr=2 pr=2 pw=0 time=131330 us)'
 
*** 2017-07-05 00:31:46.094
Incident 176395 created, dump file: /oracle/diag/rdbms/orcl/orcl2/incident/incdir_176395/orcl_ora_51261_i176395.trc
ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [], [], [], [], []

以及以往恢复经验和mos,基本上可以确定是由于tab$和obj$记录不匹配导致该问题.而且是obj#=20的记录在tab$和obj$中不匹配.

分析tab$和obj$记录

Data UnLoader: 11.2.0.1.5 - Internal Only - on Wed Jul 05 01:28:53 2017
with 64-bit io functions and the decompression option
 
Copyright (c) 1994 2017 Bernard van Duijnen All rights reserved.
 
 Strictly Oracle Internal Use Only
 
 
Found db_id = 1334610369
Found db_name = orcl
DUL> unload table TAB$( OBJ# number, DATAOBJ# number,
  2      TS# number, FILE# number, BLOCK# number,
  3      BOBJ# number, TAB# number, COLS number, CLUCOLS number,
  4      PCTFREE$ ignore, PCTUSED$ ignore, INITRANS ignore, MAXTRANS ignore,
  5      FLAGS ignore, AUDIT$ ignore, ROWCNT ignore, BLKCNT ignore,
  6      EMPCNT ignore, AVGSPC ignore, CHNCNT ignore, AVGRLN ignore,
  7      AVGSPC_FLB ignore, FLBCNT ignore,
  8      ANALYZETIME ignore, SAMPLESIZE ignore,
  9      DEGREE ignore, INSTANCES ignore,
 10      INTCOLS ignore, KERNELCOLS number, PROPERTY number)
 11      cluster  C_OBJ#(OBJ#)
 12      storage ( tablespace 0 segobjno 2 tabno 1 file 1 block 144);
. unloading table                      TAB$       0 rows unloaded
DUL> unload table OBJ$( OBJ# number, DATAOBJ# number, OWNER# number,
  2      NAME clean varchar2(30), NAMESPACE ignore, SUBNAME clean varchar2(30),
  3      TYPE# number, CTIME ignore, MTIME ignore, STIME ignore,
  4      STATUS ignore, REMOTEOWNER ignore, LINKNAME ignore,
  5      FLAGS ignore, OID$ hexraw)
  6      storage ( tablespace 0 segobjno 18 file 1 block 240);
. unloading table                      OBJ$   89804 rows unloaded
DUL>

这里可以明确表示tab$无记录,obj$有记录,从而启动的过程出现ORA-600 16703错误可以很好解释.由于数据库启动成功和报错几乎同时进行,以及上面看到的tab$记录不存在的情况,初步怀疑是有startup触发器清空tab$表记录
工具分析触发器表trigger$
startup-trigger
这里果然看到一个after startup on database的触发器,名字为DBMS_SUPPORT_DBMONITOR,而它调用的是DBMS_SUPPORT_DBMONITORP存储.

工具分析pl/sql表source$
DBMS_SUPPORT_DBMONITOR


对wraped加密的程序进行解密
DBMS_SUPPORT_DBMONITOR-unwraped

这里可以明确的看到DBMS_SUPPORT_DBMONITORP存储过程备份tab$表到orachk中然后delete tab$表,现在已经明确是由于DBMS_SUPPORT_DBMONITOR触发器在数据库重启之后开始执行调用DBMS_SUPPORT_DBMONITORP程序,如果判断数据库创建时间大于等于300天,便干掉tab$表,实现数据库破坏.

查找DBMS_SUPPORT_DBMONITOR等程序源头
分析相关程序创建时间,通过obj$表的ctime和name来判断
DBMS_SUPPORT_DBMONITOR-ctime
bootstrap-ctime

这里可以看出来DBMS_SUPPORT_DBMONITOR和DBMS_SUPPORT_DBMONITORP的创建时间基本上和数据库核心对象的创建时间相差无几,我们可以大概排除掉,是由于pl sql等工具连接数据库导致该发问题(类似:plsql dev引起的数据库被黑勒索比特币实现原理分析和解决方案),很可能是在dbca创建库的过程中就已经带有了DBMS_SUPPORT_DBMONITOR等程序,如果这样那很可能是由于数据库的安装介质被破坏导致该问题.

分析DBMS_SUPPORT_DBMONITOR来源
prvtsupp
20170711001626

这里已经很清晰,由于prvtsupp.plb被人注入了恶意脚本从而使得数据库被创建了DBMS_SUPPORT_DBMONITOR的触发器和DBMS_SUPPORT_DBMONITORP的存储过程,从而实现了对数据库的而易破坏.
md5

通过md5码对比,可以确定是有人对软件的安装介质进行了破坏,从而实现了恶意代码的注入,从而实现了数据库300天之后重启之后无法正常启动而是出现类似ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [], [], [], [], []的错误.提醒各位一定要从官方途径下载oracle安装介质,如果是从其他互联网途径下载一定要验证md5,确保文件没有被人恶意篡改,造成无可挽回的损坏.如果真的不幸遇到这类问题,请保护现场联系我们
Tel:13429648788    Q Q:107644445QQ咨询惜分飞    E-Mail:dba@xifenfei.com
可以实现业务数据0丢失恢复,最大限度抢救您的数据

asm disk 分区表损坏恢复

有朋友反馈,他们做了xx存储的双活之后,重启主机发现gi无法正常启动,分析发现所有该存储的磁盘分区信息丢失,导致asmlib无法发现磁盘(使用分区做asm disk)
类似如下错误(磁盘分区丢失)

--fdisk -l 显示部分结果
Disk /dev/mapper/datahds1: 1099.5 GB, 1099511627776 bytes
255 heads, 63 sectors/track, 133674 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

--ls -l /dev/mapper/   显示结果无分区信息
lrwxrwxrwx 1 root root      7 May  6 03:44 datahds1 -> ../dm-1
lrwxrwxrwx 1 root root      7 May  6 03:26 datahds2 -> ../dm-3
lrwxrwxrwx 1 root root      7 May  6 03:26 datahds3 -> ../dm-8
lrwxrwxrwx 1 root root      7 May  6 03:26 ocrhds1 -> ../dm-0
lrwxrwxrwx 1 root root      7 May  6 03:26 ocrhds2 -> ../dm-2
lrwxrwxrwx 1 root root      7 May  6 03:26 ocrhds3 -> ../dm-4

asm日志显示

SUCCESS: diskgroup DATADG was mounted
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 3
SUCCESS: diskgroup OCRHDS was mounted
ORA-15032: not all alterations performed
ORA-15017: diskgroup "DATA" cannot be mounted
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATA"

分析系统日志

May  6 02:23:27 db2 kernel: sdb: unknown partition table
May  6 02:23:27 db2 kernel: sde: unknown partition table
May  6 02:23:27 db2 kernel: sdc: unknown partition table
May  6 02:23:27 db2 kernel: sdf: unknown partition table
May  6 02:23:27 db2 kernel: sdd: unknown partition table
May  6 02:23:27 db2 kernel: sdj:Dev sdj: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sdi: sdi1
May  6 02:23:27 db2 kernel: sdk: sdk1
May  6 02:23:27 db2 kernel: sdg: unknown partition table
May  6 02:23:27 db2 kernel: sdl: sdl1
May  6 02:23:27 db2 kernel: sdm:Dev sdm: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sdo:Dev sdo: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sdn:Dev sdn: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sdp:Dev sdp: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sds:Dev sds: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sdh:
May  6 02:23:27 db2 kernel: sdt: sdt1
May  6 02:23:27 db2 kernel: sdv:Dev sdv: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sdq:Dev sdq: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sd 1:0:1:9: [sdr] Very big device. Trying to use READ CAPACITY(16).
May  6 02:23:27 db2 kernel: sdr:Dev sdr: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sd 2:0:0:9: [sdab] Very big device. Trying to use READ CAPACITY(16).
May  6 02:23:27 db2 kernel: sdab: unknown partition table
May  6 02:23:27 db2 kernel: sdac: unknown partition table
May  6 02:23:27 db2 kernel: sdw: sdw1
May  6 02:23:27 db2 kernel: sdu:Dev sdu: unable to read RDB block 0
May  6 02:23:27 db2 kernel: unable to read partition table
May  6 02:23:27 db2 kernel: sdx: sdx1
May  6 02:23:27 db2 kernel: sdy: sdy1
May  6 02:23:27 db2 kernel: sdaa: sdaa1
May  6 02:23:27 db2 kernel: sdz: sdz1
May  6 02:23:27 db2 kernel: sdae: unknown partition table
May  6 02:23:27 db2 kernel: sdaf: unknown partition table
May  6 02:23:27 db2 kernel: sdag: unknown partition table
May  6 02:23:27 db2 kernel: sdai:
May  6 02:23:27 db2 kernel: sdah: unknown partition table
May  6 02:23:27 db2 kernel: sdad: unknown partition table
May  6 02:23:28 db2 mcelog: failed to prefill DIMM database from DMI data

这里错误比较明显unknown partition table,磁盘的分区信息损坏.使用fdisk无法发现分区

partprobe也无效

[root@db2 oracle]# partprobe /dev/mapper/ocrhds3
[root@db2 oracle]# 
[root@db2 oracle]# ls -l /dev/mapper/ocrhds3*
lrwxrwxrwx 1 root root 7 May  6 07:30 /dev/mapper/ocrhds3 -> ../dm-4

从尚需信息看,磁盘的分区表信息应该已经损坏,现在能够做的,就是希望运气好,磁盘的分区的实际数据没有损坏

分析磁盘实际分区数据

[root@db2 ~]$ dd if=/dev/mapper/datahds1 of=/tmp/datahds1.dd bs=1024k count=50
[root@db2 ~]$ dd if=/tmp/datahds1.dd of=/tmp/xff01.dd  bs=3225 skip=1
[grid@db2 ~]$ kfed read /tmp/xff01.dd |more
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:              2147483648 ; 0x008: disk=0
kfbh.check:                  3110278718 ; 0x00c: 0xb963163e
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdhdb.driver.provstr: ORCLDISKHDSDATA1 ; 0x000: length=16
kfdhdb.driver.reserved[0]:   1146307656 ; 0x008: 0x44534448
kfdhdb.driver.reserved[1]:    826364993 ; 0x00c: 0x31415441
kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000
kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000
kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000
kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000
kfdhdb.compat:                186646528 ; 0x020: 0x0b200000
kfdhdb.dsknum:                        0 ; 0x024: 0x0000
kfdhdb.grptyp:                        1 ; 0x026: KFDGTP_EXTERNAL
kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER
kfdhdb.dskname:             DATADG_0000 ; 0x028: length=11
kfdhdb.grpname:                  DATADG ; 0x048: length=6
kfdhdb.fgname:              DATADG_0000 ; 0x068: length=11
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.crestmp.hi:             33050696 ; 0x0a8: HOUR=0x8 DAYS=0x2 MNTH=0x4 YEAR=0x7e1
kfdhdb.crestmp.lo:           3813740544 ; 0x0ac: USEC=0x0 MSEC=0x44 SECS=0x35 MINS=0x38
kfdhdb.mntstmp.hi:             33050701 ; 0x0b0: HOUR=0xd DAYS=0x2 MNTH=0x4 YEAR=0x7e1
kfdhdb.mntstmp.lo:            411385856 ; 0x0b4: USEC=0x0 MSEC=0x150 SECS=0x8 MINS=0x6

通过上述分析,我们可以初步判断,分区磁盘的信息很可能是好的(因为asm disk header是好的,根据一般的规则从前往后覆盖,既然header是好的,后面的block被覆盖的概率非常小)

通过准备新磁盘直接把磁盘分区dd到新设备上

dd if=/dev/mapper/ocrhds1 of=/dev/mapper/ocrhdsnew1 skip=1 bs=3225
dd if=/dev/mapper/ocrhds2 of=/dev/mapper/ocrhdsnew2 skip=1 bs=3225
dd if=/dev/mapper/ocrhds3 of=/dev/mapper/ocrhdsnew3 skip=1 bs=3225
dd if=/dev/mapper/datahds1 of=/dev/mapper/datahdsnew1 skip=1 bs=3225
dd if=/dev/mapper/datahds2 of=/dev/mapper/datahdsnew2 skip=1 bs=3225
dd if=/dev/mapper/datahds3 of=/dev/mapper/datahdsnew3 skip=1 bs=3225

asmlib重新扫描磁盘

[root@db1 disks]# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks...
Scanning system for ASM disks...
Instantiating disk "HDSOCR3"
Instantiating disk "HDSDATA2"
Instantiating disk "HDSDATA1"
Instantiating disk "HDSDATA3"
Instantiating disk "HDSOCR1"
Instantiating disk "HDSOCR2"
[root@db1 disks]# ls -ltr
total 0
brw-rw---- 1 grid asmadmin  8, 160 May  6 13:49 HDSOCR3
brw-rw---- 1 grid asmadmin  8, 192 May  6 13:49 HDSDATA2
brw-rw---- 1 grid asmadmin  8, 176 May  6 13:49 HDSDATA1
brw-rw---- 1 grid asmadmin  8, 208 May  6 13:49 HDSDATA3
brw-rw---- 1 grid asmadmin  8, 128 May  6 13:49 HDSOCR1
brw-rw---- 1 grid asmadmin  8, 144 May  6 13:49 HDSOCR2

kfed验证拷贝的分区

[root@db2 tmp]# /oracle/app/11.2.0/grid_1/bin/kfed read /dev/oracleasm/disks/HDSDATA1
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:              2147483648 ; 0x008: disk=0
kfbh.check:                  3110278718 ; 0x00c: 0xb963163e
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdhdb.driver.provstr: ORCLDISKHDSDATA1 ; 0x000: length=16
kfdhdb.driver.reserved[0]:   1146307656 ; 0x008: 0x44534448
kfdhdb.driver.reserved[1]:    826364993 ; 0x00c: 0x31415441
kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000
kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000
kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000
kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000
kfdhdb.compat:                186646528 ; 0x020: 0x0b200000
kfdhdb.dsknum:                        0 ; 0x024: 0x0000
kfdhdb.grptyp:                        1 ; 0x026: KFDGTP_EXTERNAL
kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER
kfdhdb.dskname:             DATADG_0000 ; 0x028: length=11
kfdhdb.grpname:                  DATADG ; 0x048: length=6
kfdhdb.fgname:              DATADG_0000 ; 0x068: length=11
kfdhdb.capname:                         ; 0x088: length=0

asm和数据库启动正常

[grid@db2 ~]$ asmcmd
ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512   4096  1048576   3145710  2378034                0         2378034              0             N  DATADG/
MOUNTED  NORMAL  N         512   4096  1048576     15342    14416             5114            4651              0             Y  OCRHDS/
ASMCMD> 

[oracle@db2 ~]$ sqlplus  / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Sat May 6 13:54:21 2017

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup
ORACLE instance started.

Total System Global Area 3.6077E+10 bytes
Fixed Size                  2260648 bytes
Variable Size            7247757656 bytes
Database Buffers         2.8723E+10 bytes
Redo Buffers              104382464 bytes
Database mounted.
Database opened.
SQL> 

asm-disk-partition-lost-recovery


通过上述恢复,实现asm磁盘分区丢失数据0丢失