错误操作导致数据文件丢失—alter database create datafile

alter database create datafile导致原始数据文件丢失
有客户一个小系统找我们恢复,通过Oracle Database Recovery Check 检测之后我们红框部分发现一奇怪现象
ctl
create-datafile-ctl
create-datafile


1.文件头fuzzy为NO,不符合数据库异常crash常识,也和其他文件该状态不匹配
2.文件的创建时间,scn均和checkpoint时间,scn一致(也就是说该文件是创建之后就checkpoint,然后就没有其他操作)
3.文件开始应用的归档为5,110和其他数据文件要求的3115相差甚远
结合这些情况,怀疑该文件被重新创建,查找alert日志果如发现如下信息

两个文件通过create datafile创建之后,然后offline操作.通过alert日志核查file 6和8的创建时间和seq信息
1
Fri Jan 16 15:03:36 2015
Thread 1 advanced to log sequence 5 (LGWR switch)
  Current log# 2 seq# 5 mem# 0: D:\APP\ADMINISTRATOR\ORADATA\MYCHOICE\REDO02.LOG
Fri Jan 16 15:13:19 2015
CREATE BIGFILE TABLESPACE "FBAUDIT"
DATAFILE 'E:\ZDSoft\ZDFood\databak\FBAUDIT' SIZE 10M
AUTOEXTEND ON NEXT 10M
MAXSIZE UNLIMITED LOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO
Completed: CREATE BIGFILE TABLESPACE "FBAUDIT"
DATAFILE 'E:\ZDSoft\ZDFood\databak\FBAUDIT' SIZE 10M
AUTOEXTEND ON NEXT 10M
MAXSIZE UNLIMITED LOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO

Sat Feb 07 15:03:46 2015
Thread 1 advanced to log sequence 110 (LGWR switch)
  Current log# 2 seq# 110 mem# 0: D:\APP\ADMINISTRATOR\ORADATA\MYCHOICE\REDO02.LOG
Sat Feb 07 15:20:41 2015
CREATE BIGFILE TABLESPACE "CARD"
DATAFILE 'E:\ZDSoft\ZDCARD\databak\CARD1' SIZE 10M
AUTOEXTEND ON NEXT 10M
MAXSIZE UNLIMITED LOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO
Completed: CREATE BIGFILE TABLESPACE "CARD"
DATAFILE 'E:\ZDSoft\ZDCARD\databak\CARD1' SIZE 10M
AUTOEXTEND ON NEXT 10M
MAXSIZE UNLIMITED LOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO

通过结合alert日志判断,我们可以确定,当前我们Oracle Database Recovery Check检查出来的情况,是由于执行了create datafile命令导致故障前的文件丢失,创建了一个新的数据文件,而由于该库为非归档模式,导致该文件数据无法恢复(备注:不光是非归档模式不行,就算是归档模式,也需要从文件创建到现在的所有归档才行).在大部分生产系统,我相信不可能有这么的归档,因为在执行alter database create datafile命令之时一定要慎重,评估确定是否丢失归档,否则可能导致不可理的损坏).
客户意识到了悲剧的发生,但是希望我们帮忙恢复一张核心数据,用户的余额信息.

对于alter database create datafile丢失文件恢复
通过工具扫描原始文件相关的记录(由于写入大量数据,无法完整恢复,只能通过工具扫描,恢复部分数据)[asm disk header 彻底损坏恢复]
scan-datafile


因为原库虽然丢失了这两个文件,但是已经open成功,通过相关的data obj结合这个里面扫描到的文件,抽取出来需要的对象的block,然后对block里面的数据进行读取恢复出来相关数据.在这里我们还有一个难点就是由于这两个文件都是bigfile,给恢复过程增加了难度
recover-data


至此我们已经实现了对于alter database create datafile导致文件丢失的核心数据的恢复.尽可能的减小的客户的损坏.这种恢复是取决运气,数据在磁盘上的block没有被覆盖.如果覆盖了基本无望.
如果需要数据库恢复,请联系我们(ORACLE数据库恢复技术支持),将为您提供专业数据库技术支持:
Phone:17813235971    Q Q:107644445    E-Mail:dba@xifenfei.com
再次提醒
1.在数据库出现故障之时,尽可能保护现场,做操作之前要之后后果别百度了就不分青红皂白的直接操作,导致不可逆的破坏,数据可能永久性丢失[Oracle异常恢复前备份保护现场建议—FileSystem环境|Oracle异常恢复前备份保护现场建议—ASM环境
2.使用alter database create datafile命令之前需要慎重,评估是否所有的归档都存在

DUL: Error: No compatibility segments found

最近有朋友误操作引起了非常大的事故,差点吃了官司.在做数据库迁移的时候,远程误操作删除了原库的system等几个数据库初始安装的文件,而且该磁盘空间使用率非常高,还有少量写入.最后结果比较悲剧,通过文件系统层面无法直接恢复出来数据文件,而且该库无任何有效备份,又没有表名,列名等信息,无奈之下只能通过底层io block重组来恢复数据文件,可是悲剧又一次发生,这个磁盘上以前也有一份system等文件,最后经过多方重组恢复出来一份相对理想的数据文件.但是第三方公司通过这样重组出来的数据文件和未被删除的业务文件恢复出来的数据大量有问题,依旧需要我们进一步分析恢复处理.这篇文章主要描述了dul在无法加载bootstrap命令之后通过一些方法依旧可以正常使用unload table/user 等命令实现数据尽可能恢复.你要知道几百张表没有表名/列名要把他们区分出来那是什么样的工作量……
在dul中配置system文件

D:\xifenfei\system01.dbf

D:\TEMP\recover\dul\bak>dul

Data UnLoader: 11.2.0.0.4 - Internal Only - on Wed Sep 28 17:01:56 2016
with 64-bit io functions

Copyright (c) 1994 2016 Bernard van Duijnen All rights reserved.

 Strictly Oracle Internal Use Only


DUL> show datafiles;
Sorry, no valid data files found in control.txt

使用默认的dul中数据文件配置方法,让dul自己发现数据文件方法不可行

随意表空间号和文件号dul识别

0 0 D:\xifenfei\system01.dbf

D:\TEMP\recover\dul\bak>dul

Data UnLoader: 11.2.0.0.4 - Internal Only - on Wed Sep 28 17:00:27 2016
with 64-bit io functions

Copyright (c) 1994 2016 Bernard van Duijnen All rights reserved.

 Strictly Oracle Internal Use Only



DUL: Warning: File Type mismatch 1 != 8
DUL: Warning: D:\xifenfei\system01.dbf Header tablespace number 3
!= 0
DUL: Warning: D:\xifenfei\system01.dbf Header relative file number 1 != 0
Found db_id = 2948357999
Found db_name = XIFENFEI
DUL: Warning: Found mismatch while checking file D:\xifenfei\system01.dbf
DUL: Warning: DUL osd_parameter or control.dul configuration error
DUL: Warning: Given file number(0) in control file does not match file# in dba(1)

通过这个识别我们可以知道system的表空间号为3,文件号为1

再次配置system让dul识别

3 1 D:\xifenfei\system01.dbf
D:\TEMP\recover\dul\bak>dul

Data UnLoader: 11.2.0.0.4 - Internal Only - on Wed Sep 28 17:03:46 2016
with 64-bit io functions

Copyright (c) 1994 2016 Bernard van Duijnen All rights reserved.

 Strictly Oracle Internal Use Only



DUL: Warning: File Type mismatch 1 != 8
Found db_id = 2948357999
Found db_name = XIFENFEI
DUL> show datafiles;
ts# rf# start   blocks offs open  err file name
  3   1     0   320257    0    1    0 D:\xifenfei\system01.dbf

dul正常识别出来system文件但是根据经验我们知道tablespace 3肯定是有问题的,因此后续操作依旧问题非常多

尝试dul bootstrap恢复失败

DUL> bootstrap;
Scanning SYSTEM tablespace to locate compatibility segment ...
DUL: Warning: No files found for tablespace 0
Reading EXT.dat 0 entries loaded and sorted 0 entries
Reading SEG.dat 0 entries loaded
Reading COMPATSEG.dat 0 entries loaded
Reading SCANNEDLOBPAGE.dat 0 entries loaded and sorted 0 entries
DUL: Error: No compatibility segments found

由于表空间号错误,dul无法加载到bootstrap$表,另外根据bbed分析恢复出来的system文件中bootstrap$这部分丢失

尝试人工加载dul所需数据字典

DUL> unload table OBJ$ 
  2     storage ( tablespace 3 segobjno 18 file 1 block 240);
. unloading table                      OBJ$   79074 rows unloaded

DUL> unload table TAB$( OBJ# number, DATAOBJ# number,
  2      cluster  C_OBJ#(OBJ#)
  3      storage ( tablespace 3 segobjno 2 tabno 1 file 1 block 144);
. unloading table                      TAB$    4482 rows unloaded

DUL> unload table COL$ ( OBJ# number, COL# number , SEGCOL# number,
  2      cluster C_OBJ#(OBJ#)
  3      storage ( tablespace 3 segobjno 2 tabno 5 file 1 block 144);
. unloading table                      COL$  114491 rows unloaded

DUL> unload table USER$
  2      cluster C_USER#(USER#)
  3      storage ( tablespace 3 segobjno 10 tabno 1 file 1 block 208);
. unloading table                     USER$      96 rows unloaded

----其他表省略,根据需要的依次处理

尝试使用dul恢复数据

DUL> desc portal_emr.BASEELEMENT;
Table PORTAL_EMR.BASEELEMENT
obj#= 87200, dataobj#= 87200, ts#= 9, file#= 7, block#=458
      tab#= 0, segcols= 8, clucols= 0
Column information:
icol# 01 segcol# 01       BENAME len   30 type  1 VARCHAR2 cs 852(ZHS16GBK)
icol# 02 segcol# 02     TYPENAME len   30 type  1 VARCHAR2 cs 852(ZHS16GBK)
icol# 03 segcol# 03     TYPETYPE len   22 type  2 NUMBER(0,0)
icol# 04 segcol# 04    BEXMLTEXT len 4000 type  1 VARCHAR2 cs 852(ZHS16GBK)
icol# 05 segcol# 05 DEPTGROUPCODE len   30 type  1 VARCHAR2 cs 852(ZHS16GBK)
icol# 06 segcol# 06     ISCOMMON len   22 type  2 NUMBER(0,0)
icol# 07 segcol# 07      BESPELL len   15 type  1 VARCHAR2 cs 852(ZHS16GBK)
icol# 08 segcol# 08     ELEMTYPE len   22 type  2 NUMBER(0)

DUL> show datafiles;
ts# rf# start   blocks offs open  err file name
  3   1     0   320257    0    1    0 D:\xifenfei\system01.dbf
  9   7     0  4170425    0    1    0 D:\BaiduYunDownload\PORTAL_EMR

DUL> unload table portal_emr.BASEELEMENT;
. unloading table               BASEELEMENT    1913 rows unloaded

这里描述了在dul无法加载bootstrap命令之后,通过人工加载数据字典实现正常的unload table/user功能,丢弃了一般处理思路中的只能通过scan 然后unload没有表名,列名的处理方法,从而实现了恢复的最大化.
我们对原厂官方oracle dual工具有深入研究,如果在oracle dul恢复方面有搞不定的问题.
请联系我们,提供专业ORACLE数据库恢复技术支持
Phone:17813235971    Q Q:107644445QQ咨询惜分飞    E-Mail:dba@xifenfei.com

kfed read asm disk header

kfed读取数据磁盘头主要参数解释说明

   % kfed read /dev/raw/raw1
     
   kfbh.endian:                          1 ; 0x000: 0x01
   kfbh.hard:                          130 ; 0x001: 0x82
   kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
   kfbh.datfmt:                          1 ; 0x003: 0x01
   kfbh.block.blk:                       0 ; 0x004: T=0 NUMB=0x0
   kfbh.block.obj:              2147483648 ; 0x008: TYPE=0x8 NUMB=0x0
   kfbh.check:                  2932902794 ; 0x00c: 0xaed08b8a
   kfbh.fcn.base:                        0 ; 0x010: 0x00000000
   kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
   kfbh.spare1:                          0 ; 0x018: 0x00000000
   kfbh.spare2:                          0 ; 0x01c: 0x00000000
   kfdhdb.driver.provstr:         ORCLDISK ; 0x000: length=8
   kfdhdb.driver.reserved[0]:            0 ; 0x008: 0x00000000
   kfdhdb.driver.reserved[1]:            0 ; 0x00c: 0x00000000
   kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000
   kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000
   kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000
   kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000
   kfdhdb.compat:                168820736 ; 0x020: 0x0a100000
   kfdhdb.dsknum:                        0 ; 0x024: 0x0000
   kfdhdb.grptyp:                        1 ; 0x026: KFDGTP_EXTERNAL
   kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER
   kfdhdb.dskname:              ASM01_0000 ; 0x028: length=10
   kfdhdb.grpname:                   ASM01 ; 0x048: length=5
   kfdhdb.fgname:               ASM01_0000 ; 0x068: length=10
   kfdhdb.capname:                         ; 0x088: length=0
   kfdhdb.crestmp.hi:             32837774 ; 0x0a8: HOUR=0xe DAYS=0x4 MNTH=0x4 YEAR=0x7d4
   kfdhdb.crestmp.lo:           1555722240 ; 0x0ac: USEC=0x0 MSEC=0x29c SECS=0xb MINS=0x17
   kfdhdb.mntstmp.hi:             32837774 ; 0x0b0: HOUR=0xe DAYS=0x4 MNTH=0x4 YEAR=0x7d4
   kfdhdb.mntstmp.lo:           1563864064 ; 0x0b4: USEC=0x0 MSEC=0x1ab SECS=0x13 MINS=0x17
   kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
   kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
   kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
   kfdhdb.mfact:                    113792 ; 0x0c0: 0x0001bc80
   kfdhdb.dsksize:                    9075 ; 0x0c4: 0x00002373
   kfdhdb.pmcnt:                         2 ; 0x0c8: 0x00000002
   kfdhdb.fstlocn:                       1 ; 0x0cc: 0x00000001
   kfdhdb.altlocn:                       2 ; 0x0d0: 0x00000002
   kfdhdb.f1b1locn:                      2 ; 0x0d4: 0x00000002
   kfdhdb.redomirrors[0]:                0 ; 0x0d8: 0x0000
   kfdhdb.redomirrors[1]:                0 ; 0x0da: 0x0000
   kfdhdb.redomirrors[2]:                0 ; 0x0dc: 0x0000
   kfdhdb.redomirrors[3]:                0 ; 0x0de: 0x0000
   kfdhdb.ub4spare[0]:                   0 ; 0x0e0: 0x00000000
   ...
   kfdhdb.ub4spare[60]:                  0 ; 0x1d0: 0x00000000
   kfdhdb.acdb.aba.seq:                  0 ; 0x1d4: 0x00000000
   kfdhdb.acdb.aba.blk:                  0 ; 0x1d8: 0x00000000
   kfdhdb.acdb.ents:                     0 ; 0x1dc: 0x0000
   kfdhdb.acdb.ub2spare:                 0 ; 0x1de: 0x0000
 
  Breakdown:

   kfbh.endian  
     kf3.h   /* endianness of writer */ 
       Little endian = 1  
       Big endian = 0
  
   kfbh.hard    
     kf3.h   /* H.A.R.D. magic # and block size */  

  kfbh.type
    kf3.h    /* metadata block type               */

  kfbh.datfmt
    kf3.h   /* metadata block data format        */

  kfbh.block
    kf3.h   /* block location of this block      */
      blk -- Disk header should have T=0 and NUMB=0x0
      obj -- Disk header should have TYPE=0x8 NUMB=<disknumber>
    blk and obj values are derived from a series of macros in kf3.h.  See 
    "KFBL Macros" in kf3.h for more information.

  kfbh.check
    kf3.h   /* check value to verify consistency */

  kfbh.fcn
    kf3.h   /* change number of last change      */
     
  kfdhdb.driver
    kf3.h   /* OSMLIB driver reserved block  */
       If no driver is defined "ORCLDISK" is used.  
      
  kfdhdb.compat
    kf3.h   /* Comaptible software version   */
      example: 0x0a100000 
      You get:      
          a=10 1=1 so 10.1.0.0.0

  kfdhdb.dsknum
    kf3.h   /* OSM disk number               *
      This is the disk number.  The first disk being "0".  There can be up to
      ub2 disks in a diskgroup.  This allows for 65336 disks 0 through 65335.

  kfdhdb.grptyp
    kf3.h   /* Disk group type               */

  kfdhdb.hdrsts
    kf3.h   /* Disk header status            */
      This is what is used to determine if a disk is available or not to 
      the diskgroup.  0x03 is the correct value for a valid status.

  kfdhdb.dskname   /* OSM disk name       */
  kfdhdb.grpname   /* OSM disk group name */
  kfdhdb.fgname    /* Failure group name  */
  kfdhdb.capname   /* Capacity grp, unused*/
    kf3.h 

  kfdhdb.crestmp   /* Creation timestamp            */
  kfdhdb.mntstmp   /* Mount timestamp               */
    kf3.h To derive the hi and low time`from an unformated dump use the 
    "KFTS Macros" in kf3.h.

  kfdhdb.secsize
    kf3.h   /* Disk sector size (bytes)      */
      This is the physical sector size of the disk in bytes. All I/O's to the
      disk are described in physical sectors. This must be a power of 2. An
      ideal value would be 4096, but most disks are formatted with 512 byte
      sectors. (from asmlib.h)

  kfdhdb.blksize
    kf3.h   /* Metadata block (bytes)        */
       
  kfdhdb.ausize
    kf3.h   /* Allocation Unit (bytes)       */

  kfdhdb.mfact 
    kf3.h   /* Stride between phys addr AUs  */
     
  kfdhdb.dsksize
    kf3.h   /* Disk size in AUs              */
      Mulitply by AUs to get actual size of disk when added.  
         
  kfdhdb.pmcnt
    kf3.h   /* Permanent phys addressed AUs  */
      Number of physically addressed allocation units.

  kfdhdb.fstlocn
    kf3.h   /* First FreeSpace table blk num */
      Used to find freespace.

  kfdhdb.altlocn
    kf3.h   /* First Alocation table blk num */
      Used to find alocated space.

  kfdhdb.f1b1locn
    kf3.h   /* File Directory blk 1 AU num   */
      Beginging for file directory.

通过update _NEXT_OBJECT 实现obj$.obj#和obj$.dataobj#跳号

在一些特殊的情况下(比如ORA-00600 [15267],ORA-00600 [KKDLCOB-OBJN-EXISTS],Ora-600 [15260]),考虑需要把dba_objects中的object_id往前推进,这里通过试验的方法实现该功能
数据库版本信息

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Prod
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

分析obj和dataobj

SQL> select max(obj#),max(dataobj#) from obj$;

 MAX(OBJ#) MAX(DATAOBJ#)
---------- -------------
     51887         51907

SQL> select name from obj$ where obj#=51887;

NAME
------------------------------
T_DUL

SQL> select name from obj$ where dataobj#=51907;

NAME
------------------------------
_NEXT_OBJECT

SQL> select object_id,data_object_id from dba_objects where object_name='_NEXT_OBJECT';

no rows selected

为什么dba_objects中无_NEXT_OBJECT
因为dba_objects视图中跳过了_NEXT_OBJECT这条记录
_next_object


测试创建新表后obj和dataobj的变化

SQL>  create table t_xff as select * from dual;

Table created.

SQL> select max(obj#),max(dataobj#) from obj$;

 MAX(OBJ#) MAX(DATAOBJ#)
---------- -------------
     51898         51907

SQL> select name from obj$ where obj#=51898;

NAME
------------------------------
T_XFF

SQL> select max(object_id),max(data_object_id) from dba_objects where object_name='T_XFF';

MAX(OBJECT_ID) MAX(DATA_OBJECT_ID)
-------------- -------------------
         51898               51898

通过测试可以确定,obj发生增加,但是dataobj不一定增加(因为dataobj本身比obj大,如果出现obj>dataobj那属于异常情况)

测试数据库重启obj和dataobj是否会跳号

---正常重启数据库
SQL> SHUTDOWN IMMEDIATE;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> STARTUP
ORACLE instance started.

Total System Global Area  260046848 bytes
Fixed Size                  1266920 bytes
Variable Size              83888920 bytes
Database Buffers          171966464 bytes
Redo Buffers                2924544 bytes
Database mounted.
Database opened.
SQL> select max(obj#),max(dataobj#) from obj$;

 MAX(OBJ#) MAX(DATAOBJ#)
---------- -------------
     51898         51907

---强制重启数据库
SQL> shutdown abort
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area  260046848 bytes
Fixed Size                  1266920 bytes
Variable Size              83888920 bytes
Database Buffers          171966464 bytes
Redo Buffers                2924544 bytes
Database mounted.
Database opened.
SQL> select max(obj#),max(dataobj#) from obj$;

 MAX(OBJ#) MAX(DATAOBJ#)
---------- -------------
     51898         51907

通过这个证明obj和dataobj没有因为数据库重启而发生改变

实现obj跳号

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup restrict
ORACLE instance started.

Total System Global Area  260046848 bytes
Fixed Size                  1266920 bytes
Variable Size              83888920 bytes
Database Buffers          171966464 bytes
Redo Buffers                2924544 bytes
Database mounted.
Database opened.
SQL>  update obj$ set dataobj#=1000000 where name='_NEXT_OBJECT';  

1 row updated.

SQL> commit;

Commit complete.

SQL> shutdown abort;
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area  260046848 bytes
Fixed Size                  1266920 bytes
Variable Size              83888920 bytes
Database Buffers          171966464 bytes
Redo Buffers                2924544 bytes
Database mounted.
Database opened.

SQL> select max(obj#),max(dataobj#) from obj$;

 MAX(OBJ#) MAX(DATAOBJ#)
---------- -------------
     51898       1000000

SQL> create table t_www_xifenfei_com as select * from dual;

Table created.

SQL> select max(obj#),max(dataobj#) from obj$;

 MAX(OBJ#) MAX(DATAOBJ#)
---------- -------------
   1000000       1000010

SQL> select max(object_id),max(data_object_id) from dba_objects;

MAX(OBJECT_ID) MAX(DATA_OBJECT_ID)
-------------- -------------------
       1000000             1000000

SQL> select object_name from dba_objects where object_id=1000000;

OBJECT_NAME
----------------------------------------------------------------
T_WWW_XIFENFEI_COM

通过丢_NEXT_OBJECT的更新实现obj和dataobj跳号(变成100w)

ORA-15042—hp rdisk丢失

有老朋友找到我,说一个客户的数据库异常,问题是asm无法正常mount,提示是缺少两块磁盘.问我是否可以恢复.因为是内网环境,通过他那边发过来的零零散散的信息,大概分析如下
asm alert日志报错
ERROR: diskgroup DGROUP1 was not mounted

Fri Aug 12 16:03:12 EAT 2016
SQL> alter diskgroup DGROUP1 mount 
Fri Aug 12 16:03:12 EAT 2016
NOTE: cache registered group DGROUP1 number=1 incarn=0xf6781b5c
Fri Aug 12 16:03:12 EAT 2016
NOTE: Hbeat: instance first (grp 1)
Fri Aug 12 16:03:16 EAT 2016
NOTE: start heartbeating (grp 1)
Fri Aug 12 16:03:16 EAT 2016
NOTE: cache dismounting group 1/0xF6781B5C (DGROUP1) 
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGROUP1 was not mounted

前台尝试mount asm 磁盘组报错ORA-15042
ORA-15042


从这里可以明显的看出来asm 磁盘组无法正常mount,是由于缺少asm disk 15,16.如果想恢复asm,最好的方法就是找出来这两个磁盘.通过kfed对现在的磁盘进行分析,最后我们发现asm disk 14对应的磁盘为disk160,,asm disk 17对应的disk163,根据第一感觉很可能是disk161和disk161两块盘异常,让机房检查硬件无任何告警

OS层面分析
省略和本次结论无关的记录

ls -l /dev/rdisk
crw-rw----   1 oracle     dba         13 0x000070 Jan  1  2016 disk160
crw-rw----   1 oracle     dba         13 0x000073 Jan  1  2016 disk163

ls -l /dev/disk
brw-r-----   1 bin        sys          1 0x000070 Jan 13  2015 disk160
brw-r-----   1 bin        sys          1 0x000071 Jan 13  2015 disk161
brw-r-----   1 bin        sys          1 0x000072 Jan 13  2015 disk162
brw-r-----   1 bin        sys          1 0x000073 Jan 13  2015 disk163

这里我们发现在hp unix中/dev/disk下面磁盘都存在,但是/dev/rdisk下面丢失,通过ioscan相关命令继续分析

ioscan -fNnkC disk
disk    160  64000/0xfa00/0x70  esdisk   CLAIMED     DEVICE       HP      OPEN-V
                      /dev/disk/disk160   /dev/rdisk/disk160
disk    161  64000/0xfa00/0x71  esdisk   CLAIMED     DEVICE       HP      OPEN-V
                      /dev/disk/disk161
disk    162  64000/0xfa00/0x72  esdisk   CLAIMED     DEVICE       HP      OPEN-V
                      /dev/disk/disk162
disk    163  64000/0xfa00/0x73  esdisk   CLAIMED     DEVICE       HP      OPEN-V
                      /dev/disk/disk163   /dev/rdisk/disk163

这里我们基本上可以确定是/dev/rdisk下面的盘发生丢失.进一步分析,因为rdisk是聚合后的盘符,那我们分析聚合前的盘符是否正常

ioscan -m dsf
/dev/rdisk/disk160       /dev/rdsk/c29t12d4
                         /dev/rdsk/c28t12d4
/dev/rdisk/disk163       /dev/rdsk/c29t12d7
                         /dev/rdsk/c28t12d7

ls -l /dev/rdsk
crw-r-----   1 bin        sys        188 0x1dc000 Apr 22  2014 c29t12d0
crw-r-----   1 bin        sys        188 0x1dc100 Apr 22  2014 c29t12d1
crw-r-----   1 bin        sys        188 0x1dc300 Jan 13  2015 c29t12d3
crw-r-----   1 bin        sys        188 0x1dc400 Jan 13  2015 c29t12d4
crw-r-----   1 bin        sys        188 0x1dc500 Jan 13  2015 c29t12d5
crw-r-----   1 bin        sys        188 0x1dc600 Jan 13  2015 c29t12d6
crw-r-----   1 bin        sys        188 0x1dc700 Jan 13  2015 c29t12d7

crw-r-----   1 bin        sys        188 0x1cc100 Apr 22  2014 c28t12d1
crw-r-----   1 bin        sys        188 0x1cc300 Jan 13  2015 c28t12d3
crw-r-----   1 bin        sys        188 0x1cc400 Jan 13  2015 c28t12d4
crw-r-----   1 bin        sys        188 0x1cc500 Jan 13  2015 c28t12d5
crw-r-----   1 bin        sys        188 0x1cc600 Jan 13  2015 c28t12d6
crw-r-----   1 bin        sys        188 0x1cc700 Jan 13  2015 c28t12d7

通过这里我们基本上可以大概判断出来/dev/rdsk/c28t12d5,/dev/rdsk/c28t12d6,/dev/rdsk/c29t12d5,/dev/rdsk/c29t12d6就是我们需要找的/dev/rdisk/disk161和disk162的聚合之前的盘符.也就是说,现在我们判断只有/dev/rdisk下面的字符设备有问题,其他均正常.

通过系统命令修复异常

insf -e -H 64000/0xfa00/0x71
insf -e -H 64000/0xfa00/0x72

hp-asm-disk


现在已经可以正常看到/dev/rdisk/disk161和/dev/rdisk/disk162盘符,初步判断,os层面盘符已经恢复正常.修改磁盘权限和所属组

chmod 660 /dev/rdisk/disk161
chmod 660 /dev/rdisk/disk162
chown oracle:dba /dev/rdisk/disk161
chown oracle:dba /dev/rdisk/disk162

正常启动asm,mount磁盘组,open数据库
asm-mount


这次的恢复,主要是从操作系统层面判断解决问题,从而实现数据库完美恢复,数据0丢失.有类似恢复案例:分区无法识别导致asm diskgroup无法mount