KFED-00322 [kfbtTraverseBlock][Invalid OSM block type]

在oracle asm的使用过程中由于操作系统层面的错误操作导致asm disk 被破坏,这里列举了几种破坏之后的kfed报错现象(KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type])
asm mount 磁盘组报错(ORA-15040 ORA-15042)

SQL> alter diskgroup DATA mount;
alter diskgroup DATA mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "2" is missing from group number "2"

asm alert日志报错(ORA-15335 ORA-15066 ORA-15196等)

ORA-15335: ASM metadata corruption detected in disk group 'DATA'
ORA-15130: diskgroup "DATA" is being dismounted
ORA-15066: offlining disk "DATA_0002" in group "DATA" may result in a data loss
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483651] [48] [0 != 1]

kfed查看磁盘头报错
文件文件头(不光是disk header的4k,可能是连续的几个au,甚至更多)可能彻底损坏,一般kfed 读取都会看到KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type]之类错误

[oracle@fcomtaep2 disks]$ kfed read ASMRECO03
kfbh.endian: 0 ; 0x000: 0x00
kfbh.hard: 0 ; 0x001: 0x00
kfbh.type: 0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt: 0 ; 0x003: 0x00
kfbh.block.blk: 0 ; 0x004: blk=0
kfbh.block.obj: 0 ; 0x008: file=0
kfbh.check: 0 ; 0x00c: 0x00000000
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
7FC18D899400 00000000 00000000 00000000 00000000 [................]
  Repeat 27 times
7FC18D8995C0 FEEE0001 0001FFFF FFFF0000 00000FFF [................]
7FC18D8995D0 00000000 00000000 00000000 00000000 [................]
  Repeat 1 times
7FC18D8995F0 00000000 00000000 00000000 AA550000 [..............U.]
7FC18D899600 20494645 54524150 00010000 0000005C [EFI PART....\...] <==== **** Here ******
7FC18D899610 BD82BBB3 00000000 00000001 00000000 [................]
7FC18D899620 0FFFFFFF 00000000 00000022 00000000 [........".......]
7FC18D899630 0FFFFFDE 00000000 FD8857E5 42D7B49B [.........W.....B]
7FC18D899640 0901FA87 6B3DB5AA 00000002 00000000 [......=k........]
7FC18D899650 00000080 00000080 FE48EB77 00000000 [........w.H.....]
7FC18D899660 00000000 00000000 00000000 00000000 [................]
  Repeat 25 times
7FC18D899800 EBD0A0A2 4433B9E5 B668C087 C79926B7 [......3D..h..&..]
7FC18D899810 5381F6DF 4626F988 0E4F468D D78D3B28 [...S..&F.FO.(;..]
7FC18D899820 000007A1 00000000 0FFFF85F 00000000 [........_.......]
7FC18D899830 00000000 00000000 00720070 006D0069 [........p.r.i.m.]
7FC18D899840 00720061 00000079 00000000 00000000 [a.r.y...........]
7FC18D899850 00000000 00000000 00000000 00000000 [................]
 Repeat 186 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

“EFI PART”是分区的元数据,一般是被分区导致asm disk损坏.

[ebernal@dbaasm new2]$ kfed read emcpowerl | head -25
kfbh.endian: 0 ; 0x000: 0x00
kfbh.hard: 0 ; 0x001: 0x00
kfbh.type: 0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt: 0 ; 0x003: 0x00
kfbh.block.blk: 0 ; 0x004: blk=0
kfbh.block.obj: 0 ; 0x008: file=0
kfbh.check: 0 ; 0x00c: 0x00000000
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
2ABD671E9400 00000000 00000000 00000000 00000000 [................]
  Repeat 31 times
2ABD671E9600 4542414C 454E4F4C 00000001 00000000 [LABELONE........]
2ABD671E9610 E4E1DDB1 00000020 324D564C 31303020 [.... ...LVM2 001] <==== **** Here ******
2ABD671E9620 50365A77 71327874 34303156 4B4E6136 [wZ6Ptx2qV1046aNK]
2ABD671E9630 35395159 5147634C 487A5A38 63575A37 [YQ95LcGQ8ZzH7ZWc]
2ABD671E9640 00000000 00000019 00030000 00000000 [................]
2ABD671E9650 00000000 00000000 00000000 00000000 [................]
2ABD671E9660 00000000 00000000 00001000 00000000 [................]
2ABD671E9670 0002F000 00000000 00000000 00000000 [................]
2ABD671E9680 00000000 00000000 00000000 00000000 [................]
  Repeat 215 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

“LVM2 001” 是逻辑卷的名字,该asm disk很可能被做为lvm管理而被破坏

[ebernal@dbaasm tars]$ kfed read rhdisk16
kfbh.endian:                         65 ; 0x000: 0x41
kfbh.hard:                           73 ; 0x001: 0x49
kfbh.type:                           88 ; 0x002: *** Unknown Enum ***
kfbh.datfmt:                         32 ; 0x003: 0x20
kfbh.block.blk:              1111709260 ; 0x004: blk=1111709260
kfbh.block.obj:              1634861056 ; 0x008: file=131072
kfbh.check:                         119 ; 0x00c: 0x00000077
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
2B6FE2AC1400 20584941 4243564C 61720000 00000077  [AIX LVCB..raw...] <==== **** Here ******
2B6FE2AC1410 00000000 00000000 00000000 00000000  [................]
2B6FE2AC1420 00000000 00000000 30300000 38306430  [..........000d08]
2B6FE2AC1430 30306131 34643030 30303030 31303030  [1a0000d400000001]
2B6FE2AC1440 61006533 766C6D73 7461645F 00003161  [3e.asmlv_data1..]
2B6FE2AC1450 00000000 00000000 00000000 00000000  [................]
        Repeat 2 times
2B6FE2AC1480 54000000 4D206575 20207961 31312037  [...Tue May  7 11]
2B6FE2AC1490 3A33343A 32203633 0A333130 00000000  [:43:36 2013.....]
2B6FE2AC14A0 65755400 79614D20 20372020 343A3131  [.Tue May  7 11:4]
2B6FE2AC14B0 34323A38 31303220 00000A33 44000000  [8:24 2013......D]
2B6FE2AC14C0 41313830 30303444 6D6D7900 02007900  [081AD400.ymm.y..]
2B6FE2AC14D0 0100E40C 656E6F4E 00000000 00000000  [....None........]
2B6FE2AC14E0 00000000 00000000 00000000 00000000  [................]
        Repeat 14 times
2B6FE2AC15D0 00000000 00000000 65310000 61653934  [..........1e49ea]
2B6FE2AC15E0 342E3862 00000000 00000000 00000000  [b8.4............]
2B6FE2AC15F0 00000000 00000000 00000000 00000000  [................]
  Repeat 224 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][88]

这里的“AIX LVCB..raw” 是AIX OS volume 的元数据库,也就是说,asm disk 被作为了aix os层面破坏

[oracle@dbep2 disks]$ kfed read asm-disk3
kfbh.endian: 0 ; 0x000: 0x00
kfbh.hard: 0 ; 0x001: 0x00
kfbh.type: 0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt: 0 ; 0x003: 0x00
kfbh.block.blk: 0 ; 0x004: blk=0
kfbh.block.obj: 0 ; 0x008: file=0
kfbh.check: 0 ; 0x00c: 0x00000000
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
06000000 00000000 00000000 00000000 00000000 [................]
  Repeat 25 times
0602100 51e2b7f6 00ed4e00 00000000 00000001  [...Q.N..........]
0602120 00000000 0000000b 00000100 0000003c [............<...]
0602140 00000242 0000007b 5d8468e7 6147782a [B...{....h.]*xGa]
0602160 d17851a2 327552e2 00000000 00000000 [.Qx..Ru2........]
0602200 00000000 00000000 3130752f 91a4f000 [......../u01....] <==== **** Here ******
0602220 ff8808e4 d5104cff 000000ac 00000100 [.....L..........]
0602240 00000000 00000000 00000000 09d18000 [................]
  Repeat 254 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][88]

这里的/u01很可能表明该asm disk被文件系统覆盖

对于asm disk的各种破坏情况,如果是normal/high冗余,那么asm dg没有问题,可以考虑通过删除异常盘,然后重新加入;如果是外部冗余遭遇到asm disk 被破坏,一般asm disk 会dismount,而且无法正常mount,如果有备份的磁盘头,可以尝试还原磁盘头,mount 磁盘组,然后只读方式迁移数据;如果没有备份磁盘头或者还原之后也无法mount,可能需要通过一些额外的方式处理比如通过工具在asm dismount状态下恢复数据文件,甚至通过对asm block/oracle block碎片重组的方式恢复数据.参考相关文章:
oracle asm系列文章汇总
pvid=yes导致asm无法mount
asm disk header 彻底损坏恢复
分区无法识别导致asm diskgroup无法mount
oracle asm disk格式化恢复—格式化为ext4文件系统
oracle asm disk格式化恢复—格式化为ntfs文件系统
asm disk误设置pvid导致asm diskgroup无法mount恢复
分享oracleasm createdisk重新创建asm disk后数据0丢失恢复案例
ORA-15042: ASM disk “N” is missing from group number “M” 故障恢复

自动精简配置(Thin Provisioning)导致asm异常

有朋友在一个存储空间给asm使用,发生空间不足,然后使用另外一个存储中的lun给asm的data磁盘组增加asm disk,运行了大概1天之后,asm磁盘组直接dismount,数据库crash.然后就无法正常mount.包括这个存储上的几个其他磁盘组也无法正常mount.
数据库异常日志

Sun Oct 23 08:43:59 2016
SUCCESS: diskgroup DATA was dismounted
SUCCESS: diskgroup DATA was dismounted
Sun Oct 23 08:44:00 2016
Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lmon_79128.trc:
ORA-00202: control file: '+DATA/orcl/controlfile/current.278.892363163'
ORA-15078: ASM diskgroup was forcibly dismounted
Sun Oct 23 08:44:00 2016
Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lgwr_79174.trc:
ORA-00345: redo log write error block 15924 count 2
ORA-00312: online log 2 thread 1: '+DATA/orcl/onlinelog/group_2.274.892363167'
ORA-15078: ASM diskgroup was forcibly dismounted
ORA-15078: ASM diskgroup was forcibly dismounted
Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lgwr_79174.trc:
ORA-00202: control file: '+DATA/orcl/controlfile/current.278.892363163'
ORA-15078: ASM diskgroup was forcibly dismounted
Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lgwr_79174.trc:
ORA-00204: error in reading (block 1, # blocks 1) of control file
ORA-00202: control file: '+DATA/orcl/controlfile/current.278.892363163'
ORA-15078: ASM diskgroup was forcibly dismounted
Sun Oct 23 08:44:00 2016
LGWR (ospid: 79174): terminating the instance due to error 204
Sun Oct 23 08:44:00 2016
opiodr aborting process unknown ospid (79742) as a result of ORA-1092
Sun Oct 23 08:44:01 2016
ORA-1092 : opitsk aborting process
Sun Oct 23 08:44:01 2016
ORA-1092 : opitsk aborting process
System state dump requested by (instance=1, osid=79174 (LGWR)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_diag_79118.trc
Instance terminated by LGWR, pid = 79174

很明显,数据库异常是由于asm diskgroup dismount,因此分析asm 日志

asm 日志

Sun Oct 23 07:00:31 2016
Time drift detected. Please check VKTM trace file for more details.
Sun Oct 23 08:43:55 2016
Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_8755.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 1048576
WARNING: Write Failed. group:1 disk:2 AU:1222738 offset:0 size:1048576
ERROR: failed to copy file +DATA.524, extent 15030
ERROR: ORA-15080 thrown in ARB0 for group number 1
Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_8755.trc:
ORA-15080: synchronous I/O operation to a disk failed
Sun Oct 23 08:43:55 2016
NOTE: stopping process ARB0
NOTE: rebalance interrupted for group 1/0xec689cdd (DATA)
NOTE: ASM did background COD recovery for group 1/0xec689cdd (DATA)
NOTE: starting rebalance of group 1/0xec689cdd (DATA) at power 1
Starting background process ARB0
Sun Oct 23 08:43:56 2016
ARB0 started with pid=24, OS id=103554 
NOTE: assigning ARB0 to group 1/0xec689cdd (DATA) with 1 parallel I/O
Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_103554.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 1048576
WARNING: Write Failed. group:1 disk:2 AU:1222738 offset:0 size:1048576
ERROR: failed to copy file +DATA.256, extent 6570
ERROR: ORA-15080 thrown in ARB0 for group number 1
Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_103554.trc:
ORA-15080: synchronous I/O operation to a disk failed
NOTE: stopping process ARB0
Sun Oct 23 08:43:58 2016
Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_dbw0_8521.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
WARNING: Write Failed. group:1 disk:3 AU:6789 offset:24576 size:4096
NOTE: cache initiating offline of disk 3 group DATA
NOTE: process _dbw0_+asm1 (8521) initiating offline of disk 3.3915934787 (DATA_0003) with mask 0x7e in group 1
Sun Oct 23 08:43:58 2016
WARNING: Disk 3 (DATA_0003) in group 1 mode 0x7f is now being offlined
WARNING: Disk 3 (DATA_0003) in group 1 in mode 0x7f is now being taken offline on ASM inst 1
NOTE: initiating PST update: grp = 1, dsk = 3/0xe9686c43, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 14 for pid 14, osid 8521
ERROR: Disk 3 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 1)
Sun Oct 23 08:43:58 2016
NOTE: cache dismounting (not clean) group 1/0xEC689CDD (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 103577, image: oracle@node1 (B000)
WARNING: Disk 3 (DATA_0003) in group 1 mode 0x7f offline is being aborted
WARNING: Offline of disk 3 (DATA_0003) in group 1 and mode 0x7f failed on ASM inst 1
NOTE: halting all I/Os to diskgroup 1 (DATA)
Sun Oct 23 08:43:59 2016
NOTE: LGWR doing non-clean dismount of group 1 (DATA)
NOTE: LGWR sync ABA=160.10145 last written ABA 160.10145

错误信息很明显,由于Write Failed导致asm diskgroup dismount.

系统日志

Oct 23 08:43:55 node1 kernel: sd 6:0:12:1: [sdd]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 23 08:43:55 node1 kernel: sd 6:0:12:1: [sdd]  Sense Key : Data Protect [current] 
Oct 23 08:43:55 node1 kernel: sd 6:0:12:1: [sdd]  Add. Sense: Space allocation failed write protect
Oct 23 08:43:55 node1 kernel: sd 6:0:12:1: [sdd] CDB: Write(16): 8a 00 00 00 00 02 e7 18 37 f9 00 00 00 07 00 00
Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev sdd, sector 12467058681
Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467058681
Oct 23 08:43:55 node1 kernel: sd 8:0:6:1: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 23 08:43:55 node1 kernel: sd 8:0:6:1: [sdh]  Sense Key : Data Protect [current] 
Oct 23 08:43:55 node1 kernel: sd 8:0:6:1: [sdh]  Add. Sense: Space allocation failed write protect
Oct 23 08:43:55 node1 kernel: sd 8:0:6:1: [sdh] CDB: Write(16): 8a 00 00 00 00 02 e7 18
Oct 23 08:43:55 node1 kernel: sd 6:0:4:1: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 23 08:43:55 node1 kernel: sd 6:0:4:1: [sdb]  Sense Key : Data Protect [current] 
Oct 23 08:43:55 node1 kernel: sd 6:0:4:1: [sdb]   33Add. Sense: Space allocation failed write protect
Oct 23 08:43:55 node1 kernel: sd 6:0:4:1: [sdb] CDB: Write(16): 8a 00 00 00 00 02 e7 18 30 00 00 00 03 f9 00 00
Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev sdb, sector 12467056640
Oct 23 08:43:55 node1 kernel: f9 00 00 04 00
Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467056640
Oct 23 08:43:55 node1 kernel: 00 00
Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev sdh, sector 12467057657
Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467057657
Oct 23 08:43:57 node1 kernel: sd 8:0:6:1: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 23 08:43:57 node1 kernel: sd 8:0:6:1: [sdh]  Sense Key : Data Protect [current] 
Oct 23 08:43:57 node1 kernel: sd 8:0:6:1: [sdh]  Add. Sense: Space allocation failed write protect
Oct 23 08:43:57 node1 kernel: sd 8:0:6:1: [sdh] CDB: Write(16): 8a 00 00 00 00 02 e7 18 37 f9 00 00 00 07 00 00
Oct 23 08:43:57 node1 kernel: end_request: critical space allocation error, dev sdh, sector 12467058681
Oct 23 08:43:57 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467058681
Oct 23 08:43:57 node1 kernel: sd 8:0:12:1: [sdj]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 23 08:43:57 node1 kernel: sd 8:0:12:1: [sdj]  Sense Key : Data Protect [current] 
Oct 23 08:43:57 node1 kernel: sd 8:0:12:1: [sdj]  Add. Sense: Space allocation failed write protect
Oct 23 08:43:57 node1 kernel: sd 8:0:12:1: [sdj] CDB: Write(16): 8a 00 00 00 00 02 e7 18 30 00 00 00 03 f9 00 00
Oct 23 08:43:57 node1 kernel: end_request: critical space allocation error, dev sdj, sector 12467056640
Oct 23 08:43:57 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467056640
Oct 23 08:43:57 node1 kernel: sd 6:0:4:1: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 23 08:43:57 node1 kernel: sd 6:0:4:1: [sdb]  Sense Key : Data Protect [current] 
Oct 23 08:43:57 node1 kernel: sd 6:0:4:1: [sdb]  Add. Sense: Space allocation failed write protect
Oct 23 08:43:57 node1 kernel: sd 6:0:4:1: [sdb] CDB: Write(16): 8a 00 00 00 00 02 e7 18 33 f9 00 00 04 00 00 00
Oct 23 08:43:58 node1 kernel: sd 6:0:4:1: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 23 08:43:58 node1 kernel: sd 6:0:4:1: [sdb]  Sense Key : Data Protect [current] 
Oct 23 08:43:58 node1 kernel: sd 6:0:4:1: [sdb]  Add. Sense: Space allocation failed write protect
Oct 23 08:43:58 node1 kernel: sd 6:0:4:1: [sdb] CDB: Write(16): 8a 00 00 00 00 03 3b 7e 78 30 00 00 00 08 00 00
Oct 23 10:50:59 node1 init: oracle-ohasd main process (6150) killed by TERM signal

错误信息为:critical space allocation error,严重空间分配错误.也就是linux在分配空间之时发生错误.在换而言之,由于分配空间错误导致asm 磁盘组dismount.

查看多路径信息

[root@node1 ~]# multipath -ll
36000d31003190c000000000000000003 dm-3 COMPELNT,Compellent Vol
size=80T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 6:0:9:1  sdd 8:48  active ready running
  `- 8:0:9:1  sdi 8:128 active ready running
delldisk2 (36000d310031908000000000000000003) dm-4 COMPELNT,Compellent Vol
size=8.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 6:0:12:1 sde 8:64  active ready running
  |- 8:0:6:1  sdh 8:112 active ready running
  |- 6:0:4:1  sdb 8:16  active ready running
  `- 8:0:12:1 sdj 8:144 active ready running
delldisk1 (36000d31003190a000000000000000007) dm-2 COMPELNT,Compellent Vol
size=12T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 6:0:1:1  sda 8:0   active ready running
  |- 8:0:2:1  sdf 8:80  active ready running
  |- 6:0:7:1  sdc 8:32  active ready running
  `- 8:0:3:1  sdg 8:96  active ready running

很明显报错的都是同一个lun(delldisk2),也就是存储空间使用完的存储.也就是说,由于delldisk2存储的空间使用尽了导致系统出现分配空间错误,从而导致asm 写失败,进而导致数据库异常.这种问题的本质其实就是存储给系统分配了8T,但是实际存储可以使用的空间不足8T,而os按照8T来使用从而出现该问题.专业名字叫做”存储精简卷”.因此各位在存储配置之时需要注意该问题.因为这种情况的出现一般只是写io异常,读依旧正常,因此不会丢失数据.

kfed read asm disk header

kfed读取数据磁盘头主要参数解释说明

   % kfed read /dev/raw/raw1
     
   kfbh.endian:                          1 ; 0x000: 0x01
   kfbh.hard:                          130 ; 0x001: 0x82
   kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
   kfbh.datfmt:                          1 ; 0x003: 0x01
   kfbh.block.blk:                       0 ; 0x004: T=0 NUMB=0x0
   kfbh.block.obj:              2147483648 ; 0x008: TYPE=0x8 NUMB=0x0
   kfbh.check:                  2932902794 ; 0x00c: 0xaed08b8a
   kfbh.fcn.base:                        0 ; 0x010: 0x00000000
   kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
   kfbh.spare1:                          0 ; 0x018: 0x00000000
   kfbh.spare2:                          0 ; 0x01c: 0x00000000
   kfdhdb.driver.provstr:         ORCLDISK ; 0x000: length=8
   kfdhdb.driver.reserved[0]:            0 ; 0x008: 0x00000000
   kfdhdb.driver.reserved[1]:            0 ; 0x00c: 0x00000000
   kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000
   kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000
   kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000
   kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000
   kfdhdb.compat:                168820736 ; 0x020: 0x0a100000
   kfdhdb.dsknum:                        0 ; 0x024: 0x0000
   kfdhdb.grptyp:                        1 ; 0x026: KFDGTP_EXTERNAL
   kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER
   kfdhdb.dskname:              ASM01_0000 ; 0x028: length=10
   kfdhdb.grpname:                   ASM01 ; 0x048: length=5
   kfdhdb.fgname:               ASM01_0000 ; 0x068: length=10
   kfdhdb.capname:                         ; 0x088: length=0
   kfdhdb.crestmp.hi:             32837774 ; 0x0a8: HOUR=0xe DAYS=0x4 MNTH=0x4 YEAR=0x7d4
   kfdhdb.crestmp.lo:           1555722240 ; 0x0ac: USEC=0x0 MSEC=0x29c SECS=0xb MINS=0x17
   kfdhdb.mntstmp.hi:             32837774 ; 0x0b0: HOUR=0xe DAYS=0x4 MNTH=0x4 YEAR=0x7d4
   kfdhdb.mntstmp.lo:           1563864064 ; 0x0b4: USEC=0x0 MSEC=0x1ab SECS=0x13 MINS=0x17
   kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
   kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
   kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
   kfdhdb.mfact:                    113792 ; 0x0c0: 0x0001bc80
   kfdhdb.dsksize:                    9075 ; 0x0c4: 0x00002373
   kfdhdb.pmcnt:                         2 ; 0x0c8: 0x00000002
   kfdhdb.fstlocn:                       1 ; 0x0cc: 0x00000001
   kfdhdb.altlocn:                       2 ; 0x0d0: 0x00000002
   kfdhdb.f1b1locn:                      2 ; 0x0d4: 0x00000002
   kfdhdb.redomirrors[0]:                0 ; 0x0d8: 0x0000
   kfdhdb.redomirrors[1]:                0 ; 0x0da: 0x0000
   kfdhdb.redomirrors[2]:                0 ; 0x0dc: 0x0000
   kfdhdb.redomirrors[3]:                0 ; 0x0de: 0x0000
   kfdhdb.ub4spare[0]:                   0 ; 0x0e0: 0x00000000
   ...
   kfdhdb.ub4spare[60]:                  0 ; 0x1d0: 0x00000000
   kfdhdb.acdb.aba.seq:                  0 ; 0x1d4: 0x00000000
   kfdhdb.acdb.aba.blk:                  0 ; 0x1d8: 0x00000000
   kfdhdb.acdb.ents:                     0 ; 0x1dc: 0x0000
   kfdhdb.acdb.ub2spare:                 0 ; 0x1de: 0x0000
 
  Breakdown:

   kfbh.endian  
     kf3.h   /* endianness of writer */ 
       Little endian = 1  
       Big endian = 0
  
   kfbh.hard    
     kf3.h   /* H.A.R.D. magic # and block size */  

  kfbh.type
    kf3.h    /* metadata block type               */

  kfbh.datfmt
    kf3.h   /* metadata block data format        */

  kfbh.block
    kf3.h   /* block location of this block      */
      blk -- Disk header should have T=0 and NUMB=0x0
      obj -- Disk header should have TYPE=0x8 NUMB=<disknumber>
    blk and obj values are derived from a series of macros in kf3.h.  See 
    "KFBL Macros" in kf3.h for more information.

  kfbh.check
    kf3.h   /* check value to verify consistency */

  kfbh.fcn
    kf3.h   /* change number of last change      */
     
  kfdhdb.driver
    kf3.h   /* OSMLIB driver reserved block  */
       If no driver is defined "ORCLDISK" is used.  
      
  kfdhdb.compat
    kf3.h   /* Comaptible software version   */
      example: 0x0a100000 
      You get:      
          a=10 1=1 so 10.1.0.0.0

  kfdhdb.dsknum
    kf3.h   /* OSM disk number               *
      This is the disk number.  The first disk being "0".  There can be up to
      ub2 disks in a diskgroup.  This allows for 65336 disks 0 through 65335.

  kfdhdb.grptyp
    kf3.h   /* Disk group type               */

  kfdhdb.hdrsts
    kf3.h   /* Disk header status            */
      This is what is used to determine if a disk is available or not to 
      the diskgroup.  0x03 is the correct value for a valid status.

  kfdhdb.dskname   /* OSM disk name       */
  kfdhdb.grpname   /* OSM disk group name */
  kfdhdb.fgname    /* Failure group name  */
  kfdhdb.capname   /* Capacity grp, unused*/
    kf3.h 

  kfdhdb.crestmp   /* Creation timestamp            */
  kfdhdb.mntstmp   /* Mount timestamp               */
    kf3.h To derive the hi and low time`from an unformated dump use the 
    "KFTS Macros" in kf3.h.

  kfdhdb.secsize
    kf3.h   /* Disk sector size (bytes)      */
      This is the physical sector size of the disk in bytes. All I/O's to the
      disk are described in physical sectors. This must be a power of 2. An
      ideal value would be 4096, but most disks are formatted with 512 byte
      sectors. (from asmlib.h)

  kfdhdb.blksize
    kf3.h   /* Metadata block (bytes)        */
       
  kfdhdb.ausize
    kf3.h   /* Allocation Unit (bytes)       */

  kfdhdb.mfact 
    kf3.h   /* Stride between phys addr AUs  */
     
  kfdhdb.dsksize
    kf3.h   /* Disk size in AUs              */
      Mulitply by AUs to get actual size of disk when added.  
         
  kfdhdb.pmcnt
    kf3.h   /* Permanent phys addressed AUs  */
      Number of physically addressed allocation units.

  kfdhdb.fstlocn
    kf3.h   /* First FreeSpace table blk num */
      Used to find freespace.

  kfdhdb.altlocn
    kf3.h   /* First Alocation table blk num */
      Used to find alocated space.

  kfdhdb.f1b1locn
    kf3.h   /* File Directory blk 1 AU num   */
      Beginging for file directory.

oracle asm系列文章汇总—www.xifenfie.com

为了方便大家更容易的查看相关asm内容,今天(2016年7月28日)对asm的相关文章进行了汇总整理.如果有asm相关的其他问题,可以通过手机(13429648788)或者QQ(107644445)交流
远程访问ASM
ASMCMD常用命令
ASM简单管理(1)
ASM简单管理(2)
找回ASM中数据文件
bbed修改ASM中数据
ASM迁移至文件系统
普通库迁移至ASM存储
使用dd复制asm中文件
配置Oracle ASM磁盘
ASM中磁盘组权限设置
监控asm disk磁盘性能
create spfile to asm
pvid=yes导致asm无法mount—ASM恢复案例
asm数据文件迁移(os–>asm)
asm数据文件迁移(asm–>asm)
asm数据文件迁移(asm–>os)
asm disk格式化为ntfs恢复—ASM恢复案例
通过ftp/http拷贝asm中文件
ASM DISK HEADER 备份与恢复
手工修复ASM DISK HEADER 异常
asm disk header 彻底损坏恢复—ASM恢复案例
ASM未正常启动,使用dd找回数据文件
asm disk被格式化为ext4文件系统恢复—ASM恢复案例
ORACLE 12C ASM 新特性:共享密码文件
asm备份元数据之md_backup和md_restore
分区无法识别导致asm diskgroup无法mount—ASM恢复案例
使用losetup实现linux普通文件做asm disk
Oracle异常恢复前备份保护现场建议—ASM环境
多cpu环境中运行root.sh失败,asm报ORA-04031
asmlib异常报ORA-00600[kfklLibFetchNext00]
因asm sga_target设置不当导致11gr2 rac无法正常启动
asm disk误设置pvid导致asm diskgroup无法mount恢复—ASM恢复案例
使用_asm_allow_only_raw_disks实现普通文件做asm disk
ORACLE 12C RAC修改ocr/votedisk/asm spfile所在磁盘组名称
使用asm disk header 自动备份信息恢复异常asm disk header
分享oracleasm createdisk重新创建asm disk后数据0丢失恢复案例—ASM恢复案例
ADHU(ASM Disk Header Utility)—asm disk header备份恢复工具
How to Get the Contents of an Spfile on ASM when ASM/GRID is down
ORA-15042: ASM disk “N” is missing from group number “M” 故障恢复—ASM恢复案例

分区无法识别导致asm diskgroup无法mount

有客户咨询由于主机重启之后,导致四个磁盘组中的data2磁盘组无法mount(报ORA-15032,ORA-15017,ORA-15063),数据库无法open,让我们帮忙分析解决

Wed Mar 09 18:10:53 2016
NOTE: Assigning number (1,1) to disk (/dev/oracleasm/disks/VOL011)
Wed Mar 09 18:10:53 2016
ERROR: no read quorum in group: required 1, found 0 disks
NOTE: cache dismounting (clean) group 1/0xBD42B778 (DATA2) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 45093, image: oracle@BA (TNS V1-V3)
NOTE: dbwr not being msg'd to dismount
NOTE: lgwr not being msg'd to dismount
NOTE: cache dismounted group 1/0xBD42B778 (DATA2) 
NOTE: cache ending mount (fail) of group DATA2 number=1 incarn=0xbd42b778
NOTE: cache deleting context for group DATA2 1/0xbd42b778
GMON dismounting group 1 at 16 for pid 18, osid 45093
NOTE: Disk DATA2_0001 in mode 0x9 marked for de-assignment
ERROR: diskgroup DATA2 was not mounted
ORA-15032: not all alterations performed
ORA-15017: diskgroup "DATA2" cannot be mounted
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATA2"
ERROR: ALTER DISKGROUP DATA2 MOUNT  /* asm agent *//* {0:0:431} */

这里很明显由于缺少asm disk导致data2无法正常mount,进一步分析发现data2是有两块磁盘组成

Mon Sep 14 13:14:35 2015
SQL> create diskgroup data2 external redundancy  disk '/dev/oracleasm/disks/VOL010','/dev/oracleasm/disks/VOL011' 
NOTE: Assigning number (4,0) to disk (/dev/oracleasm/disks/VOL010)
NOTE: Assigning number (4,1) to disk (/dev/oracleasm/disks/VOL011)
NOTE: initializing header on grp 4 disk DATA2_0000
NOTE: initializing header on grp 4 disk DATA2_0001
NOTE: initiating PST update: grp = 4
Mon Sep 14 13:14:35 2015
GMON updating group 4 at 29 for pid 26, osid 51535
NOTE: group DATA2: initial PST location: disk 0000 (PST copy 0)
NOTE: PST update grp = 4 completed successfully 
NOTE: cache registered group DATA2 number=4 incarn=0xea085f62
NOTE: cache began mount (first) of group DATA2 number=4 incarn=0xea085f62
NOTE: cache opening disk 0 of grp 4: DATA2_0000 path:/dev/oracleasm/disks/VOL010
NOTE: cache opening disk 1 of grp 4: DATA2_0001 path:/dev/oracleasm/disks/VOL011
NOTE: cache creating group 4/0xEA085F62 (DATA2)
NOTE: cache mounting group 4/0xEA085F62 (DATA2) succeeded
NOTE: allocating F1X0 on grp 4 disk DATA2_0000
NOTE: diskgroup must now be re-mounted prior to first use
NOTE: cache dismounting (clean) group 4/0xEA085F62 (DATA2) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 51535, image: oracle@BA (TNS V1-V3)
NOTE: lgwr not being msg'd to dismount
NOTE: cache dismounted group 4/0xEA085F62 (DATA2) 
GMON dismounting group 4 at 30 for pid 26, osid 51535
GMON dismounting group 4 at 31 for pid 26, osid 51535
NOTE: Disk DATA2_0000 in mode 0x7e marked for de-assignment
NOTE: Disk DATA2_0001 in mode 0x7e marked for de-assignment
SUCCESS: diskgroup DATA2 was created

结合这部分信息,我们可以确定data2磁盘组是由两个磁盘组构成,分别为VOL010和VOL011,现在由于只发现了VOL011,因此data2磁盘组无法正常mount.观察发现该系统使用的是asmlib,通过oracleasm querydisk命令结合fdisk的盘符,
oracleasm-querydisk


基本上可以确定VOL010丢失应该在mpathb盘(由于只有该盘和分区未被使用,其他盘和分区已经全部被现在可以查询到的asmlib使用作为asmdisk)之上

Disk /dev/mapper/mpathb: 3846.7 GB, 3846677987328 bytes
255 heads, 63 sectors/track, 467665 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

             Device Boot      Start         End      Blocks   Id  System
/dev/mapper/mpathbp1               1      267350  2147483647+  ee  GPT

Disk /dev/mapper/mpathbp1: 3846.7 GB, 3846675890176 bytes
255 heads, 63 sectors/track, 467665 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xb84bb99a

                Device Boot      Start         End      Blocks   Id  System
/dev/mapper/mpathbp1p1               1      200513  1610620641   83  Linux
/dev/mapper/mpathbp1p2          200514      267349   536860170   83  Linux
/dev/mapper/mpathbp1p3          267350      467665  1609038270   83  Linux

这里我们发现奇怪现象:mpathb盘先使用parted分为一个mapthbp1分区,然后又使用fdisk分了三个p1p1,p1p2,p1p3三个子分区.然后我们查看/dev/mapper/中的设备情况
mpathb


发现没有p1p1,p1p2,p1p3这三个本该属于mapthb上的子分区.现在基本上明确,是由于对mapthb先使用了parted分区,然后再使用fdisk分区,在操作系统重启之后,无法正常识别相关子分区导致该问题.到此解决该问题的思路有三种.
1. 因为磁盘分区表信息是正常的,就是分区表信息没有同步到操作系统之上,想办法同步过去即可,os部分内容,此处忽略
2. 使用数据文件重组的方式直接对data2这两个asm disk进行重组,这里因为三个子分区未发现,直接对mapthbp1分区进行扫描即可,参考:asm disk header 彻底损坏恢复
3. 因为分区对于asm disk来说主要就是设置了磁盘的偏移量和大小,如果找到磁盘的偏移量,然后确定asm disk大小,直接通过dd命令把该部分dd到新的磁盘设备之上,然后直接mount磁盘组即可,这里重点讲解第三种方法恢复处理
使用dd出来mapthp1的磁盘头,然后使用bbed找出来偏移量,主要依据是第一次出现01820101信息的部分

BBED> d
 File: bp1 (0)
 Block: 64               Offsets:    0 to   63           Dba:0x00000000
------------------------------------------------------------------------
 01820101 00000000 00000080 bc60223c 00000000 00000000 00000000 00000000
 4f52434c 4449534b 564f4c30 31300000 00000000 00000000 00000000 00000000

 <32 bytes per line>

BBED> show all
        FILE#           0
        BLOCK#          64
        OFFSET          0
        DBA             0x00000000 (0 0,64)
        FILENAME        bp1
        BIFILE          bifile.bbd
        LISTFILE
        BLOCKSIZE       512
        MODE            Browse
        EDIT            Unrecoverable
        IBASE           Dec
        OBASE           Dec
        WIDTH           80
        COUNT           64
        LOGFILE         log.bbd
        SPOOL           No

这里基本上可以定位到asm disk header对于mapthbp1的偏移量为32256,dd出来asm disk header分析
dd


使用kfed查看磁盘头信息
kfed
kfed2

现在基本上可以确定,asm disk大小为1572871M,磁盘的偏移量为32256,然后使用dd命令把这部分dd到新的磁盘设备上,然后oracleasm scandisks后
asm-mount
asmcd-lsdg
oracle-open

data2 mount成功,数据库正常open,此数据库完美恢复