ERROR: diskgroup XXXX was not mounted

联系:手机/微信(+86 13429648788) QQ(107644445)QQ咨询惜分飞

标题:ERROR: diskgroup XXXX was not mounted

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

aix平台10.2.0.5 2节点RAC,由于节点2系统盘故障,通过节点1镜像系统,复制到节点2,结果由于节点2磁盘顺序和节点1不匹配,aix工程师进行了相关操作之后,节点1重启之后datadg磁盘组无法mount

SQL> alter diskgroup datadg mount 
Mon Jun 10 23:23:46 CST 2019
NOTE: cache registered group DATADG number=1 incarn=0x8cf61164
Mon Jun 10 23:23:46 CST 2019
NOTE: Hbeat: instance first (grp 1)
Mon Jun 10 23:23:50 CST 2019
NOTE: start heartbeating (grp 1)
Mon Jun 10 23:23:50 CST 2019
NOTE: cache dismounting group 1/0x8CF61164 (DATADG) 
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DATADG was not mounted

检查datadg磁盘组相关信息

Tue Jan 29 19:21:45 CST 2019
NOTE: start heartbeating (grp 2)
NOTE: cache opening disk 0 of grp 2: DATADG_0000 path:/dev/rhdisk6
Tue Jan 29 19:21:45 CST 2019
NOTE: F1X0 found on disk 0 fcn 0.0
NOTE: cache opening disk 1 of grp 2: DATADG_0001 path:/dev/rhdisk7
NOTE: cache opening disk 2 of grp 2: DATADG_0002 path:/dev/rhdisk8
NOTE: cache opening disk 3 of grp 2: DATADG_0003 path:/dev/rhdisk9
NOTE: cache mounting (first) group 2/0x60E59155 (DATADG)
* allocate domain 2, invalid = TRUE 
Tue Jan 29 19:21:45 CST 2019
NOTE: attached to recovery domain 2
Tue Jan 29 19:21:45 CST 2019
NOTE: cache recovered group 2 to fcn 0.849668
Tue Jan 29 19:21:45 CST 2019
NOTE: LGWR attempting to mount thread 1 for disk group 2
NOTE: LGWR mounted thread 1 for disk group 2
NOTE: opening chunk 1 at fcn 0.849668 ABA 
NOTE: seq=21 blk=5394 
Tue Jan 29 19:21:46 CST 2019
NOTE: cache mounting group 2/0x60E59155 (DATADG) succeeded
SUCCESS: diskgroup DATADG was mounted

通过这里可以看出来datadg磁盘组是由rhdisk6-9 四块磁盘组成,查询相关磁盘信息发现
asm-foreign


这里确定rhdisk7磁盘异常,通过kfed分析磁盘情况

D:\BaiduNetdiskDownload\xifenfei>kfed read rhdisk7.dd
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                           34 ; 0x001: 0x22
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                   49407 ; 0x004: blk=49407
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                    58396 ; 0x010: 0x0000e41c
kfbh.fcn.wrap:                   131072 ; 0x014: 0x00020000
kfbh.spare1:                 4294967064 ; 0x018: 0xffffff18
kfbh.spare2:                 2105310074 ; 0x01c: 0x7d7c7b7a
005918A00 00002200 0000C0FF 00000000 00000000  [."..............]
005918A10 0000E41C 00020000 FFFFFF18 7D7C7B7A  [............z{|}]
005918A20 00000000 00000000 00000000 00000000  [................]
  Repeat 253 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

D:\BaiduNetdiskDownload\xifenfei>kfed read rhdisk7.dd blkn=1
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
006EF8A00 00000000 00000000 00000000 00000000  [................]
  Repeat 255 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

D:\BaiduNetdiskDownload\xifenfei>kfed read rhdisk7.dd blkn=2|more
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                33554432 ; 0x004: blk=33554432
kfbh.block.obj:                16777344 ; 0x008: file=128
kfbh.check:                  3844041089 ; 0x00c: 0xe51f6981
kfbh.fcn.base:               1297484544 ; 0x010: 0x4d560b00
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdatb10.aunum:                       0 ; 0x000: 0x00000000
kfdatb10.shrink:                  49153 ; 0x004: 0xc001
kfdatb10.ub2pad:                  20555 ; 0x006: 0x504b
kfdatb10.auinfo[0].link.next:      2048 ; 0x008: 0x0800
kfdatb10.auinfo[0].link.prev:      2048 ; 0x00a: 0x0800
kfdatb10.auinfo[0].free:              0 ; 0x00c: 0x0000
kfdatb10.auinfo[0].total:         49153 ; 0x00e: 0xc001
kfdatb10.auinfo[1].link.next:      4096 ; 0x010: 0x1000
kfdatb10.auinfo[1].link.prev:      4096 ; 0x012: 0x1000
kfdatb10.auinfo[1].free:              0 ; 0x014: 0x0000
kfdatb10.auinfo[1].total:             0 ; 0x016: 0x0000
kfdatb10.auinfo[2].link.next:      6144 ; 0x018: 0x1800
kfdatb10.auinfo[2].link.prev:      6144 ; 0x01a: 0x1800
kfdatb10.auinfo[2].free:              0 ; 0x01c: 0x0000
kfdatb10.auinfo[2].total:             0 ; 0x01e: 0x0000
kfdatb10.auinfo[3].link.next:      8192 ; 0x020: 0x2000
kfdatb10.auinfo[3].link.prev:      8192 ; 0x022: 0x2000
kfdatb10.auinfo[3].free:              0 ; 0x024: 0x0000

对比磁盘可能的损坏情况,由于在aix 平台asm disk的block有一个特征一般0082开头,通过工具打开磁盘,检索该标记对比
正常磁盘
0082-good


异常磁盘
0082-had

通过上述分析,大概评估rhdisk7 元数据部分损坏的不光是block 0和1,人工修复继续使用的可能性不太大,而且基于客户的数据库不大,采取方案是直接拷贝数据文件、redo、控制文件到文件系统,然后在本地文件系统open库
open_db

运气不错,实现完美恢复数据0丢失

WARNING: Read Failed.导致asm磁盘组异常

联系:手机/微信(+86 13429648788) QQ(107644445)QQ咨询惜分飞

标题:WARNING: Read Failed.导致asm磁盘组异常

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

有客户对asm dg进行扩容,一段时间之后,asm data 磁盘组直接dismount

Wed May 29 18:37:25 2019
SUCCESS: ALTER DISKGROUP DATA ADD  DISK '/dev/oracleasm/disks/DATA_0028' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0027' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0026' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0025' SIZE 511993M /* ASMCA */
NOTE: starting rebalance of group 1/0x9e18e2f1 (DATA) at power 1
Wed May 29 18:37:26 2019
Starting background process ARB0
Wed May 29 18:37:26 2019
ARB0 started with pid=34, OS id=96638 
NOTE: assigning ARB0 to group 1/0x9e18e2f1 (DATA) with 1 parallel I/O
NOTE: Attempting voting file refresh on diskgroup DATA
NOTE: Refresh completed on diskgroup DATA. No voting file found.
cellip.ora not found.
Wed May 29 19:21:43 2019
WARNING: Read Failed. group:1 disk:27 AU:0 offset:360448 size:4096
WARNING: cache failed reading from group=1(DATA) dsk=27 blk=88 count=1 from disk= 27 
(DATA_0027) kfkist=0x20 status=0x02 osderr=0x0 file=kfc.c line=11596
ERROR: cache failed to read group=1(DATA) dsk=27 blk=88 from disk(s): 27(DATA_0027)
ORA-15080: synchronous I/O operation to a disk failed
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 704
Additional information: -1
NOTE: cache initiating offline of disk 27 group DATA
NOTE: process _user31879_+asm1 (31879) initiating offline of disk 27.3915911747 (DATA_0027) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 27/0xe9681243, mask = 0x6a, op = clear
Wed May 29 19:21:43 2019
GMON updating disk modes for group 1 at 10 for pid 35, osid 31879
ERROR: Disk 27 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 1)
Wed May 29 19:21:43 2019
NOTE: cache dismounting (not clean) group 1/0x9E18E2F1 (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 90256, image: oracle@ftz-db-o1 (B000)
Wed May 29 19:21:43 2019
NOTE: halting all I/Os to diskgroup 1 (DATA)
WARNING: Offline for disk DATA_0027 in mode 0x7f failed.
Wed May 29 19:21:43 2019
NOTE: LGWR doing non-clean dismount of group 1 (DATA)
NOTE: LGWR sync ABA=27.3207 last written ABA 27.3207
Wed May 29 19:21:43 2019
ERROR: ORA-15130 thrown in ARB0 for group number 1
Errors in file /oracle/grid_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_96638.trc:
ORA-15130: diskgroup "" is being dismounted
ORA-15130: diskgroup "DATA" is being dismounted
Wed May 29 19:21:43 2019
NOTE: stopping process ARB0

后续继续mount data 磁盘组成功,但是立马又dismount

Wed May 29 18:37:25 2019
SUCCESS: ALTER DISKGROUP DATA ADD  DISK '/dev/oracleasm/disks/DATA_0028' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0027' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0026' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0025' SIZE 511993M /* ASMCA */
NOTE: starting rebalance of group 1/0x9e18e2f1 (DATA) at power 1
Wed May 29 18:37:26 2019
Starting background process ARB0
Wed May 29 18:37:26 2019
ARB0 started with pid=34, OS id=96638 
NOTE: assigning ARB0 to group 1/0x9e18e2f1 (DATA) with 1 parallel I/O
NOTE: Attempting voting file refresh on diskgroup DATA
NOTE: Refresh completed on diskgroup DATA. No voting file found.
cellip.ora not found.
Wed May 29 19:21:43 2019
WARNING: Read Failed. group:1 disk:27 AU:0 offset:360448 size:4096
WARNING: cache failed reading from group=1(DATA) dsk=27 blk=88 count=1 from disk= 27 
(DATA_0027) kfkist=0x20 status=0x02 osderr=0x0 file=kfc.c line=11596
ERROR: cache failed to read group=1(DATA) dsk=27 blk=88 from disk(s): 27(DATA_0027)
ORA-15080: synchronous I/O operation to a disk failed
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 704
Additional information: -1
NOTE: cache initiating offline of disk 27 group DATA
NOTE: process _user31879_+asm1 (31879) initiating offline of disk 27.3915911747 (DATA_0027) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 27/0xe9681243, mask = 0x6a, op = clear
Wed May 29 19:21:43 2019
GMON updating disk modes for group 1 at 10 for pid 35, osid 31879
ERROR: Disk 27 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 1)
Wed May 29 19:21:43 2019
NOTE: cache dismounting (not clean) group 1/0x9E18E2F1 (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 90256, image: oracle@ftz-db-o1 (B000)
Wed May 29 19:21:43 2019
NOTE: halting all I/Os to diskgroup 1 (DATA)
WARNING: Offline for disk DATA_0027 in mode 0x7f failed.
Wed May 29 19:21:43 2019
NOTE: LGWR doing non-clean dismount of group 1 (DATA)
NOTE: LGWR sync ABA=27.3207 last written ABA 27.3207
Wed May 29 19:21:43 2019
ERROR: ORA-15130 thrown in ARB0 for group number 1
Errors in file /oracle/grid_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_96638.trc:
ORA-15130: diskgroup "" is being dismounted
ORA-15130: diskgroup "DATA" is being dismounted
Wed May 29 19:21:43 2019
NOTE: stopping process ARB0

对于上述的故障现象,本质原因是由于asm 磁盘组增加新磁盘之后,开始做rebalance,但是由于遭遇到 27号盘上有IO读错误,使得asm磁盘组无法正常完成rebalance,因而data磁盘组无法稳定的mount。解决该问题思路,通过patch asm磁盘组,禁止rebalance,从而使得data磁盘组不再dismount,再进行后续恢复

ORA-600 kfrValAcd30 恢复

联系:手机/微信(+86 13429648788) QQ(107644445)QQ咨询惜分飞

标题:ORA-600 kfrValAcd30 恢复

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

有客户由于存储控制器损坏,修复控制器之后,asm无法正常mount,报ORA-600 kfrValAcd30错误,让我们提供技术支持
kfrValAcd30


asm alert日志报错

Wed Apr 03 16:50:57 2019
SQL> alter diskgroup data mount 
NOTE: cache registered group DATA number=1 incarn=0x14248741
NOTE: cache began mount (first) of group DATA number=1 incarn=0x14248741
NOTE: Assigning number (1,0) to disk (ORCL:DATAVOL1)
Wed Apr 03 16:51:03 2019
NOTE: start heartbeating (grp 1)
kfdp_query(DATA): 15 
kfdp_queryBg(): 15 
NOTE: cache opening disk 0 of grp 1: DATAVOL1 label:DATAVOL1
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache mounting (first) external redundancy group 1/0x14248741 (DATA)
Wed Apr 03 16:51:04 2019
* allocate domain 1, invalid = TRUE 
Wed Apr 03 16:51:04 2019
NOTE: attached to recovery domain 1
NOTE: starting recovery of thread=1 ckpt=27.2697 group=1 (DATA)
Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_15951.trc  (incident=23394):
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [27], [2697], [28], [2697], [], [], [], [], []
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM2/incident/incdir_23394/+ASM2_ora_15951_i23394.trc
Abort recovery for domain 1
NOTE: crash recovery signalled OER-600
ERROR: ORA-600 signalled during mount of diskgroup DATA
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [27], [2697], [28], [2697], [], [], [], [], []
ERROR: alter diskgroup data mount
NOTE: cache dismounting (clean) group 1/0x14248741 (DATA) 
NOTE: lgwr not being msg'd to dismount
freeing rdom 1
Wed Apr 03 16:51:05 2019
Sweep [inc][23394]: completed
Sweep [inc2][23394]: completed
Wed Apr 03 16:51:05 2019
Trace dumping is performing id=[cdmp_20190403165105]
NOTE: detached from domain 1
NOTE: cache dismounted group 1/0x14248741 (DATA) 
NOTE: cache ending mount (fail) of group DATA number=1 incarn=0x14248741
kfdp_dismount(): 16 
kfdp_dismountBg(): 16 
NOTE: De-assigning number (1,0) from disk (ORCL:DATAVOL1)
ERROR: diskgroup DATA was not mounted
NOTE: cache deleting context for group DATA 1/337938241

mos相关记录
参考:ORA-600 [KFRVALACD30] in ASM (Doc ID 2123013.1)
kfrValAcd30-mos


ORA-00600: internal error code, arguments: [kfrValAcd30]可能是bug或者硬件故障导致.基于客户的情况,最大可能就是由于硬件故障导致asm 磁盘组的acd无法正常进行,从而无法mount成功.如果运气好,通过kfed相关修复可以正常mount成功,运气不好可以通过底层进行恢复数据文件,从而最大限度恢复数据.

KFED-00322 [kfbtTraverseBlock][Invalid OSM block type]

在oracle asm的使用过程中由于操作系统层面的错误操作导致asm disk 被破坏,这里列举了几种破坏之后的kfed报错现象(KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type])
asm mount 磁盘组报错(ORA-15040 ORA-15042)

SQL> alter diskgroup DATA mount;
alter diskgroup DATA mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "2" is missing from group number "2"

asm alert日志报错(ORA-15335 ORA-15066 ORA-15196等)

ORA-15335: ASM metadata corruption detected in disk group 'DATA'
ORA-15130: diskgroup "DATA" is being dismounted
ORA-15066: offlining disk "DATA_0002" in group "DATA" may result in a data loss
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483651] [48] [0 != 1]

kfed查看磁盘头报错
文件文件头(不光是disk header的4k,可能是连续的几个au,甚至更多)可能彻底损坏,一般kfed 读取都会看到KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type]之类错误

[oracle@fcomtaep2 disks]$ kfed read ASMRECO03
kfbh.endian: 0 ; 0x000: 0x00
kfbh.hard: 0 ; 0x001: 0x00
kfbh.type: 0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt: 0 ; 0x003: 0x00
kfbh.block.blk: 0 ; 0x004: blk=0
kfbh.block.obj: 0 ; 0x008: file=0
kfbh.check: 0 ; 0x00c: 0x00000000
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
7FC18D899400 00000000 00000000 00000000 00000000 [................]
  Repeat 27 times
7FC18D8995C0 FEEE0001 0001FFFF FFFF0000 00000FFF [................]
7FC18D8995D0 00000000 00000000 00000000 00000000 [................]
  Repeat 1 times
7FC18D8995F0 00000000 00000000 00000000 AA550000 [..............U.]
7FC18D899600 20494645 54524150 00010000 0000005C [EFI PART....\...] <==== **** Here ******
7FC18D899610 BD82BBB3 00000000 00000001 00000000 [................]
7FC18D899620 0FFFFFFF 00000000 00000022 00000000 [........".......]
7FC18D899630 0FFFFFDE 00000000 FD8857E5 42D7B49B [.........W.....B]
7FC18D899640 0901FA87 6B3DB5AA 00000002 00000000 [......=k........]
7FC18D899650 00000080 00000080 FE48EB77 00000000 [........w.H.....]
7FC18D899660 00000000 00000000 00000000 00000000 [................]
  Repeat 25 times
7FC18D899800 EBD0A0A2 4433B9E5 B668C087 C79926B7 [......3D..h..&..]
7FC18D899810 5381F6DF 4626F988 0E4F468D D78D3B28 [...S..&F.FO.(;..]
7FC18D899820 000007A1 00000000 0FFFF85F 00000000 [........_.......]
7FC18D899830 00000000 00000000 00720070 006D0069 [........p.r.i.m.]
7FC18D899840 00720061 00000079 00000000 00000000 [a.r.y...........]
7FC18D899850 00000000 00000000 00000000 00000000 [................]
 Repeat 186 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

“EFI PART”是分区的元数据,一般是被分区导致asm disk损坏.

[ebernal@dbaasm new2]$ kfed read emcpowerl | head -25
kfbh.endian: 0 ; 0x000: 0x00
kfbh.hard: 0 ; 0x001: 0x00
kfbh.type: 0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt: 0 ; 0x003: 0x00
kfbh.block.blk: 0 ; 0x004: blk=0
kfbh.block.obj: 0 ; 0x008: file=0
kfbh.check: 0 ; 0x00c: 0x00000000
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
2ABD671E9400 00000000 00000000 00000000 00000000 [................]
  Repeat 31 times
2ABD671E9600 4542414C 454E4F4C 00000001 00000000 [LABELONE........]
2ABD671E9610 E4E1DDB1 00000020 324D564C 31303020 [.... ...LVM2 001] <==== **** Here ******
2ABD671E9620 50365A77 71327874 34303156 4B4E6136 [wZ6Ptx2qV1046aNK]
2ABD671E9630 35395159 5147634C 487A5A38 63575A37 [YQ95LcGQ8ZzH7ZWc]
2ABD671E9640 00000000 00000019 00030000 00000000 [................]
2ABD671E9650 00000000 00000000 00000000 00000000 [................]
2ABD671E9660 00000000 00000000 00001000 00000000 [................]
2ABD671E9670 0002F000 00000000 00000000 00000000 [................]
2ABD671E9680 00000000 00000000 00000000 00000000 [................]
  Repeat 215 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

“LVM2 001” 是逻辑卷的名字,该asm disk很可能被做为lvm管理而被破坏

[ebernal@dbaasm tars]$ kfed read rhdisk16
kfbh.endian:                         65 ; 0x000: 0x41
kfbh.hard:                           73 ; 0x001: 0x49
kfbh.type:                           88 ; 0x002: *** Unknown Enum ***
kfbh.datfmt:                         32 ; 0x003: 0x20
kfbh.block.blk:              1111709260 ; 0x004: blk=1111709260
kfbh.block.obj:              1634861056 ; 0x008: file=131072
kfbh.check:                         119 ; 0x00c: 0x00000077
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
2B6FE2AC1400 20584941 4243564C 61720000 00000077  [AIX LVCB..raw...] <==== **** Here ******
2B6FE2AC1410 00000000 00000000 00000000 00000000  [................]
2B6FE2AC1420 00000000 00000000 30300000 38306430  [..........000d08]
2B6FE2AC1430 30306131 34643030 30303030 31303030  [1a0000d400000001]
2B6FE2AC1440 61006533 766C6D73 7461645F 00003161  [3e.asmlv_data1..]
2B6FE2AC1450 00000000 00000000 00000000 00000000  [................]
        Repeat 2 times
2B6FE2AC1480 54000000 4D206575 20207961 31312037  [...Tue May  7 11]
2B6FE2AC1490 3A33343A 32203633 0A333130 00000000  [:43:36 2013.....]
2B6FE2AC14A0 65755400 79614D20 20372020 343A3131  [.Tue May  7 11:4]
2B6FE2AC14B0 34323A38 31303220 00000A33 44000000  [8:24 2013......D]
2B6FE2AC14C0 41313830 30303444 6D6D7900 02007900  [081AD400.ymm.y..]
2B6FE2AC14D0 0100E40C 656E6F4E 00000000 00000000  [....None........]
2B6FE2AC14E0 00000000 00000000 00000000 00000000  [................]
        Repeat 14 times
2B6FE2AC15D0 00000000 00000000 65310000 61653934  [..........1e49ea]
2B6FE2AC15E0 342E3862 00000000 00000000 00000000  [b8.4............]
2B6FE2AC15F0 00000000 00000000 00000000 00000000  [................]
  Repeat 224 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][88]

这里的“AIX LVCB..raw” 是AIX OS volume 的元数据库,也就是说,asm disk 被作为了aix os层面破坏

[oracle@dbep2 disks]$ kfed read asm-disk3
kfbh.endian: 0 ; 0x000: 0x00
kfbh.hard: 0 ; 0x001: 0x00
kfbh.type: 0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt: 0 ; 0x003: 0x00
kfbh.block.blk: 0 ; 0x004: blk=0
kfbh.block.obj: 0 ; 0x008: file=0
kfbh.check: 0 ; 0x00c: 0x00000000
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
06000000 00000000 00000000 00000000 00000000 [................]
  Repeat 25 times
0602100 51e2b7f6 00ed4e00 00000000 00000001  [...Q.N..........]
0602120 00000000 0000000b 00000100 0000003c [............<...]
0602140 00000242 0000007b 5d8468e7 6147782a [B...{....h.]*xGa]
0602160 d17851a2 327552e2 00000000 00000000 [.Qx..Ru2........]
0602200 00000000 00000000 3130752f 91a4f000 [......../u01....] <==== **** Here ******
0602220 ff8808e4 d5104cff 000000ac 00000100 [.....L..........]
0602240 00000000 00000000 00000000 09d18000 [................]
  Repeat 254 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][88]

这里的/u01很可能表明该asm disk被文件系统覆盖

对于asm disk的各种破坏情况,如果是normal/high冗余,那么asm dg没有问题,可以考虑通过删除异常盘,然后重新加入;如果是外部冗余遭遇到asm disk 被破坏,一般asm disk 会dismount,而且无法正常mount,如果有备份的磁盘头,可以尝试还原磁盘头,mount 磁盘组,然后只读方式迁移数据;如果没有备份磁盘头或者还原之后也无法mount,可能需要通过一些额外的方式处理比如通过工具在asm dismount状态下恢复数据文件,甚至通过对asm block/oracle block碎片重组的方式恢复数据.参考相关文章:
oracle asm系列文章汇总
pvid=yes导致asm无法mount
asm disk header 彻底损坏恢复
分区无法识别导致asm diskgroup无法mount
oracle asm disk格式化恢复—格式化为ext4文件系统
oracle asm disk格式化恢复—格式化为ntfs文件系统
asm disk误设置pvid导致asm diskgroup无法mount恢复
分享oracleasm createdisk重新创建asm disk后数据0丢失恢复案例
ORA-15042: ASM disk “N” is missing from group number “M” 故障恢复

自动精简配置(Thin Provisioning)导致asm异常

有朋友在一个存储空间给asm使用,发生空间不足,然后使用另外一个存储中的lun给asm的data磁盘组增加asm disk,运行了大概1天之后,asm磁盘组直接dismount,数据库crash.然后就无法正常mount.包括这个存储上的几个其他磁盘组也无法正常mount.
数据库异常日志

Sun Oct 23 08:43:59 2016
SUCCESS: diskgroup DATA was dismounted
SUCCESS: diskgroup DATA was dismounted
Sun Oct 23 08:44:00 2016
Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lmon_79128.trc:
ORA-00202: control file: '+DATA/orcl/controlfile/current.278.892363163'
ORA-15078: ASM diskgroup was forcibly dismounted
Sun Oct 23 08:44:00 2016
Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lgwr_79174.trc:
ORA-00345: redo log write error block 15924 count 2
ORA-00312: online log 2 thread 1: '+DATA/orcl/onlinelog/group_2.274.892363167'
ORA-15078: ASM diskgroup was forcibly dismounted
ORA-15078: ASM diskgroup was forcibly dismounted
Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lgwr_79174.trc:
ORA-00202: control file: '+DATA/orcl/controlfile/current.278.892363163'
ORA-15078: ASM diskgroup was forcibly dismounted
Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lgwr_79174.trc:
ORA-00204: error in reading (block 1, # blocks 1) of control file
ORA-00202: control file: '+DATA/orcl/controlfile/current.278.892363163'
ORA-15078: ASM diskgroup was forcibly dismounted
Sun Oct 23 08:44:00 2016
LGWR (ospid: 79174): terminating the instance due to error 204
Sun Oct 23 08:44:00 2016
opiodr aborting process unknown ospid (79742) as a result of ORA-1092
Sun Oct 23 08:44:01 2016
ORA-1092 : opitsk aborting process
Sun Oct 23 08:44:01 2016
ORA-1092 : opitsk aborting process
System state dump requested by (instance=1, osid=79174 (LGWR)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_diag_79118.trc
Instance terminated by LGWR, pid = 79174

很明显,数据库异常是由于asm diskgroup dismount,因此分析asm 日志

asm 日志

Sun Oct 23 07:00:31 2016
Time drift detected. Please check VKTM trace file for more details.
Sun Oct 23 08:43:55 2016
Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_8755.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 1048576
WARNING: Write Failed. group:1 disk:2 AU:1222738 offset:0 size:1048576
ERROR: failed to copy file +DATA.524, extent 15030
ERROR: ORA-15080 thrown in ARB0 for group number 1
Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_8755.trc:
ORA-15080: synchronous I/O operation to a disk failed
Sun Oct 23 08:43:55 2016
NOTE: stopping process ARB0
NOTE: rebalance interrupted for group 1/0xec689cdd (DATA)
NOTE: ASM did background COD recovery for group 1/0xec689cdd (DATA)
NOTE: starting rebalance of group 1/0xec689cdd (DATA) at power 1
Starting background process ARB0
Sun Oct 23 08:43:56 2016
ARB0 started with pid=24, OS id=103554 
NOTE: assigning ARB0 to group 1/0xec689cdd (DATA) with 1 parallel I/O
Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_103554.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 1048576
WARNING: Write Failed. group:1 disk:2 AU:1222738 offset:0 size:1048576
ERROR: failed to copy file +DATA.256, extent 6570
ERROR: ORA-15080 thrown in ARB0 for group number 1
Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_103554.trc:
ORA-15080: synchronous I/O operation to a disk failed
NOTE: stopping process ARB0
Sun Oct 23 08:43:58 2016
Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_dbw0_8521.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
WARNING: Write Failed. group:1 disk:3 AU:6789 offset:24576 size:4096
NOTE: cache initiating offline of disk 3 group DATA
NOTE: process _dbw0_+asm1 (8521) initiating offline of disk 3.3915934787 (DATA_0003) with mask 0x7e in group 1
Sun Oct 23 08:43:58 2016
WARNING: Disk 3 (DATA_0003) in group 1 mode 0x7f is now being offlined
WARNING: Disk 3 (DATA_0003) in group 1 in mode 0x7f is now being taken offline on ASM inst 1
NOTE: initiating PST update: grp = 1, dsk = 3/0xe9686c43, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 14 for pid 14, osid 8521
ERROR: Disk 3 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 1)
Sun Oct 23 08:43:58 2016
NOTE: cache dismounting (not clean) group 1/0xEC689CDD (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 103577, image: oracle@node1 (B000)
WARNING: Disk 3 (DATA_0003) in group 1 mode 0x7f offline is being aborted
WARNING: Offline of disk 3 (DATA_0003) in group 1 and mode 0x7f failed on ASM inst 1
NOTE: halting all I/Os to diskgroup 1 (DATA)
Sun Oct 23 08:43:59 2016
NOTE: LGWR doing non-clean dismount of group 1 (DATA)
NOTE: LGWR sync ABA=160.10145 last written ABA 160.10145

错误信息很明显,由于Write Failed导致asm diskgroup dismount.

系统日志

Oct 23 08:43:55 node1 kernel: sd 6:0:12:1: [sdd]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 23 08:43:55 node1 kernel: sd 6:0:12:1: [sdd]  Sense Key : Data Protect [current] 
Oct 23 08:43:55 node1 kernel: sd 6:0:12:1: [sdd]  Add. Sense: Space allocation failed write protect
Oct 23 08:43:55 node1 kernel: sd 6:0:12:1: [sdd] CDB: Write(16): 8a 00 00 00 00 02 e7 18 37 f9 00 00 00 07 00 00
Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev sdd, sector 12467058681
Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467058681
Oct 23 08:43:55 node1 kernel: sd 8:0:6:1: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 23 08:43:55 node1 kernel: sd 8:0:6:1: [sdh]  Sense Key : Data Protect [current] 
Oct 23 08:43:55 node1 kernel: sd 8:0:6:1: [sdh]  Add. Sense: Space allocation failed write protect
Oct 23 08:43:55 node1 kernel: sd 8:0:6:1: [sdh] CDB: Write(16): 8a 00 00 00 00 02 e7 18
Oct 23 08:43:55 node1 kernel: sd 6:0:4:1: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 23 08:43:55 node1 kernel: sd 6:0:4:1: [sdb]  Sense Key : Data Protect [current] 
Oct 23 08:43:55 node1 kernel: sd 6:0:4:1: [sdb]   33Add. Sense: Space allocation failed write protect
Oct 23 08:43:55 node1 kernel: sd 6:0:4:1: [sdb] CDB: Write(16): 8a 00 00 00 00 02 e7 18 30 00 00 00 03 f9 00 00
Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev sdb, sector 12467056640
Oct 23 08:43:55 node1 kernel: f9 00 00 04 00
Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467056640
Oct 23 08:43:55 node1 kernel: 00 00
Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev sdh, sector 12467057657
Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467057657
Oct 23 08:43:57 node1 kernel: sd 8:0:6:1: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 23 08:43:57 node1 kernel: sd 8:0:6:1: [sdh]  Sense Key : Data Protect [current] 
Oct 23 08:43:57 node1 kernel: sd 8:0:6:1: [sdh]  Add. Sense: Space allocation failed write protect
Oct 23 08:43:57 node1 kernel: sd 8:0:6:1: [sdh] CDB: Write(16): 8a 00 00 00 00 02 e7 18 37 f9 00 00 00 07 00 00
Oct 23 08:43:57 node1 kernel: end_request: critical space allocation error, dev sdh, sector 12467058681
Oct 23 08:43:57 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467058681
Oct 23 08:43:57 node1 kernel: sd 8:0:12:1: [sdj]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 23 08:43:57 node1 kernel: sd 8:0:12:1: [sdj]  Sense Key : Data Protect [current] 
Oct 23 08:43:57 node1 kernel: sd 8:0:12:1: [sdj]  Add. Sense: Space allocation failed write protect
Oct 23 08:43:57 node1 kernel: sd 8:0:12:1: [sdj] CDB: Write(16): 8a 00 00 00 00 02 e7 18 30 00 00 00 03 f9 00 00
Oct 23 08:43:57 node1 kernel: end_request: critical space allocation error, dev sdj, sector 12467056640
Oct 23 08:43:57 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467056640
Oct 23 08:43:57 node1 kernel: sd 6:0:4:1: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 23 08:43:57 node1 kernel: sd 6:0:4:1: [sdb]  Sense Key : Data Protect [current] 
Oct 23 08:43:57 node1 kernel: sd 6:0:4:1: [sdb]  Add. Sense: Space allocation failed write protect
Oct 23 08:43:57 node1 kernel: sd 6:0:4:1: [sdb] CDB: Write(16): 8a 00 00 00 00 02 e7 18 33 f9 00 00 04 00 00 00
Oct 23 08:43:58 node1 kernel: sd 6:0:4:1: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 23 08:43:58 node1 kernel: sd 6:0:4:1: [sdb]  Sense Key : Data Protect [current] 
Oct 23 08:43:58 node1 kernel: sd 6:0:4:1: [sdb]  Add. Sense: Space allocation failed write protect
Oct 23 08:43:58 node1 kernel: sd 6:0:4:1: [sdb] CDB: Write(16): 8a 00 00 00 00 03 3b 7e 78 30 00 00 00 08 00 00
Oct 23 10:50:59 node1 init: oracle-ohasd main process (6150) killed by TERM signal

错误信息为:critical space allocation error,严重空间分配错误.也就是linux在分配空间之时发生错误.在换而言之,由于分配空间错误导致asm 磁盘组dismount.

查看多路径信息

[root@node1 ~]# multipath -ll
36000d31003190c000000000000000003 dm-3 COMPELNT,Compellent Vol
size=80T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 6:0:9:1  sdd 8:48  active ready running
  `- 8:0:9:1  sdi 8:128 active ready running
delldisk2 (36000d310031908000000000000000003) dm-4 COMPELNT,Compellent Vol
size=8.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 6:0:12:1 sde 8:64  active ready running
  |- 8:0:6:1  sdh 8:112 active ready running
  |- 6:0:4:1  sdb 8:16  active ready running
  `- 8:0:12:1 sdj 8:144 active ready running
delldisk1 (36000d31003190a000000000000000007) dm-2 COMPELNT,Compellent Vol
size=12T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 6:0:1:1  sda 8:0   active ready running
  |- 8:0:2:1  sdf 8:80  active ready running
  |- 6:0:7:1  sdc 8:32  active ready running
  `- 8:0:3:1  sdg 8:96  active ready running

很明显报错的都是同一个lun(delldisk2),也就是存储空间使用完的存储.也就是说,由于delldisk2存储的空间使用尽了导致系统出现分配空间错误,从而导致asm 写失败,进而导致数据库异常.这种问题的本质其实就是存储给系统分配了8T,但是实际存储可以使用的空间不足8T,而os按照8T来使用从而出现该问题.专业名字叫做”存储精简卷”.因此各位在存储配置之时需要注意该问题.因为这种情况的出现一般只是写io异常,读依旧正常,因此不会丢失数据.