asmlib异常报ORA-00600[kfklLibFetchNext00]

一个朋友的历史库出现故障,在linux 4的平台上asm的10.2.0.1的单库,asm使用asmlib来处理。
asm不能正常mount磁盘组,可以看到asmdisk,alert日志报ORA-00600[kfklLibFetchNext00]
操作系统内核是:2.6.9-78
oracleasmlib是:2.0.2-1
asm磁盘组mount失败

--以前故障
SQL> ALTER DISKGROUP ALL MOUNT 
Thu Sep  6 14:23:16 2012
NOTE: cache registered group DGARC number=1 incarn=0x2bf96274
NOTE: cache registered group DGDATA number=2 incarn=0x2c196275
NOTE: cache registered group DGSYS number=3 incarn=0x2c196276
Thu Sep  6 14:23:16 2012
Errors in file /opt/app/oracle/admin/+ASM/bdump/+asm_rbal_10204.trc:
ORA-15183: ASMLIB initialization error [driver/agent not installed]
Thu Sep  6 14:23:16 2012
Errors in file /opt/app/oracle/admin/+ASM/bdump/+asm_rbal_10204.trc:
ORA-15183: ASMLIB initialization error [/opt/oracle/extapi/64/asm/orcl/1/libasm.so]
ORA-15183: ASMLIB initialization error [driver/agent not installed]
Thu Sep  6 14:23:16 2012
ERROR: no PST quorum in group 1: required 2, found 0
Thu Sep  6 14:23:16 2012
NOTE: cache dismounting group 1/0x2BF96274 (DGARC) 
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGARC was not mounted
Thu Sep  6 14:23:16 2012
ERROR: no PST quorum in group 2: required 2, found 0
Thu Sep  6 14:23:16 2012
NOTE: cache dismounting group 2/0x2C196275 (DGDATA) 
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGDATA was not mounted
Thu Sep  6 14:23:16 2012
ERROR: no PST quorum in group 3: required 2, found 0
Thu Sep  6 14:23:16 2012
NOTE: cache dismounting group 3/0x2C196276 (DGSYS) 
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGSYS was not mounted

--现在故障
Thu Jan 24 13:49:45 2013
SQL> ALTER DISKGROUP ALL MOUNT 
Thu Jan 24 13:49:45 2013
NOTE: cache registered group DGARC number=1 incarn=0xf388cee9
NOTE: cache registered group DGDATA number=2 incarn=0xf3a8ceea
NOTE: cache registered group DGSYS number=3 incarn=0xf3a8ceeb
Thu Jan 24 13:49:45 2013
Errors in file /opt/app/oracle/admin/+ASM/bdump/+asm_rbal_13449.trc:
ORA-00600: internal error code, arguments: [kfklLibFetchNext00], 
[18446744073709551614], [0], [], [], [], [], []
Thu Jan 24 13:49:46 2013
Errors in file /opt/app/oracle/admin/+ASM/bdump/+asm_rbal_13449.trc:
ORA-00600: internal error code, arguments: [kfklLibFetchNext00], 
[18446744073709551614], [0], [], [], [], [], []
Thu Jan 24 13:49:46 2013
ERROR: no PST quorum in group 1: required 2, found 0
Thu Jan 24 13:49:46 2013
NOTE: cache dismounting group 1/0xF388CEE9 (DGARC) 
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGARC was not mounted
Thu Jan 24 13:49:46 2013
ERROR: no PST quorum in group 2: required 2, found 0
Thu Jan 24 13:49:46 2013
NOTE: cache dismounting group 2/0xF3A8CEEA (DGDATA) 
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGDATA was not mounted
Thu Jan 24 13:49:46 2013
ERROR: no PST quorum in group 3: required 2, found 0
Thu Jan 24 13:49:46 2013
NOTE: cache dismounting group 3/0xF3A8CEEB (DGSYS) 
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGSYS was not mounted
Shutting down instance: further logons disabled

trace文件信息

----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
ksedst()+31          call     ksedst1()            000000000 ? 000000001 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ? 000000001 ?
ksedmp()+610         call     ksedst()             000000000 ? 000000001 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ? 000000001 ?
ksfdmp()+21          call     ksedmp()             000000003 ? 000000001 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ? 000000001 ?
kgerinv()+161        call     ksfdmp()             000000003 ? 000000001 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ? 000000001 ?
kgesinv()+33         call     kgerinv()            006469D40 ? 0064E1C58 ?
                                                   000000000 ? 000000000 ?
                                                   000000001 ? 000000001 ?
kgesinw()+166        call     kgesinv()            006469D40 ? 0064E1C58 ?
                                                   000000000 ? 000000000 ?
                                                   000000001 ? 000000001 ?
kfklLibScanNext()+2  call     kgesinw()            006469D40 ? 000000000 ?
39                                                 000000001 ? 000000000 ?
                                                   FFFFFFFFFFFFFFFE ?
                                                   000000000 ?
kfkLibFetchNext()+3  call     kfklLibScanNext()    0064DDD70 ? 7FBFFFDCD0 ?
43                                                 000000001 ? 000000000 ?
                                                   FFFFFFFFFFFFFFFE ?
                                                   000000000 ?
kfuitrnInit()+524    call     kfkLibFetchNext()    006469D40 ? 2A971DFF90 ?
                                                   000000001 ? 000000000 ?
                                                   FFFFFFFFFFFFFFFE ?
                                                   000000000 ?
kfkLibIterInit()+18  call     kfuitrnInit()        006469D40 ? 2A971DFCB0 ?
0                                                  2A971DFF90 ? 000000009 ?
                                                   000000009 ? 000000000 ?
kfkLoadAllLibs()+36  call     kfkLibIterInit()     000000000 ? 00646C7E0 ?
3                                                  2A971DFF90 ? 000000009 ?
                                                   000000009 ? 000000000 ?
kfkDiscoverString()  call     kfkLoadAllLibs()     000000000 ? 00646C7E0 ?
+107                                               2A971DFF90 ? 000000009 ?
                                                   000000009 ? 000000000 ?
Cannot find symbol
Cannot find symbol
Cannot find symbol
kfdDiscoverString()  call     kfkDiscoverString()  067A53768 ? 00646C7E0 ?
+28                                                2A971DFF90 ? 000000009 ?
                                                   000000009 ? 000000000 ?
kfdDiscoverShallow(  call     kfdDiscoverString()  067A53768 ? 000000000 ?
)+315                                              2A971DFF90 ? 000000009 ?
                                                   000000009 ? 000000000 ?
kfgbDriver()+1174    call     kfdDiscoverShallow(  000000180 ? 000000000 ?
                              )                    2A971DFF90 ? 000000009 ?
                                                   000000009 ? 000000000 ?
ksbabs()+564         call     kfgbDriver()         7FBFFFE5C0 ? 000000048 ?
                                                   000000000 ? 000000009 ?
                                                   000000009 ? 000000000 ?
ksbrdp()+727         call     ksbabs()             7FBFFFE5C0 ? 000000048 ?
                                                   000000000 ? 000000009 ?
                                                   000000009 ? 000000000 ?
opirip()+616         call     ksbrdp()             7FBFFFE5C0 ? 000000048 ?
                                                   000000001 ? 06002C770 ?
                                                   000000009 ? 000000000 ?
opidrv()+582         call     opirip()             000000032 ? 000000004 ?
                                                   7FBFFFF6C8 ? 06002C770 ?
                                                   000000009 ? 000000000 ?
sou2o()+114          call     opidrv()             000000032 ? 000000004 ?
                                                   7FBFFFF6C8 ? 06002C770 ?
                                                   000000009 ? 000000000 ?
opimai_real()+317    call     sou2o()              7FBFFFF6A0 ? 000000032 ?
                                                   000000004 ? 7FBFFFF6C8 ?
                                                   000000009 ? 000000000 ?
main()+116           call     opimai_real()        000000003 ? 7FBFFFF730 ?
                                                   000000004 ? 7FBFFFF6C8 ?
                                                   000000009 ? 000000000 ?
<0x3c9fb1c40b>       call     main()               000000003 ? 7FBFFFF730 ?
                                                   000000004 ? 7FBFFFF6C8 ?
                                                   000000009 ? 000000000 ?
 
--------------------- Binary Stack Dump ---------------------

因为客户的库是一个历史库,基本上不怎么使用,在2012年启动asm就出现了ORA-15183错误,然后在2013年重启机器后,再次启动asm就出现了ORA-00600[kfklLibFetchNext00]错误,通过2012年的错误提示,我们大概可以判断出来该问题和ASMLIB有关系,查询mos发现429945.1,发现Call Stack Trace完全一致,可以定位是该问题(如果想深入分析,可以通过strace继续分析)

ORA-600: [kfklLibFetchNext00], [18446744073709551614], [0] when mounting diskgroup in ASM

Applies to: 
Linux OS - Version: 2.0.1-1 and later   [Release: RHEL4 and later ]
Information in this document applies to any platform.
Linux Kernel - Version: 2.0.1 

Symptoms
 3 RAC db.  
2 nodes are up and functioning except for 1 node - ASM did not come back up after 
the reboot eventhough all disks show available from asmlib's perspective:
 
Changes
 All that was done with resources were stopped on Node1 and an extra LUN added. 
 A reboot was then performed.

Cause
 The cause of the issue is libasm.o corruption

Ran the following to confirm that disks are ok:
/dev/oracleasm listdisks
/usr/sbin/asmtool -I -l /dev/oracleasm -n /dev/sdg1 -a label
/usr/sbin/oracleasm-discover 'ORCL:*'
dd if=/dev/sdg1 bs=8192 count=1 | od -c
==> output checked out fine
.
kfod asm_diskstring='ORCL:*'
==> this failed on Node1
KFOD-00600: file not found; argument [610][kfklLibFetchNext00] even though libasm.o exists

You might see the following call stack as well

----- Call Stack Trace -----

kfklLibScanNext 
kfkLibFetchNext 
kfuitrnInit 
kfkLibIterInit 
kfkLoadAllLibs 
kfkDiscoverString 
kfdDiscoverString 
kfdDiscoverShallow 
kfgbDriver 

strace showed
 
 Node1-failing
-------
stat("/opt/oracle/extapi/64/asm/orcl/1/libasm.so", {st_mode=S_IFREG|0777, st_size=19344, ...}) = 0
 getdents64(4, /* 0 entries */, 4096) = 0 <<<<
 close(4) = 0
 open("/opt/oracle/product/10.2.0/db_1/rdbms/mesg/kfodus.msb", O_RDONLY) = -1 
 ENOENT (No such file or directory)
 open("/opt/oracle/product/10.2.0/db_1/rdbms/mesg/kfodus.msb", O_RDONLY) = -1 
 ENOENT (No such file or directory)
 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2a9750d000
write(1, "KFOD-00600: file not found; argu"..., 69) = 69

Node2-working
 -----
 stat("/opt/oracle/extapi/64/asm/orcl/1/libasm.so", {st_mode=S_IFREG|0755, st_size=19344, ...}) = 0
 open("/opt/oracle/extapi/64/asm/orcl/1/libasm.so", O_RDONLY) = 4
read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\23\0"..., 832) = 832
fstat(4, {st_mode=S_IFREG|0755, st_size=19344, ...}) = 0
mmap(NULL, 1066104, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) 0x2a9750d000

通过MOS的描述,可以明确定位到问题是:libasm.o异常导致

解决方案

To implement the solution, reinstall the ASMlib RPM
>rpm -Uvh oracleasmlib-2.0.0-1
 
This replaces the /opt/oracle/extapi/64/asm/orcl/1/libasm.so

How to Get the Contents of an Spfile on ASM when ASM/GRID is down

在11g中asm的spfile文件是存放在asm中的,如果asm不能正常启动是否可以获得其spfile信息.这里通过gpnptool来获得spfile文件信息,给大家提供了在11gr2的rac是怎么利用asm 中的spfile启动asm的思路
asm spfile信息

[grid@rac1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.3.0 Production on Fri Dec 21 01:41:31 2012

Copyright (c) 1982, 2011, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
With the Real Application Clusters and Automatic Storage Management options

SQL> create pfile='/tmp/pfile' from spfile;

File created.

SQL> !more /tmp/pfile 
+ASM1.__oracle_base='/u01/app/gridbase'#ORACLE_BASE set from in memory value
+ASM2.asm_diskgroups='XIFENFEI'#Manual Mount
+ASM1.asm_diskgroups='XIFENFEI'#Manual Mount
*.asm_diskstring='/dev/oracleasm/disks/*'
*.asm_power_limit=1
*.diagnostic_dest='/u01/app/gridbase'
*.instance_type='asm'
*.large_pool_size=12M
*.remote_login_passwordfile='EXCLUSIVE'

关闭集群(asm已关闭)

[root@rac1 ~]# crsctl stop crs 
[root@rac1 ~]# ps -ef|grep pmon
root      8768  6372  0 02:53 pts/1    00:00:00 grep pmon
[root@rac1 ~]# crsctl stat res
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.

gpnptool命令获取asm disk信息

[root@rac1 ~]# gpnptool get -o-
<?xml version="1.0" encoding="UTF-8"?>
<gpnp:GPnP-Profile Version="1.0" xmlns="http://www.grid-pnp.org/2005/11/gpnp-profile" 
xmlns:gpnp="http://www.grid-pnp.org/2005/11/gpnp-profile" 
xmlns:orcl="http://www.oracle.com/gpnp/2005/11/gpnp-profile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.grid-pnp.org/2005/11/gpnp-profile gpnp-profile.xsd" ProfileSequence="4" 
ClusterUId="885339054e904f1dbfa646b41d7a0edb" ClusterName="rac-cluster" PALocation="">
<gpnp:Network-Profile>
<gpnp:HostNetwork id="gen" HostName="*">
<gpnp:Network id="net1" IP="192.168.1.0" Adapter="eth0" Use="public"/>
<gpnp:Network id="net2" IP="10.10.1.0" Adapter="eth1" Use="cluster_interconnect"/>
</gpnp:HostNetwork>
</gpnp:Network-Profile>
<orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/>
--重点关注信息(asm disk 信息)
<orcl:ASM-Profile id="asm" DiscoveryString="/dev/oracleasm/disks/*" 
SPFile="+DATA/rac-cluster/asmparameterfile/registry.253.776955291"/>
<ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
<ds:SignedInfo>
<ds:CanonicalizationMethod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/>
<ds:SignatureMethod Algorithm="http://www.w3.org/2000/09/xmldsig#rsa-sha1"/>
<ds:Reference URI="">
<ds:Transforms><ds:Transform Algorithm="http://www.w3.org/2000/09/xmldsig#enveloped-signature"/>
<ds:Transform Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"> 
<InclusiveNamespaces xmlns="http://www.w3.org/2001/10/xml-exc-c14n#" PrefixList="gpnp orcl xsi"/>
</ds:Transform></ds:Transforms><ds:DigestMethod Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
<ds:DigestValue>T2Q3r+5sER2Rp0VfeqzYh461f2s=</ds:DigestValue>
</ds:Reference>
</ds:SignedInfo>
<ds:SignatureValue>
LwcQEtlsPGfywzdYJrOqiTp4cZNFGB/S9Ts8OCvYOGf/Z8HDT2yN5p2nCuxArUfW+KzaPzPHHihpRVaTcAY31nJ2Rcf2vMqYp4e++shliQXC8mg
1oGxQGifkjZwA4pTTEK5MBmr4FTZnR3VArZjjVfJdsmOMfyH4YeSMU5HPjdA=
</ds:SignatureValue>
</ds:Signature>
</gpnp:GPnP-Profile>
Success.

Error CLSGPNP_NO_DAEMON getting profile.

获得asm spfile信息
通过kfed找磁盘中的kfdhdb.sp|ausize来获得asm spfile相关信息

[root@rac1 ~]# ls /dev/oracleasm/disks/
VOL1  VOL2  VOL3  VOL4
[root@rac1 ~]# kfed dev=/dev/oracleasm/disks/VOL1 op=READ | egrep "kfdhdb.sp|ausize"
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.spfile:                       22 ; 0x0f4: 0x00000016
kfdhdb.spfflg:                        1 ; 0x0f8: 0x00000001
[root@rac1 ~]# kfed dev=/dev/oracleasm/disks/VOL2 op=READ | egrep "kfdhdb.sp|ausize"
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.spfile:                        0 ; 0x0f4: 0x00000000
kfdhdb.spfflg:                        0 ; 0x0f8: 0x00000000
[root@rac1 ~]# kfed dev=/dev/oracleasm/disks/VOL3 op=READ | egrep "kfdhdb.sp|ausize"
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.spfile:                        0 ; 0x0f4: 0x00000000
kfdhdb.spfflg:                        0 ; 0x0f8: 0x00000000
[root@rac1 ~]# kfed dev=/dev/oracleasm/disks/VOL4 op=READ | egrep "kfdhdb.sp|ausize"
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.spfile:                        0 ; 0x0f4: 0x00000000
kfdhdb.spfflg:                        0 ; 0x0f8: 0x00000000

这里可以看出来asm spfile信息在磁盘VOL1中,spfile从第22个au开始,1个au(1M).

获得asm spfile 内容

[root@rac1 ~]# dd if=/dev/oracleasm/disks/VOL1 bs=1M skip=22 count=1 > /tmp/spfile
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 1.47474 seconds, 711 kB/s
[root@rac1 ~]# strings /tmp/spfile
+ASM1.__oracle_base='/u01/app/gridbase'#ORACLE_BASE set from in memory value
+ASM2.asm_diskgroups='XIFENFEI'#Manual Mount
+ASM1.asm_diskgroups='XIFENFEI'#Manual Mount
*.asm_diskstring='/dev/oracleasm/disks/*'
*.asm_power_limit=1
*.diagnostic_dest='/u01/app/gridbase'
*.instance_type='asm'
*.large_pool_size=12M
*.remote_login_passwordfile='EXCLUSIVE'

通过对比发现,在asm实例未正常启动的情况下,也可以通过其他方面来获得asm spfile文件.本实验只是对于spfile在asm中位置的定位(大家去猜测11gr2的rac是怎么利用asm 中的spfile启动asm的思路),实际生产环境中请勿模仿,gpnptool命令有较大风险

使用asm disk header 自动备份信息恢复异常asm disk header

通过参考kamus的Where is the backup of ASM disk header block,发现从10.2.0.5开始的asm确实存在自动备份asm disk header功能.有了这个功能对于那些不备份asm disk header的同学,提供了一层保证,也增加了asm的安全性.
对于10.2.0.5.0以及以后版本,不管au size是多少,asm disk header自动备份存储的位置是第2个au的倒数第2个block.
计算方法:AU中包含的block num[AU_SIZE/block_size]*2-2[因为从第一个块从0计数],通过该方法计算结论为:
1M AU在510
2M AU在1022
4M AU在2046
8M AU在4094
16M AU在8190
32M AU在16382
64M AU在32766
1.对比备份asm disk header

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - Prod
PL/SQL Release 10.2.0.5.0 - Production
CORE    10.2.0.5.0      Production
TNS for Linux: Version 10.2.0.5.0 - Production
NLSRTL Version 10.2.0.5.0 - Production

SQL> select to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') "xifenfei.com"  from dual;

xifenfei.com
-------------------
2012-06-17 09:41:19

SQL>  select group_number,DISK_NUMBER,PATH,HEADER_STATUS 
   2  from v$asm_disk where group_number<>0;

GROUP_NUMBER DISK_NUMBER PATH            HEADER_STATU
------------ ----------- --------------- ------------
           1           1 /dev/raw/raw2   MEMBER
           1           0 /dev/raw/raw1   MEMBER

SQL> select group_number,name,BLOCK_SIZE,ALLOCATION_UNIT_SIZE from v$asm_diskgroup;

GROUP_NUMBER NAME                           BLOCK_SIZE ALLOCATION_UNIT_SIZE
------------ ------------------------------ ---------- --------------------
           1 DATA                                 4096              1048576

rac1->  kfed read /dev/raw/raw1 blknum=510|>/tmp/xifenfei.510
rac1->  kfed read /dev/raw/raw1 blknum=0|>/tmp/xifenfei.0
rac1-> ll /tmp/xifenfei*
-rw-r--r--  1 oracle oinstall 6606 Jun 14 04:11 /tmp/xifenfei.0
-rw-r--r--  1 oracle oinstall 6606 Jun 14 04:12 /tmp/xifenfei.510
rac1-> diff /tmp/xifenfei.510 /tmp/xifenfei.0
--通过对比发现两者无不同记录返回,证明他们记录内容完全相同

2.尝试破坏asm disk header

rac1-> dd if=/dev/zero of=/dev/raw/raw1 bs=4096 count=1
1+0 records in
1+0 records out
rac1->  kfed read /dev/raw/raw1 blknum=0
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: T=0 NUMB=0x0
kfbh.block.obj:                       0 ; 0x008: TYPE=0x0 NUMB=0x0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000

SQL> select group_number,DISK_NUMBER,PATH,HEADER_STATUS 
   2 from v$asm_disk where group_number<>0;

GROUP_NUMBER DISK_NUMBER PATH            HEADER_STATU
------------ ----------- --------------- ------------
           1           1 /dev/raw/raw2   MEMBER
           1           0 /dev/raw/raw1   CANDIDATE

SQL> alter diskgroup  data dismount;

Diskgroup altered.

SQL> alter diskgroup  data mount;
alter diskgroup  data mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATA"

3.使用kfed repair修改损坏asm disk header

rac1-> kfed  repair '/dev/raw/raw1'
rac1->  kfed read /dev/raw/raw1 blknum=0
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                       0 ; 0x004: T=0 NUMB=0x0
kfbh.block.obj:              2147483648 ; 0x008: TYPE=0x8 NUMB=0x0
kfbh.check:                   883602253 ; 0x00c: 0x34aab34d
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
…………

SQL> alter diskgroup  data mount;

Diskgroup altered.

4.使用kfed merge恢复asm disk header

rac1-> dd if=/dev/zero of=/dev/raw/raw1 bs=4096 count=1
1+0 records in
1+0 records out
rac1->  kfed read /dev/raw/raw1 blknum=0
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: T=0 NUMB=0x0
kfbh.block.obj:                       0 ; 0x008: TYPE=0x0 NUMB=0x0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000

SQL> alter diskgroup  data dismount;

Diskgroup altered.

SQL> alter diskgroup  data mount;
alter diskgroup  data mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATA"

rac1->  kfed merge /dev/raw/raw1 /tmp/xifenfei.510

SQL> alter diskgroup  data mount;

Diskgroup altered.

通过试验证明在10.2.0.5及其以后版本中,对于备份的asm disk header我们可以通过使用kfed repair和kfed merge来恢复.

asm备份元数据之md_backup和md_restore

在11g的asm中增加了md_backup和md_restore命令,用来备份和还原asm的元数据信息
当前磁盘组相关信息

SQL> select PATH,b.NAME from v$asm_disk a,v$asm_diskgroup b where a.GROUP_NUMBER=b.GROUP_NUMBER;

PATH                                     NAME
---------------------------------------- ----------
/dev/oracleasm/disks/VOL2                DATA
/dev/oracleasm/disks/VOL1                DATA
/dev/oracleasm/disks/VOL4                XIFENFEI
/dev/oracleasm/disks/VOL3                XIFENFEI

md_backup操作

--备份所有mount磁盘组
ASMCMD> md_backup /tmp/xifenfei.md    
Disk group metadata to be backed up: DATA
Disk group metadata to be backed up: XIFENFEI
Current alias directory path: XFF/ARCHIVELOG
Current alias directory path: XFF/ARCHIVELOG/2012_04_30
Current alias directory path: XFF/ONLINELOG
Current alias directory path: rac-cluster/OCRFILE
Current alias directory path: XFF/ARCHIVELOG/2012_05_01
Current alias directory path: XFF/CONTROLFILE
Current alias directory path: XFF/ARCHIVELOG/2012_04_13
Current alias directory path: rac-cluster/ASMPARAMETERFILE
Current alias directory path: rac-cluster
Current alias directory path: XFF
Current alias directory path: XFF/ARCHIVELOG/2012_03_03
Current alias directory path: XFF/PARAMETERFILE
Current alias directory path: XFF/DATAFILE
Current alias directory path: ASM/DATAFILE
Current alias directory path: XFF/CONTROLFILE
Current alias directory path: XFF
Current alias directory path: XFF/ONLINELOG
Current alias directory path: XFF/TEMPFILE
Current alias directory path: ASM

--备份指定磁盘组
ASMCMD> md_backup /tmp/xifenfei_data.md -G DATA  
Disk group metadata to be backed up: DATA
Current alias directory path: XFF/ARCHIVELOG/2012_03_03
Current alias directory path: XFF/CONTROLFILE
Current alias directory path: XFF/ARCHIVELOG/2012_05_01
Current alias directory path: XFF/ARCHIVELOG
Current alias directory path: rac-cluster/OCRFILE
Current alias directory path: XFF/ARCHIVELOG/2012_05_24
Current alias directory path: XFF/ONLINELOG
Current alias directory path: XFF/ARCHIVELOG/2012_04_30
Current alias directory path: rac-cluster/ASMPARAMETERFILE
Current alias directory path: rac-cluster
Current alias directory path: XFF
Current alias directory path: XFF/ARCHIVELOG/2012_04_13

md_restore操作

--生产sql文件(未执行)
ASMCMD> md_restore -S  /tmp/get_dg_sql -G data /tmp/xifenfei_data.md
Current Diskgroup metadata being restored: DATA

破坏XIFENFEI磁盘组中的其中一个asm disk(/dev/oracleasm/disks/VOL3)
[root@rac1 tmp]#  dd if=/dev/zero of=/dev/sdb1 bs=4096 count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 4.8629e-05 seconds, 84.2 MB/s

--尝试还原磁盘组(操作单位是磁盘组)
ASMCMD>  md_restore /tmp/xifenfei.md --silent -G xifenfei
Current Diskgroup metadata being restored: XIFENFEI
ASMCMD-9352: CREATE DISKGROUP failed
ORA-15018: diskgroup cannot be created
ORA-15033: disk /dev/oracleasm/disks/VOL4 belongs to diskgroup "XIFENFEI" (DBD ERROR: OCIStmtExecute)
--如果一个磁盘组中某个asm disk 出了问题,这种方法不能生效,甚至需要先dd 处理掉所有该磁盘组中的asm disk

总结说明
md_backup和md_restore是磁盘组级别的备份和还原,如果一个磁盘组的某个asm disk出现问题,使用这对命令解决起来还是很麻烦,甚至根本不可行(因为代价太大:要删除该磁盘组其他asm disk header,然后要重新还原所有数据文件),这样的情景下dd或者kfed的备份还是非常有必要,ASM DISK HEADER 备份与恢复.如果是一个磁盘组都损坏,需要还原磁盘组,这个时候这个命令非常的完美(至少比起dd和kfed方便很多).md_backup/md_restore和dd与kfed是互补的命令,而不是md_backup/md_restore出现使得dd和kfed在asm元数据的备份恢复上就没有用武之地.

通过ftp/http拷贝asm中文件

1.检查Oracle XML Database组件

SQL> select comp_name, status, version from DBA_REGISTRY where comp_name='Oracle XML Database';

COMP_NAME                 STATUS                 VERSION
------------------------- ---------------------- ------------------------------
Oracle XML Database       VALID                  11.2.0.3.0

SQL> select count(*) from dba_objects where owner='XDB' and status='INVALID';

  COUNT(*)
----------
         0

2.配置xdb的ftp和http

[oracle@rac1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.3.0 Production on Tue May 1 12:05:27 2012

Copyright (c) 1982, 2011, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

SQL> execute dbms_xdb.sethttpport(8080);

PL/SQL procedure successfully completed.

SQL> execute dbms_xdb.setftpport(2100);

PL/SQL procedure successfully completed.

SQL> commit;

Commit complete.

SQL> select dbms_xdb.GETFTPPORT() from dual;

DBMS_XDB.GETFTPPORT()
---------------------
                 2100

SQL> select dbms_xdb.GETHTTPPORT() from dual;

DBMS_XDB.GETHTTPPORT()
----------------------
                  8080
--根据你的需求,可以选择一个即可

SQL> show parameter dispatchers;

NAME                                 TYPE        VALUE
------------------------------------ ----------- -------------------------------
dispatchers                          string      (PROTOCOL=TCP) (SERVICE=XFFXDB)
--dispatchers参数会自动配置,这里需要说明,MOS中说的sidxdb是不恰当的,我这里是db_namexdb
--因为我这里是rac,sid为XFF1,总之相信自动配置

3.查看监听

[oracle@rac1 ~]$ lsnrctl status

LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 01-MAY-2012 12:09:14

Copyright (c) 1991, 2011, Oracle.  All rights reserved.

Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Linux: Version 11.2.0.3.0 - Production
Start Date                01-MAY-2012 11:51:13
Uptime                    0 days 0 hr. 18 min. 1 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/11.2.0/grid/network/admin/listener.ora
Listener Log File         /u01/app/gridbase/diag/tnslsnr/rac1/listener/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.1.31)(PORT=1521)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.1.33)(PORT=1521)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=rac1)(PORT=8080))(Presentation=HTTP)(Session=RAW))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=rac1)(PORT=2100))(Presentation=FTP)(Session=RAW))
Services Summary...
Service "+ASM" has 1 instance(s).
  Instance "+ASM1", status READY, has 1 handler(s) for this service...
Service "XFF" has 1 instance(s).
  Instance "XFF1", status READY, has 1 handler(s) for this service...
Service "XFFXDB" has 1 instance(s).
  Instance "XFF1", status READY, has 1 handler(s) for this service...
The command completed successfully

--以下两条监听是自动增加上去,如果没有自动增加,需要手工增加并且重启或者重新加载监听
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=rac1)(PORT=8080))(Presentation=HTTP)(Session=RAW))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=rac1)(PORT=2100))(Presentation=FTP)(Session=RAW))

4.ftp基本操作

[oracle@rac1 ~]$ ftp -n
ftp> open rac1 2100
Connected to rac1.
220- rac1 
Unauthorised use of this FTP server is prohibited and may be subject to civil and criminal prosecution.
220 rac1 FTP Server (Oracle XML DB/Oracle Database) ready.
530  Please login with USER and PASS.
530  Please login with USER and PASS.
KERBEROS_V4 rejected as an authentication type
ftp> user system xifenfei
331 pass required for SYSTEM
230 SYSTEM logged in
ftp> ls
227 Entering Passive Mode (192,168,1,31,181,5)
150 ASCII Data Connection
drw-r--r--   2 SYS      oracle         0 SEP 18 17:49 OLAP_XDS
drw-r--r--   2 SYS      oracle         0 SEP 18 17:47 home
drw-r--r--   2 SYS      oracle         0 SEP 18 18:02 images
drw-r--r--   2 SYS      oracle         0 SEP 18 17:49 olap_data_security
drw-r--r--   2 SYS      oracle         0 SEP 18 17:43 public
drw-r--r--   2 SYS      oracle         0 SEP 18 17:44 sys
-rw-r--r--   1 SYS      oracle         0 MAY 01 04:06 xdbconfig.xml
drw-r--r--   2 SYS      oracle         0 SEP 18 17:49 xds
226 ASCII Transfer Complete
ftp> cd sys
250 CWD Command successful
ftp> cd asm
250 CWD Command successful
ftp> ls
227 Entering Passive Mode (192,168,1,31,98,133)
150 ASCII Data Connection
drw-r--r--   2 SYS      oracle         0 MAY 01 04:14 XIFENFEI
drw-r--r--   2 SYS      oracle         0 MAY 01 04:14 DATA
226 ASCII Transfer Complete
ftp> cd xifenfei
250 CWD Command successful
ftp> ls
227 Entering Passive Mode (192,168,1,31,151,70)
150 ASCII Data Connection
drw-r--r--   2 SYS      oracle         0 MAY 01 04:15 XFF
drw-r--r--   2 SYS      oracle         0 MAY 01 04:15 ASM
226 ASCII Transfer Complete
ftp> cd xff
250 CWD Command successful
ftp> ls
227 Entering Passive Mode (192,168,1,31,100,14)
150 ASCII Data Connection
drw-r--r--   2 SYS      oracle         0 MAY 01 04:15 DATAFILE
drw-r--r--   2 SYS      oracle         0 MAY 01 04:15 CONTROLFILE
drw-r--r--   2 SYS      oracle         0 MAY 01 04:15 ONLINELOG
drw-r--r--   2 SYS      oracle         0 MAY 01 04:15 TEMPFILE
drw-r--r--   2 SYS      oracle         0 MAY 01 04:15 PARAMETERFILE
-rw-r--r--   1 SYS      oracle      3584 MAY 01 04:15 spfileXFF.ora
226 ASCII Transfer Complete
ftp> cd xff/datafile
250 CWD Command successful
ftp> ls
227 Entering Passive Mode (192,168,1,31,30,63)
150 ASCII Data Connection
-rw-r--r--   1 SYS      oracle  744497152 MAY 01 04:20 SYSTEM.256.776961315
-rw-r--r--   1 SYS      oracle  618668032 MAY 01 04:20 SYSAUX.257.776961315
-rw-r--r--   1 SYS      oracle  83894272 MAY 01 04:20 UNDOTBS1.258.776961317
-rw-r--r--   1 SYS      oracle   6291456 MAY 01 04:20 user_dd.dbf
-rw-r--r--   1 SYS      oracle  26222592 MAY 01 04:20 UNDOTBS2.264.776961693
-rw-r--r--   1 SYS      oracle  157294592 MAY 01 04:20 xifenfei01.dbf
226 ASCII Transfer Complete
ftp> get xifenfei01.dbf
local: xifenfei01.dbf remote: xifenfei01.dbf
227 Entering Passive Mode (192,168,1,31,143,34)
150 ASCII Data Connection
550- Error Response 
ORA-31198: Mismatch in number of bytes transferred due to non-binary mode
550 End Error Response 
270340 bytes received in 0.053 seconds (5e+03 Kbytes/s)
ftp> binary
200  Type set to I.
ftp> get xifenfei01.dbf
local: xifenfei01.dbf remote: xifenfei01.dbf
227 Entering Passive Mode (192,168,1,31,9,112)
150 BIN Data Connection
226 BIN Transfer Complete
157294592 bytes received in 14 seconds (1.1e+04 Kbytes/s)
--主要需要设置为二进制传输模式,默认是ASCII方式的,可能会报错
ftp> quit
221 QUIT Goodbye.

这篇文章主要参考How to configure XDB for using ftp and http protocols with ASM [ID 357714.1],但是在自己试验过程中,发现文档中有些地方不太合适,这里做个补充说明
1.ftp服务端不需要启动(这里只是用到了操作系统的ftp客户端功能,如果使用其他客户端工具,连操作系统客户端都省了)
2.dispatchers 中的SERVICE=XDB不准确,系统自动配置的XDB亦可以正常工作,更相信数据库自动配置
3.ftp和http两个功能,只需要配置一个即可(选择你需要的)
4.下图展示的是通过http方式访问结果(system用户登录)
http_asm