一次 CRS-1013: ASM 磁盘组中的 OCR 位置不可访问 故障分析

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:一次 CRS-1013: ASM 磁盘组中的 OCR 位置不可访问 故障分析

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

有朋友告知我集群突然异常,让我给看看什么原因,集群alert日志

2021-03-14 21:02:15.517 [OHASD(31771)]CRS-8500: Oracle Clusterware OHASD 进程以操作系统进程 ID 31771 开头
2021-03-14 21:02:15.561 [OHASD(31771)]CRS-0714: Oracle Clusterware 发行版 12.1.0.2.0。
2021-03-14 21:02:15.619 [OHASD(31771)]CRS-2112: 已在节点 rac1 上启动 OLR 服务。
2021-03-14 21:02:15.791 [OHASD(31771)]CRS-1301: 已在节点 rac1 上启动 Oracle 高可用性服务。
2021-03-14 21:02:15.910 [OHASD(31771)]CRS-8017: 位置:/etc/oracle/lastgasp具有2个重新启动指导日志文件,0个已发布,0个出现错误
2021-03-14 21:02:16.789 [CSSDAGENT(32015)]CRS-8500: Oracle Clusterware CSSDAGENT 进程以操作系统进程 ID 32015 开头
2021-03-14 21:02:16.868 [CSSDMONITOR(32017)]CRS-8500: Oracle Clusterware CSSDMONITOR 进程以操作系统进程 ID 32017 开头
2021-03-14 21:02:17.751 [ORAROOTAGENT(32008)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 32008 开头
2021-03-14 21:02:17.916 [ORAAGENT(32012)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 32012 开头
2021-03-14 21:02:18.604 [ORAAGENT(32012)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:02:18.969 [ORAAGENT(32117)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 32117 开头
2021-03-14 21:02:19.050 [MDNSD(32130)]CRS-8500: Oracle Clusterware MDNSD 进程以操作系统进程 ID 32130 开头
2021-03-14 21:02:19.117 [EVMD(32132)]CRS-8500: Oracle Clusterware EVMD 进程以操作系统进程 ID 32132 开头
2021-03-14 21:02:20.078 [GPNPD(32151)]CRS-8500: Oracle Clusterware GPNPD 进程以操作系统进程 ID 32151 开头
2021-03-14 21:02:21.145 [GIPCD(32172)]CRS-8500: Oracle Clusterware GIPCD 进程以操作系统进程 ID 32172 开头
2021-03-14 21:02:21.163 [GPNPD(32151)]CRS-2328: 已在节点 rac1 上启动 GPNPD。
2021-03-14 21:02:22.172 [ORAROOTAGENT(32181)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 32181 开头
2021-03-14 21:02:22.339 [CLSECHO(32204)]CRS-10001: 14-Mar-21 21:02 ACFS-9391: 正在检查现有 ADVM/ACFS 安装。
2021-03-14 21:02:22.580 [CLSECHO(32209)]CRS-10001: 14-Mar-21 21:02 ACFS-9392: 正在验证操作系统的 ADVM/ACFS 安装文件。
2021-03-14 21:02:22.598 [CLSECHO(32211)]CRS-10001: 14-Mar-21 21:02 ACFS-9393: 正在验证 ASM 管理员设置。
2021-03-14 21:02:22.646 [CLSECHO(32216)]CRS-10001: 14-Mar-21 21:02 ACFS-9308: 正在加载已安装的 ADVM/ACFS 驱动程序。
2021-03-14 21:02:22.678 [CLSECHO(32219)]CRS-10001: 14-Mar-21 21:02 ACFS-9154: 正在加载 'oracleoks.ko' 驱动程序。
2021-03-14 21:02:22.809 [CLSECHO(32234)]CRS-10001: 14-Mar-21 21:02 ACFS-9154: 正在加载 'oracleadvm.ko' 驱动程序。
2021-03-14 21:02:22.892 [CLSECHO(32290)]CRS-10001: 14-Mar-21 21:02 ACFS-9154: 正在加载 'oracleacfs.ko' 驱动程序。
2021-03-14 21:02:23.054 [CLSECHO(32334)]CRS-10001: 14-Mar-21 21:02 ACFS-9327: 正在验证 ADVM/ACFS 设备。
2021-03-14 21:02:23.079 [CLSECHO(32336)]CRS-10001: 14-Mar-21 21:02 ACFS-9156: 正在检测控制设备 '/dev/asm/.asm_ctl_spec'。
2021-03-14 21:02:23.108 [CLSECHO(32340)]CRS-10001: 14-Mar-21 21:02 ACFS-9156: 正在检测控制设备 '/dev/ofsctl'。
2021-03-14 21:02:23.263 [CLSECHO(32346)]CRS-10001: 14-Mar-21 21:02 ACFS-9322: 已完成
2021-03-14 21:02:28.571 [CSSDMONITOR(32409)]CRS-8500: Oracle Clusterware CSSDMONITOR 进程以操作系统进程 ID 32409 开头
2021-03-14 21:02:28.756 [CSSDAGENT(32425)]CRS-8500: Oracle Clusterware CSSDAGENT 进程以操作系统进程 ID 32425 开头
2021-03-14 21:02:28.975 [OCSSD(32436)]CRS-8500: Oracle Clusterware OCSSD 进程以操作系统进程 ID 32436 开头
2021-03-14 21:02:30.072 [OCSSD(32436)]CRS-1713: CSSD 守护程序已在 hub 模式下启动
2021-03-14 21:02:46.185 [OCSSD(32436)]CRS-1707: 节点 rac1 (编号为 1) 的租约获取已完成
2021-03-14 21:02:47.337 [OCSSD(32436)]CRS-1605: CSSD 表决文件联机: ORCL:OCR3; 详细资料见
 /u01/app/gridbase/diag/crs/rac1/crs/trace/ocssd.trc。
2021-03-14 21:02:47.357 [OCSSD(32436)]CRS-1605: CSSD 表决文件联机: ORCL:OCR2; 详细资料见 
/u01/app/gridbase/diag/crs/rac1/crs/trace/ocssd.trc。
2021-03-14 21:02:47.365 [OCSSD(32436)]CRS-1605: CSSD 表决文件联机: ORCL:OCR1; 详细资料见 
/u01/app/gridbase/diag/crs/rac1/crs/trace/ocssd.trc。
2021-03-14 21:02:48.781 [OCSSD(32436)]CRS-1601: CSSD 重新配置完毕。活动节点为 rac1 rac2 。
2021-03-14 21:02:50.971 [OCTSSD(32591)]CRS-8500: Oracle Clusterware OCTSSD 进程以操作系统进程 ID 32591 开头
2021-03-14 21:02:51.938 [OCTSSD(32591)]CRS-2403: 主机 rac1 上的集群时间同步服务处于观察程序模式。
2021-03-14 21:02:52.140 [OCTSSD(32591)]CRS-2407: 新的集群时间同步服务引用节点为主机 rac2。
2021-03-14 21:02:52.140 [OCTSSD(32591)]CRS-2401: 已在主机 rac1 上启动了集群时间同步服务。
2021-03-14 21:02:52.167 [OCTSSD(32591)]CRS-2409: 主机 rac1 上的时钟与集群标准时间不同步。
由于集群时间同步服务正在以观察程序模式运行, 所以未采取任何操作。
2021-03-14 21:02:59.284 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" (
位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:01.486 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" (
位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:01.514 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" (
位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:18.163 [OCTSSD(32591)]CRS-2407: 新的集群时间同步服务引用节点为主机 rac1。
2021-03-14 21:03:19.406 [OCSSD(32436)]CRS-1625: 节点 rac2 (编号为 2) 已关闭
2021-03-14 21:03:19.419 [OCSSD(32436)]CRS-1601: CSSD 重新配置完毕。活动节点为 rac1 。
2021-03-14 21:03:24.916 [OSYSMOND(318)]CRS-8500: Oracle Clusterware OSYSMOND 进程以操作系统进程 ID 318 开头
2021-03-14 21:03:26.558 [CRSD(325)]CRS-8500: Oracle Clusterware CRSD 进程以操作系统进程 ID 325 开头
2021-03-14 21:03:27.750 [CRSD(325)]CRS-1012: 已在节点 rac1 上启动 OCR 服务。
2021-03-14 21:03:27.807 [CRSD(325)]CRS-1201: 已在节点 rac1 上启动 CRSD。
2021-03-14 21:03:28.470 [ORAAGENT(1027)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 1027 开头
2021-03-14 21:03:28.499 [ORAROOTAGENT(1031)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 1031 开头
2021-03-14 21:03:28.515 [ORAAGENT(1036)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 1036 开头
2021-03-14 21:03:28.666 [ORAAGENT(1036)]CRS-5011: 检查资源 "oracledb" 失败: 详细资料见 "(:CLSN00007:)"
 (位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/crsd_oraagent_oracle.trc")
2021-03-14 21:03:30.649 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:30.718 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:30.722 [CRSD(325)]CRS-1024: 由于此节点上的 ASM 实例未处于活动状态, 此节点上的集群就绪服务终止。详细信息见 
(:PROCR00009:) (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd.trc)。
2021-03-14 21:03:30.736 [ORAROOTAGENT(1031)]CRS-5822: 代理 '/u01/app/grid/12.1.0/bin/orarootagent_root' 已从服务器断开连接。
详细资料见 (:CRSAGF00117:) {0:3:3} (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd_orarootagent_root.trc)。
2021-03-14 21:03:30.736 [ORAAGENT(1027)]CRS-5822: 代理 '/u01/app/grid/12.1.0/bin/oraagent_grid' 已从服务器断开连接。
详细资料见 (:CRSAGF00117:) {0:1:3} (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd_oraagent_grid.trc)。
2021-03-14 21:03:30.793 [CRSD(1157)]CRS-8500: Oracle Clusterware CRSD 进程以操作系统进程 ID 1157 开头
2021-03-14 21:03:31.457 [OLOGGERD(1162)]CRS-8500: Oracle Clusterware OLOGGERD 进程以操作系统进程 ID 1162 开头
2021-03-14 21:03:31.798 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:31.823 [ORAAGENT(32117)]CRS-5011: 检查资源 "ora.asm" 失败: 详细资料见 "(:CLSN00006:)" 
(位于 "/u01/app/gridbase/diag/crs/rac1/crs/trace/ohasd_oraagent_grid.trc")
2021-03-14 21:03:40.234 [CRSD(1157)]CRS-1013: ASM 磁盘组中的 OCR 位置不可访问。
详细资料见 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd.trc。
2021-03-14 21:03:40.238 [CRSD(1157)]CRS-0804: 由于 Oracle 集群注册表错误 [PROC-26: 访问物理存储时出错
ORA-15077: 找不到提供所需磁盘组的 ASM 实例
], 集群就绪服务中止。详细资料见 (:CRSD00111:) (位于 /u01/app/gridbase/diag/crs/rac1/crs/trace/crsd.trc)。

从整个集群的启动过程看cssd,crs都起来了,然后等一会由于crs无法访问ocr磁盘组,导致异常.开始crs起来了,证明ocr磁盘组应该是mount成功过.后面看错误提示又无法访问了.根据经验以及ora.asm失败的提示,怀疑很可能是asm实例出现问题了.对于这样的情况,分析asm的alert日志是最好的方法.通过分析日志发现

Sun Mar 14 21:03:24 2021
NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 1
Sun Mar 14 21:03:24 2021
SUCCESS: diskgroup ARCHLOG was mounted
Sun Mar 14 21:03:24 2021
NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 2
Sun Mar 14 21:03:24 2021
SUCCESS: diskgroup DATA was mounted
Sun Mar 14 21:03:24 2021
NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 3
Sun Mar 14 21:03:24 2021
SUCCESS: diskgroup OCR was mounted
Sun Mar 14 21:03:24 2021
NOTE: Instance updated compatible.asm to 12.1.0.0.0 for grp 5
Sun Mar 14 21:03:24 2021
SUCCESS: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:9:3} */
Sun Mar 14 21:03:24 2021
WARNING: failed to online diskgroup resource ora.ARCHLOG.dg (unable to communicate with CRSD/OHASD)
WARNING: failed to online diskgroup resource ora.DATA.dg (unable to communicate with CRSD/OHASD)
WARNING: failed to online diskgroup resource ora.OCR.dg (unable to communicate with CRSD/OHASD)
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_32721.trc  (incident=123423):
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
Incident details in: /u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_123423/+ASM1_rbal_32721_i123423.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sun Mar 14 21:03:25 2021
ERROR: An unrecoverable error has been identified in ASM metadata.
Sun Mar 14 21:03:27 2021
NOTE: [crsd.bin@rac1.schic.org (TNS V1-V3) 325] opening OCR file +OCR.255.4294967295
Starting background process ASMB
Sun Mar 14 21:03:27 2021
ASMB started with pid=28, OS id=932 
Sun Mar 14 21:03:27 2021
NOTE: ASMB registering with ASM instance as Standard client 0xffffffffffffffff (reg:3401595347) (new connection)
Sun Mar 14 21:03:27 2021
NOTE: Standard client +ASM1:+ASM:racscan registered, osid 934, mbr 0x0, asmb 932 (reg:3401595347)
Sun Mar 14 21:03:27 2021
NOTE: ASMB connected to ASM instance +ASM1 osid: 934 (Flex mode; client id 0xffffffffffffffff)
Sun Mar 14 21:03:28 2021
NOTE: AMDU dump of disk group ARCHLOG initiated at /u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_123423
ERROR: ORA-600 in COD recovery for diskgroup 1/0x730955f2 (ARCHLOG)
ERROR: ORA-600 thrown in RBAL for group number 1
Sun Mar 14 21:03:30 2021
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_32721.trc:
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
Sun Mar 14 21:03:30 2021
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_32721.trc:
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
USER (ospid: 32721): terminating the instance due to error 488
Sun Mar 14 21:03:30 2021
System state dump requested by (instance=1, osid=32721 (RBAL)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_diag_32691_20210314210330.trc
Sun Mar 14 21:03:30 2021
Dumping diagnostic data in directory=[cdmp_20210314210330], requested by (instance=1, osid=32721 (RBAL)), s
ummary=[abnormal instance termination].
Sun Mar 14 21:03:30 2021
Instance terminated by USER, pid = 32721

通过上述日志,果然发现ocr磁盘组先mount成功,然后asm实例由于ARCHLOG磁盘组的ORA-00600 kfdAuDealloc2错误而导致整个实例crash,从而使得ocr磁盘组无法被crs访问,从而出现了”CRS-0804: 由于 Oracle 集群注册表错误 [PROC-26: 访问物理存储时出错 ORA-15077: 找不到提供所需磁盘组的 ASM 实例], 集群就绪服务中止”这样的错误提示.进一步分析为什么archlog进程会报这个错误.

SQL> /* ASMCMD */alter diskgroup /*ASMCMD*/ "DATA" drop file '+DATA/xff/XIFENFEI.270.1040985885' 
Sun Mar 14 20:46:46 2021
SUCCESS: /* ASMCMD */alter diskgroup /*ASMCMD*/ "DATA" drop file '+DATA/xff/XIFENFEI.270.1040985885'
Sun Mar 14 20:49:24 2021
NOTE: Dropping directory '+archlog/oracledb/archivelog/2021_03_11' recursively
Sun Mar 14 20:49:24 2021
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_ora_15281.trc  (incident=114015):
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
Incident details in: /u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_114015/+ASM1_ora_15281_i114015.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sun Mar 14 20:49:24 2021
ERROR: An unrecoverable error has been identified in ASM metadata.
NOTE:AMDU dump of disk group ARCHLOG initiated at/u01/app/gridbase/diag/asm/+asm/+ASM1/incident/incdir_114015
Sun Mar 14 20:49:28 2021
Errors in file /u01/app/gridbase/diag/asm/+asm/+ASM1/trace/+ASM1_ora_15281.trc  (incident=114016):
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [ARCHLOG], [213], [410], [0], [], [], [], [], [], [], []

因为这个库有一个历史背景:几天前由于存储cache导致,数据库使用备份还原(还原到一个新磁盘组中,老磁盘组没有使用),今天估计是运维人员在清理老磁盘组中不要的文件,然后archlog中的归档日志的时候,清空了+archlog/oracledb/archivelog/2021_03_11中的文件,然后触发asm删除该目录的异常(异常原因估计和上次清理存储cache引起了该磁盘组的元数据异常有关).该故障的基本思路原因已经清楚:由于archlog磁盘组本身元数据库有问题,清理该磁盘组文件之后,引起该磁盘组删除空目录出发问题,从而使得整个asm 实例crash.进而引起crs异常.解决方法比较简单,因为archlog磁盘组本身已经不需要,直接dd掉磁盘头,让其启动的时候不再mount,故障解决

[grid@rac1 ~]$ crsctl status res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ARCHLOG.dg
               ONLINE  OFFLINE      rac1                     STABLE
               ONLINE  OFFLINE      rac2                     STABLE
ora.DATA.dg
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.LISTENER.lsnr
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.LISTENER1.lsnr
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.NEWDATA.dg
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.OCR.dg
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.asm
               ONLINE  ONLINE       rac1                     Started,STABLE
               ONLINE  ONLINE       rac2                     Started,STABLE
ora.net1.network
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
ora.ons
               ONLINE  ONLINE       rac1                     STABLE
               ONLINE  ONLINE       rac2                     STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       rac2                     STABLE
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       rac1                     STABLE
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       rac1                     STABLE
ora.MGMTLSNR
      1        ONLINE  OFFLINE      rac2                     169.254.86.142 7.7.7
                                                             .1,STARTING
ora.cvu
      1        ONLINE  ONLINE       rac1                     STABLE
ora.mgmtdb
      1        ONLINE  OFFLINE                               Instance Shutdown,ST
                                                             ABLE
ora.oc4j
      1        ONLINE  OFFLINE      rac1                     STARTING
ora.xff.db
      1        ONLINE  OFFLINE      rac1                     STARTING
      2        ONLINE  OFFLINE      rac2                     STARTING
ora.rac1.vip
      1        ONLINE  ONLINE       rac1                     STABLE
ora.rac2.vip
      1        ONLINE  ONLINE       rac2                     STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       rac2                     STABLE
ora.scan2.vip
      1        ONLINE  ONLINE       rac1                     STABLE
ora.scan3.vip
      1        ONLINE  ONLINE       rac1                     STABLE
--------------------------------------------------------------------------------
[grid@rac1 ~]$ 

Oracle Recovery Tools恢复—ORA-00704 ORA-01555故障

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:Oracle Recovery Tools恢复—ORA-00704 ORA-01555故障

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

由于虚拟化环境使用了精简模式(预分配),后面出现分布式存储空间不足,导致虚拟化环境中的数据库服务器异常,通过一系列操作恢复好系统,发现数据库无法open,请求我们给予解决
通过我们的Oracle Database Recovery Check脚本分析,分析文件的checkpoint scn 有部分3月2日,还有一些是2月28日,是严重不一致,而且对应的归档也丢失
20210314194011


基于这样的情况,试试看强制打开库

C:\Users\XIFENFEI>sqlplus  / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on 星期四 3月 11 23:51:39 2021

Copyright (c) 1982, 2013, Oracle.  All rights reserved.


连接到:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> startup mount pfile='d:/pfile.txt'
ORACLE 例程已经启动。

Total System Global Area 1603411968 bytes
Fixed Size                  2281656 bytes
Variable Size             469765960 bytes
Database Buffers         1124073472 bytes
Redo Buffers                7290880 bytes
数据库装载完毕。
SQL> recover database until cancel;
ORA-00279: 更改 57834775 (在 02/28/2021 22:37:35 生成) 对于线程 1 是必需的
ORA-00289: 建议:
D:\APP\XIFENFEI\PRODUCT\11.2.0.4\DBHOME_1\RDBMS\ARC0000003072_1043082043.0001
ORA-00280: 更改 57834775 (用于线程 1) 在序列 #3072 中


指定日志: {<RET>=suggested | filename | AUTO | CANCEL}
cancel
ORA-01547: 警告: RECOVER 成功但 OPEN RESETLOGS 将出现如下错误
ORA-01194: 文件 1 需要更多的恢复来保持一致性
ORA-01110: 数据文件 1: 'D:\BAIDUNETDISKDOWNLOAD\DATA\PROD\SYSTEM01.DBF'


ORA-01112: 未启动介质恢复


SQL> alter database open resetlogs;
alter database open resetlogs
*
第 1 行出现错误:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 10 with name
"_SYSSMU10_1197734989$" too small
进程 ID: 7928
会话 ID: 96 序列号: 3

在数据库open的过程中,报ORA-01555错误,这类问题比较明显以前写过类似文章:
在数据库open过程中常遇到ORA-01555汇总
数据库open过程遭遇ORA-1555对应sql语句补充
使用_allow_resetlogs_corruption导致ORA-00704/ORA-01555故障
这次尝试使用自己开发的小程序:Oracle Recovery Tools进行恢复
20210311235641


然后直接尝试打开数据库成功

SQL> alter database open;
alter database open
*
第 1 行出现错误:
ORA-01113: ?? 1 ??????
ORA-01110: ???? 1: 'D:\BAIDUNETDISKDOWNLOAD\DATA\XFF\SYSTEM01.DBF'


SQL> recover database;
完成介质恢复。
SQL> alter database open;

数据库已更改。

这次证明,对于数据库open过程汇总报ORA-00704 ORA-01555故障,可以通过Oracle Recovery Tools工具一键式open库。
后续安排数据导出,对于个别导出报错的表利用dul进行处理,完成本次恢复任务

使用rman from service 搭建dataguard

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:使用rman from service 搭建dataguard

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

从oracle 12c开始提供了rman通过from service方式搭建dg,使用12c长期支持版19c(并打上最新的patch)
配置dataguard相关参数

alter system set db_unique_name='XIFENFEI' scope=spfile;
alter system set service_names='XIFENFEI';
alter system set log_archive_config='dg_config=(XIFENFEI,XIFENFEIDG)';
alter system set log_archive_dest_1='LOCATION=USE_DB_RECOVERY_FILE_DEST valid_for=(all_logfiles,all_roles) db_unique_name=XIFENFEI';
alter system set log_archive_dest_2='service=XIFENFEIDG lgwr async valid_for=(online_logfiles,primary_role) db_unique_name=XIFENFEIDG';
alter system set standby_file_management=auto;
alter system set db_file_name_convert='/u01/app/oracle/oradata/XIFENFEI/','/u01/app/oracle/oradata/XIFENFEI/' scope=spfile;
alter system set log_file_name_convert='/u01/app/oracle/oradata/XIFENFEI/','/u01/app/oracle/oradata/XIFENFEI/' scope=spfile;
alter system set fal_server=XIFENFEIDG;

配置tnsnames.ora

XIFENFEI =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.0.238)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVICE_NAME = XIFENFEI)
    )
  )

XIFENFEIDG =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.0.124)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVICE_NAME = XIFENFEI)
    )
  )

拷贝主库密码文件到备库

[oracle@primary ~]$ scp $ORACLE_HOME/dbs/orapwXIFENFEI 192.168.0.124:$ORACLE_HOME/dbs/
The authenticity of host '192.168.0.124 (192.168.0.124)' can't be established.
ECDSA key fingerprint is SHA256:NI2952z4Bqc3M/B+AK7EJRiJNauROIyluvu1l4NSTX0.
ECDSA key fingerprint is MD5:1d:64:dd:ef:1c:ad:ed:cf:70:22:2d:4d:7c:90:5e:5e.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.0.124' (ECDSA) to the list of known hosts.
oracle@192.168.0.124's password: 
orapwXIFENFEI                                                                   100% 2048     6.6MB/s   00:00    
[oracle@primary ~]$ 

备库启动到nomount状态

[oracle@standby ~]$ sqlplus / as sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Sat Mar 13 20:32:34 2021
Version 19.10.0.0.0

Copyright (c) 1982, 2020, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> create spfile from pfile='/tmp/pfile';

File created.

SQL> startup nomount pfile='/tmp/pfile'
ORACLE instance started.

Total System Global Area 4294963264 bytes
Fixed Size                  8904768 bytes
Variable Size             805306368 bytes
Database Buffers         3472883712 bytes
Redo Buffers                7868416 bytes
SQL> exit
Disconnected from Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.10.0.0.0

rman from service方式创建standby ctl和还原数据文件

[oracle@standby ~]$ rman target /

Recovery Manager: Release 19.0.0.0.0 - Production on Sat Mar 13 20:34:37 2021
Version 19.10.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

connected to target database: XIFENFEI (not mounted)

RMAN> restore standby controlfile from service XIFENFEI;

Starting restore at 13-MAR-21
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=9 device type=DISK

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: using network backup set from service XIFENFEI
channel ORA_DISK_1: restoring control file
channel ORA_DISK_1: restore complete, elapsed time: 00:00:01
output file name=/u01/app/oracle/oradata/XIFENFEI/control01.ctl
output file name=/u01/app/oracle/fast_recovery_area/XIFENFEI/control02.ctl
Finished restore at 13-MAR-21

RMAN> alter database mount;

released channel: ORA_DISK_1
Statement processed

RMAN> restore database from service XIFENFEI;

Starting restore at 13-MAR-21
Starting implicit crosscheck backup at 13-MAR-21
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=12 device type=DISK
Finished implicit crosscheck backup at 13-MAR-21

Starting implicit crosscheck copy at 13-MAR-21
using channel ORA_DISK_1
Finished implicit crosscheck copy at 13-MAR-21

searching for all files in the recovery area
cataloging files...
no files cataloged

using channel ORA_DISK_1

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: using network backup set from service XIFENFEI
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00001 to /u01/app/oracle/oradata/XIFENFEI/system01.dbf
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: using network backup set from service XIFENFEI
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00003 to /u01/app/oracle/oradata/XIFENFEI/sysaux01.dbf
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: using network backup set from service XIFENFEI
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00004 to /u01/app/oracle/oradata/XIFENFEI/undotbs01.dbf
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: using network backup set from service XIFENFEI
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00007 to /u01/app/oracle/oradata/XIFENFEI/users01.dbf
channel ORA_DISK_1: restore complete, elapsed time: 00:00:01
Finished restore at 13-MAR-21
[/sehll]
备库启动mrp进程
1
1
2021-03-13T20:54:08.075418+08:00
Attempt to start background Managed Standby Recovery process (XIFENFEI)
Starting background process MRP0
2021-03-13T20:54:08.086269+08:00
MRP0 started with pid=56, OS id=8182 
2021-03-13T20:54:08.087276+08:00
Background Managed Standby Recovery process started (XIFENFEI)
2021-03-13T20:54:13.104757+08:00
 Started logmerger process
2021-03-13T20:54:13.112058+08:00

IM on ADG: Start of Empty Journal 

IM on ADG: End of Empty Journal 
PR00 (PID:8188): Managed Standby Recovery starting Real Time Apply
2021-03-13T20:54:13.205668+08:00
Parallel Media Recovery started with 4 slaves
2021-03-13T20:54:13.216576+08:00
Stopping change tracking
PR00 (PID:8188): Media Recovery Waiting for T-1.S-25 (in transit)
2021-03-13T20:54:13.269138+08:00
Recovery of Online Redo Log: Thread 1 Group 12 Seq 25 Reading mem 0
  Mem# 0: /u01/app/oracle/oradata/XIFENFEI/s_redo12.log

至此dataguard基本上搭建完成

rm -rf 删除数据文件恢复方法—文件系统反删除+oracle碎片重组

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:rm -rf 删除数据文件恢复方法—文件系统反删除+oracle碎片重组

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

最近有朋友公司运维人员,不小心在CentOS 8的操作系统上执行了rm -rf /*,导致系统无法启动,而且oracle数据库被删除.
通过专业的xfs文件系统工具,尝试恢复
20210306093412


通过对恢复出来的数据文件进行检测
20210302181953

发现通过文件系统层面恢复的数据文件有大量坏块(由于文件系统层面有文件分配目录被覆盖导致恢复出来的部分连续block被空块代替),无法满足业务需求,对于此类情况,考虑通过底层碎片重组技术进行恢复
20210308202839

参考以往类似文章
xfs删除数据文件恢复
dbca删除库和rm删库恢复
restore database误操作恢复
sql server 数据库 mdf 0kb 恢复
文件系统损坏导致数据文件异常恢复
Oracle 数据文件大小为0kb或者文件丢失恢复
对于操作系统层面误删除了数据文件,一般优先考虑os层面反删除恢复,如果恢复效果不好,考虑数据库层面碎片重组技术进行恢复

[star-new@email.tg].Devos加密数据库恢复

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:[star-new@email.tg].Devos加密数据库恢复

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

有客户咨询win服务器数据库被文件系统加密病毒加密,有oracle数据库文件也被加密,请求我们给予恢复支持,加密结果如下:
20210306085626


显示勒索信息
star-new@email.tg

我们通过专业工具分析,确认每个文件破坏192个block(分多段破坏),整体数据都还存在
20210306090511

通过自研工具对其数据文件进行恢复
20210306090755

然后通过技术手段,顺利open数据库,确认含xml字段表正常
20210306085900

导入数据到新库,完成本次数据库恢复工作,实现数据最大程度恢复,而且客户那边直接可以使用,有此类勒索病毒加密的数据库(sql,mysql,oracle)我们可以提供数据库级别恢复服务