RAC环境redo在各节点本地导致数据库故障恢复

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:RAC环境redo在各节点本地导致数据库故障恢复

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

最近一个运行在win平台的rac,由于断电之后,集群两个节点均无法正常启动,客户进行了一系列尝试,结果到了ora-600 kclchkblk_4错误无法继续.
通过对数据库日志分析,回溯了故障大概的原因,启动的时候报错为:
节点1启动报错

Sun Aug 03 15:21:22 2025
alter database open
This instance was first to open
Beginning crash recovery of 2 threads
 parallel recovery started with 32 processes
Started redo scan
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_ora_7108.trc:
ORA-00314: 日志 11 (用于线程 2) 要求的 sequence# 147717 与 147541 不匹配
ORA-00312: 联机日志 11 线程 2: 'D:\REDOLOG\REDO011.LOG'
Abort recovery for domain 0
Aborting crash recovery due to error 314
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_ora_7108.trc:
ORA-00314: 日志 11 (用于线程 2) 要求的 sequence# 147717 与 147541 不匹配
ORA-00312: 联机日志 11 线程 2: 'D:\REDOLOG\REDO011.LOG'
Abort recovery for domain 0
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_ora_7108.trc:
ORA-00314: 日志 11 (用于线程 2) 要求的 sequence# 147717 与 147541 不匹配
ORA-00312: 联机日志 11 线程 2: 'D:\REDOLOG\REDO011.LOG'
ORA-314 signalled during: alter database open...

节点2启动报错

Sat Aug 02 15:45:43 2025
Successful mount of redo thread 2, with mount id 1735887907
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Lost write protection disabled
Completed: ALTER DATABASE MOUNT /* db agent *//* {1:47460:124} */
ALTER DATABASE OPEN /* db agent *//* {1:47460:124} */
This instance was first to open
Beginning crash recovery of 2 threads
Sat Aug 02 15:45:49 2025
 parallel recovery started with 32 processes
Started redo scan
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_ora_3444.trc:
ORA-00314: ?? 1 (???? 1) ??? sequence# 67782 ? 60818 ???
ORA-00312: ???? 1 ?? 1: 'D:\REDOLOG\REDO01.LOG'
Abort recovery for domain 0
Aborting crash recovery due to error 314
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_ora_3444.trc:
ORA-00314: ?? 1 (???? 1) ??? sequence# 67782 ? 60818 ???
ORA-00312: ???? 1 ?? 1: 'D:\REDOLOG\REDO01.LOG'
Abort recovery for domain 0
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_ora_3444.trc:
ORA-00314: ?? 1 (???? 1) ??? sequence# 67782 ? 60818 ???
ORA-00312: ???? 1 ?? 1: 'D:\REDOLOG\REDO01.LOG'
ORA-314 signalled during: ALTER DATABASE OPEN /* db agent *//* {1:47460:124} */...

看到这两个报错信息得出两个结论:
1)比较明显节点1需要thead 2的 group 11 sequence为147717,但是实际group 11文件的sequence为147541;而节点2启动需要thread 1的group 1 sequence为67782,但是实际中group 1文件的sequnece为60818,这两个都相差比较多,属于非正常的情况,很可能是文件本身有问题
2)这是一套win的rac架构,理论上redo应该在共享文件系统(一般是asm中),而这个第一感觉很可能是本地文件系统中
客户当时恢复之时查询信息截图
ORA-00314-ORA-00312


查看了两个节点的最后redo切换信息

--节点1
Sat Aug 02 10:49:31 2025
Thread 1 advanced to log sequence 67782 (LGWR switch)
  Current log# 1 seq# 67782 mem# 0: D:\REDOLOG\REDO01.LOG

--节点2(redo每组2G,节点2长时间没跑业务,之时做数据库导出操作,所以切换时间比较久远)
Sat Jul 26 16:56:42 2025
Thread 2 advanced to log sequence 147717 (LGWR switch)
  Current log# 11 seq# 147717 mem# 0: D:\REDOLOG\REDO011.LOG

并查看两个机器d:/redolog信息(客户自行resetlogs之后的,非第一现场,但是可以确认两个节点各自有一份redo文件
redo


本来这个是一个比较小的故障,只要把节点2的thread 1的redo拷贝到到节点1或者节点1的thread 2的redo拷贝到节点2,然后正常open库即可,现场恢复对rac不太熟悉,直接按照互联网上检索的处理方法,加上_allow_resetlogs_corruption然后强制拉库,结果不太幸运,拉库失败报ORA-600 kclchkblk_4错误
kclchkblk_4


Sun Aug 03 18:59:24 2025
alter database open resetlogs
RESETLOGS is being done without consistancy checks. This may result
in a corrupted database. The database should be recreated.
RESETLOGS after incomplete recovery UNTIL CHANGE 21497084214
Resetting resetlogs activation ID 1543012633 (0x5bf88119)
Sun Aug 03 18:59:46 2025
Setting recovery target incarnation to 3
Sun Aug 03 18:59:46 2025
This instance was first to open
Picked broadcast on commit scheme to generate SCNs
Sun Aug 03 18:59:48 2025
Assigning activation ID 1735960667 (0x6778a85b)
Thread 1 opened at log sequence 1
  Current log# 1 seq# 1 mem# 0: D:\REDOLOG\REDO01.LOG
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Sun Aug 03 18:59:49 2025
SMON: enabling cache recovery
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_ora_8508.trc  (incident=728324):
ORA-00600: 内部错误代码, 参数: [kclchkblk_4], [5], [200595988], [5], [22247740], [], [], [], [], [], [], []
Incident details in: D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\incident\incdir_728324\orcl1_ora_8508_i728324.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sun Aug 03 18:59:53 2025
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_ora_8508.trc:
ORA-00704: 引导程序进程失败
ORA-00704: 引导程序进程失败
ORA-00600: 内部错误代码, 参数: [kclchkblk_4], [5], [200595988], [5], [22247740], [], [], [], [], [], [], []
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_ora_8508.trc:
ORA-00704: 引导程序进程失败
ORA-00704: 引导程序进程失败
ORA-00600: 内部错误代码, 参数: [kclchkblk_4], [5], [200595988], [5], [22247740], [], [], [], [], [], [], []
Error 704 happened during db open, shutting down database
USER (ospid: 8508): terminating the instance due to error 704
Sun Aug 03 18:59:54 2025
opiodr aborting process unknown ospid (9480) as a result of ORA-1092
Sun Aug 03 19:00:09 2025
Instance terminated by USER, pid = 8508
ORA-1092 signalled during: alter database open resetlogs...
opiodr aborting process unknown ospid (8508) as a result of ORA-1092

这个故障之后,客户那边无法自行恢复,让我这边介入处理,对于这个错误以前处理比较多,一般就是scn问题,通过Patch SCN小工具快速解决
QQ20250808-185918


数据库open成功之后主要报一些ORA-600 4137,ORA-600 6006等错误

Database Characterset is ZHS16GBK
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_smon_8300.trc  (incident=1176205):
ORA-00600: 内部错误代码, 参数: [4137], [1.14.2713957], [0], [0], [], [], [], [], [], [], [], []
Incident details in: D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\incident\incdir_1176205\orcl1_smon_8300_i1176205.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Fri Aug 08 19:03:14 2025
ORACLE Instance orcl1 (pid = 25) - Error 600 encountered while recovering transaction (1, 14).
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_smon_8300.trc:
ORA-00600: 内部错误代码, 参数: [4137], [1.14.2713957], [0], [0], [], [], [], [], [], [], [], []
Fri Aug 08 19:03:15 2025
ORACLE Instance orcl1 (pid = 25) - Error 600 encountered while recovering transaction (5, 19).
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_smon_8300.trc:
ORA-00600: 内部错误代码, 参数: [4137], [5.19.2318502], [0], [0], [], [], [], [], [], [], [], []
Starting background process MMON
Fri Aug 08 19:03:18 2025
MMON started with pid=29, OS id=4624
Fri Aug 08 19:03:19 2025
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_smon_8300.trc  (incident=1176207):
ORA-00600: 内部错误代码, 参数: [6006], [1], [], [], [], [], [], [], [], [], [], []
Incident details in: D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\incident\incdir_1176207\orcl1_smon_8300_i1176207.trc
Starting background process MMNL
Fri Aug 08 19:03:19 2025
MMNL started with pid=30, OS id=8344
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
ORACLE Instance orcl1 (pid = 25) - Error 600 encountered while recovering transaction (46, 28) on object 197344.
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_smon_8300.trc:
ORA-00600: 内部错误代码, 参数: [6006], [1], [], [], [], [], [], [], [], [], [], []

通过重建undo解决该错误,数据库稳定运行,没有再crash和报明显错误,导出核心数据,完成本次恢复任务.

删除redo导致ORA-00313 ORA-00312故障处理

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:删除redo导致ORA-00313 ORA-00312故障处理

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

有客户由于误操作直接rm 删除了redo文件,导致数据库启动报ORA-00313 ORA-00312错

2025-03-07T14:49:16.325723+08:00
ALTER DATABASE OPEN
2025-03-07T14:50:00.124620+08:00
Ping without log force is disabled:
  instance mounted in exclusive mode.
2025-03-07T14:50:00.198907+08:00
Crash Recovery excluding pdb 2 which was cleanly closed.
2025-03-07T14:50:00.238450+08:00
Beginning crash recovery of 1 threads
 parallel recovery started with 15 processes
 Thread 1: Recovery starting at checkpoint rba (logseq 2966 block 74686), scn 0
2025-03-07T14:50:00.325246+08:00
Started redo scan
2025-03-07T14:50:00.341193+08:00
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_2681.trc:
ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: '/u01/app/oracle/oradata/orcl/redo02.log'
ORA-27037: unable to obtain file status
Linux-x86_64 Error: 2: No such file or directory
Additional information: 7
2025-03-07T14:50:00.372632+08:00
Slave encountered ORA-10388 exception during crash recovery
…………
2025-03-07T14:50:00.385698+08:00
Slave encountered ORA-10388 exception during crash recovery
2025-03-07T14:50:00.388594+08:00
Aborting crash recovery due to error 313
2025-03-07T14:50:00.388739+08:00
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_2681.trc:
ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: '/u01/app/oracle/oradata/orcl/redo02.log'
ORA-27037: unable to obtain file status
Linux-x86_64 Error: 2: No such file or directory
Additional information: 7
2025-03-07T14:50:00.389243+08:00
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_2681.trc:
ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: '/u01/app/oracle/oradata/orcl/redo02.log'
ORA-27037: unable to obtain file status
Linux-x86_64 Error: 2: No such file or directory
Additional information: 7
ORA-313 signalled during: ALTER DATABASE OPEN...

然后客户把历史的redo文件拷贝过来,尝试恢复数据库,报ORA-00314 ORA-00312错误

2025-03-07T15:07:30.784759+08:00
ALTER DATABASE OPEN
Ping without log force is disabled:
  instance mounted in exclusive mode.
2025-03-07T15:07:30.808497+08:00
Crash Recovery excluding pdb 2 which was cleanly closed.
2025-03-07T15:07:30.838664+08:00
Beginning crash recovery of 1 threads
 parallel recovery started with 15 processes
 Thread 1: Recovery starting at checkpoint rba (logseq 2966 block 74686), scn 0
2025-03-07T15:07:30.897547+08:00
Started redo scan
2025-03-07T15:07:30.898222+08:00
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_4106.trc:
ORA-00314: log 2 of thread 1, expected sequence# 2966 doesn't match 1646
ORA-00312: online log 2 thread 1: '/u01/app/oracle/oradata/orcl/redo02.log'
2025-03-07T15:07:30.930089+08:00
Slave encountered ORA-10388 exception during crash recovery
…………
2025-03-07T15:07:30.940051+08:00
Slave encountered ORA-10388 exception during crash recovery
2025-03-07T15:07:30.942274+08:00
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_mz00_4138.trc:
ORA-00312: online log 1 thread 1: '/u01/app/oracle/oradata/orcl/redo01.log'
2025-03-07T15:07:30.945509+08:00
Slave encountered ORA-10388 exception during crash recovery
2025-03-07T15:07:30.945512+08:00
Slave encountered ORA-10388 exception during crash recovery
2025-03-07T15:07:30.948369+08:00
Aborting crash recovery due to error 314
2025-03-07T15:07:30.948488+08:00
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_4106.trc:
ORA-00314: log 2 of thread 1, expected sequence# 2966 doesn't match 1646
ORA-00312: online log 2 thread 1: '/u01/app/oracle/oradata/orcl/redo02.log'
2025-03-07T15:07:30.949390+08:00
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_4106.trc:
ORA-00314: log 2 of thread 1, expected sequence# 2966 doesn't match 1646
ORA-00312: online log 2 thread 1: '/u01/app/oracle/oradata/orcl/redo02.log'
ORA-314 signalled during: ALTER DATABASE OPEN...

使用Oracle数据库异常恢复检查脚本(Oracle Database Recovery Check)脚本收集信息之后数据文件头状态和所需要redo信息
df_header


数据库需要sequence#为2966的redo日志,但是当前已经被删除,基于当前情况,只能进行强制非一致性恢复,尝试强制打开库

SQL> recover database;                 
ORA-00283: recovery session canceled due to errors
ORA-00314: log 2 of thread 1, expected sequence# 2966 doesn't match 1646
ORA-00312: online log 2 thread 1: '/u01/app/oracle/oradata/orcl/redo02.log'

QL> select group#,status,sequence# from v$log;

	  GROUP# STATUS 		 SEQUENCE#
---------------- ---------------- ----------------
	       1 UNUSED 			 0
	       3 CURRENT		      2967
	       2 ACTIVE 		      2966

SQL> 
SQL> 
SQL> recover database until cancel;
ORA-00279: change 163033183 generated at 03/07/2025 14:04:20 needed for thread 1
ORA-00289: suggestion : /u01/app/oracle/recovery_area/orcl/archivelog/2025_03_08/o1_mf_1_2966_%u_.arc
ORA-00280: change 163033183 for thread 1 is in sequence #2966


Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
cancel
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/u01/app/oracle/oradata/orcl/system01.dbf'


ORA-01112: media recovery not started


SQL> alter database open resetlogs;

Database altered.

运气不错,直接打开数据库成功,然后逻辑导出数据,完成此处恢复.这个让我想起来了一些类似案例:
Oracle 23ai rm redo*.log恢复
清空redo,导致ORA-27048: skgfifi: file header information is invalid
由于默认情况下oracle的redo文件扩展名是.log,然后被当做是不重要文件从而被清理导致数据库故障,在oracle服务器上清理数据之前建议查询v$datafile,v$logfile,v$tempfile,v$controlfile来确认是否是数据库文件