ORA-600 4194/ORA-600 4193/ORA-600 4137 恢复思路

对于常见的undo异常错误,ORA-600 4193,ORA-600 4194,ORA-600 4137等错误的处理一般步骤.
适用版本

Oracle Database - Enterprise Edition - Version 9.2.0.1 to 11.2.0.4 [Release 9.2 to 11.2]
Information in this document applies to any platform.

报错现象

The following error is occurring in the alert.log right before the database crashes.

ORA-00600: internal error code, arguments: [4194], [#], [#], [], [], [], [], []

This error indicates that a mismatch has been detected between redo records and rollback (undo) records.

ARGUMENTS:

Arg [a] - Maximum Undo record number in Undo block
Arg [b] - Undo record number from Redo block

Since we are adding a new undo record to our undo block, we would expect that the new record number
 is equal to the maximum record number in the undo block plus one. Before Oracle can add 
a new undo record to the undo block it validates that this is correct. If this validation fails,
 then an ORA-600 [4194] will be triggered.

报错原因

This also can be cause by the following defect

Bug 8240762 Abstract: Undo corruptions with ORA-600 [4193]/ORA-600 [4194] or ORA-600 [4137] after SHRINK

Details: 
Undo corruption may be caused after a shrink and the same undo block may be used 
for two different transactions causing several internal errors like:
ORA-600 [4193] / ORA-600 [4194] for new transactions
ORA-600 [4137] for a transaction rollback

处理步骤

Best practice to create a new undo tablespace.
This method includes segment check.

Create pfile from spfile to edit
>create pfile from spfile;

1. Shutdown the instance

2. set the following parameters in the pfile
    undo_management = manual
    event = '10513 trace name context forever, level 2'

3. >startup restrict pfile=<initsid.ora>

4. >select tablespace_name, status, segment_name from dba_rollback_segs where status != 'OFFLINE';

This is critical - we are looking for all undo segments to be offline - System will always be online.

If any are 'PARTLY AVAILABLE' or 'NEEDS RECOVERY' - Please open an issue with Oracle Support or update the current SR.

If all offline then continue to the next step

5. Create new undo tablespace - example
>create undo tablespace <new undo tablespace> datafile <datafile> size 2000M;

6. Drop old undo tablespace
>drop tablespace <old undo tablespace> including contents and datafiles;

7. >shutdown immediate;

8 >startup nomount;  --> Using your Original spfile

9 modify the spfile with the new undo tablespace name

  Alter system set undo_tablespace = '<new tablespace created in step 5>' scope=spfile;

10. >shutdown immediate;

11. >startup;  --> Using spfile
 


The reason we create a new undo tablespace first is to use new undo segment numbers
 that are higher then the current segments being used.
This way when a transaction goes to do block clean-out 
the reference to that undo segment does not exist and continues with the block clean-out.

参考:tep by step to resolve ORA-600 4194 4193 4197 on database crash (Doc ID 1428786.1)

ORA-600 kdsgrp1

在硬件恢复,断电,redo异常等恢复case中ORA-600 [kdsgrp1]是一个比较常见的错误,这里该出来官方关于该错误的解释说明和处理方法

RROR:
  Format: ORA-600 [kdsgrp1]

VERSIONS:
  versions 10.1 and above



DESCRIPTION:

 This error was introduced in 10g with the fix to Bug 2442351, it provides
 for an extra health check on a block, we detected a null row header,
 see Note:2442351.9 for more information.

 Error may be caused by:

 Case 1. A row referenced in an index that does not exist in the table.
 Case 2. An non-existent rowid pointed to by a chained row.

 Trace Examples:

 Case 1. Mismatch between table and index:
====================================================

 Trace file has:

 row 02433566.13 continuation at
 file# 9 block# 210278 slot 20 not found

 The file=9 block=210278 is rdba=0x02433566 which was taken from an index:
 
 row#3[7549] flag: ------, lock: 0, len=85, data:(6):  02 43 35 66 00 14

 But the slot 20 does not exist in the table block:
 
 tab 0, row 1, @0x1e62
 tl: 2 fb: --HDFL-- lb: 0x3
 tab 0, row 12, @0x191a
 tl: 2 fb: --HDFL-- lb: 0x1
 tab 0, row 17, @0x1675
 tl: 2 fb: --HDFL-- lb: 0x2    
 tab 0, row 21, @0x1459       
 tl: 2 fb: --HDFL-- lb: 0x4

 ORA-1499 may be produced by analyze:

 analyze table <table name> validate structure cascade;

 Case 2. A row points to another rowid which does not exist (Chained row does not exist).
============================================================================================
 
 Trace file has:

 row 1186b11a.ffffffff continuation at
 file# 70 block# 441621 slot 1 not found 

 It means that row with rdba 0x1186b11a continues in file# 70 block# 441621 slot 1.  
 But the information in file# 70 block# 441621 slot 1 does not exist.  It is:

 tab 0, row 16, @0xd7f    ---> This is the slot with the problem.
 tl: 29 fb: -------- lb: 0x0  cc: 11
 nrid:  0x1186bd15.1      ---> It points to rdba=0x1186bd15 slot 1 
(file# 70 block# 441621 slot 1) but that row does not exist in that block.

 For this case ANALYZE TABLE .. VALIDATE STRUCTURE is not detecting this logical corruption
Referece Bug 6858313

Run an export (exp) or Full Table Scan to identify if there is a permanent invalid chained row.

 Workaround for Case 2:
 
 The row producing the ORA-600 [kdsgrp1] can be skipped by setting the Event 10231

 Note that a testcase has concluded that event 10231 does not skip rows in an Index Organized Table (IOT)
 when there is an invalid nrid as explained in Case 2.  It only works for regular tables.

 Event 43810 skip corrupt block in IOT?s (10.2.0.4) 
nor  parameter _index_scan_check_skip_corrupt (11g) work for this case 2 on IOTs either.

FUNCTIONALITY:
  Kernel Data layer Seek/Scan

IMPACT:

  PROCESS FAILURE
  POSSIBLE PHYSICAL CORRUPTION

ORA-600 2663分析

在大家熟悉的ORA-600 2662错误中还有一个类似的ORA-600 2663的错误,以下是对该2663的解释说明

ERROR:
  Format: ORA-600 [2663] [a] [b] {c} [d]

VERSIONS:
  versions 10.1 to 11.1

DESCRIPTION:
  A data block SCN is ahead of the current SCN.

  The ORA-600 [2663] occurs when an SCN is compared to the dependent SCN 
  stored in a UGA variable.

  If the SCN is less than the dependent SCN then we signal the ORA-600 [2663]
  internal error.

ARGUMENTS:
  Arg [a]  Current SCN WRAP
  Arg [b]  Current SCN BASE
  Arg {c}  dependent SCN WRAP
  Arg [d]  dependent SCN BASE 

FUNCTIONALITY:
  Kernel Cache Redo File Redo Generation

IMPACT:
  INSTANCE FAILURE
  POSSIBLE PHYSICAL CORRUPTION

SUGGESTIONS:
  There are different situations where ORA-600 [2663] can be raised.

  It can be raised on startup or duing database operation.

  If not using RAC, Real Application Clusters,, check that 2 instances 
  have not mounted the same database.

  Check for SMON traces and have the alert.log and trace files ready
  to send to support.

  Check the SCN difference [argument d]-[argument b].

  If the SCNs in the error are very close, then try to shutdown and startup
  the instance several times. 

  In some situations, the SCN increment during startup may permit the 
  database to open. Keep track of the number of times you attempted a 
  startup.

ORA-600 2662分析

在数据库恢复中,ORA-600 2662我想是很多人都非常熟悉的错误,下文是对于该错误的一些解释
ORA-600 2662解释说明

ERROR:              

  Format: ORA-600 [2662] [a] [b] {c} [d] [e]
 
VERSIONS:
  versions 6.0 to 12.1
 
DESCRIPTION:

  A data block SCN is ahead of the current SCN.

  The ORA-600 [2662] occurs when an SCN is compared to the dependent SCN 
  stored in a UGA variable.

  If the SCN is less than the dependent SCN then we signal the ORA-600 [2662]
  internal error.

ARGUMENTS:
  Arg [a]  Current SCN WRAP
  Arg [b]  Current SCN BASE
  Arg {c}  dependent SCN WRAP
  Arg [d]  dependent SCN BASE 
  Arg [e]  Where present this is the DBA where the dependent SCN came from.

出现ORA-600 2662可能的原因

  (1) doing an open resetlogs with _ALLOW_RESETLOGS_CORRUPTION enabled    
  (2) a hardware problem, like a faulty controller, resulting in a failed 
      write to the control file or the redo logs     
  (3) restoring parts of the database from backup and not doing the 
      appropriate recovery     
  (4) restoring a control file and not doing a RECOVER DATABASE USING BACKUP 
      CONTROLFILE     
  (5) having _DISABLE_LOGGING set during crash recovery                      
  (6) problems with the DLM in a parallel server environment      
  (7) a bug   

ORA-600 2662解决方法

   (1) if the SCNs in the error are very close, attempting a startup several 
       times will bump up the dscn every time we open the database even if 
       open fails. The database will open when dscn=scn.
      
   (2)You can bump the SCN either on open or while the database is open 
      using Event:ADJUST_SCN 
      Be aware that you should rebuild the database if you use this
      option.

_ALLOW_RESETLOGS_CORRUPTION 参数

我相信_ALLOW_RESETLOGS_CORRUPTION 这个参数一定很多人都熟悉,是redo异常恢复的杀手锏之一,以下文章是来自官方的解释

DB_Parameter _ALLOW_RESETLOGS_CORRUPTION 
========================================
 
This documentation has been prepared avoiding the mention of the complex 
structures from the code and to simply give an insight to the 'damage it could 
cause'.  The usage of this parameter leads to an in-consistent Database with no 
other alternative but to rebuild the complete Database.  This parameter could 
be used when we realize that there are no stardard options available and are 
convinced that the customer understands the implications of using the Oracle's 
secret parameter.  The factors to be considered are ;-- 
 
1. Customer does not have a good backup. 
2. A lot of time and money has been invested after the last good backup and     
   there is no possibility for reproduction of the lost data. 
3. The customer has to be ready to export the full database and import it     
   back after creating a new one. 
4. There is no 100% guarantee that by using this parameter the database would 
   come up. 
5. Oracle does not support the database after using this parameter for        
   recovery.    
6. ALL OPTIONS including the ones mentioned in the action part of the error   
   message have been tried. 
 
 

 
By setting _ALLOW_RESETLOGS_CORRUPTION=TRUE, certain consistency checks are 
SKIPPED during database open stage.  This basically means it does not check 
the datafile headers as to what the status was before the shutdown and how it 
was shutdown.  The following cases mention few of the checks that were skipped. 
 
Case-I 
------ 
Verification that the datafile present has not been restored from a BACKUP 
taken before the database was opened successfully by using RESETLOGS.   

ORA-01190: control file or data file %s is from before the last RESETLOGS
    Cause: Attempting to use a data file when the log reset information in  
           the file does not match the control file.  Either the data file or  
           the control file is a backup that was made before the most recent 
           ALTER DATABASE OPEN RESETLOGS. 
   Action: Restore file from a more recent backup. 

 
Case-II 
------- 
Verification that the status bit of the datafile is not in a FUZZY state. 
The datafile could be in this state due to the database going down when the  
 - Datafile was on-line and open 
 - Datafile was not closed cleanly (maybe due to OS). 

ORA-01194: file %s needs more recovery to be consistent 
    Cause: An incomplete recover session was started, but an insufficient 
           number of logs were applied to make the file consistent.  The  
           reported file was not closed cleanly when it was last opened by 
           the database.  It must be recovered to a time when it was not  
           being updated.  The most likely cause of this error is forgetting 
           to restore the file from a backup before doing incomplete  
           recovery. 
   Action: Either apply more logs until the file is consistent or restore the 
           file from an older backup and repeat recovery. 
 

Case-III 
-------- 
Verification that the COMPLETE recover strategies have been applied for 
recovering the datafile and not any of the INCOMPLETE recovery options.  
Basically because the complete recovery is one in which we even apply the 
ON-LINE redo log files and open the DB without reseting the logs. 

ORA-01113: file '%s' needs media recovery starting at log sequence # %s 
    Cause: An attempt was made to open a database file that is in need of  
           media recovery. 
   Action: First apply media recovery to the file. 

 
Case-IV 
------- 
Verification that the datafile has been recovered through an END BACKUP if the 
control file indicates that it was in backup mode. 
This is useful when the DB has crashed while in hot backup mode and we lost 
all log files in DB version's less than V7.2. 

ORA-01195: on-line backup of file %s needs more recovery to be consistent" 
    Cause: An incomplete recovery session was started, but an insufficient  
           number of logs were applied to make the file consistent.  The 
           reported file is an on-line backup which must be recovered to the 
           time the backup ended. 
   Action: Either apply more logs until the file is consistent or resotre 
          the database files from an older backup and repeat recovery. 
 
In version 7.2, we could simply issue the ALTER DATABASE DATAFILE xxxx END 
BACKUP statement and proceed with the recovery.  But again to issue this 
statement, we need to have the ON-LINE redo logs or else we still are forced to
use this parameter. 
 

Case-V 
------ 
Verification that the data file status is not still in (0x10) MEDIA recovery 
FUZZY. 
When recovery is started, a flag is set in the datafile header status flag to 
indicate that the file is presently in media recovery.  This is reset when 
recovery is completed and at times when it has not been reset we are forced to 
use this paramter. 

ORA-01196: file %s is inconsistent due to a failed media recovery session 
    Cause: The file was being recovered but the recovery did not terminate 
           normally.  This left the file in an inconsistent state.  No more  
           recovery was successfully completed on this file. 
   Action: Either apply more logs until the file is consistent or restore the 
           backup again and repeat recovery. 
 
 
Case-VI 
------- 
Verification that the datafile has been restored form a proper backup to 
correspond with the log files.  This situation could happen when we have 
decided that the data file is invalid since its SCN is ahead of the last 
applied logs SCN but it has not failed on one of the ABOVE CHECKS. 

ORA-01152: file '%s' was not restored from a sufficientluy old backup" 
    Cause: A manual recovery session was started, but an insufficient number 
           of logs were applied to make the database consistent.  This file is 
           still in the future of the last log applied.  Note that this  
           mistake can not always be caught. 
   Action: Either apply more logs until the database is consistent or 
           restore the database file from an older backup and repeat  
           recovery.

使用_ALLOW_RESETLOGS_CORRUPTION 参数需谨慎,因为该参数可能导致数据库逻辑不一致,甚至可能把本来很简单的一个恢复弄的非常复杂甚至不可恢复的后果,建议在oracle support支持下使用.另外使用该参数resetlogs库之后,强烈建议通过逻辑方式重建库