rac redo log file被意外覆盖数据库恢复

当前日志被覆盖导致错误
朋友的一客户在一套rac上包含了两个数据库,其其中一个库增加redo group时候,覆盖了另外一个库的redo,悲剧的是刚好是current redo

Wed May 16 17:03:05 2012
ALTER DATABASE OPEN
This instance was first to open
Wed May 16 17:03:09 2012
Beginning crash recovery of 2 threads
 parallel recovery started with 15 processes
Wed May 16 17:03:11 2012
Started redo scan
Wed May 16 17:03:11 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2040024.trc:
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_2'
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_1'
Abort recovery for domain 0
Wed May 16 17:03:11 2012
Aborting crash recovery due to error 305
Wed May 16 17:03:11 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2040024.trc:
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_2'
ORA-00305: log 14 of thread 1 inconsistent; belongs to another database
ORA-00312: online log 14 thread 1: '/dev/rods_redo1_2_1'
ORA-305 signalled during: ALTER DATABASE OPEN...
Wed May 16 17:03:13 2012
Shutting down instance (abort)

使用_allow_resetlogs_corruption= TRUE进行恢复

Wed May 16 18:16:48 2012
SMON: enabling cache recovery
Wed May 16 18:16:48 2012
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Wed May 16 18:16:48 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2105454.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [2522], [18446744072024280773], 
[2522], [18446744072024247666], [], [], []
Wed May 16 18:16:50 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2105454.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [2522], [18446744072024280773], 
[2522], [18446744072024247666], [], [], []
Wed May 16 18:16:50 2012
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
Instance terminated by USER, pid = 2105454
ORA-1092 signalled during: alter database open resetlogs...

ORA-600[KCLCHKBLK_4], is signaled because the SCN in a tempfile block is too high.
The same reason caused the ORA-600[2662]s in the alert logs.
因为是临时文件的scn太大的问题,那就比较好解决:
启动数据库到mount状态,查询出来相关temp file,然后drop掉.

ORA-00600[6856]

Wed May 16 20:25:16 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_smon_2482210.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/dev/rods_redo2_1_1'
ORA-00600: internal error code, arguments: [6856], [0], [0], [], [], [], [], []
ORACLE Instance odsdb1 (pid = 16) - Error 600 encountered while recovering transaction 
(10, 8) on object 7162533.
Wed May 16 20:25:16 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_smon_2482210.trc:
ORA-00600: internal error code, arguments: [6856], [0], [0], [], [], [], [], []

这里的4193和4194是比较熟悉的,根据这里的提示猜测6856也是和undo有关系
ORA-600[6856]SMON is trying to recover a dead transaction.
But the undo application runs into an internal error (trying to delete a row that is already deleted).
因为smon回滚的时候出现上面错误,解决方法是想办法终止回滚,使用event=”10513 trace name context forever, level 2″.

ORA-00600[4193]/ORA-00600[4194]

Wed May 16 20:25:17 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2547936.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/dev/rods_redo2_1_1'
ORA-00600: internal error code, arguments: [4194], [22], [25], [], [], [], [], []
Wed May 16 20:25:18 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2547936.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/dev/rods_redo2_1_1'
ORA-00600: internal error code, arguments: [4194], [22], [25], [], [], [], [], []
Wed May 16 20:25:56 2012
Errors in file /oracle/admin/odsdb/udump/odsdb1_ora_2547936.trc:
ORA-00600: internal error code, arguments: [4193], [22248], [22252], [], [], [], [], []

太常见错误,不再做说明,虽然使用event是的库open成功,因为部分回滚段有问题,该错误还是会出现(还是喜欢直接屏蔽回滚段)

ORA-00600[ktpridestroy2]

Wed May 16 20:36:26 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_smon_2101296.trc:
ORA-00600: internal error code, arguments: [ktpridestroy2], [], [], [], [], [], [], []

This error could be the result of a corruption and involves the parallel rollback that SMON enables each startup.
解决:fast_start_parallel_rollback=false

ORA-00600[kturacf1]/ORA-00600[kcbgcur_9]

Wed May 16 20:49:15 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_j000_2007088.trc:
ORA-00600: internal error code, arguments: [kturacf1], [2097152], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kcbgcur_9], [780140563], [4], [4294959056], [2097152], [], [], []

ORA-00600[kturacf1]错误未查询到原因
ORA-00600[kcbgcur_9]错误原因可能是:Buffers are pinned in a specific class order to prevent internal deadlocks.
因为这两个错误是job产生非致命错误,在这次的处理过程中可以忽略

ORA-00600[4097]

Wed May 16 21:05:05 2012
Errors in file /oracle/admin/odsdb/bdump/odsdb1_j000_1716282.trc:
ORA-12012: error on auto execute of job 6603
ORA-20001: ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []
ORA-06512: at "EPBI.UP_SYSLOG_ONLINE_USER", line 141
ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []

When an instance has a rollback segment offline and the instance crashes, or
the user does a shutdown abort, the rollback segment wrap number does not get
updated. If that segment is then dropped and recreated immediately after the
instance is restarted, the wrap number could be lower than existing wrap
numbers. This will cause the ORA-600[4097] to occur in subsequent
transactions using Rollback.
这个错误也是因为回滚段wrap number未被及时更新导致的异常.

总结这次恢复过程
1.因当前redo丢失,使用隐含参数打开库,然后出现ORA-600[KCLCHKBLK_4](这个错误比较少见,更何况这个库是10.2.0.4)
2.undo出现问题出现ORA-00600[6856]错误不是很多见
3.接下来都是一些列undo导致的问题,其实如果开始就直接使用隐含参数删除掉有问题回滚段,效果可能会比event好.
4.因为部分trace文件没有拿到,未做深入分析,文章列出整体恢复思路
5.本次恢复的数据库版本是10.2.0.4,根据mos描述,很多错误应不会发生,但是实际还是发生了,MOS的版本范围,不要太看重.
6.其实这篇文章的本质不是展示恢复过程,而是再一次提醒:操作数据库慎重,特别是一台机器上多套库.

library cache lock等待事件

Library cache lock介绍
Oracle利用Library cache lock和Library cache pin来实现并发控制,Library cache lock是在handle上获取的,而Library cache pin则是在data heap上获取。访问对象时,首先必须获取handle上的lock,然后将访问的数据pin在内存中。lock的作用是控制进程间的并发访问,而pin的作用是保证数据一致性,防止数据在访问时被交换出去。
lock和pin的实现类似于enqueue,在每个handle上都有lock和pin的holder list和waiter list,用来保存持有该资源和等待该资源的队列。

Library cache lock相关sql语句

--找出library cache lock等待sid,saddr信息
select sid,saddr from v$session where event= 'library cache lock';

SID        SADDR
---------- --------
16         572ed244

--找出blocked信息
select kgllkhdl Handle,kgllkreq Request, kglnaobj Object
from x$kgllk where kgllkses = '572ed244'
and kgllkreq > 0;

HANDLE   REQUEST   OBJECT
-------- ---------- ------------------
62d064dc          2 EMPLOYEES

--找出blocking信息
select kgllkses saddr,kgllkhdl handle,kgllkmod mod,kglnaobj object
from x$kgllk lock_a
where kgllkmod > 0
and exists (select lock_b.kgllkhdl from x$kgllk lock_b
where kgllkses = '572ed244' /* blocked session */
and lock_a.kgllkhdl = lock_b.kgllkhdl
and kgllkreq > 0);

SADDR     HANDLE   MOD         OBJECT
--------  -------- ---------- ------------
572eac94  62d064dc          3  EMPLOYEES

--blocking 会话信息
select sid,username,terminal,program from v$session where saddr = '572eac94'

SID        USERNAME     TERMINAL  PROGRAM
---------- -----------  --------- --------------------------------------------
12          SCOTT        pts/20    sqlplus@goblin.forgotten.realms (TNS V1-V3)

--所有blocked 会话
select sid,username,terminal,program from v$session
where saddr in 
(select kgllkses from x$kgllk lock_a 
 where kgllkreq > 0
 and exists (select lock_b.kgllkhdl from x$kgllk lock_b
             where kgllkses = '572eac94' /* blocking session */
             and lock_a.kgllkhdl = lock_b.kgllkhdl
             and kgllkreq = 0)
);

SID        USERNAME  TERMINAL  PROGRAM
---------- --------- --------- -------------------------------------------
13         SCOTT     pts/22    sqlplus@goblin.forgotten.realms (TNS V1-V3)
16         SCOTT     pts/7     sqlplus@goblin.forgotten.realms (TNS V1-V3)

library cache pin等待事件

library cache pin说明
library cache pin 事件是用来管理library cache的并发访问的, pin一个object会引起相应的heap被载入内存中,如果客户端需要修改或检测这个object它就必须在锁住后取得一个pin.library cache pin的等待时间为3秒钟,其中有1秒钟用于PMON后台进程,即在取得pin之前最多等待3秒钟,否则就超时.library cache pin通常是发生在编译或重新编译PL/SQL,VIEW,TYPES等object时.编译通常都是显性的,如安装应用程序,升级,安装补丁程序等,但object的重新编译也可能发生在object变得无效时.library cache pin的参数如下,有用的主要是P1和P2:
P1 – KGL Handle address.
P2 – Pin address
P3 – 10*Mode + Namespace
其中,P1,P2可与x$kglpn和x$kglob表相关.x$kglpn和x$kglob是ORACLE数据库的内部数据字典.
x$kglpn library cache pin信息
x$kglob library cache object信息

查询方法一

--通过查询V$SESSION_WAIT找出正在等待”library cache pin”的session
SELECT sid,
       SUBSTR (event, 1, 30),
       TO_CHAR(p1, 'xxxxxxxx') p1_16,
       --P1RAW P1_16,
       p2,
       p3
  FROM v$session_wait
 WHERE wait_time = 0 AND event LIKE 'library cache pin%';
--P1 列是Library Cache Handle Address
--P2 列是Library Cache Pin Address. 

--找到相关session pin状态
SELECT ADDR,
       INDX,
       KGLPNADR,-- Library Cache Pin Address
       KGLPNUSE,
       KGLPNSES,--识别锁住此pin 的session
       KGLPNHDL,--Library Cache Handle Address
       kGLPNLCK,
       KGLPNMOD,-- Pin 锁 
       KGLPNREQ-- Pin 请求
  FROM x$kglpn
 WHERE KGLPNHDL LIKE '%EB3EB8%';--p1_16
 
 --询X$KGLOB (Library Cache Object),可找到相关的object
SELECT KGLNAOBJ-- 相关object的名字(取前面80个字符)
  FROM X$KGLOB
 WHERE KGLHDADR LIKE '%EB3EB8%';--p1_16
 
 --查出占着pin锁的session目前正在做什么
SELECT a.sid, a.username, a.program
  FROM v$session a, x$kglpn b
 WHERE a.saddr = b.kglpnuse AND b.kglpnhdl LIKE '%EB3EB8%'--p1_16
  AND b.kgnmod <> 0;
  
   --查出阻塞者正执行的SQL语句
 SELECT sid, sql_text
  FROM v$session, v$sqlarea
 WHERE v$session.sql_address = v$sqlarea.address AND sid =&sid;

查询方法二

--通过查询DBA_LOCK_INTERNAL和V$SESSION_WAIT,可得到与”library cache pin” 等待相关的object的名字
SELECT TO_CHAR (SESSION_ID, '999') sid,
       SUBSTR (LOCK_TYPE, 1, 30) TYPE,
       SUBSTR (lock_id1, 1, 23) Object_Name,
       SUBSTR (mode_held, 1, 4) HELD,
       SUBSTR (mode_requested, 1, 4) REQ,
       lock_id2 Lock_addr
  FROM dba_lock_internal
 WHERE mode_requested <> 'None' AND mode_requested <> mode_held
       AND session_id IN
              (SELECT sid
                 FROM v$session_wait
                WHERE wait_time = 0 AND event LIKE 'library cache pin%');
                
 --查出”library cache pin”占有者(即阻塞者)的session id
 SELECT sid Holder,
       KGLPNUSE Sesion,
       KGLPNMOD Held,
       KGLPNREQ Req
  FROM sys.x$kglpn, v$session
 WHERE KGLPNHDL IN (SELECT p1raw
                      FROM v$session_wait
                     WHERE wait_time = 0 AND event LIKE 'library cache pin%')
       AND KGLPNMOD <> 0
       AND v$session.saddr = x$kglpn.kglpnuse;
       
 --查出”library cache pin”占有者(阻塞者)正在等什么
 SELECT sid, SUBSTR (event, 1, 30), wait_time
  FROM v$session_wait
 WHERE sid IN
          (SELECT sid
             FROM x$kglpn, v$session
            WHERE KGLPNHDL IN
                     (SELECT p1raw
                        FROM v$session_wait
                       WHERE wait_time = 0
                             AND event LIKE 'library cache pin%')
                  AND KGLPNMOD <> 0
                  AND v$session.saddr = x$kglpn.kglpnuse);
                  
 --查出阻塞者正执行的SQL语句
 SELECT sid, sql_text
  FROM v$session, v$sqlarea
 WHERE v$session.sql_address = v$sqlarea.address AND sid =&sid;

cursor: pin S wait on X 等待事件

cursor: pin S整体描述

cursor: pin S A session waits on this event when it wants to update a shared mutex pin and another session 
is currently in the process of updating a shared mutex pin for the same cursor object. This wait event should
rarely be seen because a shared mutex pin update is very fast.(Wait Time: Microseconds)

--Parameter说明
P1 Hash value of cursor

P2 Mutex value 
64 bit platforms
8 bytes are used.
Top 4 bytes hold the session id (if the mutex is held X)
Bottom 4 bytes hold the ref count (if the mutex is held S).

32 bit platforms 
4 bytes are used.
Top 2 bytes hold the session id (if the mutex is held X) 
Bottom 2 bytes hold the ref count (if the mutex is held S).

P3 Mutex where (an internal code locator) OR'd with Mutex Sleeps

--查询sql
SELECT a.*, s.sql_text
  FROM v$sql s,
       (SELECT sid,
               event,
               wait_class,
               p1 cursor_hash_value,
               p2raw Mutex_value,
               TO_NUMBER (SUBSTR (p2raw, 1, 8), 'xxxxxxxx') hold_mutex_x_sid
          FROM v$session_wait
         WHERE event LIKE 'cursor%') a
 WHERE s.HASH_VALUE = a.cursor_hash_value

cursor: pin S wait on X描述

- In previous versions of Oracle, library cache pin is protected by “library cache pin latch”.
- But in recent versions of Oracle(I believe it’s 10.2.0.2), 
  library cache pin for the cursor LCO is protected by mutext.
- Mutex is allocated per LCO, so it enables fine-grained access control.

“cursor: pin S wait on X” wait event is mostly related to mutex and hard parse.
- When a process hard parses the SQL statement, it should acquire exclusive 
  library cache pin for the corresponding LCO.
- This means that the process acquires the mutex in exclusive mode.
- Another process which also executes the same query needs to acquire the mutex 
  but it’s being blocked by preceding process. The wait event is “cursor: pin S wait on X”.

--发生cursor: pin S wait on X原因
Frequent Hard Parses
If the frequency of Hard Parsing is extremely high, then contention can occur on this pin.

High Version Counts
When Version counts become excessive, a long chain of versions needs to 
be examined and this can lead to contention on this event

Known bugs
Bug 5907779 - Self deadlock hang on "cursor: pin S wait on X" (typically from DBMS_STATS) [ID 5907779.8]
Bug 7568642: BLOCKING_SESSION EMPTY FOR "CURSOR: PIN S WAIT ON X"

ORA-00600[kcratr1_lostwrt]/ORA-00600[3020]错误恢复

open数据库alert日志报ORA-00600[kcratr1_lostwrt]错误

Mon May 14 14:57:28 2012
ALTER DATABASE OPEN
Mon May 14 14:57:29 2012
Beginning crash recovery of 1 threads
Mon May 14 14:57:29 2012
Started redo scan
Mon May 14 14:57:29 2012
Errors in file d:\oracle\admin\cqgasold\udump\cqgasold_ora_504.trc:
ORA-00600: 内部错误代码,参数: [kcratr1_lostwrt], [], [], [], [], [], [], []

ORA-600 signalled during: alter database open...

查询相关SCN
同一个查询中SCN相同,省略

SQL> select file#,online_status "STATUS",to_char(change#,'9999999999999999') "SCN",
  2  To_char(time,'yyyy-mm-dd hh24:mi:ss')"TIME" from v$recover_file;

未选定行

SQL> select file#,to_char(checkpoint_change#,'999999999999999') "SCN",
  2  to_char(last_change#,'999999999999999')"STOP_SCN" from v$datafile;

     FILE# SCN              STOP_SCN                                            
---------- ---------------- ----------------                                    
         1       7842987188                                                     
         2       7842987188                                                     
         3       7842987188                                                     

SQL> select file#,to_char(checkpoint_change#,'9999999999999999') "SCN",
  2  to_char(RESETLOGS_CHANGE#,'9999999999999999') "RESETLOGS SCN"
  3  from v$datafile_header;

     FILE# SCN               RESETLOGS SCN                                      
---------- ----------------- -----------------                                  
         1        7842991811                 1                                  
         2        7842991811                 1                                  
         3        7842991811                 1                                  

这里看到奇怪现象datafile scn小于datafile_header scn,数据库异常断电一般来说也不会出现这样的情况,个人猜测是错误的恢复或者使用历史控制文件导致,对于这样的现状,我先尝试着使用using backup controlfile方式恢复,结果失败.估计控制文件有异常,本着先拉起库原则,重建控制文件.

进行完全恢复

SQL> recover database;
ORA-00283: 恢复会话因错误而取消
ORA-00600: 内部错误代码,参数: [3020], [8388617], [1], [23403], [25], [112],[], []
ORA-10567: Redo is inconsistent with data block (file# 2, block# 9)
ORA-10564: tablespace UNDOTBS1
ORA-01110: 数据文件 2: 'D:\ORACLE\ORADATA\CQGASOLD\UNDO_1.DBF'
ORA-10560: block type 'KTU SMU HEADER BLOCK'

尝试跳过坏块继续恢复

SQL> recover database allow 1 corruption;
ORA-00283: 恢复会话因错误而取消
ORA-00600: 内部错误代码,参数: [3020], [8388610], [1], [23403], [2264], [16],[], []
ORA-10567: Redo is inconsistent with data block (file# 2, block# 2)
ORA-10564: tablespace UNDOTBS1
ORA-01110: 数据文件 2: 'D:\ORACLE\ORADATA\CQGASOLD\UNDO_1.DBF'
ORA-10560: block type 'KTFB Bitmapped File Space Header'

使用dbv检查坏块数量

C:\>dbv file='d:\oracle\oradata\cqgasold\undo_1.dbf' blocksize=8192

DBVERIFY: Release 9.2.0.5.0 - Production on 星期二 5月 15 19:43:42 2012

Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.

DBVERIFY - 验证正在开始 : FILE = d:\oracle\oradata\cqgasold\undo_1.dbf

DBV-00200: 块, dba 8388617, 已经标记为崩溃
汇入的页369 - 可能是介质损坏
***
Corrupt block relative dba: 0x00800171 (file 2, block 369)
Fractured block found during dbv:
Data in bad block -
 type: 2 format: 2 rdba: 0x00800171
 last change scn: 0x0001.d37c57db seq: 0x1 flg: 0x04
 consistency value in tail: 0x4e930260
 check value in block header: 0x8202, computed block checksum: 0x4e5f
 spare1: 0x0, spare2: 0x0, spare3: 0x0
***

汇入的页417 - 可能是介质损坏
***
Corrupt block relative dba: 0x008001a1 (file 2, block 417)
Fractured block found during dbv:
Data in bad block -
 type: 2 format: 2 rdba: 0x008001a1
 last change scn: 0x0001.d37c53d4 seq: 0x2 flg: 0x04
 consistency value in tail: 0x4b6b0201
 check value in block header: 0x6ae7, computed block checksum: 0x5abc
 spare1: 0x0, spare2: 0x0, spare3: 0x0
***
…………
--类此记录很多,我放弃了跳过坏块修复的方法

恢复过程中提示坏块数据库文件离线恢复

SQL> alter database datafile 'd:\oracle\oradata\cqgasold\undo_1.dbf' offline;

数据库已更改。

SQL> recover database;
完成介质恢复。
SQL> alter database open;
alter database open
*
ERROR 位于第 1 行:
ORA-00604: 递归 SQL 层 1 出现错误
ORA-00376: 此时无法读取文件 2
ORA-01110: 数据文件 2: 'D:\ORACLE\ORADATA\CQGASOLD\UNDO_1.DBF'

到了这一步,根据经验,数据库被open的可能性很多了,很有可能是open以后因为smon回滚导致数据库down

查看日志,屏蔽回滚段,完成恢复

Tue May 15 19:59:52 2012
alter database open
Tue May 15 19:59:52 2012
Beginning crash recovery of 1 threads
Tue May 15 19:59:52 2012
Started redo scan
Tue May 15 19:59:52 2012
Completed redo scan
 323 redo blocks read, 82 data blocks need recovery
Tue May 15 19:59:52 2012
Started recovery at
 Thread 1: logseq 23404, block 3, scn 0.0
Recovery of Online Redo Log: Thread 1 Group 4 Seq 23404 Reading mem 0
  Mem# 0 errs 0: F:\ORACLE\ORADATA\LOGCQGASOLD4.ORA
Tue May 15 19:59:52 2012
Completed redo application
Tue May 15 19:59:52 2012
Ended recovery at
 Thread 1: logseq 23404, block 326, scn 1.3548264979
 82 data blocks read, 82 data blocks written, 323 redo blocks read
Crash recovery completed successfully
Tue May 15 19:59:53 2012
Thread 1 advanced to log sequence 23405
Thread 1 opened at log sequence 23405
  Current log# 2 seq# 23405 mem# 0: D:\ORACLE\ORADATA\CQGASOLD\REDO02.LOG
Successful open of redo thread 1
Tue May 15 19:59:53 2012
SMON: enabling cache recovery
SMON: enabling tx recovery
Tue May 15 19:59:54 2012
Database Characterset is ZHS16GBK
Tue May 15 19:59:55 2012
replication_dependency_tracking turned off (no async multimaster replication found)
ORA-604 signalled during: alter database open...
Tue May 15 19:59:56 2012
SMON: about to recover undo segment 1
SMON: mark undo segment 1 as needs recovery
SMON: about to recover undo segment 2
SMON: mark undo segment 2 as needs recovery
SMON: about to recover undo segment 3
SMON: mark undo segment 3 as needs recovery
SMON: about to recover undo segment 4
SMON: mark undo segment 4 as needs recovery
SMON: about to recover undo segment 5
SMON: mark undo segment 5 as needs recovery
SMON: about to recover undo segment 6
SMON: mark undo segment 6 as needs recovery
SMON: about to recover undo segment 7
SMON: mark undo segment 7 as needs recovery
SMON: about to recover undo segment 8
SMON: mark undo segment 8 as needs recovery
SMON: about to recover undo segment 9
SMON: mark undo segment 9 as needs recovery
SMON: about to recover undo segment 10
SMON: mark undo segment 10 as needs recovery
Tue May 15 20:00:37 2012
Shutting down instance (abort)

看到这里,可以大概确定是因为undo文件离线,导致回滚段异常.
这个问题,基本上可以确定通过隐含参数屏蔽回滚段,然后open数据库,重建undo删除异常undo,数据库恢复完成。