large pool太小导致shared server异常

数据库出现如下错误

Fri Oct  5 09:33:54 2012
Process S001 started up but failed with error = 20
failed to start shared server 'S001', oer=20
Process S001 started up but failed with error = 20
failed to start shared server 'S001', oer=20
Process S001 started up but failed with error = 20
failed to start shared server 'S001', oer=20
Process S001 started up but failed with error = 20
failed to start shared server 'S001', oer=20
Process S001 started up but failed with error = 20
failed to start shared server 'S001', oer=20
Process S001 started up but failed with error = 20
failed to start shared server 'S001', oer=20
Process S001 started up but failed with error = 20
failed to start shared server 'S001', oer=20
Process S001 started up but failed with error = 20
failed to start shared server 'S001', oer=20

重启后错误提示变为

Successfully onlined Undo Tablespace 1.
Fri Oct  5 09:34:41 2012
SMON: enabling tx recovery
Fri Oct  5 09:34:41 2012
Database Characterset is AL32UTF8
replication_dependency_tracking turned off (no async multimaster replication found)
Completed: ALTER DATABASE OPEN
Fri Oct  5 13:53:50 2012
Errors in file /oracle/admin/ora/bdump/ora_s000_25948.trc:
ORA-04031: unable to allocate 72 bytes of shared memory 
("large pool","unknown object","session heap","trigger condition node")
Fri Oct  5 13:53:50 2012
Errors in file /oracle/admin/ora/bdump/ora_s000_25948.trc:
ORA-00600: internal error code, arguments: [ksudel1], [], [], [], [], [], [], []
ORA-04031: unable to allocate 72 bytes of shared memory 
("large pool","unknown object","session heap","trigger condition node")
Fri Oct  5 13:54:52 2012
found dead shared server 'S000', pid = (11, 1)
Fri Oct  5 17:25:59 2012
Errors in file /oracle/admin/ora/bdump/ora_s000_31081.trc:
ORA-04031: unable to allocate 72 bytes of shared memory 
("large pool","unknown object","session heap","trigger condition node")
Fri Oct  5 17:25:59 2012
Errors in file /oracle/admin/ora/bdump/ora_s000_31081.trc:
ORA-00600: internal error code, arguments: [ksudel1], [], [], [], [], [], [], []
ORA-04031: unable to allocate 72 bytes of shared memory 
("large pool","unknown object","session heap","trigger condition node")

通过这里的错误,我们可以看到是large pool不能分配72 bytes的连续内存空间而使得S000进程报错.那这两者有什么联系:我们知道S000是shared server的进程,那shared server为什么导致large pool不足呢?查询官方文档得出,如下三种情况会使用large pool

Session memory for the shared server and the Oracle XA interface 
(used where transactions interact with more than one database)

I/O server processes

Oracle backup and restore operations

主要也就是shared server/parallel query buffers/backup restore这几个操作会使用到large pool.在该案例中很明显的可以看到是因为shared server进程需要分配large pool中一部分空间,而没有连续空间从而出现该错误.数据库相关参数配置

SQL> show parameter mts;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
mts_circuits                         integer     555
mts_dispatchers                      string      (PROTOCOL=TCP) (SERVICE=oraXDB)
mts_listener_address                 string
mts_max_dispatchers                  integer     5
mts_max_servers                      integer     20
mts_multiple_listeners               boolean     FALSE
mts_servers                          integer     1
mts_service                          string      ora
mts_sessions                         integer     550

SQL> show parameter large;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
large_pool_size                      big integer 16777216

这里可以看出来,数据库明显配置了MTS,因为数据库在启动时候,最少会建立一个shared server进程,而这个时候因为large pool太小(16M),导致该进程无法正常建立,从而出现上述alert中相关错误,临时处理方法增加large pool.后续需要关注业务特点,考虑是否可以采用Oracle Dedicated server模式来处理.

ORA-00600[kccpb_sanity_check_2]

alert日志出现如下错误
数据库在mount的时候,因为出现ORA-00600[kccpb_sanity_check_2]错误导致数据库不能正常被mount成功

--sqlplus中报错
SQL> startup nomount;
ORACLE instance started.
 
Total System Global Area 4294967296 bytes                                     
Fixed Size                  2273256 bytes                                     
Variable Size            1486573592 bytes                                     
Database Buffers         2801795072 bytes                                     
Redo Buffers                4325376 bytes                                     
  
SQL> alter database mount;
alter database mount
*
ERROR at line 1:
ORA-600: internal error code, arguments: [kccpb_sanity_check_2], [2825], 
[2824], [0x000000000], [], [], [], [] 

--alert日志
Mon Sep 24 16:35:37 2012
ALTER DATABASE   MOUNT
Mon Sep 24 16:35:41 2012
Errors in file /opt/app/oracle/admin/lhgk/udump/lhgk_ora_17034.trc:
ORA-00600: internal error code, arguments: [kccpb_sanity_check_2], [2825], [2824], [0x000000000], [], [], [], []
Mon Sep 24 16:35:41 2012
ORA-600 signalled during: ALTER DATABASE   MOUNT...

错误原因

ORA-600 [kccpb_sanity_check_2] indicates that the seq# of the last read block is
higher than the seq# of the control file header block. This is indication of
the lost write of the header block during commit of the previous cf
transaction.

解决方法

1) restore a backup of a controlfile and recover

OR

2) recreate the controlfile

OR

3) restore the database from last good backup and recover

NOTE:  If you do not have any special backup of control file to restore and you are using Multiple Control File 
copies in your pfile/init.ora/spfile you can attempt to mount the database using each control file one by one.  
If you are able to mount the database with any of these control file copies you can then issue 
'alter database backup controlfile to trace' to recreate controlfile.

模拟普通ORA-08103并解决

在上一篇中说到:模拟极端ORA-08103并解决,不能通过修改成坏块来解决,这里演示了是一个普通的数据块出现异常,然后通过bbed修改为坏块通过dbms_repair来解决该故障,补充说明:在11.2.0.3.3的库中,使用该方法不能重现该错误,而是直接提示ORA-01578,证明ORACLE的新版本在这一方面进行了改进
创建测试表

SQL> SELECT * FROM V$VERSION;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Prod
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> create table t_xifenfei as
  2   select * from dba_objects where rownum<3000;

Table created.

SQL> SELECT owner, segment_name, EXTENT_ID, FILE_ID, BLOCK_ID, BLOCKS 
  2  FROM dba_extents 
  3  WHERE segment_name='T_XIFENFEI' AND owner='CHF';

OWNER                          SEGMENT_NAME     EXTENT_ID    FILE_ID   BLOCK_ID     BLOCKS
------------------------------ --------------- ---------- ---------- ---------- ----------
CHF                            T_XIFENFEI               0          4         57          8
CHF                            T_XIFENFEI               1          4         65          8
CHF                            T_XIFENFEI               2          4         73          8
CHF                            T_XIFENFEI               3          4         81          8
CHF                            T_XIFENFEI               4          4         89          8
CHF                            T_XIFENFEI               5          4         97          8

6 rows selected.

SQL> SELECT DISTINCT dbms_rowid.rowid_block_number(rowid) blk#,
  2  dbms_rowid.rowid_relative_fno(rowid) file# 
  3   FROM t_xifenfei
  4   ORDER BY 2,1;

      BLK#      FILE#
---------- ----------
        60          4
        61          4
        62          4
        63          4
        64          4
        65          4
        66          4
        67          4
        68          4
        69          4
        70          4

      BLK#      FILE#
---------- ----------
        71          4
        72          4
        74          4
        75          4
        76          4
        77          4
        78          4
        79          4
        80          4
        81          4
        82          4

      BLK#      FILE#
---------- ----------
        83          4
        84          4
        85          4
        86          4
        87          4
        88          4
        90          4
        91          4
        92          4
        93          4
        94          4

      BLK#      FILE#
---------- ----------
        95          4
        96          4
        97          4
        98          4

37 rows selected.

模拟ORA-08103

SQL> CONN / AS SYSDBA
Connected.
SQL> SHUTDOWN IMMEDIATE
Database closed.
Database dismounted.
ORACLE instance shut down.

--破坏数据快(其实就是清空一个数据块block 95,注意dd和实际数据的block对应关系相差1)
[oracle@xifenfei ~]$ dd if=/dev/zero of=/u01/oracle/oradata/XFF/users01.dbf bs=8192 seek=95 count=1 conv=notrunc
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000187113 seconds, 43.8 MB/s


SQL> STARTUP
ORACLE instance started.

Total System Global Area  318767104 bytes
Fixed Size                  1267236 bytes
Variable Size             104860124 bytes
Database Buffers          205520896 bytes
Redo Buffers                7118848 bytes
Database mounted.
Database opened.
SQL> SELECT COUNT(*) FROM CHF.T_XIFENFEI;
SELECT COUNT(*) FROM CHF.T_XIFENFEI
                         *
ERROR at line 1:
ORA-08103: object no longer exists


[oracle@xifenfei ~]$ exp chf/xifenfei tables=t_xifenfei file=/tmp/t_xifenfei.dmp

Export: Release 10.2.0.4.0 - Production on Fri Jan 13 22:09:43 2012

Copyright (c) 1982, 2007, Oracle.  All rights reserved.


Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Export done in ZHS16GBK character set and AL16UTF16 NCHAR character set

About to export specified tables via Conventional Path ...
. . exporting table                     T_XIFENFEI
EXP-00056: ORACLE error 8103 encountered
ORA-08103: object no longer exists
Export terminated successfully with warnings.

[oracle@xifenfei ~]$ expdp chf/xifenfei tables=t_xifenfei dumpfile=t_xifenfei.dmp

Export: Release 10.2.0.4.0 - Production on Friday, 13 January, 2012 22:10:26

Copyright (c) 2003, 2007, Oracle.  All rights reserved.

Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Starting "CHF"."SYS_EXPORT_TABLE_01":  chf/******** tables=t_xifenfei dumpfile=t_xifenfei.dmp 
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 384 KB
Processing object type TABLE_EXPORT/TABLE/TABLE
ORA-31693: Table data object "CHF"."T_XIFENFEI" failed to load/unload and is being skipped due to error:
ORA-02354: error in exporting/importing data
ORA-08103: object no longer exists
Master table "CHF"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for CHF.SYS_EXPORT_TABLE_01 is:
  /u01/oracle/oracle/product/10.2.0/db_1/rdbms/log/t_xifenfei.dmp
Job "CHF"."SYS_EXPORT_TABLE_01" completed with 1 error(s) at 22:10:50

测试证明当出现ORA-08103的时候,全表扫描,exp,expdp均不能正常工作

找到出现ORA-08103数据块

SQL> alter session set max_dump_file_size=unlimited;

Session altered.

SQL> alter session set db_file_multiblock_read_count=1;

Session altered.

SQL> alter session set events 'immediate trace name trace_buffer_on level 1048576';

Session altered.

SQL>  alter session set events '10200 trace name context forever, level 1';

Session altered.

SQL> alter session set events '8103 trace name errorstack level 3';

Session altered.

SQL> alter session set events '10236 trace name context forever, level 1';

Session altered.

SQL>  alter session set tracefile_identifier='ORA8103';

Session altered.


SQL> select * from chf.t_xifenfei;

……………………

ERROR:
ORA-08103: object no longer exists

2700 rows selected.


--在trace文件结尾发现如下记录,表示读到这个数据块时发生错误
KTRVAC: path typ=0, rdba=100005f


SQL> select to_number('100005f','xxxxxxxxxxxxx') from dual;

TO_NUMBER('100005F','XXXXXXXXXXXXX')
------------------------------------
                            16777311

SQL> select
  2  dbms_utility.data_block_address_file(16777311) FILE_NO,
  3  dbms_utility.data_block_address_block(16777311) BLOCK_NO
  4  from dual;

   FILE_NO   BLOCK_NO
---------- ----------
         4         95

bbed继续破坏异常块

BBED> set filename '/u01/oracle/oradata/XFF/users01.dbf'
        FILENAME        /u01/oracle/oradata/XFF/users01.dbf

BBED> set block 95
        BLOCK#          95

BBED> map
 File: /u01/oracle/oradata/XFF/users01.dbf (0)
 Block: 95                                    Dba:0x00000000
------------------------------------------------------------
BBED-00400: invalid blocktype (00)


BBED> set count 32
        COUNT           32

BBED> d
 File: /u01/oracle/oradata/XFF/users01.dbf (0)
 Block: 95               Offsets:    0 to   31           Dba:0x00000000
------------------------------------------------------------------------
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 

 <32 bytes per line>

--这个就是和11gr2的区别,在11g中使用该方法来模拟ORA-08103,直接提示坏块,从而不会出现ORA-08103
BBED> verify
DBVERIFY - Verification starting
FILE = /u01/oracle/oradata/XFF/users01.dbf
BLOCK = 95


DBVERIFY - Verification complete

Total Blocks Examined         : 1
Total Blocks Processed (Data) : 0
Total Blocks Failing   (Data) : 0
Total Blocks Processed (Index): 0
Total Blocks Failing   (Index): 0
Total Blocks Empty            : 1
Total Blocks Marked Corrupt   : 0
Total Blocks Influx           : 0

BBED> d offset 8180
 File: /u01/oracle/oradata/XFF/users01.dbf (0)
 Block: 95               Offsets: 8180 to 8191           Dba:0x00000000
------------------------------------------------------------------------
 00000000 00000000 00000000 

 <32 bytes per line>

BBED> m /x 01010101 offset 8188
BBED-00215: editing not allowed in BROWSE mode


BBED> set mode edit
        MODE            Edit

--修改sumcheck
BBED> m /x 01010101 offset 8188
Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) y
 File: /u01/oracle/oradata/XFF/users01.dbf (0)
 Block: 95               Offsets: 8188 to 8191           Dba:0x00000000
------------------------------------------------------------------------
 01010101 

 <32 bytes per line>

BBED> sum
Check value for File 0, Block 95:
current = 0x0000, required = 0x0000

测试修改为坏块效果

SQL> select count(*) from chf.t_xifenfei;
select count(*) from chf.t_xifenfei
                         *
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 4, block # 95)
ORA-01110: data file 4: '/u01/oracle/oradata/XFF/users01.dbf'


SQL> exec dbms_repair.skip_corrupt_blocks('CHF','T_XIFENFEI');

PL/SQL procedure successfully completed.

SQL> select count(*) from chf.t_xifenfei;

  COUNT(*)
----------
      2918

通过让ORA-08103对应的块变为真正的坏块,然后使用dbms_repair或者event来跳过坏块,达到拯救数据的目的

模拟极端ORA-08103并解决

ORA-08103错误在数据库日程运维和异常恢复中都可能遇到,出现该错误的原因很多,有很多情况下(模拟普通ORA-08103并解决),直接通过修改块使其在数据库查询的时候表标志为坏块,然后使用event或者dbms_repair包来标志该块,然后跳过就可以解决该问题,但是有些时候,遇到极端情况,该方法会失效,需要借助极端工具来处理该极端问题.
分析表相关EXTENT

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production

SQL> create table t_xifenfei  as
  2   select * from dba_objects;

Table created.

SQL> select count(*) from t_xifenfei;

  COUNT(*)
----------
     74504

SQL> select data_object_id,object_id from dba_objects where object_name='T_XIFENFEI';

DATA_OBJECT_ID  OBJECT_ID
-------------- ----------
         75592      75592

SQL> select file_id,block_id,block_id+blocks-1
  2      from dba_extents
  3   where segment_name ='T_XIFENFEI' AND owner='CHF';

   FILE_ID   BLOCK_ID BLOCK_ID+BLOCKS-1
---------- ---------- -----------------
         4        680               687
         4        688               695
         4        696               703
         4        704               711  <---注意
         4        712               719
         4        720               727
         4        728               735
         …………
24 rows selected.

SQL> Select segment_name,header_file,header_blocK
  2  from dba_segments where
  3  segment_name in ('T_XIFENFEI') and owner='CHF'
  4  ;

SEGMENT_NAME                   HEADER_FILE HEADER_BLOCK
------------------------------ ----------- ------------
T_XIFENFEI                               4          682

SQL> alter system checkpoint;

System altered.

SQL> alter system dump datafile 4 block 682;

System altered.

SQL> select value from v$diag_info where name='Default Trace File';

VALUE
--------------------------------------------------------------------------------
/u01/oracle/diag/rdbms/ora11g/ora11g/trace/ora11g_ora_28933.trc

--trace文件
Dump of memory from 0xB6CFD600 to 0xB6CFF600
B6CFD600 0000A223 010002AA 000DB4EA 04010000  [#...............]
B6CFD610 00008ECF 00000000 00000000 00000000  [................]
B6CFD620 00000000 00000018 00000480 00000A9C  [................]
B6CFD630 00000017 0000003F 00000080 010007BF  [....?...........]
B6CFD640 00000000 00000017 00000000 0000043F  [............?...]
B6CFD650 00000000 00000000 00000000 00000017  [................]
B6CFD660 0000003F 00000080 010007BF 00000000  [?...............]
B6CFD670 00000017 00000000 0000043F 01000780  [........?.......]
B6CFD680 01000780 00000000 00000000 00000000  [................]
B6CFD690 00000000 00000000 00000000 00000000  [................]
        Repeat 3 times
B6CFD6D0 00000001 00002000 00000000 00001434  [..... ......4...]
B6CFD6E0 00000000 010002A9 00000001 01000781  [................]
B6CFD6F0 010002A9 00000000 00000000 00000000  [................]
B6CFD700 00000000 00000000 00000018 00000000  [................]
B6CFD710 00012748 10000000 010002A8 00000008  [H'..............]
B6CFD720 010002B0 00000008 010002B8 00000008  [................]
B6CFD730 010002C0 00000008 010002C8 00000008  [................]   <----dump中找到下面值
……

  Extent Map
  -----------------------------------------------------------------
   0x010002a8  length: 8     
   0x010002b0  length: 8     
   0x010002b8  length: 8     
   0x010002c0  length: 8     <-----选择第四个exent
   0x010002c8  length: 8     
   …………   
  
  Auxillary Map
  --------------------------------------------------------
   Extent 0     :  L1 dba:  0x010002a8 Data dba:  0x010002ab
   Extent 1     :  L1 dba:  0x010002a8 Data dba:  0x010002b0
   Extent 2     :  L1 dba:  0x010002b8 Data dba:  0x010002b9
   Extent 3     :  L1 dba:  0x010002b8 Data dba:  0x010002c0  <---同上
   Extent 4     :  L1 dba:  0x010002c8 Data dba:  0x010002c9
   Extent 5     :  L1 dba:  0x010002c8 Data dba:  0x010002d0
   …………
  --------------------------------------------------------

--确定Extent 3的记录在file 4 block 682 的偏移量为304

------省略了相同部分B6CFD
SQL> SELECT TO_NUMBER('730','XXXXX') FROM DUAL;

TO_NUMBER('730','XXXXX')
------------------------
                    1840

SQL> SELECT TO_NUMBER('600','XXX') FROM DUAL;

TO_NUMBER('600','XXX')
----------------------
                  1536

SQL> SELECT 1840-1536 FROM DUAL;

 1840-1536
----------
       304

SQL> SHUTDOWN IMMEDIATE;
Database closed.
Database dismounted.
ORACLE instance shut down.

这里我们假设修改SEGMENT HEADER中关于EXTENT 3(从0开始计数)的映射地址,从而来使得该表在查询的时候出现ORA-08103错误

bbed修改相关值

[oracle@xifenfei ~]$ bbed parfile=bbed.par
BBED> SET MODE EDIT
        MODE            Edit

BBED> INFO
 File#  Name                                                        Size(blks)
 -----  ----                                                        ----------
     4  /u01/oracle/oradata/ora11g/users01.dbf                               0

BBED> SET FILE 4 BLOCK 682
        FILE#           4
        BLOCK#          682

BBED> D
 File: /u01/oracle/oradata/ora11g/users01.dbf (4)
 Block: 682              Offsets:    0 to  511           Dba:0x010002aa
------------------------------------------------------------------------
 23a20000 aa020001 eab40d00 00000104 cf8e0000 ………… 

 <32 bytes per line>

BBED> m /x 1100
 File: /u01/oracle/oradata/ora11g/users01.dbf (4)
 Block: 682              Offsets:    0 to  511           Dba:0x010002aa
------------------------------------------------------------------------
 11000000 aa020001 eab40d00 00000104 cf8e0000 ………… 

 <32 bytes per line>

BBED> d
 File: /u01/oracle/oradata/ora11g/users01.dbf (4)
 Block: 682              Offsets:    0 to  511           Dba:0x010002aa
------------------------------------------------------------------------
 11000000 aa020001 eab40d00 00000104 cf8e0000 ………… 

 <32 bytes per line>

BBED> sum apply
Check value for File 4, Block 682:
current = 0x2cfd, required = 0x2cfd

重现ORA-08103

SQL> startup
ORACLE instance started.

Total System Global Area  313860096 bytes
Fixed Size                  1344652 bytes
Variable Size             239078260 bytes
Database Buffers           67108864 bytes
Redo Buffers                6328320 bytes
Database mounted.
Database opened.
SQL> conn chf/xifenfei
Connected.
SQL> select count(*) from t_xifenfei;
select count(*) from t_xifenfei
                     *
ERROR at line 1:
ORA-08103: object no longer exists

定位坏块位置

SQL> alter session set max_dump_file_size=unlimited;

Session altered.

SQL> alter session set db_file_multiblock_read_count=1;

Session altered.

SQL> alter session set events 'immediate trace name trace_buffer_on level 1048576';

Session altered.

SQL> alter session set events '10200 trace name context forever, level 1';

Session altered.

SQL> alter session set events '8103 trace name errorstack level 3';

Session altered.

SQL> alter session set events '10236 trace name context forever, level 1';

Session altered.

SQL> alter session set tracefile_identifier='ORA8103';

Session altered.

SQL> select * from chf.t_xifenfei;
select * from chf.t_xifenfei
                  *
ERROR at line 1:
ORA-08103: object no longer exists

--trace文件关键内容
block_row_dump:
tab 0, row 0, @0x1f70
tl: 48 fb: --H-FL-- lb: 0x0  cc: 3
col  0: [ 9]  44 49 43 54 2e 42 41 53 45
col  1: [ 1]  32
col  2: [32]

…………

end_of_block_dump
 
The buffer with tsn: 0 rdba: 0x00400321 has already been dumped
 
The buffer with tsn: 4 rdba: 0x010002aa was pinned,
but could not be dumped

SQL> Select to_number('010002aa','xxxxxxxxxxxxxxxxxx') from dual;

TO_NUMBER('010002AA','XXXXXXXXXXXXXXXXXX')
------------------------------------------
                                  16777898

SQL> select 
  2  dbms_utility.data_block_address_file(16777898) FILE_NO,
  3  dbms_utility.data_block_address_block(16777898) BLOCK_NO
  4  from dual;

   FILE_NO   BLOCK_NO
---------- ----------
         4        682

检查坏块

[oracle@xifenfei ~]$ dbv file=/u01/oracle/oradata/ora11g/users01.dbf

DBVERIFY: Release 11.2.0.3.0 - Production on Fri Jan 13 18:03:13 2012

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

DBVERIFY - Verification starting : FILE = /u01/oracle/oradata/ora11g/users01.dbf
Block Checking: DBA = 16777898, Block Type = Unlimited data segment header with flg blks
Incorrect total map count: 24
Page 682 failed with check code 17006


DBVERIFY - Verification complete

Total Pages Examined         : 2240
Total Pages Processed (Data) : 1421
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 199
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 229
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 391
Total Pages Marked Corrupt   : 1
Total Pages Influx           : 0
Total Pages Encrypted        : 0
Highest block SCN            : 898278 (0.898278)

--对应块
SQL> select 
  2  dbms_utility.data_block_address_file(16777898) FILE_NO,
  3  dbms_utility.data_block_address_block(16777898) BLOCK_NO
  4  from dual;

   FILE_NO   BLOCK_NO
---------- ----------
         4        682

通过这里dbv检查发现,该数据库已经是坏块了,也就是说,网上流传的人工标志为坏块的方法在此处行不通,这里我们考虑使用dul等类此工具挖取数据,来拯救数据

dul和odu常规操作

DUL> unload table chf.t_xifenfei;
. unloading table                T_XIFENFEI
DUL: Warning: Using data objno from segment header( 16779199) instead of expected id (75592) from dictionary

DUL: Error: No entry in control file for block: ts# = 4 rfile# = 0 block# = 1087
DUL: Error: While processing unknown file block# 1087
DUL: Error: Could not read/parse data block


ODU> unload table chf.t_xifenfei

Unloading table: T_XIFENFEI,object ID: 75592
Unloading segment,storage(Obj#=75592 DataObj#=75592 TS#=4 File#=4 Block#=682 Cluster=0)
corrupted block
0 rows unloaded

通过上面的测试证明,在该中情况下(SEGMENT HEADER)出现异常时,dul/odu均不能使用最常规的方法挖的数据

扫描数据文件方式挖
因为odu在这个方面的操作人性化于dul,所以只用odu进行相关测试,dul肯定能过实现相同功能

ODU> SCAN EXTENT TABLESPACE 4 PARALLEL 2

scan extent start: 2012-9-23 2:47:51
scanning extent...
scanning extent finished.
scan extent completed: 2012-9-23 2:47:51

ODU> unload table chf.t_xifenfei object  75592

Unloading table: T_XIFENFEI,object ID: 75592
Unloading segment,storage(Obj#=75592 DataObj#=75592 TS#=4 File#=4 Block#=682 Cluster=0)
74504 rows unloaded

试验证明通过odu扫描表空间/数据文件找回来所有的数据,然后truncate掉问题表,导入该数据库,问题可以得意顺利解决

ORA-30013导致RAC 节点down掉

今天一朋友让我帮忙分析他们的9.2.0.2 rac 节点2异常down掉原因,相关信息如下:
前提信息

OS:HP-UX B.11.31
DB:9.2.0.2.0 RAC

节点2alert日志信息

Fri Sep  7 13:13:49 2012
ARC0: Completed archiving  log 11 thread 2 sequence 11651
Fri Sep  7 13:31:56 2012
Errors in file /oracle/admin/agent/udump/agent2_ora_797.trc:
ORA-00600: internal error code, arguments: [kgavsd_3], [0], [], [], [], [], [], []
ORA-00028: your session has been killed
Fri Sep  7 13:31:58 2012
Errors in file /oracle/admin/agent/bdump/agent2_pmon_5938.trc:
ORA-30013: undo tablespace 'UNDOTBS1' is currently in use
Fri Sep  7 13:31:58 2012
PMON: terminating instance due to error 30013
Fri Sep  7 13:31:58 2012
Errors in file /oracle/admin/agent/bdump/agent2_lms7_6033.trc:
ORA-30013: undo tablespace '' is currently in use
Fri Sep  7 13:31:58 2012
…………
Errors in file /oracle/admin/agent/bdump/agent2_lms0_6027.trc:
ORA-30013: undo tablespace '' is currently in use
Fri Sep  7 13:31:58 2012
System state dump is made for local instance
Fri Sep  7 13:32:03 2012
Instance terminated by PMON, pid = 5938
Fri Sep  7 14:34:35 2012

这里可以看到因为ORA-30013的错误使得pmon进程异常,从而使得该rac的节点2 down掉.同时这里还发现了ORA-00600[kgavsd_3]错误,是否是因为该ORA-600导致了数据库异常down还是一个偶然机会,我们继续分析

查看ORA-600[kgavsd_3]相关trace文件

*********START PLSQL RUNTIME DUMP************
***Got ORA-28 while running PLSQL***
***********END PLSQL RUNTIME DUMP************
*** 2012-09-07 13:31:56.740
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kgavsd_3], [0], [], [], [], [], [], []
ORA-00028: your session has been killed
Current SQL statement for this session:
--用户补档
DECLARE
  OUT_ERR_CODE NUMBER;
  OUT_ERR_MSG  VARCHAR2(1000);
  V_COUNT NUMBER;
BEGIN   
    WHILE TRUE LOOP
      SELECT COUNT(*) INTO V_COUNT FROM amc_stat_log  where proc_name in('pRunOdsChannelWareData') 
       and run_param=201208 AND STATE='A';
      IF V_COUNT>0 THEN
         dbms_output.put_line('exit loop '|| sysdate);
         EXIT; 
      END IF;
      sys.Dbms_Lock.sleep(600);
      dbms_output.put_line('wake up '|| sysdate);
    END LOOP;
   
   PKG_AME_ODS_DATA.P_Add_TO_AgentServ(201208,OUT_ERR_CODE,OUT_ERR_MSG);
   PKG_AME_ODS_DATA.P_Update_Serv_Ware_ID(201208, OUT_ERR_CODE, OUT_ERR_MSG);
   PKG_AMS_SETTLE.P_COMMISION_51PRE_FLAG(201208,OUT_ERR_CODE, OUT_ERR_MSG); 
END;
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
c000000e84ec59d0        14  anonymous block
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
Cannot find symbol in .
Cannot find symbol in .
Cannot find symbol in .
ksedmp()+512         call     9fffffffffff3940     000000000 ?
                                                   C000000000000A17 ?
                                                   40000000025B6540 ?
ksfdmp()+64          call     9fffffffffff3940     000000003 ?
kgerinv()+352        call     9fffffffffff3940     60000000000466B0 ?
                                                   000000003 ?
                                                   C000000000000714 ?
                                                   4000000004EBA6A0 ?
                                                   00001821B ?
                                                   6000000000468EA8 ?
kgesinv()+48         call     9fffffffffff3940     60000000000466B0 ?
                                                   600000000059BD98 ?
                                                   600000000046B070 ?
                                                   60000000000179C0 ?
                                                   6000000000017950 ?
kgesin()+112         call     9fffffffffff3940     60000000000466B0 ?
                                                   600000000059BD98 ?
                                                   4000000000B44C10 ?
                                                   000000001 ?
                                                   9FFFFFFFFFFF4310 ?
$cold_kgavsd_stackl  call     9fffffffffff3940     60000000000466B0 ?
et_done()+1184                                     600000000059BD98 ?
                                                   4000000000B44C10 ?
                                                   000000001 ? 000000000 ?
                                                   000000000 ? 000000000 ?
                                                   00001FE19 ?
pbesd_stacklet_done  call     9fffffffffff3940     60000000000466B0 ?
()+96                                              000000002 ? 000000000 ?
                                                   9FFFFFFFBEC6AE68 ?
pfrrun()+3328        call     9fffffffffff3940     9FFFFFFFBEC6AE68 ?
                                                   C000000000001D42 ?
                                                   4000000001ABBDA0 ?
                                                   9FFFFFFFBEC6B06C ?
                                                   9FFFFFFFFFFF64F0 ?
                                                   6000000000468EA8 ?
                                                   000000000 ? 000000000 ?
peicnt()+480         call     0000000000000000     9FFFFFFFBEC6AEE8 ?
                                                   C000000000000996 ?
                                                   40000000019F3680 ?
kkxexe()+816         call     0000000000000000     9FFFFFFFFFFF64F0 ?
                                                   9FFFFFFFBEC6AE68 ?
                                                   C00000000000099B ?
                                                   4000000001DD90F0 ?
                                                   00000FE4F ?
                                                   9FFFFFFFFFFF5F00 ?
                                                   60000000000467F0 ?
                                                   4000000000B603F0 ?
opiexe()+11168       call     0000000000000000     000000000 ?
                                                   C000000000002C60 ?
                                                   4000000001BEF980 ?
                                                   00000800F ?
                                                   9FFFFFFFFFFF6470 ?
                                                   9FFFFFFFBEC6AEB2 ?
                                                   6000000000040900 ?
                                                   9FFFFFFFBEC6B534 ?
opiall0()+3184       call     _etext_f()+23058430  000000004 ? 0000000C4 ?
                              09110686928          9FFFFFFFFFFF7B40 ?
                                                   C000000000002BDF ?
                                                   4000000001B26CD0 ?
                                                   000000000 ? 00000C893 ?
                                                   9FFFFFFFFFFF6690 ?
Cannot find symbol in .
kpoal8()+2064        call     9fffffffffff7ad0     000000001 ?
                                                   9FFFFFFFFFFF8304 ?
                                                   FFFFFFFFBFFFFFFF ?
                                                   9FFFFFFFFFFF83E4 ?
                                                   FFFFFFFFFFE7FBDF ?
                                                   9FFFFFFFFFFF7B88 ?
                                                   000000000 ?
                                                   6000000000474528 ?
opiodr()+3584        call     9fffffffffff81fc     6000000000040950 ?
                                                   000000000 ? 000000000 ?
                                                   C000000000002C60 ?
                                                   4000000001C09FE0 ?
                                                   00000C50B ?
                                                   9FFFFFFFFFFF81F0 ?
                                                   9FFFFFFFFFFF81D8 ?
ttcpip()+3776        call     _etext_f()+23058430  00000005E ? 000000014 ?
                              09114957288          9FFFFFFFFFFFA5F0 ?
                                                   6000000000040918 ?
                                                   C000000000001ABD ?
                                                   4000000001AB3BA0 ?
                                                   000000000 ? 00000C59B ?
opitsk()+1872        call     9fffffffffffa200     6000000000049FC0 ?
                                                   000000001 ?
                                                   9FFFFFFFFFFFA5F0 ?
                                                   000000001 ?
                                                   9FFFFFFFFFFFA740 ?
                                                   9FFFFFFFFFFFA564 ?
                                                   9FFFFFFFBF780058 ?
                                                   000000000 ?
opiino()+3184        call     000000000000057b     000000000 ? 000000000 ?
                                                   C00000000000132B ?
                                                   4000000001F78730 ?
                                                   000008001 ?
opiodr()+3584        call     0000000000000000     6000000000548A38 ?
                                                   4000000000B606F0 ?
                                                   6000000000548A38 ?
                                                   C000000000002C60 ?
                                                   4000000001C09FE0 ?
                                                   00000A201 ?
                                                   9FFFFFFFFFFFBC90 ?
                                                   4000000000B606F0 ?
opidrv()+976         call     _etext_f()+23058430  00000003C ? 000000004 ?
                              09114957288          9FFFFFFFFFFFEFB0 ?
                                                   6000000000040918 ?
sou2o()+80           call     _etext_f()+23058430  000000004 ? 000000004 ?
                              09114957288          9FFFFFFFFFFFEFB0 ?
main()+352           call     _etext_f()+23058430  9FFFFFFFFFFFEFD0 ?
                              09114957288          9FFFFFFFFFFFEFD4 ?
                                                   60000000004744F0 ?
                                                   9FFFFFFFFFFFEFB0 ?
main_opd_entry()+80  call     _etext_f()+23058430  000000000 ?
                              09114957288          9FFFFFFFFFFFF498 ?
                                                   C000000000000004 ?
                                                   C00000000002BE30 ?
 
--------------------- Binary Stack Dump ---------------------

Process global information:
     process: c000000d6428c0c0, call: c000000e46e772a8, xact: 0000000000000000, 
     curses: c000000d6437d020, usrses: c000000d6437d020
  ----------------------------------------
  SO: c000000d6428c0c0, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
  (process) Oracle pid=282, calls cur/top: c000000e46e772a8/c000000e46e772a8, flag: (0) -
            int error: 28, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 0
              last post received-location: No post
              last process to post me: c000000d64234f18 1 6
              last post sent: 0 0 104
              last post sent-location: kglpsl: in loop
              last process posted by me: c000000d6428e900 23 0
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: c000000d62234ee0
    O/S info: user: oracle, term: UNKNOWN, ospid: 797
    OSD pid info: Unix process pid: 797, image: oracle@gzagent2 (TNS V1-V3)
    ----------------------------------------
    SO: c000000d6437d020, type: 4, owner: c000000d6428c0c0, flag: INIT/-/-/0x00
    (session) trans: 0000000000000000, creator: c000000d6428c0c0, flag: (41) USR/- BSY/-/-/-/KIL/-
              DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
              txn branch: 0000000000000000
              oct: 0, prv: 0, sql: c000000e76b03f10, psql: 0000000000000000, user: 31/CUSTOM
    O/S info: user: huangqianhai_lc, term: SVCTAG-D1MLV2X, ospid: 11124:11796, machine: WORKGROUP\SVCTAG-D1MLV2X
              program: plsqldev.exe
    application name: PL/SQL Developer, hash value=1190136663
    action name: 测试窗口 - 新建, hash value=3604520210
    last wait for 'null event' blocking sess=0x0 seq=142 wait_time=567341620
                =ea60, =0, =0
    temporary object counter: 0
      ----------------------------------------

通过这里可以看出来是因为pl/sql dev进行一个plsql的操作导致该错误发生,查询MOS[ID 403575.1]发现

Applies to:

Oracle Server - Enterprise Edition - Version: 9.2.0.7 and later   [Release: 9.2 and later ]
Information in this document applies to any platform.
***Checked for relevance on 27-Oct-2010***
Symptoms

The following errors appears in the alert log file :

Probe:read_pipe: receive failed, status 3
Probe:S:debug_loop: timeout. Action 1

*********START PLSQL RUNTIME DUMP************
***Got ORA-604 while running PLSQL***
***********END PLSQL RUNTIME DUMP************

ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kgavsd_3], [0], [], [], [], [], [], []
ORA-00604: error occurred at recursive SQL level 2

Current SQL statement for this session:
begin :id := sys.dbms_transaction.local_transaction_id; end;


.
Cause

The ora-600 kgavsd_3 appears when calling kgavsd_stacklet_done function which is related to PLSQL DEBUG.

From traces, dbms_debug package is being used during trace generation.
The return code of 3 further indicates that the dbms_pipe message was interrupted.

Probably user cancelled a plsql program, so the appeared while trying to dump the stack

Solution

There is no data corruption over here. 
The error appears to be due to abnormal termination of aPL/SQL Developer application while executing a PL/SQL block.
Changing the PL/SQL and/or the procedure code could help in avoiding this error message.

Hence, this error can be safely ignored.

查找trace文件确实发现有name=SYS.DBMS_DEBUG,进一步表明该错误是由于plsql dev工具使用debug模式运行上面的plsql而引起该错误的发生,但是因为mos中记录和错误不是完全的一致,所以不能十分确定是该错误导致数据库down掉

继续分析ORA-30013

Error:      ORA-30013  (ORA-30013)
Text:      undo tablespace '%s' is currently in use 
---------------------------------------------------------------------------
Cause:    the specified undo tablespace is currently used by another instance. 
Action:    Wait for the undo tablespace to become available or change to another name and reissue the statement.

这个说明是没有疑问的:因为2节点配置的当前undo是UNDOTBS2,而UNDOTBS1是1节点使用的,证明这里的undo确实发生了错误,继续查询mos发现Bug 3368552

Hdr: 3368552 9.2.0.3 RDBMS 9.2.0.3 RAC PRODID-5 PORTID-23
Abstract: RAC:  ORA-30013 WHEN INSTANCE 2 ATTEMPTS TO ACCESS UNDO TABLESPACE OF INSTANCE 1
 
*** 01/12/04 06:21 am ***
TAR:
----
3554549.995
 
PROBLEM:
--------
The RAC database has been stable, but experienced an instance termination due 
to ORA-30031 error in the alert log (instance 2):
 
...
Tue Dec 23 03:01:46 2003
ARC1: Evaluating archive  log 4 thread 2 sequence 1116
ARC1: Beginning to archive log 4 thread 2 sequence 1116
Creating archive destination LOG_ARCHIVE_DEST_1: 
'/oracle/oradata/VLDB/logs/archives/VLDBN2/VLDB_0000001116_0002.arc'
ARC1: Completed archiving  log 4 thread 2 sequence 1116
Tue Dec 23 08:14:09 2003
Errors in file /oracle/admin/VLDB/bdump/vldbn2_pmon_22860.trc:
ORA-30013: undo tablespace 'UNDOTBS1' is currently in use
Tue Dec 23 08:14:09 2003
PMON: terminating instance due to error 30013
Tue Dec 23 08:14:10 2003
System state dump is made for local instance
Tue Dec 23 08:14:12 2003
Trace dumping is performing id=[cdmp_20031223081410]
Tue Dec 23 08:14:14 2003
Instance terminated by PMON, pid = 22860
<eof>
 
Instance 1 alert log shows only the reconfiguration and the cdump info:
..
Tue Dec 23 03:54:13 2003
Errors in file /oracle/admin/VLDB/udump/vldbn1_ora_13564.trc:
ORA-27037: unable to obtain file status
SVR4 Error: 2: No such file or directory
Additional information: 4
Tue Dec 23 08:14:10 2003
Trace dumping is performing id=[cdmp_20031223081410]
Tue Dec 23 08:14:12 2003
Reconfiguration started
List of nodes: 0,
Global Resource Directory frozen
one node partition
Communication channels reestablished
...

因为在9.2.0.3的RAC中有着该bug,那么我们可以大胆猜测在9.2.0.2中应该存在该bug,那么结合上面的ORA-00600[kgavsd_3]错误,我们大概还原该事故的全部:
1.节点1 dml操作了程序中报错的plsql中要范围的部分表对象,但是未提交(或者正在执行)
2.节点2 有用户使用pl/sql dev去执行程序中的plsql,因为是debug模式执行,需要UNDOTBS1的块来构建cr,从而使得节点2去访问UNDOTBS1,引发了Bug 3368552 从而使得数据库直接kill掉该plsql dev会话,进而出现ORA-00600[kgavsd_3]错误和pmon进程异常使得节点2 down掉