ORA-600[4194]/[4193]解决

朋友的库启动的时候出现ORA-600[4194]/[4193]错误

Tue Feb 14 09:34:11 2012
Errors in file d:\oracle\product\10.2.0\admin\interlib\bdump\interlib_smon_2784.trc:
ORA-01595: error freeing extent (2) of rollback segment (3))
ORA-00607: Internal error occurred while making a change to a data block
ORA-00600: internal error code, arguments: [4194], [6], [30], [], [], [], [], []

Tue Feb 14 09:35:34 2012
Errors in file d:\oracle\product\10.2.0\admin\interlib\udump\interlib_ora_2824.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-00600: internal error code, arguments: [4193], [2005], [2008], [], [], [], [], []
ORA-00600: internal error code, arguments: [4193], [2005], [2008], [], [], [], [], []

Tue Feb 14 09:36:30 2012
DEBUG: Replaying xcb 0x1fa24174, pmd 0x1fba06d4 for failed op 8
Doing block recovery for file 2 block 177
No block recovery was needed
Tue Feb 14 09:37:30 2012
Errors in file d:\oracle\product\10.2.0\admin\interlib\bdump\interlib_pmon_2732.trc:
ORA-00600: internal error code, arguments: [4193], [2005], [2008], [], [], [], [], []

Tue Feb 14 09:37:31 2012
Errors in file d:\oracle\product\10.2.0\admin\interlib\bdump\interlib_pmon_2732.trc:
ORA-00600: internal error code, arguments: [4193], [2005], [2008], [], [], [], [], []

从这里可以看到出现了ORA-600[4194]/[4193],第一感觉就是undo出现问题。
4193:表示undo和redo不一致(Arg [a] Undo record seq number,Arg [b] Redo record seq number );
4194:表示也是undo和redo不一致(Arg [a] Maximum Undo record number in Undo block,Arg [b] Undo record number from Redo block)
至于为什么有时候会只出现其中一个,我不太清楚,求答案

直接设置了下面参数,数据库就意外的open成功,这位朋友比较幸运

undo_tablespace=SYSTEM
undo_management=MANUAL

既然库已经open,然后新建undo空间,删除出问题的undo,做如下修改,数据库恢复完成

undo_tablespace=新undo
undo_management=AUTO

如果出现极端的情况可能需要做如下处理:
1.使用_offline_rollback_segments和_corrupted_rollback_segments屏蔽掉有问题的undo segment
2.继续可能出现ora-600[2662],需要推进scn

Query Duration=0与ORA-01555

1.ALERT日志错误
奇怪之处:Query Duration=0 sec,竟然出现了ORA-01555

Tue Feb  7 02:41:34 2012
ORA-01555 caused by SQL statement below (Query Duration=0 sec, SCN: 0x0b2e.efcd78a9):
Tue Feb  7 02:41:34 2012
SELECT "ID_NO","CUST_ID" FROM "DBACCADM"."DCUSTMSG" "C" WHERE "ID_NO"=:1

2.ORA-01555解释
超过了undo_retention时间,undo被覆盖导致ORA-01555

[zwq_acc1:/home/oraeye/check]oerr ora 1555
01555, 00000, "snapshot too old: rollback segment number %s with name \"%s\" too small"
// *Cause: rollback records needed by a reader for consistent read are
//         overwritten by other writers
// *Action: If in Automatic Undo Management mode, increase undo_retention
//          setting. Otherwise, use larger rollback segments

3.数据库版本

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production
PL/SQL Release 9.2.0.8.0 - Production
CORE    9.2.0.8.0       Production
TNS for IBM/AIX RISC System/6000: Version 9.2.0.8.0 - Production
NLSRTL Version 9.2.0.8.0 - Production

4.undo基本信息
从这里可以发现,两个节点的undo表空间还有很多剩余,缺发生了undo被覆盖从而出现了ORA-01555

SQL> col name for a20
SQL> col value for a15
SQL> SELECT INST_ID, NAME, VALUE
  2    FROM GV$PARAMETER
  3   WHERE UPPER (Name) LIKE '%UNDO%';

   INST_ID NAME                 VALUE
---------- -------------------- ---------------
         1 undo_management      AUTO
         1 undo_tablespace      UNDOTBS1
         1 undo_suppress_errors FALSE
         1 undo_retention       1800
         2 undo_management      AUTO
         2 undo_tablespace      UNDOTBS2
         2 undo_suppress_errors FALSE
         2 undo_retention       1800

8 rows selected.


TABLESPACE_NAME                CURRENT_TOTAL(MB)   USED(MB)   FREE(MB)      FREE% AUT MAX_TOTAL(MB)
------------------------------ ----------------- ---------- ---------- ---------- --- -------------
UNDOTBS1                                   40950    1587.94 39362.0625      96.12 NO          40950
UNDOTBS2                                   57330    1926.31 55403.6875      96.64 NO          57330

SQL> SELECT DISTINCT STATUS ,
  2                  COUNT(*) "EXTENT_NUM",
  3                  SUM(BYTES) / 1024 / 1024 / 1024 "UNDO(G)"
  4    FROM DBA_UNDO_EXTENTS
  5   GROUP BY STATUS;

STATUS    EXTENT_NUM    UNDO(G)
--------- ---------- ----------
ACTIVE           208 .273658752
EXPIRED         7651 2.42865753
UNEXPIRED        941 .752548218

查询MOS[ID 761128.1],发现可能是Oracle bug导致(BUG:6799685 – ORA-1555 ERROR WITH QUERY DURATION=0 AND UNDO_RETENTION=1800和BUG:5475085 – V$UNDOSTAT.EXPBLKREUCNT IS NEVER INCREMENTED)

5.解决方法
Increase the size of the UNDO tablespace and increase the UNDO_RETENTION parameter value to try to prevent required undo expiring too quickly.
基于本库,因为undo空间还有很大剩余,直接设置UNDO_RETENTION=3600即可(可以从一定程度上缓解整个问题,但是要从根本上解决整个问题,需要升级到10.2.0.4及其以上版本)

StatSpack报告中redo size为负数

在一份statspack报告中发现redo size 为负数

DB Name         DB Id    Instance     Inst Num Release     Cluster Host
------------ ----------- ------------ -------- ----------- ------- ------------
CRM           3413823439 crm2                2 9.2.0.8.0   YES     zwq_crm2

              Snap Id     Snap Time      Sessions Curs/Sess Comment
            --------- ------------------ -------- --------- -------------------
Begin Snap:     47654 05-Feb-12 11:00:04    2,301      20.0
  End Snap:     47655 05-Feb-12 12:00:02    2,298      20.3
   Elapsed:               59.97 (mins)

Cache Sizes (end)
~~~~~~~~~~~~~~~~~
               Buffer Cache:    36,448M      Std Block Size:          8K
           Shared Pool Size:    10,240M          Log Buffer:     20,480K

Load Profile
~~~~~~~~~~~~                            Per Second       Per Transaction
                                   ---------------       ---------------
  负数=====>      Redo size:           -359,121.01             -7,828.69
              Logical reads:            349,787.58              7,625.22
              Block changes:              5,656.13                123.30
             Physical reads:             12,521.51                272.96
            Physical writes:                427.44                  9.32
                 User calls:             22,223.63                484.47
                     Parses:              4,673.27                101.88
                Hard parses:                 46.78                  1.02
                      Sorts:              4,027.70                 87.80
                     Logons:                 10.57                  0.23
                   Executes:             10,480.35                228.47
               Transactions:                 45.87

一时之间感觉很奇怪,在运行的数据库redo size不可能为负数(只要数据库在运行redo size都应该大于0).查询MOS[ID 308414.1]发现原来是一个bug引起(Bug:1713403 NEGATIVE VALUE IN V$SESSTAT FOR STATISTIC REDO SIZE),sp中的redo size其本质还是来源于V$SESSTAT.

Applies to:
Oracle Server - Enterprise Edition - Version: 8.1.7.4 to 9.2.0.8
This problem can occur on any platform.

Symptoms
Redo Size is displayed as a Negative number in a Statspack report.

For example:

Load Profile
~~~~~~~~~~~~ Per Second Per Transaction
             --------------- ---------------
Redo size:        -17,931.33      -55,277.73
Logical reads:     31,095.80       95,860.43
...


Cause
Bug:1713403 NEGATIVE VALUE IN V$SESSTAT FOR STATISTIC REDO SIZE
Overflow of 'redo size' statistic. This is fixed in Oracle10g and above.
Patches do not exist for earlier releases.

Solution
Use 'redo blocks written' instead to measure the amount of redo. 
Ignore negative 'redo size'.

Easy Connect Naming Method与EZCONNECT关系

Easy Connect Naming Method这个东西是Oracle 10g推出的东东,我想不用我解释它的好,dba和开发人员都喜欢它,以前一直都用它,从没有关注到它和sqlnet.ora中的NAMES.DIRECTORY_PATH的关系,昨天一朋友和我说到了EZCONNECT,今天查询了一些资料和做了一些实验,使得自己对NAMES.DIRECTORY_PATH和EZCONNECT有了新的认识,也怪自己一致忽略了这个知识点。

1.NAMES.DIRECTORY_PATH= (TNSNAMES)

[oracle@node1 samples]$ more $ORACLE_HOME/network/admin/sqlnet.ora
#NAMES.DIRECTORY_PATH= (EZCONNECT)
NAMES.DIRECTORY_PATH= (TNSNAMES)


[oracle@node1 samples]$ sqlplus hr/xifenfei@127.0.0.1/ecp

SQL*Plus: Release 10.2.0.5.0 - Production on Wed Feb 8 23:12:12 2012

Copyright (c) 1982, 2010, Oracle.  All Rights Reserved.

ERROR:
ORA-12154: TNS:could not resolve the connect identifier specified


Enter user-name: 
ERROR:
ORA-01017: invalid username/password; logon denied


Enter user-name: 
ERROR:
ORA-01017: invalid username/password; logon denied


SP2-0157: unable to CONNECT to ORACLE after 3 attempts, exiting SQL*Plus

证明sqlnet.ora中的NAMES.DIRECTORY_PATH设置为TNSNAMES时,Easy Connect Naming Method不能工作。

2.NAMES.DIRECTORY_PATH= (EZCONNECT)

[oracle@node1 samples]$ more $ORACLE_HOME/network/admin/sqlnet.ora
NAMES.DIRECTORY_PATH= (EZCONNECT)
#NAMES.DIRECTORY_PATH= (TNSNAMES)


[oracle@node1 samples]$ sqlplus hr/xifenfei@127.0.0.1/ecp

SQL*Plus: Release 10.2.0.5.0 - Production on Wed Feb 8 23:13:21 2012

Copyright (c) 1982, 2010, Oracle.  All Rights Reserved.


Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> 

登录成功,其实这里生效是因为设置了NAMES.DIRECTORY_PATH为EZCONNECT的功效,从EZCONNECT的上面也可以看出是

3.NAMES.DIRECTORY_PATH为默认值

[oracle@node1 samples]$ more $ORACLE_HOME/network/admin/sqlnet.ora
#NAMES.DIRECTORY_PATH= (EZCONNECT)
#NAMES.DIRECTORY_PATH= (TNSNAMES)
[oracle@node1 samples]$ sqlplus hr/xifenfei@127.0.0.1/ecp

SQL*Plus: Release 10.2.0.5.0 - Production on Wed Feb 8 23:13:49 2012

Copyright (c) 1982, 2010, Oracle.  All Rights Reserved.


Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> 

当NAMES.DIRECTORY_PATH为默认值时,使用Easy Connect Naming Method也可以登录成功(正是因为这个原因,导致我忽略了EZCONNECT的存在).

4.sqlnet.ora中关于NAMES.DIRECTORY_PATH说明

#names.directory_path
#
#Syntax:  <adapter-name>
#Default: TNSNAMES,ONAMES,HOSTNAME
#
# Sets the (ordered) list of naming adaptors to use in resolving a name.
# The default is as shown for 3.0.2 of sqlnet onwards. The default was
# (TNSNAMES, ONAMES) before that. The value can be presented without
# parentheses if only a single entry is being specified. The parameter is
# recognized from version 2.3.2 of sqlnet onward. Acceptable values include:
#  TNSNAMES -- tnsnames.ora lookup
#  ONAMES   -- Oracle Names
#  HOSTNAME -- use the hostname (or an alias of the hostname)
#  NIS      -- NIS (also known as "yp")
#  CDS      -- OSF DCE's Cell Directory Service
#  NDS      -- Novell's Netware Directory Service

5.关于NAMES.DIRECTORY_PATH参数的补充说明

tnsnames:local naming naming method
Set to resolve a net service name through the tnsnames.ora file on the client.

hostname:host naming method
Set to resolve a host name alias through an existing names resolution service or a centrally-maintained set of /etc/hosts files.

onames:Oracle Names method
Set to resolve database objects through a Oracle Names server.

ldap:directory naming naming method
Set to resolve a database service name, net service name, or net service alias through a directory server.

cds:Cell Directory Services (CDS) external naming method
Set to resolve an Oracle database name in a Distributed Computing Environment (DCE) environment.

nis:Network Information Service (NIS) external naming method
Set to resolve service information through an existing NIS.

Ezconnect:The easy connect naming method eliminates the need for service name lookup in the tnsnames.ora files for TCP/IP environments; in fact, no naming or directory system is required if you use this method.

ORA-00600[4454]

数据库版本信息

SQL> select * from gv$version;

   INST_ID BANNER
---------- ----------------------------------------------------------------
         1 Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production
         1 PL/SQL Release 9.2.0.8.0 - Production
         1 CORE 9.2.0.8.0       Production
         1 TNS for IBM/AIX RISC System/6000: Version 9.2.0.8.0 - Production
         1 NLSRTL Version 9.2.0.8.0 - Production
         2 Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production
         2 PL/SQL Release 9.2.0.8.0 - Production
         2 CORE 9.2.0.8.0       Production
         2 TNS for IBM/AIX RISC System/6000: Version 9.2.0.8.0 - Production
         2 NLSRTL Version 9.2.0.8.0 - Production

10 rows selected.

早上检查某运营商的结算数据库时发现

Mon Feb  6 16:03:23 2012
Errors in file /oracle9/app/admin/settl/udump/settl1_ora_1355948.trc:
ORA-00600: internal error code, arguments: [4454], [], [], [], [], [], [], []
Mon Feb  6 16:03:23 2012
Errors in file /oracle9/app/admin/settl/udump/settl1_ora_1355948.trc:
ORA-00600: internal error code, arguments: [4454], [], [], [], [], [], [], []

trace文件信息

*** SESSION ID:(100.35758) 2012-02-06 16:03:23.223
*** 2012-02-06 16:03:23.223
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [4454], [], [], [], [], [], [], []
Current SQL statement for this session:
select trim(operator_id) into :b1  from b_sys_proc where program_name=:b2
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
ksedmp+0148          bl       ksedst               1029746CC ?
ksfdmp+0018          bl       01FD4010             
kgeriv+0118          bl       _ptrgl               
kgeasi+00cc          bl       kgeriv               1101FAF48 ? 110248038 ?
                                                   00000000C ? 110235978 ?
                                                   000000040 ?
ktcsptg+00c4         bl       kgeasi               1100062D8 ? 110389E88 ?
                                                   116600001166 ? 200000002 ?
                                                   000000000 ? 70000010738F0E8 ?
                                                   0FFFFFFFF ? 0FFFFFFFF ?
opiexe+0524          bl       ktcsptg              000000000 ? 000000000 ?
opiefn0+01c0         bl       opiexe               49FFFFA640 ? 4900000001 ?
                                                   FFFFFFFFFFFA5C8 ?
opiefn+0100          bl       opiefn0              01000CF60 ? 1029C61B0 ?
                                                   000000002 ? FFFFFFFFFFFC1BC ?
                                                   000000001 ? 00000000C ?
                                                   00000000B ? 110061F50 ?
opiodr+08cc          bl       _ptrgl               
ttcpip+0cc4          bl       _ptrgl               
opitsk+0d60          bl       ttcpip               11000CF60 ? 000000000 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ? 000000000 ?
opiino+0758          bl       opitsk               000000000 ? 000000000 ?
opiodr+08cc          bl       _ptrgl               
opidrv+032c          bl       opiodr               3C00000018 ? 4101FAF48 ?
                                                   FFFFFFFFFFFF790 ? 0A000EEA8 ?
sou2o+0028           bl       opidrv               3C0C000000 ? 4A0147B50 ?
                                                   FFFFFFFFFFFF790 ?
main+0138            bl       01FD3A28             
__start+0098         bl       main                 000000000 ? 000000000 ?
………………
SO: 70000010738f0e8, type: 4, owner: 7000001043acd90, flag: INIT/-/-/0x00
    (session) trans: 70000010dde1dd8, creator: 7000001043acd90, flag: (100041) USR/- BSY/-/-/-/-/-
              DID: 0001-0064-7625733A, short-term DID: 0000-0000-00000000
              txn branch: 7000001147dbed8
              oct: 0, prv: 0, sql: 700000117b7f778, psql: 0, user: 24/SETTLE
    O/S info: user: settle, term: , ospid: 1400836, machine: zwq_jies2
              program: /settlement/pkg01/bin/long/long_app@zwq_jies2 (T
    application name: /settlement/pkg01/bin/long/long_app@zwq_jies2 (T, hash value=0
    last wait for 'SQL*Net message from client' blocking sess=0x0 seq=483 wait_time=240
                driver id=54435000, #bytes=1, =0

查询MOS,发现是Bug# 1402161(虽然在mos中声明该bug是在9.2.0.1中已经被修复,但是这里的trace文件中的Call Stack Trace和mos中记录一致,而且网络上也存在很多9.2.0.8中关于该bug的情况),trace文件最后一段可以看出是在节点2上(zwq_jies2)运行/settlement/pkg01/bin/long/long_app程序导致出现该错误
产生该错误原因:

This exception is signalled because the savepoint number is not what was
  expected.
The current transaction savepoint is less than the beginning savepoint of 
  the transaction.

解决方法:
把/settlement/pkg01/bin/long/long_app中的需要运行的程序分割成几个小程序运行。

Break the job into smaller chunks and reconnect for each part of the job 
  to reset the savepoint number.