手动提交分布式事务一例

一.alert文件中出现了很多类此记录

Fri Feb 10 05:25:01 2012
Errors in file /tmp/recover/ahcx216_reco_7956.trc:
ORA-12154: TNS:could not resolve service name
Fri Feb 10 05:25:01 2012
Errors in file /tmp/recover/ahcx216_reco_7956.trc:
ORA-12154: TNS:could not resolve service name

这里可以看出来两个信息:
1.出错的进程是rec0进程,而该进程的作用是解决分布式事务失败后遗留问题(事务提交或者回滚)
2.错误信息是tns不能被解析

二.查看trace文件

/tmp/recover/ahcx216_reco_7956.trc
Oracle9i Enterprise Edition Release 9.2.0.8.0 - Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.8.0 - Production
ORACLE_HOME = /opt/oracle/product/9.2.0/db_1
System name:    Linux
Node name:      localhost.localdomain
Release:        2.6.9-89.0.0.0.1.ELhugemem
Version:        #1 SMP Tue May 19 04:38:38 EDT 2009
Machine:        i686
Instance name: ahcx216
Redo thread mounted by this instance: 1
Oracle process number: 7
Unix process pid: 7956, image: oracle@localhost.localdomain (RECO)


*** SESSION ID:(6.1) 2012-02-10 04:58:24.886
*** 2012-02-10 04:58:24.886
ERROR, tran=6.1.712757, session#=1, ose=0:
ORA-12154: TNS:could not resolve service name
………………
*** 2012-02-10 05:25:01.580
ERROR, tran=6.1.712757, session#=1, ose=0:
ORA-12154: TNS:could not resolve service name
*** 2012-02-10 05:25:01.627
ERROR, tran=12.19.99059, session#=1, ose=0:
ORA-12154: TNS:could not resolve service name

通过这里我们可以看出事务id分别为12.19.99059和6.1.712757有问题

三.查看dba_2pc_pending视图

SQL> select local_tran_id,state,fail_time,retry_time from dba_2pc_pending;

LOCAL_TRAN_ID     STATE            FAIL_TIME           RETRY_TIME
-------------- ---------------- --------------       --------------
6.1.712757	collecting	2010/12/1 13:39:03   2012/2/10 5:38:52
12.19.99059	collecting	2010/12/1 15:56:26   2012/2/10 5:38:52

可以看出,果然有两个分布式事务在2010年12月1日出了问题(本库是一个问题库,在把库拉起来后发现该问题)
因为是异机恢复,而且间隔时间较长,很多tns的信息都已经不存在,所以需要手工提交分布式事务

四.手动提交事务

SQL> BEGIN
  2  DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('6.1.712757');
  3  DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('12.19.99059');
  4  END;
  5  /
BEGIN
*
第 1 行出现错误:
ORA-30019: Illegal rollback Segment operation in Automatic Undo mode
ORA-06512: at "SYS.DBMS_TRANSACTION", line 65
ORA-06512: at "SYS.DBMS_TRANSACTION", line 85
ORA-06512: at line 2

SQL> alter session set "_smu_debug_mode"=4;

Session altered.

-------------------------------------------------------------
--设置UNDO_SUPPRESS_ERRORS=true也可以解决此问题
alter system set UNDO_SUPPRESS_ERRORS = TRUE;
EXECUTE DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('<事务ID>');
commit;
alter system set UNDO_SUPPRESS_ERRORS = false;
--------------------------------------------------------------

SQL> commit;

SQL> exec DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('6.1.712757');

PL/SQL procedure successfully completed.


SQL> exec DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('12.19.99059');
BEGIN DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('12.19.99059'); END;

*
ERROR at line 1:
ORA-01453: SET TRANSACTION must be first statement of transaction
ORA-06512: at "SYS.DBMS_TRANSACTION", line 65
ORA-06512: at "SYS.DBMS_TRANSACTION", line 85
ORA-06512: at line 1
--第一个分布式事务处理后,未提交导致

SQL> commit;

Commit complete.

SQL> exec DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('12.19.99059');

PL/SQL procedure successfully completed.

SQL> commit;

Commit complete.

五.补充说明
开始在另一个会话中,执行失败原因

SQL> alter session set "_smu_debug_mode"=4;

会话已更改。

SQL> commit;

提交完成。

SQL> BEGIN
  2  DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('6.1.712757');
  3  DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('12.19.99059');
  4  END;
  5  /
BEGIN
*
第 1 行出现错误:
ORA-01453: SET TRANSACTION must be first statement of transaction
ORA-06512: at "SYS.DBMS_TRANSACTION", line 65
ORA-06512: at "SYS.DBMS_TRANSACTION", line 85
ORA-06512: at line 3

因为这里的begin end中包含了两个事务的清理,在清理完第一个事务之后,需要提交才能够清理第二个,这里因为没有提交,导致ORA-01453错误。

sqlplus版本不正确,导致ORA-12560

sqlplus登录数据库报ORA-12560

C:\Users\XIFENFEI>sqlplus / as sysdba

SQL*Plus: Release 11.2.0.2.0 Production on Tue Feb 14 23:33:31 2012

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

ERROR:
ORA-12560: TNS:protocol adapter error


Enter user-name:
ERROR:
ORA-12560: TNS:protocol adapter error


Enter user-name:
ERROR:
ORA-12560: TNS:protocol adapter error


SP2-0157: unable to CONNECT to ORACLE after 3 attempts, exiting SQL*Plus

C:\Users\XIFENFEI>sqlplus -v

SQL*Plus: Release 11.2.0.2.0 Production

在这里发现sqlplus的版本信息不正确,我的数据库安装的11.2.0.1,这么怎么显示sqlplus是11.2.0.2,然后进入%ORACLE_HOME%/bin目录下面执行sqlplus,登录成功

C:\Users\XIFENFEI>cd e:\oracle\11_2_0\BIN

e:\oracle\11_2_0\BIN>sqlplus / as sysdba

SQL*Plus: Release 11.2.0.1.0 Production on 星期二 2月 14 23:44:38 2012

Copyright (c) 1982, 2010, Oracle.  All rights reserved.


连接到:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning, Oracle Label Security, OLAP, Data Mining,
Oracle Database Vault and Real Application Testing options

SQL>

出现了这样的问题,考虑应该是环境变量设置不当导致,查看环境变量发现下面一部分

E:\oracle\timesten\bin;E:\oracle\timesten\ttoracle_home\instantclient_11_2;e:\oracle\11_2_0\bin;

问题的答案就是在这里,因为我的电脑是先安装Oracle,前几天安装了TimesTen,导致环境变量发生了改变,使得我在默认情况下使用sqlplus的时候,自动调用了TimesTen中的sqlplus导致

C:\Users\XIFENFEI>cd E:\oracle\timesten\ttoracle_home\instantclient_11_2

E:\oracle\timesten\ttoracle_home\instantclient_11_2>dir sqlplus*
 驱动器 E 中的卷没有标签。
 卷的序列号是 38D0-2A35

 E:\oracle\timesten\ttoracle_home\instantclient_11_2 的目录

2010/11/19  00:50           655,360 sqlplus.exe
2011/11/08  04:12            84,524 sqlplus.sym
2011/11/08  04:12               554 SQLPLUS_README
               3 个文件        740,438 字节
               0 个目录  9,825,832,960 可用字节

Query Duration=0与ORA-01555

1.ALERT日志错误
奇怪之处:Query Duration=0 sec,竟然出现了ORA-01555

Tue Feb  7 02:41:34 2012
ORA-01555 caused by SQL statement below (Query Duration=0 sec, SCN: 0x0b2e.efcd78a9):
Tue Feb  7 02:41:34 2012
SELECT "ID_NO","CUST_ID" FROM "DBACCADM"."DCUSTMSG" "C" WHERE "ID_NO"=:1

2.ORA-01555解释
超过了undo_retention时间,undo被覆盖导致ORA-01555

[zwq_acc1:/home/oraeye/check]oerr ora 1555
01555, 00000, "snapshot too old: rollback segment number %s with name \"%s\" too small"
// *Cause: rollback records needed by a reader for consistent read are
//         overwritten by other writers
// *Action: If in Automatic Undo Management mode, increase undo_retention
//          setting. Otherwise, use larger rollback segments

3.数据库版本

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production
PL/SQL Release 9.2.0.8.0 - Production
CORE    9.2.0.8.0       Production
TNS for IBM/AIX RISC System/6000: Version 9.2.0.8.0 - Production
NLSRTL Version 9.2.0.8.0 - Production

4.undo基本信息
从这里可以发现,两个节点的undo表空间还有很多剩余,缺发生了undo被覆盖从而出现了ORA-01555

SQL> col name for a20
SQL> col value for a15
SQL> SELECT INST_ID, NAME, VALUE
  2    FROM GV$PARAMETER
  3   WHERE UPPER (Name) LIKE '%UNDO%';

   INST_ID NAME                 VALUE
---------- -------------------- ---------------
         1 undo_management      AUTO
         1 undo_tablespace      UNDOTBS1
         1 undo_suppress_errors FALSE
         1 undo_retention       1800
         2 undo_management      AUTO
         2 undo_tablespace      UNDOTBS2
         2 undo_suppress_errors FALSE
         2 undo_retention       1800

8 rows selected.


TABLESPACE_NAME                CURRENT_TOTAL(MB)   USED(MB)   FREE(MB)      FREE% AUT MAX_TOTAL(MB)
------------------------------ ----------------- ---------- ---------- ---------- --- -------------
UNDOTBS1                                   40950    1587.94 39362.0625      96.12 NO          40950
UNDOTBS2                                   57330    1926.31 55403.6875      96.64 NO          57330

SQL> SELECT DISTINCT STATUS ,
  2                  COUNT(*) "EXTENT_NUM",
  3                  SUM(BYTES) / 1024 / 1024 / 1024 "UNDO(G)"
  4    FROM DBA_UNDO_EXTENTS
  5   GROUP BY STATUS;

STATUS    EXTENT_NUM    UNDO(G)
--------- ---------- ----------
ACTIVE           208 .273658752
EXPIRED         7651 2.42865753
UNEXPIRED        941 .752548218

查询MOS[ID 761128.1],发现可能是Oracle bug导致(BUG:6799685 – ORA-1555 ERROR WITH QUERY DURATION=0 AND UNDO_RETENTION=1800和BUG:5475085 – V$UNDOSTAT.EXPBLKREUCNT IS NEVER INCREMENTED)

5.解决方法
Increase the size of the UNDO tablespace and increase the UNDO_RETENTION parameter value to try to prevent required undo expiring too quickly.
基于本库,因为undo空间还有很大剩余,直接设置UNDO_RETENTION=3600即可(可以从一定程度上缓解整个问题,但是要从根本上解决整个问题,需要升级到10.2.0.4及其以上版本)

ORA-00600[4454]

数据库版本信息

SQL> select * from gv$version;

   INST_ID BANNER
---------- ----------------------------------------------------------------
         1 Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production
         1 PL/SQL Release 9.2.0.8.0 - Production
         1 CORE 9.2.0.8.0       Production
         1 TNS for IBM/AIX RISC System/6000: Version 9.2.0.8.0 - Production
         1 NLSRTL Version 9.2.0.8.0 - Production
         2 Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production
         2 PL/SQL Release 9.2.0.8.0 - Production
         2 CORE 9.2.0.8.0       Production
         2 TNS for IBM/AIX RISC System/6000: Version 9.2.0.8.0 - Production
         2 NLSRTL Version 9.2.0.8.0 - Production

10 rows selected.

早上检查某运营商的结算数据库时发现

Mon Feb  6 16:03:23 2012
Errors in file /oracle9/app/admin/settl/udump/settl1_ora_1355948.trc:
ORA-00600: internal error code, arguments: [4454], [], [], [], [], [], [], []
Mon Feb  6 16:03:23 2012
Errors in file /oracle9/app/admin/settl/udump/settl1_ora_1355948.trc:
ORA-00600: internal error code, arguments: [4454], [], [], [], [], [], [], []

trace文件信息

*** SESSION ID:(100.35758) 2012-02-06 16:03:23.223
*** 2012-02-06 16:03:23.223
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [4454], [], [], [], [], [], [], []
Current SQL statement for this session:
select trim(operator_id) into :b1  from b_sys_proc where program_name=:b2
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
ksedmp+0148          bl       ksedst               1029746CC ?
ksfdmp+0018          bl       01FD4010             
kgeriv+0118          bl       _ptrgl               
kgeasi+00cc          bl       kgeriv               1101FAF48 ? 110248038 ?
                                                   00000000C ? 110235978 ?
                                                   000000040 ?
ktcsptg+00c4         bl       kgeasi               1100062D8 ? 110389E88 ?
                                                   116600001166 ? 200000002 ?
                                                   000000000 ? 70000010738F0E8 ?
                                                   0FFFFFFFF ? 0FFFFFFFF ?
opiexe+0524          bl       ktcsptg              000000000 ? 000000000 ?
opiefn0+01c0         bl       opiexe               49FFFFA640 ? 4900000001 ?
                                                   FFFFFFFFFFFA5C8 ?
opiefn+0100          bl       opiefn0              01000CF60 ? 1029C61B0 ?
                                                   000000002 ? FFFFFFFFFFFC1BC ?
                                                   000000001 ? 00000000C ?
                                                   00000000B ? 110061F50 ?
opiodr+08cc          bl       _ptrgl               
ttcpip+0cc4          bl       _ptrgl               
opitsk+0d60          bl       ttcpip               11000CF60 ? 000000000 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ? 000000000 ?
opiino+0758          bl       opitsk               000000000 ? 000000000 ?
opiodr+08cc          bl       _ptrgl               
opidrv+032c          bl       opiodr               3C00000018 ? 4101FAF48 ?
                                                   FFFFFFFFFFFF790 ? 0A000EEA8 ?
sou2o+0028           bl       opidrv               3C0C000000 ? 4A0147B50 ?
                                                   FFFFFFFFFFFF790 ?
main+0138            bl       01FD3A28             
__start+0098         bl       main                 000000000 ? 000000000 ?
………………
SO: 70000010738f0e8, type: 4, owner: 7000001043acd90, flag: INIT/-/-/0x00
    (session) trans: 70000010dde1dd8, creator: 7000001043acd90, flag: (100041) USR/- BSY/-/-/-/-/-
              DID: 0001-0064-7625733A, short-term DID: 0000-0000-00000000
              txn branch: 7000001147dbed8
              oct: 0, prv: 0, sql: 700000117b7f778, psql: 0, user: 24/SETTLE
    O/S info: user: settle, term: , ospid: 1400836, machine: zwq_jies2
              program: /settlement/pkg01/bin/long/long_app@zwq_jies2 (T
    application name: /settlement/pkg01/bin/long/long_app@zwq_jies2 (T, hash value=0
    last wait for 'SQL*Net message from client' blocking sess=0x0 seq=483 wait_time=240
                driver id=54435000, #bytes=1, =0

查询MOS,发现是Bug# 1402161(虽然在mos中声明该bug是在9.2.0.1中已经被修复,但是这里的trace文件中的Call Stack Trace和mos中记录一致,而且网络上也存在很多9.2.0.8中关于该bug的情况),trace文件最后一段可以看出是在节点2上(zwq_jies2)运行/settlement/pkg01/bin/long/long_app程序导致出现该错误
产生该错误原因:

This exception is signalled because the savepoint number is not what was
  expected.
The current transaction savepoint is less than the beginning savepoint of 
  the transaction.

解决方法:
把/settlement/pkg01/bin/long/long_app中的需要运行的程序分割成几个小程序运行。

Break the job into smaller chunks and reconnect for each part of the job 
  to reset the savepoint number.

ORA-00600[ktspNextL1:4]

在检查某运营商的客服数据库时发现如下错误

Tue Jan 31 22:00:50 2012
Errors in file /oracle10/admin/ahunicom/bdump/ahunicom1_j005_24445074.trc:
ORA-00600: internal error code, arguments: [ktspNextL1:4], [], [], [], [], [], [], []
Tue Jan 31 22:01:18 2012
Trace dumping is performing id=[cdmp_20120131220118]
Tue Jan 31 22:01:18 2012
Errors in file /oracle10/admin/ahunicom/bdump/ahunicom1_j005_24445074.trc:
ORA-00600: internal error code, arguments: [ORA-00600: internal error code, arguments: [ktspNextL1:4], [], [], [], [], [], [], []
ORA-06512: at "SYS.PRVT_ADVISOR", line 1624
ORA-06512: at "SYS.DBMS_ADVISOR", line 186
ORA-06512: at "SYS.DBMS_SPACE", line 1500
ORA-06512: at "SYS.DBMS_SPACE", line 1566
], [], [], [], [], [], [], []

查看trace文件

Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /oracle10/app/product/db/10.2.0
System name:    AIX
Node name:      zwq_kfdb1
Release:        1
Version:        6
Machine:        00C5C4764C00
Instance name: ahunicom1
Redo thread mounted by this instance: 1
Oracle process number: 192
Unix process pid: 24445074, image: oracle@zwq_kfdb1 (J005)

*** ACTION NAME:(AUTO_SPACE_ADVISOR_JOB) 2012-01-31 22:00:50.874
*** MODULE NAME:(DBMS_SCHEDULER) 2012-01-31 22:00:50.874
*** SERVICE NAME:(SYS$USERS) 2012-01-31 22:00:50.874
*** SESSION ID:(454.44574) 2012-01-31 22:00:50.874
*** 2012-01-31 22:00:50.873
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [ktspNextL1:4], [], [], [], [], [], [], []
Current SQL statement for this session:
insert into wri$_adv_objspace_trend_data select timepoint,  space_usage, space_alloc, quality from  table(dbms_space.object_growth_trend(:1, :2, :3, :4, NULL
, NULL,  NULL, 'FALSE', :5, 'FALSE'))
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
70000010d9a65e0      1834  package body SYS.DBMS_SPACE
70000010d9a65e0      3922  package body SYS.DBMS_SPACE
70000010d9a65e0      4233  package body SYS.DBMS_SPACE
70000011a7d9c88         1  anonymous block
700000180678048       344  SYS.WRI$_ADV_OBJSPACE_TREND_T
700000180678048      1485  SYS.WRI$_ADV_OBJSPACE_TREND_T
7000001334867d0      1535  package body SYS.PRVT_ADVISOR
7000001334867d0      1618  package body SYS.PRVT_ADVISOR
70000011f9f64a8       186  package body SYS.DBMS_ADVISOR
70000010d9a65e0      1500  package body SYS.DBMS_SPACE
70000010d9a65e0      1566  package body SYS.DBMS_SPACE
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              70000017E9975D0 ? 100000001 ?
ksedmp+0290          bl       ksedst               104A2CDB0 ?
ksfdmp+0018          bl       03F2735C             
kgerinv+00dc         bl       _ptrgl               
kgeasnmierr+004c     bl       kgerinv              000000000 ? 000000001 ?
                                                   000000005 ? 7000000E19760FC ?
                                                   7000000E1976014 ?
ktspGetNextL1ForSca  bl       01F94828             
n+0104                                             
ktspScanInit+026c    bl       ktspGetNextL1ForSca  A0A0E1B89 ? 10564BCF4 ?
                              n                    80003000804DC ?
ktspGenExtentMap1+0  bl       ktspScanInit         000000000 ?
0e8                                                
kteinmap1+00bc       bl       ktspGenExtentMap1    000000000 ? 000000001 ?
                                                   FFFFFFFFFFE3A80 ?
kteinmap+0010        bl       kteinmap1            000000000 ? 00000000D ?
kdgini+036c          bl       kteinmap             FFFFFFFFFFE3DA0 ?
kdg_block_auto+018c  bl       kdgini               000000000 ? A0B4A708C ?
                                                   5676A0005676A ?
                                                   80003000804DC ? 000000000 ?
                                                   70000016AD82170 ? 110195498 ?
                                                   70000016AD82148 ?
ktsa_object_space_u  bl       kdg_block_auto       FFFFFFFFFFE4C50 ?
sage+0950                                          FFFFFFFFFFE4CA0 ? 200000780 ?
                                                   70000011A7D9C88 ?

由trace文件中的insert语句可以知道,这个错误是DBMS_SPACE.OBJECT_GROWTH_TREND进行空间分析时被触发
查询MOS[ID 841158.1],发现这个是一个没有公布的bug(5649098),在11.1.0.7中被修复

解决方案:
1.忽略这个错误,因为这个错误是不可重复的,发生的概率不大
2.升级到11.1.0.7及其以上版本
3.如果遇到这个错误,可以手工执行dbms_space.auto_space_advisor_job_proc();