RAC中关于”Immediate Kill Session#” bug记录

今天在rac的一个节点上发现很多Immediate Kill Session#的错误,分析记录如下
1.alert日志内容

Sun Jan  1 02:12:28 2012
ALTER SYSTEM SET service_names='' SCOPE=MEMORY SID='ora9i1';
Sun Jan  1 02:12:28 2012
Immediate Kill Session#: 496, Serial#: 51199
Immediate Kill Session: sess: 0x406bfa26b78  OS pid: 12900
Immediate Kill Session#: 497, Serial#: 38504
Immediate Kill Session: sess: 0x406bfa280e0  OS pid: 12496
Immediate Kill Session#: 499, Serial#: 45296
Immediate Kill Session: sess: 0x406bfa2abb0  OS pid: 12467
Immediate Kill Session#: 502, Serial#: 18910
Immediate Kill Session: sess: 0x406bfa2ebe8  OS pid: 28887
Immediate Kill Session#: 503, Serial#: 26631
Immediate Kill Session: sess: 0x406bfa30150  OS pid: 20749
Immediate Kill Session#: 508, Serial#: 63586
Immediate Kill Session: sess: 0x406bfa36c58  OS pid: 27614
Immediate Kill Session#: 512, Serial#: 43388
Immediate Kill Session: sess: 0x406bfa3c1f8  OS pid: 4021
Immediate Kill Session#: 516, Serial#: 33975
Immediate Kill Session: sess: 0x406bfa41798  OS pid: 18481
Immediate Kill Session#: 517, Serial#: 24240
Immediate Kill Session: sess: 0x406bfa42d00  OS pid: 823
Immediate Kill Session#: 526, Serial#: 59767
Immediate Kill Session: sess: 0x406bfa4eda8  OS pid: 12529
Immediate Kill Session#: 527, Serial#: 45765
Immediate Kill Session: sess: 0x406bfa50310  OS pid: 6059
……………………
Sun Jan  1 02:22:29 2012
ALTER SYSTEM SET service_names='ora9i' SCOPE=MEMORY SID='ora9i1';

2.数据库配置
2.1)A节点相关配置

SQL> select instance_name from v$instance;

INSTANCE_NAME
----------------
ora9i1

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux IA64: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> show parameter name;

NAME                                 TYPE       VALUE
------------------------------------ ---------- --------------------
db_file_name_convert                 string
db_name                              string     ora9i
db_unique_name                       string     ora9i
global_names                         boolean    FALSE
instance_name                        string     ora9i1
lock_name_space                      string
log_file_name_convert                string
service_names                        string     ora9i

2.2)B节点相关配置

SQL>  select instance_name from v$instance;

INSTANCE_NAME
----------------
ora9i2

SQL>  select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux IA64: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> show parameter name;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_file_name_convert                 string
db_name                              string      ora9i
db_unique_name                       string      ora9i
global_names                         boolean     FALSE
instance_name                        string      ora9i2
lock_name_space                      string
log_file_name_convert                string
service_names                        string      SYS$SYS.KUPC$C_2_2012010601100
                                                 6.ORA9I, ora9i, SYS$SYS.KUPC$S
                                                 _2_20120106011006.ORA9I

3.查看MOS,寻找解决方案
3.1)产生该问题原因

This is caused by unpublished Bug 6955040 ALL THE SESSIONS LOST CONNECTION AFTER KILLING CRSD.BIN.

The problem is when CRSD is killed or crashed and restarted, 
CRSD will run resource check action but CRS resource status will not be available at that time. 
Then in instance check action, 
it fails to get the preferred node VIP resource status and considered the preferred node VIP resource is not running. 
Therefore, instance check action will remove the default database service name 
and disconnect sessions connected using default database service name.

This causes messages "ALTER SYSTEM" and "Immediate Kill Session" printed in alert log.

3.2)解决方案

1) The fix is included in 10.2.0.5 patchset and 11.1.0.7 patchset.
    Apply the patchset once they are available.

OR

2) Configure a service name other than the default one (same as db_name), 
and get user to use the non-default service name for connection.

rac中的spfile探讨

今天朋友的的rac,因为被同事做数据库升级,分别在两个节点的本地创建了spfile,然后使用这个spfile启动了数据库,因为他不是非常懂oracle,所以向我求救,我改他的建议是:
1、利用备份的原来的pfile文件创建在asm中的spfile,规则是:+ASM/SID/spfileSID
2、dbs目录下创建一个本地的initsid.ora,然后在里面加一个spfile=’+ASM pfile path’(两个节点同样操作,注意sid不同)
3、分别重启数据库
出现该问题的原因分析:
做数据库升级的朋友的同事也不懂rac的spfile的相关规则,应该是在重启数据库的时候,提示spfile不存在,然后自己手工创建利用pfile创建的spfile到dbs下面,然后朋友十一后检测数据库,发现spfile都放置在本地了。

1、通常读取参数文件顺序
我们知道,如果不指定参数文件,oracle是按照这个顺序查找文件来启动数据库的:
spfileSID.ora
spfile.ora
initSID.ora
init.ora
如果这些文件都没有找到,启动会失败。

2、RAC中关于spfile的启动探讨

[rac@cent1 dbs]$ echo $ORACLE_SID
RACDB1
[rac@cent1 dbs]$ touch spfileRACDB1.ora  <==手工创建一个空白的spfile
[rac@cent1 dbs]$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.4.0 - Production on Thu Apr 29 13:45:50 2010

Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORA-27091: unable to queue I/O  <== 用sqlplus启动数据库时会报错
ORA-27069: attempt to do I/O beyond the range of the file
Additional information: 1
Additional information: 1
SQL>
SQL> exit
Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
[rac@cent1 dbs]$ crs_stat -t
Name           Type           Target    State     Host
------------------------------------------------------------
ora....B1.inst application    OFFLINE   OFFLINE
ora....B2.inst application    ONLINE    ONLINE    cent2
ora.RACDB.db   application    ONLINE    ONLINE    cent1
ora....SM1.asm application    ONLINE    ONLINE    cent1
ora....T1.lsnr application    ONLINE    ONLINE    cent1
ora.cent1.gsd  application    ONLINE    ONLINE    cent1
ora.cent1.ons  application    ONLINE    ONLINE    cent1
ora.cent1.vip  application    ONLINE    ONLINE    cent1
ora....SM2.asm application    ONLINE    ONLINE    cent2
ora....T2.lsnr application    ONLINE    ONLINE    cent2
ora.cent2.gsd  application    ONLINE    ONLINE    cent2
ora.cent2.ons  application    ONLINE    ONLINE    cent2
ora.cent2.vip  application    ONLINE    ONLINE    cent2
[rac@cent1 dbs]$ srvctl start instance -i racdb1 -d racdb  <== 用srvctl启动成功
[rac@cent1 dbs]$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.4.0 - Production on Thu Apr 29 13:47:25 2010

Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL> select instance_name, status from v$instance;

INSTANCE_NAME    STATUS
---------------- ------------
RACDB1           OPEN
--说明srvctl不是用那个顺序去查找参数文件

3、查看srvctl读取spfile位置

[rac@cent1 dbs]$ srvctl config database -d racdb -a
cent1 RACDB1 /rac/product/10.2.0/db
cent2 RACDB2 /rac/product/10.2.0/db
DB_NAME: RACDB
ORACLE_HOME: /rac/product/10.2.0/db
SPFILE: +DATA/RACDB/spfileRACDB.ora
DOMAIN: WORLD
DB_ROLE: null
START_OPTIONS: null
POLICY:  AUTOMATIC
ENABLE FLAG: DB ENABLED

4、修改CRS中关于spfile位置

[rac@cent1 dbs]$ srvctl modify database -d racdb -p ' +DATA/RACDB/spfileRACDB1.ora'

RAC负载均衡配置

1、客户端均衡(Client-Side LB)
工作原理:当客户端发起连接时,会从地址列表中随机选取一个,再使用随机算法把连接请求分散到各个实例。

存在缺点:
1.1)分配连接时没有考虑每个节点的真实负载,最后分配不过不一定是平衡
1.2)随机算法需要长时间片,如果在短时间内同时发起多个连接,这些连接有可能被分配到一个节点上
1.3)有些情况下,连接可能被分配到故障节点上

配置方法:在tns中添加LOAD_BALANCE = YES条目

2、服务器端均衡(Server-Side LB)
工作原理:
2.1)该均衡实现是依赖于Listener收集的负载信息。在数据库运行过程中,PMON后台进程会收集数系统的负载信息,然后登记到Listener中。
2.2)PMON进程不仅会向本地的Listener注册,也会想其他节点上的Listener注册,但到底向何处注册,是由Remote_Listeners和Local_Listener这两个参数决定。Local_Listener不用设置,而Remote_Listeners需要设置,参数值有一个tnsnames项。
2.3)当收到客户端连接请求时,就会把连接转给负载最小的节点,这个节点可能是自己,也可能是其他节点,也就是Listener会转发客户端的连接请求。

配置方法:

SQL> show parameter listener;        

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
local_listener                       string
remote_listener                      string      LISTENERS_DEVDB

tnsnames.ora
LISTENERS_DEVDB =
  (ADDRESS_LIST =
    (ADDRESS = (PROTOCOL = TCP)(HOST = rac1-vip)(PORT = 1521))
    (ADDRESS = (PROTOCOL = TCP)(HOST = rac2-vip)(PORT = 1521))
  )

listener.ora(除掉SID_LIST_LISTENER_NAME项)
LISTENER_RAC1 =
  (DESCRIPTION_LIST =
    (DESCRIPTION =
      (ADDRESS = (PROTOCOL = TCP)(HOST = rac1-vip)(PORT = 1521)(IP = FIRST))
      (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.1.11)(PORT = 1521)(IP = FIRST))
    )
  )

3、两者联合使用
Server-Side LB和Client-Side LB不是互斥的,两者可以一起工作,这个时候客户端的连接请求会先从地址列表中随机选择一个地址,然后向该地址的Listener发送请求;Listener接到请求后,根据各个节点负载情况从中挑选出最合适的节点转发连接请求。

RAC Failover三种方式

1、Client-Side Connect Time Failover
1.1)在用户端tnsname中配置了多个地址,用户发起连接请求时,会先尝试连接地址表中的第一个地址,如果这个连接尝试失败,则继续尝试使用第二个地址,直至连接成功或者遍历了所有的地址。
1.2)这种Failover的特点是:在建立连接那一时刻起作用,一旦连接建立之后,节点出现故障都不会作处理,从而客户端的表现就是会话断开,用户程序必须重新建立连接。
启用该方法:在客户端tnsname.ora中添加FAILOVER=ON条目,因为这个参数默认值就是为NO,所以即使客户端不加该条目,也有这种Failover功能。

XFF_F =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.1.21)(PORT = 1521))
    (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.1.22)(PORT = 1521))
    (LOAD_BALANCE = yes)
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = devdb)
    )
  )

2、TAF(Transparent Application Failover)
2.1)在连接建立以后、应用系统运行过程中,如果某个实例发生故障,连接到这个实例上的用户会被自动迁移到其他的健康的实例上。对于应用程序而言,这个迁移过程是透明的,不需要用户的介入,当然在迁移过程中,未提交的事物会回滚。
2.2)与Client-Side Connect Time Failover比较起来,就是多了FAILOVER_MODE这一配置项,该配置项包含4个子项目
2.2.1)METHOD:可选值有BASIC和PRECONNECT
BASIC是指在感知到节点故障时才创建到其他实例的连接
PRECONNECT是在最初建立连接时就同时建立到所有实例的连接,当发生故障时,立刻就可以切换到其他链路上。
2.2.2)TYPE:可选值有SESSION和SELECT
两者的区别在于对select语句的处理,select表示如果发生故障迁移,正在执行的select语句将在新的节点上继续返回后续结果集;而session表示重新执行该select查询返回全部的结果。
2.2.3)DELAY表示重试间隔时间
2.2.4)RETRIES表示重试次数

XFF_T =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.1.21)(PORT = 1521))
    (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.1.22)(PORT = 1521))
    (LOAD_BALANCE = yes)
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = devdb)
      (FAILOVER_MODE =
        (TYPE = SELECT)
        (METHOD = BASIC)
        (RETRIES = 180)
        (DELAY = 5)
      )
    )
  )

3、Server-Side TAF
3.1)Server-Side TAF具有TAF的所有特点
3.2)这种TAF是在服务器上配置,不需要在客户端进行相关配置,如果修改一个参数,不需要在所有的tns上修改,而只要修改服务器中的service即可
用户有两种角色可以选择
PREFERRED:首选实例,会优先选择拥有这个角色的实例提供服务
AVAILABLE:后备实例,当PREFERRED实例不可用时,才会转到AVAILABLE实例上

XFF_RAC =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.1.21)(PORT = 1521))
    (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.1.22)(PORT = 1521))
    (LOAD_BALANCE = yes)
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = XFF)
    )
  )

集群服务启动与关闭(10g)

一、crs开启和关闭
关闭crs
/etc/init.d/init.crs stop
开启crs
/etc/init.d/init.crs start

二、启动和关闭所有的集群服务
关闭
./crs_stop -all
启动
./crs_start -all

三、分步操作crs服务
1、关闭集群
srvctl stop service -d -s
srvctl stop database -d
srvctl stop asm -n
srvctl stop asm -n
srvctl stop nodeapps -n
srvctl stop nodeapps -n

2、关闭集群
srvctl start nodeapps -n
srvctl start nodeapps -n
srvctl start asm -n
srvctl start asm -n
srvctl start database -d
srvctl start service -d -s

3、测试
3.1)关闭
srvctl stop service -d devdb -s XFF
srvctl stop instance -d devdb -i devdb1,devdb2 -o immediate
(srvctl stop database -d devdb -o immediate)
srvctl stop asm -n rac1
srvctl stop asm -n rac2
srvctl stop nodeapps -n rac1
srvctl stop nodeapps -n rac2

3.2)启动
srvctl start nodeapps -n rac1
srvctl start nodeapps -n rac2
srvctl start asm -n rac1
srvctl start asm -n rac2
srvctl start database -d devdb
(srvctl start instance -n devdb -i devdb1,devdb2)
srvctl start service -d devdb -s XFF