误杀进程导致rac hang住

有客户反馈系统hang住,不能归档,需要我们紧急介入分析
节点1日志
出现redo不能归档,redo日志都已经被写满,人工执行了ALTER SYSTEM ARCHIVE LOG CURRENT,数据库就开始把redo全部归档,但是后面产生的redo又不能归档,当redo全部写满之后,数据库有出现大量log file switch (archiving needed)等待

Tue Sep 24 22:05:37 2013
Thread 1 advanced to log sequence 47282 (LGWR switch)
  Current log# 6 seq# 47282 mem# 0: +DATA/q9db/onlinelog/group_6.1244.818697409
Tue Sep 24 22:07:31 2013
ORACLE Instance q9db1 - Can not allocate log, archival required
Thread 1 cannot allocate new log, sequence 47283
All online logs needed archiving
  Current log# 6 seq# 47282 mem# 0: +DATA/q9db/onlinelog/group_6.1244.818697409
Tue Sep 24 22:28:17 2013
ALTER SYSTEM ARCHIVE LOG
Archived Log entry 259646 added for thread 1 sequence 47266 ID 0x354620c2 dest 1:
Tue Sep 24 22:28:18 2013
Thread 1 advanced to log sequence 47283 (LGWR switch)
  Current log# 7 seq# 47283 mem# 0: +DATA/q9db/onlinelog/group_7.1243.818697415
Archived Log entry 259647 added for thread 1 sequence 47267 ID 0x354620c2 dest 1:
Archived Log entry 259648 added for thread 1 sequence 47268 ID 0x354620c2 dest 1:
Archived Log entry 259649 added for thread 1 sequence 47269 ID 0x354620c2 dest 1:
Archived Log entry 259650 added for thread 1 sequence 47270 ID 0x354620c2 dest 1:
Archived Log entry 259651 added for thread 1 sequence 47271 ID 0x354620c2 dest 1:
Archived Log entry 259652 added for thread 1 sequence 47272 ID 0x354620c2 dest 1:
Tue Sep 24 22:28:28 2013
Archived Log entry 259653 added for thread 1 sequence 47273 ID 0x354620c2 dest 1:
Archived Log entry 259654 added for thread 1 sequence 47274 ID 0x354620c2 dest 1:
Archived Log entry 259655 added for thread 1 sequence 47275 ID 0x354620c2 dest 1:
Archived Log entry 259656 added for thread 1 sequence 47276 ID 0x354620c2 dest 1:
Archived Log entry 259657 added for thread 1 sequence 47277 ID 0x354620c2 dest 1:
Archived Log entry 259658 added for thread 1 sequence 47278 ID 0x354620c2 dest 1:
Archived Log entry 259659 added for thread 1 sequence 47279 ID 0x354620c2 dest 1:
Tue Sep 24 22:28:39 2013
Archived Log entry 259660 added for thread 1 sequence 47280 ID 0x354620c2 dest 1:
Archived Log entry 259661 added for thread 1 sequence 47281 ID 0x354620c2 dest 1:
Archived Log entry 259662 added for thread 1 sequence 47282 ID 0x354620c2 dest 1:
Tue Sep 24 22:29:39 2013
Thread 1 advanced to log sequence 47284 (LGWR switch)
  Current log# 8 seq# 47284 mem# 0: +DATA/q9db/onlinelog/group_8.1242.818697417
Tue Sep 24 22:31:18 2013
Thread 1 advanced to log sequence 47285 (LGWR switch)
  Current log# 16 seq# 47285 mem# 0: +DATA/q9db/onlinelog/group_16.1884.827003545
Thread 1 advanced to log sequence 47286 (LGWR switch)
  Current log# 17 seq# 47286 mem# 0: +DATA/q9db/onlinelog/group_17.1885.827003587

节点2日志
节点2中出现大量的IPC Send timeout

Tue Sep 24 15:22:19 2013
IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]
…………
Tue Sep 24 18:51:55 2013
IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]
Tue Sep 24 18:57:54 2013
IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]
Receiver: inst 1 binc 464003926 ospid 1566
Tue Sep 24 19:03:57 2013
IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]
Receiver: inst 1 binc 464003926 ospid 1566
Tue Sep 24 19:09:53 2013
IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]
…………
Tue Sep 24 20:22:00 2013
IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]

节点1因为不能归档hang住,节点2紧接着也就hang住。对节点1hang住之时对两个节点分别做systemstate dump,使用ass进行分析得到节点1和节点2的记录大体如下:
节点1

393:waiting for 'log file switch (archiving needed)'
394:waiting for 'log file switch (archiving needed)'
     Cmd: Insert
395:waiting for 'log file switch (archiving needed)'
     Cmd: Insert
397:waiting for 'log file switch (archiving needed)'
     Cmd: Insert
398:waiting for 'log file switch (archiving needed)'
     Cmd: Insert
451:waiting for 'SQL*Net message from client'
469:waiting for 'log file switch (archiving needed)'
     Cmd: Insert
470:waiting for 'log file switch (archiving needed)'
     Cmd: Insert
471:waiting for 'log file switch (archiving needed)'
     Cmd: Insert
618:waiting for 'log file switch (archiving needed)'
     Cmd: Insert
626:waiting for 'log file switch (archiving needed)'
     Cmd: Insert

NO BLOCKING PROCESSES FOUND

节点2

515:waiting for 'gc buffer busy acquire'
     Cmd: Insert
516:waiting for 'gc buffer busy acquire'
     Cmd: Insert
517:waiting for 'gc buffer busy acquire'
     Cmd: Insert
518:waiting for 'gc buffer busy acquire'
     Cmd: Insert
519:waiting for 'gc buffer busy acquire'
     Cmd: Insert
520:waiting for 'gc buffer busy acquire'
     Cmd: Select
521:waiting for 'gc current request'
     Cmd: Insert
522:waiting for 'enq: TX - row lock contention'[Enq TX-00BA0020-001E3E3C]
     Cmd: Select
523:waiting for 'gc buffer busy acquire'
     Cmd: Insert
524:waiting for 'SQL*Net message from client'
525:waiting for 'gc buffer busy acquire'
     Cmd: Insert
526:waiting for 'gc buffer busy acquire'
     Cmd: Insert
527:waiting for 'enq: TX - row lock contention'[Enq TX-00BA0020-001E3E3C]
     Cmd: Select
528:waiting for 'SQL*Net message from client'
529:waiting for 'gc buffer busy acquire'
     Cmd: Select

                    Resource Holder State
    Enq TX-0005001E-0022374F   223: waiting for 'gc current request'
    Enq TX-0047001B-002BCEB2   247: waiting for 'gc current request'
    Enq TX-015B001E-000041FF   330: waiting for 'gc current request'
    Enq TX-00010010-002EA7CD   179: waiting for 'gc current request'
    Enq TX-00BA0020-001E3E3C    ??? Blocker

Object Names
~~~~~~~~~~~~
Enq TX-0005001E-0022374F
Enq TX-0047001B-002BCEB2
Enq TX-015B001E-000041FF
Enq TX-00010010-002EA7CD
Enq TX-00BA0020-001E3E3C

通过这里,我们可以明白,节点2的很多事务hang住是因为请求gc current request,而该等待是因为节点1无法归档,有些block无法正常传输到节点2,导致节点2一直hang在这里,然后就出现IPC Send timeout;节点1上的事务阻塞甚至hang住是因为无法归档导致.到此需要定位的问题是为什么节点1不能归档
继续分析节点1 alert日志

Tue Sep 24 15:18:20 2013
opidrv aborting process O000 ospid (7332) as a result of ORA-28
Immediate Kill Session#: 1904, Serial#: 1065
Immediate Kill Session: sess: 0x24a2522a38  OS pid: 7338
Immediate Kill Session#: 3597, Serial#: 11107
Immediate Kill Session: sess: 0x24c27cf498  OS pid: 7320
Tue Sep 24 15:18:23 2013
opidrv aborting process W000 ospid (7980) as a result of ORA-28
Tue Sep 24 15:18:23 2013
opidrv aborting process W001 ospid (8560) as a result of ORA-28
Tue Sep 24 15:18:35 2013
LGWR: Detected ARCH process failure
LGWR: Detected ARCH process failure
LGWR: Detected ARCH process failure
LGWR: Detected ARCH process failure
LGWR: STARTING ARCH PROCESSES
Tue Sep 24 15:18:35 2013
ARC0 started with pid=66, OS id=10793 
Tue Sep 24 15:18:35 2013
Errors in file /u01/app/oracle/diag/rdbms/q9db/q9db1/trace/q9db1_nsa2_12635.trc:
ORA-00028: your session has been killed
LNS: Failed to archive log 8 thread 1 sequence 47156 (28)
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
Thread 1 advanced to log sequence 47157 (LGWR switch)
ARC0: STARTING ARCH PROCESSES
  Current log# 9 seq# 47157 mem# 0: +DATA/q9db/onlinelog/group_9.1241.818697421
Tue Sep 24 15:18:36 2013
ARC1 started with pid=81, OS id=10805 
Tue Sep 24 15:18:36 2013
ARC2 started with pid=84, OS id=10807 
Tue Sep 24 15:18:36 2013
ARC3 started with pid=87, OS id=10809 
ARC1: Archival started
ARC2: Archival started
ARC2: Becoming the 'no FAL' ARCH
ARC2: Becoming the 'no SRL' ARCH
ARC1: Becoming the heartbeat ARCH
Error 1031 received logging on to the standby
PING[ARC1]: Heartbeat failed to connect to standby 'q9adgdg'. Error is 1031.
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
Archived Log entry 259135 added for thread 1 sequence 47156 ID 0x354620c2 dest 1:
Error 1031 received logging on to the standby
FAL[server, ARC3]: Error 1031 creating remote archivelog file 'q9adgdg'
FAL[server, ARC3]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance q9db1 - Archival Error. Archiver continuing.
Tue Sep 24 15:18:46 2013
opidrv aborting process O001 ospid (9605) as a result of ORA-28
Tue Sep 24 15:18:46 2013
opidrv aborting process O000 ospid (10813) as a result of ORA-28
Tue Sep 24 15:18:46 2013
Immediate Kill Session#: 2909, Serial#: 369
Immediate Kill Session: sess: 0x24226c7200  OS pid: 9091
Immediate Kill Session#: 3380, Serial#: 30271
Immediate Kill Session: sess: 0x2422782c58  OS pid: 10265
Immediate Kill Session#: 3597, Serial#: 11109
Immediate Kill Session: sess: 0x24c27cf498  OS pid: 10267
Tue Sep 24 15:20:14 2013
Restarting dead background process DIAG
Tue Sep 24 15:20:14 2013
DIAG started with pid=64, OS id=20568 
Restarting dead background process PING
Tue Sep 24 15:20:14 2013
PING started with pid=68, OS id=20570 
Restarting dead background process LMHB
Tue Sep 24 15:20:14 2013
LMHB started with pid=70, OS id=20572 
Restarting dead background process SMCO
…………
Tue Sep 24 15:23:13 2013
ARC0: Detected ARCH process failure
Tue Sep 24 15:23:13 2013
Thread 1 advanced to log sequence 47158 (LGWR switch)
  Current log# 10 seq# 47158 mem# 0: +DATA/q9db/onlinelog/group_10.1240.818697423
ARC0: STARTING ARCH PROCESSES
ARC0: STARTING ARCH PROCESSES COMPLETE
ARC0: Becoming the heartbeat ARCH
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance q9db1 - Archival Error
ORA-00028: your session has been killed

查看ARCn进程

[oracle@q9db01 ~]$ ps -ef|grep ora_ar
oracle   20718 12870  0 22:07 pts/14   00:00:00 grep ora_ar
[oracle@q9db01 ~]$ ps -ef|grep ora_ar
oracle   25998 12870  0 22:07 pts/14   00:00:00 grep ora_ar

这里基本上明白了,因为客户的系统从15:15开始由于中间件程序异常,导致大量会话连接数据库,然后dba为了防止其他业务不受影响,然后开始大量通过alter system kill session,误杀了不少系统进程,包括ARCn(0,1,2,3)进程,在后面ARCn进程因为某种原因无法正常启动,导致redo无法归档,所有的redo组写满系统即hang住,该系统由于大量kill session已经导致了实例本身异常(正常情况ARCn进程kill之后会自动重启),处理方案:先增加redo组配合定时人工归档,等待业务低峰重启节点1,解决问题。温馨提醒:kill进程请小心

安装 ORACLE 12C 单节点RAC

装过ORACLE 12C RAC 的朋友应该感觉到12C的RAC简直就是一个怪物,需要消耗太多的内存、IO、CPU资源,在没有物理机器的情况下,使用虚拟机装ORACLE 12C RAC那可能需要比较好的主机资源(8G的内存不运行主机,仅仅够2个节点的虚拟机运行,所以8G的内存主机基本上无法ORACLE 12C 2节点RAC),而没有好的资源情况下,又需要玩12C RAC功能的朋友,我这里展示了单节点RAC(一个节点的RAC的rac),这里主要显示的是单节点RAC和多节点RAC安装不同之处,同时这里的安装也仅仅是为了玩
内存要求
12c_rac_require


准备环境

[oracle@xifenfei ~]$ more /etc/oracle-release 
Oracle Linux Server release 5.8

[oracle@xifenfei ~]$ free -m
             total       used       free     shared    buffers     cached
Mem:          4350       4036        313          0          4       2805
-/+ buffers/cache:       1226       3124
Swap:         2047        853       1193

[oracle@xifenfei ~]$ /sbin/ifconfig
eth0      Link encap:Ethernet  HWaddr 00:0C:29:02:1B:A7  
          inet addr:192.168.30.22  Bcast:192.168.30.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:66152 errors:0 dropped:0 overruns:0 frame:0
          TX packets:83647 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:21143877 (20.1 MiB)  TX bytes:44756747 (42.6 MiB)

eth1      Link encap:Ethernet  HWaddr 00:0C:29:02:1B:B1  
          inet addr:10.10.30.22  Bcast:10.10.30.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8064 errors:0 dropped:0 overruns:0 frame:0
          TX packets:366 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:645158 (630.0 KiB)  TX bytes:64125 (62.6 KiB)

[oracle@xifenfei ~]$ more /etc/hosts
127.0.0.1       localhost.localdomain localhost

10.10.30.22     xifenfei-priv

192.168.30.22   xifenfei
192.168.30.32   xifenfei-vip
192.168.30.42   scan-ip

[root@xifenfei ~]# yum install oracle-validated
[root@xifenfei rpm]# rpm -ivh cvuqdisk-1.0.9-1.rpm

安装GRID软件
single_rac_gi0.jpg
single_rac_gi1
single_rac_gi2.jpg


安装ORACE DB软件
single_rac_db1


创建数据库
single_rac_dbca


安装结果

[root@xifenfei ~]# crsctl status res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ASMNET1LSNR_ASM.lsnr
               ONLINE  ONLINE       xifenfei                 STABLE
ora.DATA.dg
               ONLINE  ONLINE       xifenfei                 STABLE
ora.LISTENER.lsnr
               ONLINE  ONLINE       xifenfei                 STABLE
ora.net1.network
               ONLINE  ONLINE       xifenfei                 STABLE
ora.ons
               ONLINE  ONLINE       xifenfei                 STABLE
ora.proxy_advm
               ONLINE  ONLINE       xifenfei                 STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       xifenfei                 STABLE
ora.MGMTLSNR
      1        ONLINE  ONLINE       xifenfei                 169.254.243.16 10.10
                                                             .30.22,STABLE
ora.asm
      1        ONLINE  ONLINE       xifenfei                 STABLE
      2        OFFLINE OFFLINE                               STABLE
      3        OFFLINE OFFLINE                               STABLE
ora.cdb.db
      1        ONLINE  ONLINE       xifenfei                 Open,STABLE
ora.cvu
      1        ONLINE  ONLINE       xifenfei                 STABLE
ora.mgmtdb
      1        ONLINE  ONLINE       xifenfei                 Open,STABLE
ora.oc4j
      1        ONLINE  ONLINE       xifenfei                 STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       xifenfei                 STABLE
ora.xifenfei.vip
      1        ONLINE  ONLINE       xifenfei                 STABLE
--------------------------------------------------------------------------------

[root@xifenfei ~]# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:0C:29:02:1B:A7  
          inet addr:192.168.30.22  Bcast:192.168.30.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:66419 errors:0 dropped:0 overruns:0 frame:0
          TX packets:83849 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:21168865 (20.1 MiB)  TX bytes:44798962 (42.7 MiB)

eth0:1    Link encap:Ethernet  HWaddr 00:0C:29:02:1B:A7  
          inet addr:192.168.30.42  Bcast:192.168.30.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth0:2    Link encap:Ethernet  HWaddr 00:0C:29:02:1B:A7  
          inet addr:192.168.30.32  Bcast:192.168.30.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth1      Link encap:Ethernet  HWaddr 00:0C:29:02:1B:B1  
          inet addr:10.10.30.22  Bcast:10.10.30.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:12244 errors:0 dropped:0 overruns:0 frame:0
          TX packets:368 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:994988 (971.6 KiB)  TX bytes:64536 (63.0 KiB)

eth1:1    Link encap:Ethernet  HWaddr 00:0C:29:02:1B:B1  
          inet addr:169.254.243.16  Bcast:169.254.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:91800 errors:0 dropped:0 overruns:0 frame:0
          TX packets:91800 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:68999524 (65.8 MiB)  TX bytes:68999524 (65.8 MiB)

在11GR2 GI上配置第二个监听

有客户因为归档日志量每天很大,为了不影响业务,需要配置一个单独的万兆网络来专门的传输归档日志到DG库,这里就涉及到在11G(11203 Linux) RAC中增加一个监听用来使用专门的网络.这里提供在主库配置第二个监听的整体操作过程,主要涉及配置解析,增加网络,增加vip,配置监听,配置listener_networks
网卡情况

[oracle@q9db01 admin]$ /sbin/ifconfig eth2
eth2      Link encap:Ethernet  HWaddr 90:E2:BA:1E:14:34  
          inet addr:192.168.5.60  Bcast:192.168.5.255  Mask:255.255.255.0
          inet6 addr: fe80::92e2:baff:fe1e:1434/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3932094203 errors:0 dropped:0 overruns:0 frame:0
          TX packets:176073749 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:5752649063554 (5.2 TiB)  TX bytes:31298228144 (29.1 GiB)

[grid@q9db02 ~]$ /sbin/ifconfig eth2
eth2      Link encap:Ethernet  HWaddr 90:E2:BA:1E:13:5C  
          inet addr:192.168.5.61  Bcast:192.168.5.255  Mask:255.255.255.0
          inet6 addr: fe80::92e2:baff:fe1e:135c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:22861938 errors:0 dropped:0 overruns:0 frame:0
          TX packets:187447459 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2603356463 (2.4 GiB)  TX bytes:276672018723 (257.6 GiB)

配置hosts文件

#public dg ip
192.168.5.60     q9db01-dg
192.168.5.61     q9db02-dg

192.168.5.64    q9db01-dg-vip
192.168.5.65    q9db02-dg-vip

CRS中配置

--增加网络资源
[root@q9db01 ~]# srvctl add network -k 2 -S 192.168.5.0/255.255.255.0/eth2 -w static -v

--启动网络资源
[root@q9db01 ~]# crsctl start res ora.net2.network

--增加vip资源
[root@q9db01 ~]# srvctl add vip -n q9db01 -A 192.168.5.64/255.255.255.0 -k 2
[root@q9db01 ~]# srvctl add vip -n q9db02 -A 192.168.5.65/255.255.255.0 -k 2

--启动vip资源
[root@q9db01 ~]# srvctl start vip -i q9db01-dg-vip
[root@q9db01 ~]# srvctl start vip -i q9db02-dg-vip

--netca创建监听
[root@q9db01 ~]# su - grid
[grid@q9db01 ~]$ export DISPLAY=172.18.50.150:0.0
[grid@q9db01 ~]$ netca
------选择Subnet 2(选择网络192.168.5.60网段),选择非1521端口---------

--查看资源状态
[grid@q9db01 ~]$ crsctl status res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ARCH.dg
               ONLINE  ONLINE       q9db01                                       
               ONLINE  ONLINE       q9db02                                       
ora.DATA.dg
               ONLINE  ONLINE       q9db01                                       
               ONLINE  ONLINE       q9db02                                       
ora.LISTENER.lsnr
               ONLINE  ONLINE       q9db01                                       
               ONLINE  ONLINE       q9db02                                       
ora.LISTENER_DG.lsnr
               ONLINE  ONLINE       q9db01                                       
               ONLINE  ONLINE       q9db02                                       
ora.OCR_VOTE.dg
               ONLINE  ONLINE       q9db01                                       
               ONLINE  ONLINE       q9db02                                       
ora.asm
               ONLINE  ONLINE       q9db01                   Started             
               ONLINE  ONLINE       q9db02                   Started             
ora.gsd
               OFFLINE OFFLINE      q9db01                                       
               OFFLINE OFFLINE      q9db02                                       
ora.net1.network
               ONLINE  ONLINE       q9db01                                       
               ONLINE  ONLINE       q9db02                                       
ora.net2.network
               ONLINE  ONLINE       q9db01                                       
               ONLINE  ONLINE       q9db02                                       
ora.ons
               ONLINE  ONLINE       q9db01                                       
               ONLINE  ONLINE       q9db02                                       
ora.registry.acfs
               ONLINE  ONLINE       q9db01                                       
               ONLINE  ONLINE       q9db02                                       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       q9db01                                       
ora.cvu
      1        ONLINE  ONLINE       q9db01                                       
ora.oc4j
      1        ONLINE  ONLINE       q9db01                                       
ora.q9db.db
      1        ONLINE  ONLINE       q9db01                   Open                
      2        ONLINE  ONLINE       q9db02                   Open                
ora.q9db01-dg-vip.vip
      1        ONLINE  ONLINE       q9db01                                       
ora.q9db01.vip
      1        ONLINE  ONLINE       q9db01                                       
ora.q9db02-dg-vip.vip
      1        ONLINE  ONLINE       q9db02                                       
ora.q9db02.vip
      1        ONLINE  ONLINE       q9db02                                       
ora.scan1.vip
      1        ONLINE  ONLINE       q9db01                        

--查看监听状态
[grid@q9db01 ~]$ lsnrctl status listener_dg

LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 19-JUN-2013 14:40:24

Copyright (c) 1991, 2011, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_DG)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER_DG
Version                   TNSLSNR for Linux: Version 11.2.0.3.0 - Production
Start Date                19-JUN-2013 14:38:43
Uptime                    0 days 0 hr. 1 min. 42 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/11.2.0/grid/network/admin/listener.ora
Listener Log File         /u01/app/grid/diag/tnslsnr/q9db01/listener_dg/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_DG)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.5.64)(PORT=1522)))
The listener supports no services
The command completed successfully

配置listener_networks

--在RDBMS的tnsnames.ora中配置
Q9DB01_LOCAL_NET1 =(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = q9db01-vip )(PORT = 1521)))

Q9DB01_LOCAL_NET2 =(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = q9db01-dg-vip )(PORT = 1522)))

Q9DB02_LOCAL_NET1 =(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = q9db02-vip )(PORT = 1521)))

Q9DB02_LOCAL_NET2 =(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = q9db02-dg-vip )(PORT = 1522)))

Q9DB_REMOTE_NET2 =(DESCRIPTION_LIST =(DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = q9db01-dg-vip )
(PORT = 1522)))(DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = q9db02-dg-vip )(PORT = 1522))))

--配置listener_networks
----节点1
SQL> ALTER SYSTEM SET listener_networks='((NAME=network1)(LOCAL_LISTENER=Q9DB01_LOCAL_NET1)
     (REMOTE_LISTENER=q9dbscan:1521))','((NAME=network2)(LOCAL_LISTENER=Q9DB01_LOCAL_NET2)
     (REMOTE_LISTENER=Q9DB_REMOTE_NET2))'SCOPE=BOTH SID='q9db1';

System altered.

----节点2
SQL> ALTER SYSTEM SET listener_networks='((NAME=network1)(LOCAL_LISTENER=Q9DB02_LOCAL_NET1)
     (REMOTE_LISTENER=q9db-scan:1521))','((NAME=network2)(LOCAL_LISTENER=Q9DB02_LOCAL_NET2)
     (REMOTE_LISTENER=Q9DB_REMOTE_NET2))'SCOPE=BOTH SID='q9db2';

System altered.

查看监听状态

--节点1
[grid@q9db01 ~]$ lsnrctl status listener_dg

LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 19-JUN-2013 17:12:45

Copyright (c) 1991, 2011, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_DG)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER_DG
Version                   TNSLSNR for Linux: Version 11.2.0.3.0 - Production
Start Date                19-JUN-2013 16:47:03
Uptime                    0 days 0 hr. 25 min. 42 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/11.2.0/grid/network/admin/listener.ora
Listener Log File         /u01/app/grid/diag/tnslsnr/q9db01/listener_dg/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_DG)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.5.64)(PORT=1522)))
Services Summary...
Service "q9db" has 2 instance(s).
  Instance "q9db1", status READY, has 2 handler(s) for this service...
  Instance "q9db2", status READY, has 1 handler(s) for this service...
The command completed successfully

--节点2
[grid@q9db02 ~]$ lsnrctl status listener_dg

LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 19-JUN-2013 17:12:02

Copyright (c) 1991, 2011, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_DG)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER_DG
Version                   TNSLSNR for Linux: Version 11.2.0.3.0 - Production
Start Date                19-JUN-2013 16:52:24
Uptime                    0 days 0 hr. 19 min. 37 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/11.2.0/grid/network/admin/listener.ora
Listener Log File         /u01/app/grid/diag/tnslsnr/q9db02/listener_dg/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_DG)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.5.65)(PORT=1522)))
Services Summary...
Service "q9db" has 2 instance(s).
  Instance "q9db1", status READY, has 1 handler(s) for this service...
  Instance "q9db2", status READY, has 2 handler(s) for this service...
The command completed successfully

测试新监听

--RDBMS目录中tns配置
q9db1dg=
(DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = q9db01-dg-vip)(PORT = 1522))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SID = q9db1)
    )
 )

q9db2dg=
(DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = q9db02-dg-vip)(PORT = 1522))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SID = q9db2)
    )
 )

--验证结果
SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production

SQL> conn test/test@q9db1dg
Connected.
SQL> show parameter instance_name

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
instance_name                        string      q9db1
SQL> conn test/test@q9db2dg
Connected.
SQL>  show parameter instance_name

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
instance_name                        string      q9db2

到这里,可以正常的访问两个库,监听已经配置完成,数据库之间也可以使用特定网络

网关不通致使vip/lsnr资源异常

crs_stat显示节点1的listener和vip时断时续(一会online,一会offline)

rac1-> crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.devdb.db   application    ONLINE    ONLINE    rac1        
ora....b1.inst application    ONLINE    ONLINE    rac1        
ora....b2.inst application    ONLINE    ONLINE    rac2        
ora....SM1.asm application    ONLINE    ONLINE    rac1        
ora....C1.lsnr application    ONLINE    OFFLINE               
ora.rac1.gsd   application    ONLINE    ONLINE    rac1        
ora.rac1.ons   application    ONLINE    ONLINE    rac1        
ora.rac1.vip   application    ONLINE    ONLINE    rac2        
ora....SM2.asm application    ONLINE    ONLINE    rac2        
ora....C2.lsnr application    ONLINE    OFFLINE               
ora.rac2.gsd   application    ONLINE    ONLINE    rac2        
ora.rac2.ons   application    ONLINE    ONLINE    rac2        
ora.rac2.vip   application    ONLINE    ONLINE    rac1        
rac1-> crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.devdb.db   application    ONLINE    ONLINE    rac1        
ora....b1.inst application    ONLINE    ONLINE    rac1        
ora....b2.inst application    ONLINE    ONLINE    rac2        
ora....SM1.asm application    ONLINE    ONLINE    rac1        
ora....C1.lsnr application    ONLINE    OFFLINE               
ora.rac1.gsd   application    ONLINE    ONLINE    rac1        
ora.rac1.ons   application    ONLINE    ONLINE    rac1        
ora.rac1.vip   application    ONLINE    ONLINE    rac2        
ora....SM2.asm application    ONLINE    ONLINE    rac2        
ora....C2.lsnr application    ONLINE    ONLINE    rac2        
ora.rac2.gsd   application    ONLINE    ONLINE    rac2        
ora.rac2.ons   application    ONLINE    ONLINE    rac2        
ora.rac2.vip   application    ONLINE    ONLINE    rac2

查看crsd.log日志

0Attempting to start `ora.rac1.vip` on member `rac2`
0Start of `ora.rac1.vip` on member `rac2` failed.
0startRunnable: setting CLI values
0Attempting to start `ora.rac1.vip` on member `rac1`
0Start of `ora.rac1.vip` on member `rac1` succeeded.
0startRunnable: setting CLI values
0Attempting to start `ora.rac1.LISTENER_RAC1.lsnr` on member `rac1`
0Start of `ora.rac1.LISTENER_RAC1.lsnr` on member `rac1` succeeded.
u_freem: mem passed is null
0CheckResource error for ora.rac1.vip error code = 1
0In stateChanged, ora.rac1.vip target is ONLINE
0ora.rac1.vip on rac1 went OFFLINE unexpectedly
0StopResource: setting CLI values
0Attempting to stop `ora.rac1.vip` on member `rac1`
0Stop of `ora.rac1.vip` on member `rac1` succeeded.
0ora.rac1.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
0ora.rac1.vip failed on rac1 relocating.
0StopResource: setting CLI values
0Attempting to stop `ora.rac1.LISTENER_RAC1.lsnr` on member `rac1`
0Stop of `ora.rac1.LISTENER_RAC1.lsnr` on member `rac1` succeeded.
0Attempting to start `ora.rac1.vip` on member `rac2`
0Start of `ora.rac1.vip` on member `rac2` failed.
0Attempting to start `ora.rac1.vip` on member `rac2`
0Start of `ora.rac1.vip` on member `rac2` succeeded.
0CRS-1002: Resource 'ora.rac1.vip' is already running on member 'rac2'

这里可以看出由于vip资源失败,致使lsnr资源也出现失败,紧接着又是启动vip,再启动lsnr。所以使得我们通过crs_stat -t观察资源情况时,看到这两个进程一直处于波动状态

分析ora.rac1.vip.log日志

[ora.rac1.vip]: clsrcexecut:env ORACLE_CONFIG_HOME=/u01/app/oracle/product/10.2.0/crs_1
[ora.rac1.vip]: clsrcexecut:cmd=/u01/app/oracle/product/10.2.0/crs_1/bin/racgeut -e 
_USR_ORA_DEBUG=0 54 /u01/app/oracle/product/10.2.0/crs_1/bin/racgvip check rac1
[ora.rac1.vip]: clsrcexecut: rc = 1, time = 6.430s
[ora.rac1.vip]: end for resource = ora.rac1.vip, action=check,status=1,time=6.450s
[ora.rac1.vip]: ping to 192.168.1.1 via eth0 failed, rc = 1 (host=rac1)
ping to 192.168.1.1 via eth0 failed, rc = 1 (host=rac1)
[ora.rac1.vip]: clsrcstartorp: Error with malloc
[ora.rac1.vip]: ping to 192.168.1.1 via eth0 failed, rc = 1 (host=rac1)
ping to 192.168.1.1 via eth0 failed, rc = 1 (host=rac1)
Interface eth0 checked failed (host=rac1)
Invalid parameters, or failed to bring up VIP (host=rac1)

通过这里发现:从eth0网卡ping192.168.1.1(网关)不通,导致VIP资源不能正常工作

核实问题原因/解决
我们人工从节点1上ping 网关(192.168.1.1),果真不通.继续检查发现,网关服务器上意外的开启了防火墙,对部分进来的包进行了过滤,恰好节点1在被禁止之列,使得节点1 ping 网关不成功,从而出现该了该错误.关闭防火墙或者重新设置规则后,rac工作正常,未出现vip和lsnr资源出现波动情况.

OCR/Vote disk 维护操作

数据库版本

SQL>  select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - Prod
PL/SQL Release 10.2.0.5.0 - Production
CORE    10.2.0.5.0      Production
TNS for Linux: Version 10.2.0.5.0 - Production
NLSRTL Version 10.2.0.5.0 - Production

ocr测试(可以online处理)

rac2-> ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :     160396
         Used space (kbytes)      :       4376
         Available space (kbytes) :     156020
         ID                       : 1302494786
         Device/File Name         : /dev/raw/raw11
                                    Device/File integrity check succeeded

                                    Device/File not configured

         Cluster registry integrity check succeeded

rac2-> more /etc/oracle/ocr.loc 
ocrconfig_loc=/dev/raw/raw11
local_only=false

--增加ocr镜像
[root@rac2 bin]# ./ocrconfig -replace ocrmirror /dev/raw/raw12

rac2-> ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :     160396
         Used space (kbytes)      :       4376
         Available space (kbytes) :     156020
         ID                       : 1302494786
         Device/File Name         : /dev/raw/raw11
                                    Device/File integrity check succeeded
         Device/File Name         : /dev/raw/raw12
                                    Device/File integrity check succeeded

         Cluster registry integrity check succeeded

rac2-> more /etc/oracle/ocr.loc 
#Device/file  getting replaced by device /dev/raw/raw12 
ocrconfig_loc=/dev/raw/raw11
ocrmirrorconfig_loc=/dev/raw/raw12
local_only=false

--删除ocr
[root@rac2 bin]# ./ocrconfig -replace ocr

rac2-> more /etc/oracle/ocr.loc 
#Device/file /dev/raw/raw11 being deleted 
ocrconfig_loc=/dev/raw/raw12
local_only=false

rac2-> ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :     160396
         Used space (kbytes)      :       4376
         Available space (kbytes) :     156020
         ID                       : 1302494786
         Device/File Name         : /dev/raw/raw12
                                    Device/File integrity check succeeded

                                    Device/File not configured

         Cluster registry integrity check succeeded

--补充删除ocr镜像
[root@rac2 bin]# ./ocrconfig -replace ocrmirror

Vote disk测试(10g offline/11g online)

--关闭crs
[root@rac2 bin]# ./crsctl stop crs
[root@rac1 bin]# ./crsctl stop crs

--查询vote disk
rac2-> crsctl query css votedisk
 0.     0    /dev/raw/raw31

--增加vote disk
[root@rac2 bin]# ./crsctl add css votedisk /dev/raw/raw23 -force
Now formatting voting disk: /dev/raw/raw23
successful addition of votedisk /dev/raw/raw23.
[root@rac2 bin]# ./crsctl add css votedisk /dev/raw/raw33 -force
Now formatting voting disk: /dev/raw/raw33
successful addition of votedisk /dev/raw/raw33.
[root@rac2 bin]# ./crsctl add css votedisk /dev/raw/raw32 -force
Now formatting voting disk: /dev/raw/raw32
successful addition of votedisk /dev/raw/raw32.

rac2-> crsctl query css votedisk
 0.     0    /dev/raw/raw31
 1.     0    /dev/raw/raw23
 2.     0    /dev/raw/raw33
 3.     0    /dev/raw/raw32

located 4 votedisk(s).

--删除vote disk
[root@rac2 bin]# ./crsctl delete css votedisk /dev/raw/raw33 -force
successful deletion of votedisk /dev/raw/raw33.

--启动crs
[root@rac2 bin]# ./crsctl start crs
[root@rac1 bin]# ./crsctl start crs

补充官方操作说明[ID 428681.1]