DATAGUARD中MAXIMUM AVAILABILITY+LGWR SYNC导致主库不能启动

qq群里面的一朋友的的DG因为备库已经下架,主库重启的时候不能正常启动,帮忙处理结果如下
版本相关信息

Oracle9i Enterprise Edition Release 9.2.0.1.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.1.0 - Production
ORACLE_HOME = /export/home/oracle/product/9.2.0
System name:    SunOS
Node name:      eTermSrv2
Release:        5.10
Version:        Generic
Machine:        sun4u
Instance name: abcd

数据库不能启动日志

Fri Aug 10 12:37:56 2012
ALTER DATABASE OPEN
Fri Aug 10 12:37:56 2012
LGWR: Primary database is in CLUSTER CONSISTENT mode
LGWR: Primary database is in MAXIMUM AVAILABILITY mode
LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
LNS0 started with pid=16
Fri Aug 10 12:37:59 2012
ORA-1013 signalled during: ALTER DATABASE OPEN...
Fri Aug 10 12:41:45 2012
LGWR: Error 12535 verifying archivelog destination LOG_ARCHIVE_DEST_2
LGWR: Continuing...
Fri Aug 10 12:41:45 2012
Errors in file /export/home/oracle/admin/abcd/bdump/abcd_lgwr_11504.trc:
ORA-12535: TNS:operation timed out
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
******************************************************************
Creating archive destination LOG_ARCHIVE_DEST_2: 'ora9i'
LGWR: Error 12535 creating archivelog file 'ora9i'
Fri Aug 10 12:45:32 2012
Errors in file /export/home/oracle/admin/abcd/bdump/abcd_lgwr_11504.trc:
ORA-12535: TNS:operation timed out
LGWR: Completed archiving  log 1 thread 1 sequence 6808
Thread 1 advanced to log sequence 6808
Fri Aug 10 12:45:32 2012
Errors in file /export/home/oracle/admin/abcd/bdump/abcd_lgwr_11504.trc:
ORA-01157: cannot identify/lock data file 1 - see DBWR trace file
ORA-01110: data file 1: '/export/home/oracle/oradata/abcd/system01.dbf'
LGWR: terminating instance due to error 1157
Instance terminated by LGWR, pid = 11504

通过这里大概看出数据库原dg配置是MAXIMUM AVAILABILITY,因为备机下架,导致ora9i的tns不能访问,从而出现一些列错误,其中使得lgwr异常,因为oracle的某种内部机制,导致dbwr不能访问数据文件(这里体现出来是system01.dbf不能访问,但实际上应该是所有所有数据文件均不能访问,因为system01.dbf位于第一,所以报出该错误.)

*** SESSION ID:(3.1) 2012-08-10 12:37:56.847
Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
*** 2012-08-10 12:41:45.614
Error 12535 connecting to destination LOG_ARCHIVE_DEST_2 standby host 'ora9i'
*** 2012-08-10 12:41:45.615
LGWR: Error 12535 verifying archivelog destination LOG_ARCHIVE_DEST_2
Continuing...
ORA-12535: TNS:operation timed out
*** 2012-08-10 12:45:32.514
Error 12535 connecting to destination LOG_ARCHIVE_DEST_2 standby host 'ora9i'
Error 12535 attaching to destination LOG_ARCHIVE_DEST_2 standby host 'ora9i'
*** 2012-08-10 12:45:32.514
LGWR: Error 12535 creating archivelog file 'ora9i'
*** 2012-08-10 12:45:32.514
kcrrfail: dest:2 err:12535 force:0
ORA-12535: TNS:operation timed out
error 1157 detected in background process
ORA-01157: cannot identify/lock data file 1 - see DBWR trace file
ORA-01110: data file 1: '/export/home/oracle/oradata/abcd/system01.dbf'

通过trace文件,更加清楚的说明,可能是因为lgwr异常导致dbwr访问数据文件出现问题.

问题分析/解决汇总

SQL> show parameter log_archive_dest_state_1;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_archive_dest_state_1             string      enable

SQL> show parameter log_archive_dest_state_2;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_archive_dest_state_2             string      enable

SQL> show parameter log_archive_dest_1;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_archive_dest_1                   string      location=/export/home/oracle/o
                                                 radata/abcd/archive

SQL> show parameter log_archive_dest_2;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_archive_dest_2                   string      SERVICE=ora9i LGWR SYNC AFFIRM

SQL>select  protection_mode,database_role from v$database;

PROTECTION_MODE      DATABASE_ROLE
-------------------- ----------------
MAXIMUM AVAILABILITY PRIMARY

SQL>   show parameter succ;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_archive_min_succeed_dest         integer     1

通过上面的sql查询结果和alert日志与trace文件结合分析,大体结论是:
数据库的dg的保护模式为:MAXIMUM AVAILABILITY
数据库的日志传输方式是:LGWR SYNC AFFIRM
现在因为备机下架,主库LGWR不能通过tns访问备库,从而导致数据库的dbwr进程访问异常,是的数据库不能正常被open
这里的故障有一些巧合:MAXIMUM AVAILABILITY+LGWR SYNC AFFIRM+9.2.0.1+SunOS
我不清楚是不是ORACLE的bug导致,但是这个问题解决起来比较简单,只需要修改log_archive_dest_state_2=defer,使得log_archive_dest_2参数不生效,让lgwr不再访问备机

深入理解LOG_ARCHIVE_DEST_n与STANDBY_ARCHIVE_DEST

一、案例引入
朋友的dg配置如下,问我为什么归档日志都放在了use_db_recovery_file_dest

主库
log_archive_dest_1='location=/U01/app/oracle/oradata/bfodb/arch,valid_for=(ONLINE_LOGFILE,ALL_ROLES)'
log_archive_dest_2='service=tnsname,ARCH SYNC NOAFFIRM delay=0 OPTIONAL max_failure=0 max_connections=1   reopen=300  register net_timeout=180  valid_for=(online_logfile,primary_role)'
standby_archive_dest='location=use_db_recovery_file_dest'

备库
log_archive_dest_1='location=/U01/app/oracle/oradata/bfodb/arch,valid_for=(ONLINE_LOGFILE,ALL_ROLES)'
log_archive_dest_2='location=location=use_db_recovery_file_dest,valid_for=(STANDBY_LOGFILE,STANDBY_ROLE)'
standby_archive_dest='location=use_db_recovery_file_dest'

我很惯性的回答,直接传输过来的日志放到LOG_ARCHIVE_DEST_n下面,fal_*过来的归档放置在standby_archive_dest中,也没有过多的思考为什么,因为我们的库都是这样的规则,我已经认为是一种准则了。这个规则也整合符合了他们的要求(都在use_db_recovery_file_dest)中,没有仔细的去看他们的配置,当朋友说到valid_for的属性的时候,我感觉有点不对头了。
1)他们的log_archive_dest_1配置的是ONLINE_LOGFILE,这个是备库,所以肯定不会放到这里
2)难道是通过log_archive_dest_2使得直接传输过来的日志放置到了use_db_recovery_file_dest中?让朋友查询v$standby_log,发现他们的库没有使用standby redo logfile,也就是说,log_archive_dest_2不可能用来传输日志了,现在剩下来可以传输日志的,只有standby_archive_dest了。
3)问题解决了,都传输到use_db_recovery_file_dest,因为LOG_ARCHIVE_DEST_n都不能用(只是从排除法证明)

二、问题深入分析
standby_archive_dest和log_archive_dest_*到底有什么关系,在什么情况下传输到对应的目录中?

如果备库利用standby redo log在备库端自动归档,那么归档日志将会被放置到LOG_ARCHIVE_DEST_n
如果备库是利用主库的arch进程传输过来的归档,那么将会被放置到STANDBY_ARCHIVE_DEST
补充说明:
1)主库的LOG_ARCHIVE_DEST_n='service'默认的arch传输方式,primary会远程将archived log传输到standby_archive_dest下

2)fal_*是通过arch传输过来的,所以使用的是STANDBY_ARCHIVE_DEST目录,如果不存在这个目录,就使用LOG_ARCHIVE_DEST_n

3)If both parameters are specified, the STANDBY_ARCHIVE_DEST initialization parameter 
  overrides the directory location specified with the LOG_ARCHIVE_DEST_n parameter.
如果STANDBY_ARCHIVE_DEST和LOG_ARCHIVE_DEST_n两个参数指定,STANDBY_ARCHIVE_DEST初始化覆盖LOG_ARCHIVE_DEST_n目录指定位置参数。

4)If none of the initialization parameters have been specified, then archived redo log files are stored in the default location for the STANDBY_ARCHIVE_DEST initialization parameter.
如果STANDBY_ARCHIVE_DEST和LOG_ARCHIVE_DEST_n都没有配置,将会把归档日志放到STANDBY_ARCHIVE_DEST的默认目录

5)如果STANDBY_ARCHIVE_DEST未配置,而配置了LOG_ARCHIVE_DEST_n,那么产生的归档将放置到LOG_ARCHIVE_DEST_n

6)建议:STANDBY_ARCHIVE_DEST和LOG_ARCHIVE_DEST_n=‘location’的配置相同

7)在11g,已经不建议使用STANDBY_ARCHIVE_DEST,也就是说建议配置dg的时候尽量使用standby redo logfile

Data Guard出现gap sequence修复

一、出现gap sequence现象
备库

Fetching gap sequence in thread 1, gap sequence 710-716
Tue May 31 15:02:38 2011
FAL[client]: Failed to request gap sequence 
 GAP - thread 1 sequence 710-716
 DBID 3240478808 branch 746916894
FAL[client]: All defined FAL servers have been attempted.
-------------------------------------------------------------
Check that the CONTROL_FILE_RECORD_KEEP_TIME initialization
parameter is defined to a value that is sufficiently large
enough to maintain adequate log switch information to resolve
archivelog gaps.
-------------------------------------------------------------

主库

Tue May 31 13:50:47 2011
FAL[server]: Fail to queue the whole FAL gap
 GAP - thread 1 sequence 710-716
 DBID 3240478808 branch 746916894

二、修复操作
1、查询备库的scn

SQL> SELECT CURRENT_SCN FROM V$DATABASE;
CURRENT_SCN
-----------
    1154337
--在出现意外datafile header scn不一致的时候,需要根据提示归档日志,找出最小scn

2、确定主库是否添加数据文件

SQL> select FILE#,name from v$datafile where CREATION_CHANGE#> =1154337;
no rows selected

确定主库在这个scn之后是否有添加数据文件,如果添加文件,需要手工在备库添加

3、备库停止日志应用

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

4、主库增量备份并传输到备库上
主库进行增量备份

RMAN> BACKUP INCREMENTAL FROM SCN  1154337 DATABASE 
FORMAT '/home/oracle/xff_%U' tag 'XIFENFEI';
[oracle@localhost ~]$ scp xff* 192.168.1.30:/home/oracle/rman

说明:主库之前必须要做过rman的全备(没有全备的库,基于scn的增量备份也能够成功)

5、备库上进行恢复

RMAN> CATALOG START WITH '/home/oracle/rman';
RMAN> RECOVER DATABASE NOREDO;

说明:CATALOG START WITH是10g及其以后版本中才存在功能,没有该功能可以采用catalog或者复制主库的控制文件,rman备份放置和主库备份时相同目录实现。

6、主库上创建standby controlfile文件并传输到备库

RMAN> BACKUP CURRENT CONTROLFILE FOR STANDBY 
FORMAT '/home/oracle/xff_ctl.bck';
[oracle@localhost ~]$ scp xff_ctl.bck 192.168.1.30:/home/oracle/rman

创建standby controlfile两步可以需要根据实际情况考虑,大多数情况下不需要

7、备库恢复控制文件

RMAN> shutdown;
RMAN> STARTUP NOMOUNT;
RMAN> RESTORE STANDBY CONTROLFILE FROM '/home/oracle/rman/xff_ctl.bck';
RMAN> alter database mount;

8、清空备库日志组

SQL> ALTER DATABASE CLEAR LOGFILE GROUP 1; 
注:如果采用了standby log模式,不需要清空,如果清空会出现
SQL> ALTER DATABASE CLEAR LOGFILE GROUP 1;
ALTER DATABASE CLEAR LOGFILE GROUP 1
*
ERROR at line 1:
ORA-19527: physical standby redo log must be renamed
ORA-00312: online log 1 thread 1: '/u01/oradata/xienfei/redo01.log'

说明:如果没有采用standby log模式,有几组需要清空几组

9、备库重设flashback

SQL> ALTER DATABASE FLASHBACK OFF; 
SQL> ALTER DATABASE FLASHBACK ON;

10、备库重新接收并应用日志

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION;

三、修复成功标志
1、sql中操作
在主库中执行alter system switch logfile;
分别主备库中执行select max(sequence#) from v$archived_log;如果一致标示修复成功

2、通过alert文件
主库

PING[ARC0]: Error 3113 when pinging standby xff.
Tue May 31 14:11:51 2011
Thread 1 advanced to log sequence 719
  Current log# 3 seq# 719 mem# 0: /u01/oradata/xienfei/redo03.log
Tue May 31 14:20:05 2011
Thread 1 advanced to log sequence 720
  Current log# 1 seq# 720 mem# 0: /u01/oradata/xienfei/redo01.log
Tue May 31 14:20:16 2011
ARC0: Standby redo logfile selected for thread 1 sequence 719 for destination LOG_ARCHIVE_DEST_2

备库

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION
Tue May 31 15:30:37 2011
Attempt to start background Managed Standby Recovery process (xff)
MRP0 started with pid=18, OS id=14704
Tue May 31 15:30:37 2011
MRP0: Background Managed Standby Recovery process started (xff)
Managed Standby Recovery not using Real Time Apply
 parallel recovery started with 2 processes
Media Recovery Log /u01/archive/1_718_746916894.arc
Tue May 31 15:30:43 2011
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION
Tue May 31 15:30:52 2011
RFS[1]: Successfully opened standby log 4: '/u01/oradata/xienfei/s_redo1.log'
Media Recovery Log /u01/archive/1_719_746916894.arc
Media Recovery Waiting for thread 1 sequence 720

删除data guard归档日志

Oracle Data Guard 是通过归档文件来进行数据同步的。 主库的归档文件,我们可以在RMAN 备份的时候进行删除,但是备库的归档文件无法自动删除,需要写脚本来定时删除。

#!/bin/bash
export ORACLE_HOME=/opt/oracle/product/10.2.0/db_1
export ORACLE_SID=zjch
export ARCHIVE_DIR=/opt/oracle/oradata/zjch/archive
export LOG_FILE=/home/oracle/logs/del_archive.log

echo "开始删除归档日志:`date`……">>$LOG_FILE
if [ `whoami` != 'oracle' ]
then
echo "Error: You must be oracle to execute.">>$LOG_FILE
exit 99
fi

del_seq=`ls -tr $ARCHIVE_DIR/|grep -v stdarch | head -1|cut -f2 -d_`
$ORACLE_HOME/bin/sqlplus -silent "/ as sysdba" <<XFF>tmp.log
set pagesize 0 feedback off verify off heading off echo off
select max(sequence#) from v\$ARCHIVED_LOG where APPLIED='YES';
exit;
XFF
max_sn=`cat tmp.log`
rm tmp.log
max_sn=$(( $max_sn - 20 ))
#我这里是保留最近的20个归档文件,这个具体情况自己决定

while [[ ${del_seq} -lt ${max_sn} ]]
do
  echo "${ARCHIVE_DIR}/1_${del_seq}_681145105.dbf">>$LOG_FILE
  rm ${ARCHIVE_DIR}/1_${del_seq}_681145105.dbf
  #这里是我定义归档文件的格式,具体根据自己的归档文件格式来匹配,关键是匹配日志的sequence no。
  del_seq=$(( $del_seq + 1 ))
done
echo "删除归档日志结束:`date`……">>$LOG_FILE

#清楚controlfile中信息
$ORACLE_HOME/bin/rman target / <<XIFENFEI>>$LOG_FILE
crosscheck archivelog all;
delete expired archivelog all;
YES
exit;
XIFENFEI

echo "………………………………………………………………………………………………………………………">>$LOG_FILE

重点是在读取现在存在的归档日志文件中sequence最小值时,
使用了ls -tr $ARCHIVE_DIR/|grep -v stdarch | head -1|cut -f2 -d_

Oracle 9I Data Guard配置

1、修改参数文件
primary:
*.standby_archive_dest=’/opt/ora9/oradata/txzldb/stdarch/’
*.log_archive_dest_2=’service=txzldb.Standby optional reopen=60′ –配置tns
*.utl_file_dir=’/opt/ora9′
*.remote_archive_enable=TRUE
*.standby_file_management=AUTO
*.fal_server=’txzldb.Standby’
*.fal_client=’txzldb.Primary’
*.log_archive_dest_state_2=defer

standby:
*.standby_archive_dest=’/opt/ora9/oradata/txzldb/stdarch/’
*.log_archive_dest_2=’service=txzldb.Primary optional reopen=60′ –配置tns
*.utl_file_dir=’/opt/ora9′
*.remote_archive_enable=TRUE
*.standby_file_management=AUTO
*.fal_server=’txzldb.Primary’
*.fal_client=’txzldb.Standby’
*.log_archive_dest_state_2=defer

2、备份主库
backup database format=’/home/oracle/full_backup_%U_%s.bak’;
sql “alter system archive log current”;
–backup filesperset 10 archivelog all format=’/home/oracle/archivlog_%U_%s.bak’;(可选)
alter database create standby controlfile as ‘/home/oracle/control01.ctl’;

3、拷贝密码文件控制文件
scp /opt/oracle/product/9.2.0/dbs/orapwora9i
192.168.3.2:/opt/oracle/product/9.2.0/dbs/
scp control01.ctl
192.168.3.2:/opt/oracle/oradata/ora9i/control01.ctl

4、恢复备机
拷贝备份至备机与主机相同的目录
sqlplus “/ as sysdba”
  startup nomount pfile=”
  alter database mount standby database;
  exit
  rman target / nocatalog
  restore database
–restore archivelog all;(可选)
–recover automatic standby database;(可选)

5、启动恢复状态
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION;
create spfile from pfile=”;
alter system set log_archive_dest_state_2=enable scope=both;

6、添加临时表空间
recover managed standby database cancel;
alter database open read only;
alter tablespace temp add tempfile ‘/opt/oracle/oradata/ora9i/temp01.dbf’
SIZE 412M AUTOEXTEND ON NEXT 640K MAXSIZE UNLIMITED;
shutdown immediate;
startup nomount pfile=”
alter database mount standby database;
recover managed standby database disconnect from session;