监控asm disk磁盘性能

使用ASM的朋友估计都有一个困惑,ASM就是一个黑盒子,怎么才能够做到类似如裸设备或者文件系统一样,通过系统的命令(iostat)来监控其磁盘IO的运行性能.其实ORACLE在设计ASM的过程中,也就考虑到了这个需求,把磁盘相关的情况都记录到了ASM相关视图中v$asm_disk和v$asm_disk_stat(这两个视图功能相同,只是查询v$asm_disk需要每次访问磁盘头获取数据,v$asm_disk_stat是磁盘头存储在内存中的数据,查询v$asm_disk_stat对磁盘影响非常小),所以我们可以通过查询v$asm_disk_stat中的数据,然后做减法就可以获得asm disk某个时间段的磁盘io性能情况.ORACLE提供了相关工具叫做asmiostat用来监控,具体可以参考ASMIOSTAT Script to collect iostats for ASM disks [ID 437996.1]

确保TIMED_STATISTICS=TRUE
虽然是默认值,多检查无错,因为到该值为false之时READ_TIME/WRITE_TIME为0

[grid@xifenfei tmp]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.1.0 Production on Fri Feb 1 08:29:01 2013

Copyright (c) 1982, 2013, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Automatic Storage Management option

SQL> show parameter TIMED_STATISTICS

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
timed_statistics                     boolean     TRUE

asmiostat使用

[grid@xifenfei tmp]$ ./asmiostat.sh help=y
Invalid parameter: <interval> must be > 0; <count> must be >= 0

./asmiostat.sh [-s ASM ORACLE_SID] [-h ASM ORACLE_HOME] [-g diskgroup] [<interval>] [<count>]

Output:
  DiskPath - Path to ASM disk
  DiskName - ASM disk name
  Gr       - ASM disk group number
  Dsk      - ASM disk number
  Reads    - Reads 
  Writes   - Writes 
  AvRdTm   - Average read time (in msec)
  AvWrTm   - Average write time (in msec)
  KBRd     - Kilobytes read
  KBWr     - Kilobytes written
  AvRdSz   - Average read size (in bytes)
  AvWrSz   - Average write size (in bytes)
  RdEr     - Read errors
  WrEr     - Write errors

相关值说明

  DiskPath - Path to ASM disk
  DiskName - ASM disk name
  Gr       - ASM disk group number
  Dsk      - ASM disk number
  Reads    - 指定时间内I/O读请求次数 
  Writes   - 指定时间内I/O写请求次数  
  AvRdTm   - 平均每次I/O读请求所需时间 (in msec)
  AvWrTm   - 平均每次I/O写请求所需时间 (in msec)
  KBRd     - 指定时间内读操作的量(KB)
  KBWr     - 指定时间内写操作的量(KB)
  AvRdSz   - 平均每次I/O读请求得到的数据量(B)
  AvWrSz   - 平均每次I/O写请求得到的数据量(B)
  RdEr     - 指定时间内I/O读请求错误次数
  WrEr     - 指定时间内I/O写请求错误次数

asmiostat效果展示

[grid@xifenfei tmp]$ ./asmiostat.sh -s $ORACLE_SID -h $ORACLE_HOME -g DATA 1 3

Date: Fri Feb  1 08:31:45 CST 2013    Interval: 1 secs    Disk Group: DATA

DiskPath - DiskName                      Gr Dsk    Reads   Writes AvRdTm AvWrTm     KBRd     KBWr  AvRdSz  AvWrSz RdEr WrEr
/dev/sdb - DATA_0000                      1   0        0        0    0.0    0.0        0        0       0       0    0    0

Date: Fri Feb  1 08:31:47 CST 2013    Interval: 1 secs    Disk Group: DATA

DiskPath - DiskName                      Gr Dsk    Reads   Writes AvRdTm AvWrTm     KBRd     KBWr  AvRdSz  AvWrSz RdEr WrEr
/dev/sdb - DATA_0000                      1   0        4        3    0.6 1006.1        0        0       0       0    0    0

Date: Fri Feb  1 08:31:49 CST 2013    Interval: 1 secs    Disk Group: DATA

DiskPath - DiskName                      Gr Dsk    Reads   Writes AvRdTm AvWrTm     KBRd     KBWr  AvRdSz  AvWrSz RdEr WrEr
/dev/sdb - DATA_0000                      1   0        8        2    1.3    1.5        0        0       0       0    0    0

asmiostat下载

批量kill session实现脚本

在很多使用,因为各种原因,我们需要定时批量的kill一部分session,用来释放数据库部分资源,这里是因为bug导致temp不能正常释放,也可能是因为bug导致pga不释放,还有可能是因为太多inactive占用资源等等.我这里提供了两种方法来实现该功能
存储过程实现kill session

--创建记录表
CREATE TABLE kill_session_record
(
   kill_time        DATE,
   kill_statement   VARCHAR2 (1000)
)
/

--创建kill session存储过程
CREATE OR REPLACE PROCEDURE kill_inactive_session
IS
   CURSOR c
   IS
      SELECT sid, serial#
        FROM v$session s
       WHERE s.status = 'INACTIVE' AND s.username = 'XIFENFEI';

   k_sid      NUMBER;
   k_serial   NUMBER;
BEGIN
   OPEN c;

   FETCH c
   INTO k_sid, k_serial;

   WHILE c%FOUND
   LOOP
      BEGIN
         EXECUTE IMMEDIATE
               'ALTER SYSTEM DISCONNECT SESSION '''
            || k_sid
            || ','
            || k_serial
            || ''' IMMEDIATE';

         INSERT INTO kill_session_record (kill_time, kill_statement)
              VALUES (
                        SYSDATE,
                           'ALTER SYSTEM DISCONNECT SESSION '''
                        || k_sid
                        || ','
                        || k_serial
                        || ''' IMMEDIATE');
      EXCEPTION
         WHEN OTHERS
         THEN
            INSERT INTO kill_session_record (kill_time, kill_statement)
                 VALUES (
                           SYSDATE,
                              'Failure:ALTER SYSTEM DISCONNECT SESSION '''
                           || k_sid
                           || ','
                           || k_serial
                           || ''' IMMEDIATE');

            COMMIT;
      END;

      FETCH c
      INTO k_sid, k_serial;
   END LOOP;

   COMMIT;

   CLOSE c;
END;
/

--设置job定时运行
DECLARE
   job   NUMBER;
BEGIN
   sys.DBMS_JOB.submit (job,
                        what        => 'kill_inactive_session;',
                        next_date   => SYSDATE,
                        interval    => 'TRUNC(SYSDATE + 1) +7/24');
   COMMIT;
   DBMS_OUTPUT.put_line (job);
END;
/

如果是10GR2之前版本,需要把ALTER SYSTEM DISCONNECT SESSION 换成ALTER SYSTEM KILL SESSION

shell kill session

--shell脚本
# more kill_inactive_session.sh
#!/bin/sh
tmpfile0=/tmp/.kill_inactive_0
tmpfile1=/tmp/.kill_inactive_1
tmpfile2=/tmp/.kill_inactive_2
sqlplus / as sysdba <<EOF
spool $tmpfile1
select 'kill time:'||to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') execute_time from dual;
select p.spid,s.sid,s.serial# from v\$process p,v\$session s
where s.paddr=p.addr
and username='XIFENFEI'
and s.status='INACTIVE';
spool off
EOF
cat $tmpfile1>>$tmpfile0
grep "^[0123456789]" $tmpfile1 |awk '{print $1}'>$tmpfile2
for x in `cat $tmpfile2`
do
kill -9 $x
done
rm $tmpfile1 $tmpfile2

--contab 调度
00 07 * * * /u01/script/kill_inactive_session.sh

两个脚本都可以在where中加一些限制条件,来实现你需要kill的会话.数据库级别kill相对系统级别来说更加温和点,建议优先考虑数据库级别kill session.如果要求立即释放资源,可能需要考虑系统级别.两中kill方式对于未提交且是inactive session都会被kill掉,然后回滚事务.

ORACLE 12C SQL语句中通过with 定义PL/SQL 函数

在ORACLE 12C支持在sql语句中编写函数,用来实现sql语句操作需要使用函数的部分功能,该功能对于你不想在数据库中新建函数 or 你的库是read only模式下要使用新函数实现某种功能,可以通过这种方法实现,增加了ORACLE数据库灵活点

SQL> select * from v$version;

BANNER                                                                               CON_ID
-------------------------------------------------------------------------------- ----------
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production              0
PL/SQL Release 12.1.0.1.0 - Production                                                    0
CORE    12.1.0.1.0      Production                                                        0
TNS for Linux: Version 12.1.0.1.0 - Production                                            0
NLSRTL Version 12.1.0.1.0 - Production                                                    0

测试过程创建一个简单函数,用来判断输入数据值是否是数字,如果是数字输出Y,如果不是输出N.
如果是12C以前的数据库,需要事先创建一个函数,然后通过select语句条用;但是在12C中直接通过一条select语句解决

SQL> with function Is_Number
  2    (x in varchar2) return varchar2 is
  3      Plsql_Num_Error exception;
  4       pragma exception_init(Plsql_Num_Error, -06502);
  5   begin
  6     if (To_Number(x) is NOT null) then
  7       return 'Y';
  8     else
  9       return '';
 10     end if;
 11   exception
 12    when Plsql_Num_Error then
 13      return 'N';
 14   end Is_Number;
 15  select is_number('www.orasos.com') is_num from dual;
 16  /

IS_NUM
--------------------------------------------------------------------------------
N

使用IDLE_TIME注意事项

需要定时kill非inactive session,一种做法是通过编写脚本or脚本定时运行,从而实现该功能;另外一种方法是通过设置profile中的idle_time来实现该功能,但是这其中有两个细节问题需要注意:1.v$session.status=SNIPED最好做清理,2.未提交事务超时可能强制回滚
使用ORACLE PROFILE准备

SQL> CREATE PROFILE KILLIDLE LIMIT IDLE_TIME 1;

Profile created.

SQL> select * from dba_profiles where profile='KILLIDLE';

PROFILE                        RESOURCE_NAME                    RESOURCE LIMIT
------------------------------ -------------------------------- -------- ------------
KILLIDLE                       COMPOSITE_LIMIT                  KERNEL   DEFAULT
KILLIDLE                       SESSIONS_PER_USER                KERNEL   DEFAULT
KILLIDLE                       CPU_PER_SESSION                  KERNEL   DEFAULT
KILLIDLE                       CPU_PER_CALL                     KERNEL   DEFAULT
KILLIDLE                       LOGICAL_READS_PER_SESSION        KERNEL   DEFAULT
KILLIDLE                       LOGICAL_READS_PER_CALL           KERNEL   DEFAULT
KILLIDLE                       IDLE_TIME                        KERNEL   1
KILLIDLE                       CONNECT_TIME                     KERNEL   DEFAULT
KILLIDLE                       PRIVATE_SGA                      KERNEL   DEFAULT
KILLIDLE                       FAILED_LOGIN_ATTEMPTS            PASSWORD DEFAULT
KILLIDLE                       PASSWORD_LIFE_TIME               PASSWORD DEFAULT
KILLIDLE                       PASSWORD_REUSE_TIME              PASSWORD DEFAULT
KILLIDLE                       PASSWORD_REUSE_MAX               PASSWORD DEFAULT
KILLIDLE                       PASSWORD_VERIFY_FUNCTION         PASSWORD DEFAULT
KILLIDLE                       PASSWORD_LOCK_TIME               PASSWORD DEFAULT
KILLIDLE                       PASSWORD_GRACE_TIME              PASSWORD DEFAULT

16 rows selected.

SQL> ALTER USER CHF PROFILE KILLIDLE;

User altered.

SQL> SELECT USERNAME,PROFILE FROM DBA_USERS where username='CHF';

USERNAME                       PROFILE
------------------------------ ------------------------------
CHF                            KILLIDLE

SQL> SHOW PARAMETER resource_limit 

NAME                                 TYPE        VALUE
------------------------------------ ----------- ---------------
resource_limit                       boolean     FALSE

SQL> ALTER SYSTEM SET resource_limit=TRUE;

System altered.

如果要profile生效,需要修改resource_limit=true,IDLE_TIME单位为分钟

测试IDLE_TIME

--session 1
SQL> show user;
USER is "CHF"

SQL> select * from t_xifenfei;

        ID
----------
         1

--删除一条记录
SQL> delete from t_xifenfei;

1 row deleted.

--查询sid
SQL> select sid from v$mystat where rownum=1;

       SID
----------
        20

--开始不操作该会话时间
SQL> select to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') from dual;

TO_CHAR(SYSDATE,'YY
-------------------
2013-02-12 22:30:02

--session 2
SQL> show user;
USER is "SYS"

--查询时间
SQL> select status,to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') from v$session where sid=20;

STATUS   TO_CHAR(SYSDATE,'YY
-------- -------------------
INACTIVE 2013-02-12 22:31:00

--session 1
SQL> select to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') from dual;
select to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') from dual
*
ERROR at line 1:
ORA-02396: exceeded maximum idle time, please connect again
----已经报会话超时

--session 2
SQL> select status,to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') from v$session where sid=20;

STATUS   TO_CHAR(SYSDATE,'YY
-------- -------------------
SNIPED   2013-02-12 22:34:40
----会话状态为sniped

--session 1
SQL> conn chf/xifenfei
Connected.
SQL> select * from t_xifenfei;

        ID
----------
         1
----事务回滚

SNIPED – An inactive session that has exceeded some configured limits (for example, resource limits specified for the resource manager consumer group or idle_time specified in the user’s profile). Such sessions will not be allowed to become active again.
因为SNIPED的session只有当该session的终端发一个连接信息给数据库,然后终端才会终止连接,如果该客户端一直不发送类似访问,则该连接一直存在,数据库就很可能因为会话数目超过了数据库参数配置从而出现了ORA-00018错误,业务不能正常运行.出现该问题可以通过如下脚本kill -9 pid解决

kill SNIPED session 脚本

#!/bin/sh
tmpfile=/tmp/.kill_sniped
sqlplus system/manager <<EOF
spool $tmpfile
select p.spid from v\$process p,v\$session s
where s.paddr=p.addr
and s.status='SNIPED';
spool off
EOF
for x in `cat $tmpfile | grep "^[0123456789]"`
do
kill -9 $x
done
rm $tmpfile

另外补充说明,IDLE_TIME是对于空闲时间超过了它的配置时间就会去强制终止会话,如果该会话中存在事务,但是inactive时间超过了IDLE_TIME配置时间,数据库依然会强制终止会话,并且回滚事务

dbms_shared_pool.purge工作原理猜测

思考为什么dbms_shared_pool.purge清理掉某条sql在shared pool中的信息,为什么当该sql再次执行的时候FIRST_LOAD_TIME时间没有发生改变
测试purge某条sql,再次加重该sql,FIRST_LOAD_TIME不变

SQL> select to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') from dual;

TO_CHAR(SYSDATE,'YY
-------------------
2013-02-12 16:44:00

SQL>  select SQL_ID,FIRST_LOAD_TIME from v$sql where sql_text like 'select to_char(sysdate,%dual';     

SQL_ID        FIRST_LOAD_TIME
------------- --------------------------------------
46zkt5sgbxrxv 2013-02-12/16:43:59

SQL> SELECT ADDRESS,HASH_VALUE,SQL_TEXT FROM V$SQLAREA where sql_id='46zkt5sgbxrxv';

ADDRESS  HASH_VALUE
-------- ----------
SQL_TEXT
--------------------------------------------------------------------------------
2587FFAC  515825595
select to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') from dual


SQL>  exec dbms_shared_pool.purge('2587FFAC,515825595','C');

PL/SQL procedure successfully completed.

SQL> SELECT ADDRESS,HASH_VALUE,SQL_TEXT FROM V$SQLAREA where sql_id='46zkt5sgbxrxv';

no rows selected

SQL> !date
Tue Feb 12 16:55:15 CST 2013

SQL> select to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') from dual;

TO_CHAR(SYSDATE,'YY
-------------------
2013-02-12 16:55:23

SQL> select FIRST_LOAD_TIME FROM V$SQLAREA where sql_id='46zkt5sgbxrxv';

FIRST_LOAD_TIME
--------------------------------------
2013-02-12/16:43:59

这里可以看出来第一次执行sql语句的时候,FIRST_LOAD_TIME为2013-02-12/16:43:59,然后我使用dbms_shared_pool.purge”清除掉”了SQL语句在shared pool中的信息,但是当我再次执行执行相同的sql时候,查询发现FIRST_LOAD_TIME时间未发生改变.因为v$sql中对应的只有一张基表x$kglcursor_child,并没有where条件,而让记录不在v$sql中显示,证明是x$基表的东西发生了改变,而该基表是直接来自内存,从而个人猜测,oracle的dbms_shared_pool.purge是在shared pool该sql的内存某些部位增加了某些标记,从而使得该sql不能在v$sql等相关视图中显示,如果sql以前占用的内存区域没有被老化出shared pool,下次该sql再次访问的时候,优先启用该内存区域并修改相关值,从而出现了我们的FIRST_LOAD_TIME不改变的现象.

验证猜测

--session 1
SQL> exec dbms_shared_pool.purge('2587FFAC,515825595','C');

PL/SQL procedure successfully completed.

SQL> select FIRST_LOAD_TIME FROM V$SQLAREA where sql_id='46zkt5sgbxrxv';

no rows selected

SQL> declare
  2  begin 
  3  FOR a IN  1..10000000  
  4  LOOP  
  5  EXECUTE IMMEDIATE 'insert into t_xifenfei values ('||a||')';
  6  END LOOP;
  7  commit;
  8  end;  
  9  /

--session 2
SQL> select count(sql_text) from v$sql where sql_text like 'insert into t_xifenfei%'
  2  ;

COUNT(SQL_TEXT)
---------------
            444

SQL> /

COUNT(SQL_TEXT)
---------------
            445

SQL> /

COUNT(SQL_TEXT)
---------------
            444

SQL> /

COUNT(SQL_TEXT)
---------------
            442
--动态sql还在执行,但是共享池中的该sql不再增加,说明共享池已经满,
--部分历史的sql语句已经被刷新出共享池purge的sql语句肯定被老化出来了shared pool,然后再次执行该sql语句

--session 3
SQL> select to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') from dual;

TO_CHAR(SYSDATE,'YY
-------------------
2013-02-12 17:09:08

SQL> select SQL_ID,FIRST_LOAD_TIME from v$sql where sql_text like 'select to_char(sysdate,%dual';

SQL_ID        FIRST_LOAD_TIME
------------- --------------------------------------
46zkt5sgbxrxv 2013-02-12/17:09:07

这里可以看到当shared pool发生部分数据被刷出来之时,而且根据先进先出的原则,我们可以知道开始被purge的sql语句肯定被老化出shared pool,从而当再次执行相同sql的时候,生成了新的FIRST_LOAD_TIME,从而验证了部分猜测.
在此也补充另外一个朋友的咨询问题:在什么情况下FIRST_LOAD_TIME会发生改变,我认为是当sql语句占用的内存区域被老化出去,然后再进入内存的时候会发生改变,flush shared_pool实现效果和老化出来一样

SQL> select to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') from dual;

TO_CHAR(SYSDATE,'YY
-------------------
2013-02-12 17:09:08

SQL> select SQL_ID,FIRST_LOAD_TIME from v$sql where sql_text like 'select to_char(sysdate,%dual';

SQL_ID        FIRST_LOAD_TIME
------------- --------------------------------------
46zkt5sgbxrxv 2013-02-12/17:09:07

SQL> alter system flush shared_pool;

System altered.

SQL> select SQL_ID,FIRST_LOAD_TIME from v$sql where sql_text like 'select to_char(sysdate,%dual';

no rows selected

SQL> select to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') from dual;

TO_CHAR(SYSDATE,'YY
-------------------
2013-02-12 18:52:33

SQL> select SQL_ID,FIRST_LOAD_TIME from v$sql where sql_text like 'select to_char(sysdate,%dual';

SQL_ID        FIRST_LOAD_TIME
------------- --------------------------------------
46zkt5sgbxrxv 2013-02-12/18:52:33

因为shared pool的东西很复杂,我这里也只是大概的初步猜测,没有深入到系统级别dump之类的方法分析,如果有兴趣的朋友可以深入研究并探讨.