因未配置Hugepage会话数添增悲剧案例

今天一朋友反馈他们的一个数据库hang住了,通过ssh也不能登录系统,他们没有办法重启系统解决问题,现在想让我帮忙找出问题原因
分析awr得出


询问朋友,他们的库一般session保持在200个左右,这次突然飙升到750以上,属于异常情况

分析监听日志

看到在截图的时间内,整体访问较频繁,某个ip访问异常频繁,通过这些信息,初步怀疑是用户的数据库内存使用完,导致系统数据库hang住.

查看系统日志

Jun 26 14:35:55 result01 kernel: [5613531.566617] Free swap  = 0kB
Jun 26 14:35:55 result01 kernel: [5613531.566618] Total swap = 2104504kB
Jun 26 14:35:55 result01 kernel: [5613531.566620] Free swap:            0kB
Jun 26 14:35:55 result01 kernel: [5613531.591073] 2359296 pages of RAM
Jun 26 14:35:55 result01 kernel: [5613531.591074] 318236 reserved pages
Jun 26 14:35:55 result01 kernel: [5613531.591075] 73353 pages shared
Jun 26 14:35:56 result01 kernel: [5613531.591076] 529 pages swap cached
Jun 26 14:35:56 result01 kernel: [5613531.591079] Out of Memory: Kill process 8904 (oracle) score 891 and children.
Jun 26 14:35:56 result01 kernel: [5613531.591201] Out of memory: Killed process 8904 (oracle).
Jun 26 14:35:56 result01 kernel: [5613531.592280] oracle invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0

通过这个日志看出系统内存和交换分区都使用完,因为内存不够,系统开始kill掉部分oracle进程.通过这些确定是系统内存使用完导致hang住可以理解.

分析hang住原因
为什么session意外的从200添增到750的时候,系统内存被使用完

cat /proc/meminfo
MemTotal:      8164240 kB
SwapTotal:     2104504 kB
PageTables:      69732 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

sga_target=3674210304
pga_aggregate_target=1732247552

从这里得出几个信息:
1.数据库总内存8g,swap配置2g
2.数据库未使用Hugepage
3.数据库设置sga和pga信息

内存参数估算
数据库总计占用内存为:(3674210304+1732247552)/1024/1024=5156M(pga可能未使用完,也可能超过)
结合实际sga_target=3674210304,会话数.
保守估计下Oracle进程占用的系统内存3674210304/(4*1024)*1.5*750/1024/1024=960M
估算如果使用Hugepage Oracle进程占用系统内存为:3674210304/(2*1024*1024)*1.5*750/1024/1024=1.9M
通过这里分析Oracle总占用内存为:5156+960=6116M
通过保守计算留给系统的内存大概为:1.8G左右
因为系统的其他操作,最终导致该系统内存耗完,系统和数据库hang住

总结说明
这是一个实实在在因为linux中因为未配置Hugepage,因为用户突增,导致系统内存消耗光,从而使得系统和数据库hang住的例子.
这个库因为sga不是非常大,所以Oracle占用系统内存不是高到离谱,如果sga配置为32g,1000个session,那就会占用12g的系统内存
通过这些可以看出在linux中配置Hugepage的优点:Hugepage不光是为了减轻cpu的负担,还可以减少系统内存的消耗;在没有极端的情况下,建议linux的数据库系统配置Hugepage.

To find the TX Enqueue contention in a RAC or OPS environment

今天查找TX Enqueue看到的一篇文章,拿出来共享下

PURPOSE 
------------- 
To find the TX Enqueue contention in a RAC or OPS environment 

What is TX Enqueue ? 
In one word oracle is maintaining queue for transaction. 

How Many Resources ? 
1/ active transaction 

How Many Locks? 
1/transaction + 1/process waiting for a locked row by that 
transaction. 

How Many Users? 
1 + 1/ process waiting for something locked by this transaction. 

Who Uses? 
All processes 

What need to investigate? 
The mode of TX (6/4), Holding/Waiting/Requesting 

SCOPE & APPLICATION 
===================== 

This document will help to analyze the application design related to transaction bottlenecks 
and database performance tuning. 

Let start with an example: 
=================== 
create table akdas (A1 number, Col1 Varchar2(10), Col2 Varchar2(10)); 
insert into akdas values(5,'Hello','Hi'); 
insert into akdas values(6,'Sudip','Datta'); 
insert into akdas values(7,'Preetam','Roy'); 
insert into akdas values(8,'Michael','Polaski'); 

From Node 1: 
========== 
update akdas set a1=11 where a1=6; 

From Node 2: 
========== 
update akdas set a1=12 where a1=7; 
update akdas set a1=11 where a1=6;  /* this will wait for Node1: to complete the transaction */ 

This Note Is Made To Analyzing Only the TX-Mode-6 (Exclusive). 


1. Now run the following query to track down the problem: Who is waiting 
=================================================================== 
prompt 
prompt Query 1. Waiting for TX Enqueue where mode is Exclusive 
prompt ===================================== 
prompt 
set linesize 100 
set pagesize 66 
col c1 for a15 
col c1 heading "Program Name " 
select l.inst_id,l.SID,program c1,l.TYPE,l.ID1,l.ID2,l.LMODE,l.REQUEST 
from gv$lock l,gv$session s 
where l.type like 'TX' and l.REQUEST =6 
and l.inst_id=s.inst_id and l.sid=s.sid 
order by id1 
/ 

Output will be here 
=============== 
   INST_ID      SID     Program Name       TY     ID1     ID2       LMODE      REQUEST 
-----------  ---------- ------------------ ---   -------- --------  ---------- --------
         2           13  sqlplus@opcbsol   TX     393236  780       0          6 
                         2 (TNS V1-V3) 

It is clear that SID 12 of instance 2 is doing a DML and waiting on REQUEST Mode 6. 


2. Let's run the next query to find who is holding 
=========================================== 

prompt 
prompt 
prompt Query 2. Holding for TX Enqueue where mode greater than 6 
prompt ======================================= 
prompt 
set linesize 100 
set pagesize 66 
col c1 for a15 
col c1 heading "Program Name " 
select l.inst_id,l.SID,program c1,l.TYPE,l.ID1,l.ID2,l.LMODE,l.REQUEST 
from gv$lock l,gv$session s 
where l.type like 'TX' and l.LMODE =6 and (l.ID1,l.ID2) in 
(select id1,id2 from gv$lock where type like 'TX' and REQUEST =6) 
and l.inst_id=s.inst_id and l.sid=s.sid 
order by id1 
/ 

Output will be here 
=============== 
   INST_ID      SID     Program Name      TY        ID1        ID2      LMODE    REQUEST 
   ----------  ---------- -------------- ---   ---------- --------   ----------- --------
         1          12    sqlplus@opcbsol TX     393236        780      6          0 
                          1 (TNS V1-V3) 

So holder is SID 12 on instance 1. Where LMODE = 6. 


3. Let's find out the exact file#, block# and Record# where it is waiting 
=============================================================== 

prompt 
prompt 
prompt Query 3. Object# ,File#, Block# and Slot# TX Enqueue in detail 
prompt ======================================== 
prompt 
set linesize 110 
col c0 for 999 
col c0 heading "INS" 
col c1 for a15 
col c1 heading "Program Name " 
select inst_id c0,sid,program c1,ROW_WAIT_OBJ# object_no, ROW_WAIT_FILE# Rfile_no,
ROW_WAIT_BLOCK# Block_no ,ROW_WAIT_ROW# Row_no 
from gv$session 
where (inst_id,sid) in (select inst_id,sid from gv$session_wait where p1='1415053318') 
/ 

Output Will be here 
=============== 
 INS     SID    Program Name     OBJECT_NO RFILE_NO BLOCK_NO  ROW_NO 
----- ---------- -------------   ---------------    --------- -------
   2         13     sqlplus@opcbsol  7261      9        12346     1 
                      2 (TNS V1-V3) 
  

From the output, it is clear that it is waiting on Relative_File# 9, Block# 12346, Row Number 1. 
Here Row Number 1 means the slot number in the block 12346. This Row_No start from 0 (zero). 


4. Let's Find the object details 
============================= 

prompt 
prompt 
prompt Query 4. Object Involve for TX Enqueue in detail 
prompt =============================== 
prompt 
set linesize 100 
set pagesize 100 
col owner for a10 
col object_name for a20 
col object_type for a10 
select owner,object_name,object_id,object_type 
from dba_objects 
where 
object_id in (select ROW_WAIT_OBJ# from gv$session 
where (inst_id, sid) in (select inst_id,sid from gv$session_wait where p1='1415053318')) 
/ 

Output Will be here 
=============== 
OWNER      OBJECT_NAME  OBJECT_ID   OBJECT_TYP 
---------  ------------ --------    -----------
AKDAS      AKDAS        7261        TABLE 


5. Let’s find the row value details 
============================= 

prompt 
prompt 
prompt Query 5. Finding the row value 
prompt ==================== 
prompt 
select * from <Owner>.<Table Name>  where rowid like 
DBMS_ROWID.ROWID_CREATE(1,&Object_No,&Rfile_No, &Block_No, &Row_Number) 
/ 
From query 3 and 4  we will get the value for all variables. 
Owner = AKDAS 
Table_Name = AKDAS 
Object_No = 7261 
Rfile_No =  9 
Block_No = 12346 
Row_Number = 1 

Output Will be here 
=============== 
        A1    Col1                 Col2 
  ---------- --------------- ---------- 
         6      Hello                Hi 

So we can drag down to the row value where TX Enqueue contention exists. 


6. Let’s find the user activity that is "Holder" and "Waiter" 
==================================================== 

set linesize 120 
set pagesize 66 
col c0 for 999 
col c0 heading "INS" 
col c1 for a9 
col c1 heading "OS User" 
col c2 for a9 
col c2 heading "Oracle User" 
col c3 for a15 
col c3 heading "Program Name" 
col b1 for a9 
col b1 heading "Unix PID" 
col b2 for 9999 justify left 
col b2 heading "ORA SID" 
col b3 for 999999 justify left 
col b3 heading "SERIAL#" 
col sql_text for a45 
set space 1 
break on b1 nodup on c0 nodup on c3 nodup on c1 nodup on c2 nodup on b2 nodup on b3 skip 2 
select a.inst_id c0,b.sid b2,c.spid b1, b.program c3, b.username c2,b.serial# b3, a.sql_text 
  from gv$sql a, gv$session b, gv$process c 
 where 
   a.address = b.sql_address 
   and b.paddr = c.addr 
   and a.hash_value = b.sql_hash_value 
   and a.inst_id=b.inst_id and a.inst_id=c.inst_id 
   and a.inst_id like '&inst_id' and b.sid like '&sid' 
 order by c.spid,a.hash_value 
/ 

This query asks the Instance Number and Sid number, which you can get from step 1 and 2. 
But remember , you can see the waiter activity, but you may not see the holder activity. 
Reason is, the holder is sitting idle after doing the DML operation. So SQL for Holder 
should not be seen under gv$sql. 

This all query can be run for single instance database, but all GV$ view need to replace to V$ 
and there is no INST_ID for V$ View, that part need to be taken care. 

来自:How to Find TX Enqueue Contention in RAC or OPS [ID 179582.1]