Monday, June 07, 2010

Row cache objects latch contention

A data loading process was running on UltraSPARC T2 CPU. To take advantage of the platform architecture (or, I'd rather say, to avoid it's limitations) the loading process has been design to run with a massive amount of parallel query slaves in order to extract the maximum output from CMT architecture.

Every time this data loading process executed, it experienced strange slowdowns on seemingly random points in time. Performance drops were quite substantial, which prompted to do an additional investigation. Upon a closer examination of ASH data, it turned out that all slowdowns were due to latch: row cache objects contention.

Row cache objects latch protects the dictionary cache. The first thing was to figure out whether most of the contention was contributed by a particular row cache objects child latch:
SQL> select latch#, child#, sleeps
2 from v$latch_children
3 where name='row cache objects'
4 and sleeps > 0
5 order by sleeps desc;

LATCH# CHILD# SLEEPS
---------- ---------- ----------
270 1 24241645
270 5 523
270 4 52
The first child certainly doesn't look good when we take the amount of sleeps experienced by it, compared to all the other child latches. Once we have the troublesome child latch identified, we can move on and see which type of dictionary cache it protects:
SQL> select distinct s.kqrstcln latch#,r.cache#,r.parameter name,r.type,r.subordinate#
from v$rowcache r,x$kqrst s
where r.cache#=s.kqrstcid
order by 1,4,5; 2 3 4

LATCH# CACHE# NAME TYPE SUBORDINATE#
------- ------ --------------------------------- ----------- ------------
1 3 dc_rollback_segments PARENT
2 1 dc_free_extents PARENT
3 4 dc_used_extents PARENT
4 2 dc_segments PARENT
5 0 dc_tablespaces PARENT
6 5 dc_tablespace_quotas PARENT
7 6 dc_files PARENT
8 10 dc_users PARENT
8 7 dc_users SUBORDINATE 0
8 7 dc_users SUBORDINATE 1
8 7 dc_users SUBORDINATE 2
9 8 dc_objects PARENT
9 8 dc_object_grants SUBORDINATE 0
10 17 dc_global_oids PARENT
11 12 dc_constraints PARENT
12 13 dc_sequences PARENT
13 16 dc_histogram_defs PARENT
13 16 dc_histogram_data SUBORDINATE 0
13 16 dc_histogram_data SUBORDINATE 1
14 32 kqlsubheap_object PARENT
15 19 dc_table_scns PARENT
15 19 dc_partition_scns SUBORDINATE 0
16 18 dc_outlines PARENT
17 14 dc_profiles PARENT
18 47 realm cache PARENT
18 47 realm auth SUBORDINATE 0
19 48 Command rule cache PARENT
20 49 Realm Object cache PARENT
20 49 Realm Subordinate Cache SUBORDINATE 0
21 46 Rule Set Cache PARENT
22 34 extensible security user and rol PARENT
23 35 extensible security principal pa PARENT
24 37 extensible security UID to princ PARENT
25 36 extensible security principal na PARENT
26 33 extensible security principal ne PARENT
27 38 XS security class privilege PARENT
28 39 extensible security midtier cach PARENT
29 44 event map PARENT
30 45 format PARENT
31 43 audit collector PARENT
32 15 global database name PARENT
33 20 rule_info PARENT
34 21 rule_or_piece PARENT
34 21 rule_fast_operators SUBORDINATE 0
35 23 dc_qmc_ldap_cache_entries PARENT
36 52 qmc_app_cache_entries PARENT
37 53 qmc_app_cache_entries PARENT
38 27 qmtmrcin_cache_entries PARENT
39 28 qmtmrctn_cache_entries PARENT
40 29 qmtmrcip_cache_entries PARENT
41 30 qmtmrctp_cache_entries PARENT
42 31 qmtmrciq_cache_entries PARENT
43 26 qmtmrctq_cache_entries PARENT
44 9 qmrc_cache_entries PARENT
45 50 qmemod_cache_entries PARENT
46 24 outstanding_alerts PARENT
47 22 dc_awr_control PARENT
48 25 SMO rowcache PARENT
49 40 sch_lj_objs PARENT
50 41 sch_lj_oids PARENT

60 rows selected.
The first child protects dc_rollback_segments. We can confirm it by referencing data in v$rowcache:
SQL> select parameter, gets
2 from v$rowcache
3 order by gets desc;

PARAMETER GETS
-------------------------------- ----------
dc_rollback_segments 310995555
dc_tablespaces 76251831
dc_segments 3912096
dc_users 2307601
dc_objects 1460725
dc_users 608659
dc_histogram_defs 250666
global database name 67475
dc_histogram_data 43098
dc_histogram_data 14364
dc_global_oids 14320
outstanding_alerts 2956
dc_profiles 2555
dc_awr_control 1925
dc_object_grants 745
dc_files 532
dc_constraints 201
sch_lj_oids 158
dc_sequences 156
dc_table_scns 20
sch_lj_objs 18
dc_qmc_ldap_cache_entries 0
qmc_app_cache_entries 0
qmc_app_cache_entries 0
qmtmrcin_cache_entries 0
qmtmrctn_cache_entries 0
qmtmrcip_cache_entries 0
qmtmrctp_cache_entries 0
qmtmrciq_cache_entries 0
qmtmrctq_cache_entries 0
qmrc_cache_entries 0
qmemod_cache_entries 0
SMO rowcache 0
dc_users 0
dc_partition_scns 0
dc_users 0
realm auth 0
Realm Subordinate Cache 0
rule_or_piece 0
rule_info 0
audit collector 0
format 0
event map 0
extensible security midtier cach 0
XS security class privilege 0
extensible security principal ne 0
extensible security principal na 0
extensible security UID to princ 0
extensible security principal pa 0
extensible security user and rol 0
Rule Set Cache 0
Realm Object cache 0
Command rule cache 0
realm cache 0
dc_outlines 0
kqlsubheap_object 0
dc_tablespace_quotas 0
dc_used_extents 0
rule_fast_operators 0
dc_free_extents 0

60 rows selected
The next step is to see whether latch miss source can give us some more hints regarding the issue:
SQL> select "WHERE", sleep_count, location
2 from v$latch_misses
3 where parent_name='row cache objects'
4 and sleep_count > 0;

WHERE SLEEP_COUNT LOCATION
------------------- ----------- ------------------------------
kqrpre: find obj 20612167 kqrpre: find obj
kqrpup 7 kqrpup
kqrcmt: while loop 1 kqrcmt: while loop
kqrcmt: clear flag 1 kqrcmt: clear flag
kqreqd 1026837 kqreqd
kqreqd: reget 2602576 kqreqd: reget

6 rows selected
Now if you take kqrpre: find obj and plug it into a search on My Oracle Support you'll quickly yield Bug 5749075 High Requests on dc_rollback_segments. Among other things, this note points out at the unusually high number of undo segments being created due to cleanup not able to work properly...
SQL> select count(*) from dba_rollback_segs;

COUNT(*)
----------
14838
...and this seems to be the case. The only difference is that the issue has been observed on 11GR2 and the bug has been filled against the older versions. Though it was still worth checking in case we were seeing a regression. Indeed, after getting rid of that many undo segments by simply recreating the undo tablespace, the issue, thought not completely vanished, manifested itself a lot less making it's impact relatively insignificant to the process throughput.

5 comments:

  1. Anonymous1:43 a.m.

    Great post but I dont get the part with :
    select distinct s.kqrstcln latch#,r.cache#,r.parameter name,r.type,r.subordinate#
    from v$rowcache r,x$kqrst s
    where r.cache#=s.kqrstcid
    order by 1,4,5;

    How do You conclude that its dc_rollback_segs related ?
    Regards
    Greg

    ReplyDelete
  2. You go by the child latch number. We know that CHILD# 1 is what attracting almost all of the contention. From this query you check what line corresponds to LATCH# 1 (which is a row cache objects child latch number) - that would be dc_rollback_segments.

    ReplyDelete
  3. Anonymous10:38 a.m.

    select latch#, child#, sleeps from v$latch_children where name='row cache objects' and sleeps>0 order by sleeps desc;

    LATCH# CHILD# SLEEPS
    ---------- ---------- ----------
    270 8 144401
    270 9 38280
    270 4 6582
    270 5 6536
    270 7 4476
    270 13 1984
    270 12 892
    270 17 149
    270 32 128
    270 1 15
    270 10 4


    select distinct s.kqrstcln latch#,r.cache#,r.parameter name,r.type,r.subordinate# from v$rowcache r,x$kqrst s where r.cache#=s.kqrstcid order by 1,4,5;

    LATCH# CACHE# NAME TYPE SUBORDINATE#
    ---------- ---------- -------------------------------- ----------- ------------
    1 3 dc_rollback_segments PARENT
    2 1 dc_free_extents PARENT
    3 4 dc_used_extents PARENT
    4 2 dc_segments PARENT
    5 0 dc_tablespaces PARENT
    6 5 dc_tablespace_quotas PARENT
    7 6 dc_files PARENT
    8 10 dc_users PARENT
    8 7 dc_users SUBORDINATE 0
    8 7 dc_users SUBORDINATE 1
    8 7 dc_users SUBORDINATE 2
    9 8 dc_objects PARENT
    9 8 dc_object_grants SUBORDINATE 0
    10 17 dc_global_oids PARENT
    11 12 dc_constraints PARENT

    in my case, is it 'dc_users' related? Thanks!

    ReplyDelete
  4. Rajorshi Kumar Sen1:49 p.m.

    Hi Alex

    We hit the same problem in our database , we are in 11.1.0.7. We found that 40% of the undo segments are offlined as well , as a part of the symptom:
    SQL>select count(*) USOn from dba_rollback_segs where status='ONLINE';
    USON
    ----------
    2789
    SQL> select count(*) USOff from dba_rollback_segs where status='OFFLINE';
    USON
    ----------
    2138
    Check the Metalink Note:240746.1 .
    It has a workaround of setting _rollback_segment_count hidden parameter.

    thanks
    ---Raj [rajorshisen@yahoo.co.in]

    ReplyDelete
  5. As a follow up there's a patch for this. There's a fixed number of latches - 128 - for rollback segments. There's a patch to increase that number but currently not available for linux.

    Bug 14226599 - Increase dc_rollback_segs hash buckets to reduce 'latch: row cache objects' waits [ID 14226599.8]

    ReplyDelete