Identify the root cause of performance issues at a precise point in the past—even when the sessions have disconnected.
In my previous two articles on the topic of performance tuning (
bit.ly/SEzR6t and
bit.ly/U7U5IJ), you learned how to identify the cause of performance bottlenecks occurring in real time and view the summary of what happened in the past. Although these processes provide information to help resolve performance issues in many cases, there are times when you would like to know
precisely what happened at a specific point in the past. In this article, you will learn how to identify the precise sequence of events in the past in a historical—not current—session.
To run the examples in this article, execute the setup.sql
script. The script creates a user called ARUP and creates three tables: TEST1, TEST2, and TEST3. In the very rare event that a user called ARUP already exists, modify the script to create another user with a different name and replace all occurrences of ARUP with that name. The sample database created by setup.sql will take up less than 200 MB of space in your database.
After running the setup.sql script, open four SQL*Plus sessions, connected as ARUP. In three of these sessions, execute the test1.sql, test2.sql, and upd1.sql scripts, respectively. Here is an example of one session running upd1.sql on a UNIX-based system:
# sqlplus arup/arup
SQL> @upd1.sql
In the fourth session, execute upd1.sql again. This last session will hang, because the third session has already executed upd1.sql, which updates one row of the TEST1 table but does not commit and therefore keeps the row locked. The fourth session executes the same script, which attempts to update the same row. But the row is locked by the third session, so the fourth session waits until the lock is gone and the row is available for locking.
The test1.sql and test2.sql scripts are designed to consume significant amounts of CPU, so the first two sessions will be slow due to CPU contention. These sessions will run for a very long time.
After all the sessions have waited for about 10 minutes, stop the execution in sessions 1 and 2 by pressing Control-C to exit each SQL*Plus session. In session 3, enter
commit and press
Enter. This will release the lock on the row, and you will see session 4 show “1 row updated.” Exit SQL*Plus sessions 3 and 4. Now all the sessions connected as ARUP are closed—and history.
Past Sessions
In the setup, I showed you how to simulate performance issues in three of the four sessions. If these were typical business applications, the applications would have showed signs of slowdown, inviting the ire of the respective application owners and users and bringing unwanted attention to you, the DBA. Now that you have that attention, what should you do next?
In my previous two performance tuning articles, you learned how to find the clues to the sources of performance issues in sessions. Unfortunately, those techniques will be of no help in this case. The V$SESSION view shows the reason for a performance issue in a session, but the session that caused the issue in this article is no longer present. The other important view, V$SESSION_EVENT, shows all the wait events waited for by the session, but, as with the V$SESSION view, it will show the data only if the session is still active in the database. The view V$SESSTAT shows the resource usage for a session and can offer clues to which sessions have consumed what amount of a resource such as redo or memory. However, because all the sessions that experienced performance issues are historical, looking into these views will not help resolve the issues. You need to determine the specific events that occurred at a point in time in historical sessions. If you had enabled tracing, the trace files would have shown the events, but you did not expect these issues in advance and didn’t enable tracing, because tracing would have degraded performance. So how do you
now identify the cause of the performance issues suffered by these old sessions?
Active Session History
Fortunately, looking at performance issues in old sessions is easy with an Oracle Database feature called Active Session History. Note that the use of Active Session History requires Oracle Diagnostics Pack, a licensed option of Oracle Database available since Oracle Database 10
g Release 1.
Every second, Active Session History polls the database to identify the active sessions and dumps relevant information about each of them—such as the user ID, state, the machine it is connected from, and the SQL it is executing—into a special area in the system global area (SGA) of the database instance called the ASH buffer. So even though a session is no longer present in the database instance, the ASH buffer has captured its information. In addition, because Active Session History records activities every second, it can show a second-by-second snapshot of the activities in a session. In other words, Active Session History can show a movie of the activities of the session instead of a single picture. (Note that when the ASH buffer is filled, the data is written to disk and the snapshots are taken every 10 seconds rather than every second.)
You can examine the contents of the ASH buffer in a view named V$ACTIVE_SESSION_HISTORY. Here are a few of the important columns of the V$ACTIVE_SESSION_HISTORY view:
SAMPLE_ID. The unique identifier of the Active Session History record.
SAMPLE_TIME. When Active Session History captured this data on all active sessions.
USER_ID. The numerical user ID (not the username) of the database user who created this session.
SESSION_ID. The session ID (SID) of the session.
SESSION_STATE. The state the session was in when Active Session History took the sample. It shows WAITING if the session was waiting for something; otherwise, it shows ON CPU to indicate that the session was doing productive work.
EVENT. If the session was in a WAITING state (in the SESSION_STATE column), this column will show the wait event the session was waiting for.
TIME_WAITED. If the session was in a WAITING state, this column will show how long it had been waiting when Active Session History took the sample.
WAIT_TIME. If the session is doing productive work—not in a WAITING state—this column will show how long the session waited for the last wait event.
SQL_ID. The ID of the SQL statement the session was executing at the time the sample was taken.
SQL_CHILD_NUMBER. The child number of the cursor. If this is the only version of the cursor, the child number will be 0.
Knowing the meaning of the V$ACTIVE_SESSION_HISTORY columns, you can identify what it was that historical sessions were waiting for. To begin the identification, you need to pose two questions to the application owners or users executing the SQL statements that experienced slow performance:
- Which username was used to connect to the database?
- What was the time interval (start and end times) of the period when the performance issues occurred?
Because you ran the setup script as the user ARUP, the answer to the first question is ARUP. Next, you need to find out the USER_ID of the ARUP user by issuing the following SQL:
select user_id
from dba_users
where username = 'ARUP';
USER_ID
—————————————
92
Now suppose the user told you that the performance issues occurred between 4:55 p.m. and 5:05 p.m. on September 29. With this information, you can query the V$ACTIVE_SESSION_HISTORY view to find out the activities of the ARUP sessions (with USER_ID 92) within that period, as shown in
Listing 1. (The output has been truncated to fit the space available.) Because Active Session History collects information on all active sessions, you need to order the output by SID, which identifies a session (shown under SESSION_ID), and then by the collection time (shown under SAMPLE_TIME).
Let’s examine the first row of the output. It shows that the session identified by SESSION_ID 39 was waiting for the “enq: TX - row lock contention” event on 29-SEP-12 at 04.55.02.379 PM. Because the session was in a WAITING state, the value of the WAIT_TIME column is irrelevant, so it shows up as 0. Because the session was still in a WAITING state when Active Session History captured the state, the TIME_WAITED column shows 0. When the session finally got the lock, it could do what it had to do and stopped waiting. At that point, the total time the session had waited was updated in Active Session History, shown in the first boldface line in the
Listing 1 output (sample time 29-SEP-12 05.16.52.078): 1,310,761,160 microseconds (shown in the TIME_WAITED column), or about 22 minutes. This is such an important property of Active Session History that I repeat: Seeing 0 in the WAIT_TIME column does not mean that the session didn’t wait at all. It simply means that the session was waiting for some event for more than one second because the previous WAIT_TIME and TIME_WAITED column values showed 0. You should look at the last occurrence of the wait event (the EVENT column value) for that session in Active Session History to determine what the total wait time really was.
When you explain to your user that the cause of the delay was an unavailable lock during the period 04.55.02.379 PM to 05.16.52.078 PM, that person might ask you what SQL statement the session was executing at that time. That’s extremely easy to get from the
Listing 1 output: the session with SESSION_ID 39 was executing the SQL statement with SQL_ID fx60htyzmz6wv and child number (CH#) 0.
You can get that SQL statement text with this query:
select SQL_TEXT
from v$sql
where sql_id = 'fx60htyzmz6wv';
SQL_TEXT
————————————————————————————————————————————————
update test1 set status = 'D' where object_id = :b1
The SQL statement includes an UPDATE that had to lock the row. Because the row had already been locked by another session, it was not possible for this session (SESSION_ID 39) to succeed and it therefore had to wait. The next logical questions from the user would perhaps be which specific row on which table the SQL statement and the session were waiting for and which session had locked the row. These are also very easy to find with the query shown in
Listing 2. The BLOCKING_SESSION column shows the session that was holding the lock: the session with SESSION_ID 43. The
Listing 2 output also shows the object ID (CURRENT_OBJ#) of the table whose row was locked and other information necessary to get the row information. With the data in the
Listing 2 output, you can get the ROWID of the locked row by using the query shown in Listing 3. The row with ROWID AAAdvSAAHAAABGPAAw had been locked by the session with SESSION_ID 43 and was being requested to be updated by the session with SESSION_ID 39. You now have the information on why the session (with SESSION_ID 39) was slow—it was waiting for a lock for 22 minutes—what SQL statement it was executing, and what specific row it was looking to lock. You have successfully uncovered the root cause of the performance issue in the session with SESSION_ID 39.
Code Listing 3: Getting specific row information
select
owner||'.'||object_name||':'||nvl(subobject_name,'-') obj_name,
dbms_rowid.rowid_create (
1,
o.data_object_id,
row_wait_file#,
row_wait_block#,
row_wait_row#
) row_id
from v$session s, dba_objects o
where sid = &sid
and o.data_object_id = s.row_wait_obj#
OBJ_NAME ROW_ID
————————————— ———————————————————
ARUP.TEST1:- AAAdvSAAHAAABGPAAw
Resource Contention
After uncovering the root cause of the slowness in session 39, you now turn your attention to session 44. Revisiting the output in
Listing 1, you can see that session 44 switched between waiting (shown under SESSION_STATE as WAITING) and doing productive work (shown as ON CPU). Note the very first line in the output for the session with SESSION_ID 44 and SAMPLE_TIME 29-SEP-12 04.55.34.419 PM. The SESSION_STATE column shows WAITING, which means that the session was waiting at that time. The EVENT and TIME_WAITED columns show “resmgr:cpu quantum” and “109984,” respectively. This means that the session had already waited for 109,984 microseconds, or about 0.11 seconds, at that time for the “resmgr:cpu quantum” event. The next line in
Listing 1, sampled about a second later, shows the SESSION_STATE column value as ON CPU, which means that the session was doing productive work at that time—not waiting. You need to know why the session was intermittently waiting for this wait event and therefore slowing down.
The “resmgr:cpu quantum” event is due to Oracle Database’s Database Resource Management feature. Database Resource Management acts as a resource governor: it limits CPU consumption of individual sessions when the total CPU demand from all sessions goes up to more than 100 percent, and it enables more-important sessions to get the CPU they need. Because the output shows the session waiting, you can conclude that the CPU consumption by the session with SESSION_ID 44 was high enough at that time for Database Resource Management to limit its CPU usage. Well, the application owner counters, this application is very important and the resource usage should not have been constrained.
In that case, you may suspect that the session was under a consumer group that has a more restrictive CPU allocation than expected. So your next stop is to find out what consumer group was active for the session at that time—
not now. Fortunately, the process for finding this information is straightforward. Active Session History records the consumer group that was active for a session at the time of sampling and displays that information in the CONSUMER_GROUP_ID column in the V$ACTIVE_SESSION_HISTORY view. You can see that information by using the query shown in Listing 4.
Code Listing 4: Listing consumer groups
select sample_time, session_state, event, consumer_group_id
from v$active_session_history
where user_id = 92
and sample_time between
to_date('29-SEP-12 04.55.02 PM','dd-MON-yy hh:mi:ss PM')
and
to_date('29-SEP-12 05.05.02 PM','dd-MON-yy hh:mi:ss PM')
and session_id = 44
order by 1;
SESSION
SAMPLE_TIME _STATE EVENT CONSUMER_GROUP_ID
————————————————————————— ——————— —————————————————— —————————————————
29-SEP-12 04.55.34.419 PM WAITING resmgr:cpu quantum 12166
29-SEP-12 04.55.35.419 PM ON CPU 12166
29-SEP-12 04.55.36.419 PM WAITING resmgr:cpu quantum 12166
29-SEP-12 04.55.37.419 PM ON CPU 12166
29-SEP-12 04.55.38.419 PM WAITING resmgr:cpu quantum 12166
29-SEP-12 04.55.39.419 PM WAITING resmgr:cpu quantum 12166
29-SEP-12 04.55.40.419 PM ON CPU 12166
… output truncated …
29-SEP-12 04.55.37.419 PM ON CPU 12162
29-SEP-12 04.55.38.419 PM ON CPU 12166
29-SEP-12 04.55.39.419 PM ON CPU 12162
29-SEP-12 04.55.40.419 PM ON CPU 12162
Because the session may have been under different consumer groups, it’s wise to select the consumer groups for all of the sampled data in Active Session History for that session, rather than just one sample. In this case, the session was under consumer group 12166 until it changed to 12162 on 29-SEP-12 at 04.55.37.419 PM. To find out the names of the consumer groups, use the following query:
select name
from v$rsrc_consumer_group
where id in (12166,12162);
ID NAME
—————— ————————————
12166 OTHER_GROUPS
12162 APP_GROUP
From the output of Listing 4, you can see that from 04.55.34.419 PM, the session was under consumer_group_id 12166, which is OTHER_GROUPS (as shown in the query against v$rsrc_consumer_group above). The CONSUMER_GROUP_ID changed from 12166 to 12162 at 04.55.37.419 PM. This could have happened due to one of the three most likely reasons: a DBA manually activated a different resource plan, a different plan was activated automatically by a scheduling mechanism, or a DBA changed the consumer group of the session from OTHER_GROUPS to APP_GROUP (also shown in the query against v$rsrc_consumer_group above) online. Whatever the reason, with the APP_GROUP consumer group in effect, the session was doing more-productive work (as shown by the SESSION_STATE value ON CPU) than waiting with the “resmgr:cpu quantum” event. This observation may lead to many conclusions, but the most obvious is perhaps that the APP_GROUP consumer group is less restrictive in terms of CPU allocation than OTHER_GROUPS. In that case, you should examine why the OTHER_GROUPS consumer group was activated earlier and, perhaps more importantly, whether this restriction was necessary or just a mistake. In any case, you have now found the root cause of the wait.
The next obvious question is why session 44 consumed so much CPU that it had to be constrained by Database Resource Management. The answer lies in the SQL statement that session 44 was executing at that time (not now). The SQL_ID column in the output of
Listing 1 was fngb4y81xr57x. You can get the text of that SQL statement with the following query:
SQL> select SQL_TEXT from v$sql
where sql_id = 'fngb4y81xr57x';
SQL_TEXT
———————————————————————————————————
SELECT MAX(TEST1.OWNER) FROM TEST1,
TEST2, TEST2, TEST2, TEST2, TEST2,
TEST2, TEST2, TEST2, TEST2, TEST2,
TEST2, TEST2, TEST2, TEST2, TEST2,
TEST2, TEST2, TEST2, TEST2, TEST2,
TEST2, TEST2, TEST2, TEST2
You can immediately see from the output that the query was performing a Cartesian join (joining
all rows of a table with all rows of another
without a join condition) on a TEST2 table several times. Even if the TEST2 table has a relatively small number of rows, the Cartesian join will produce a lot of logical I/Os. Because logical I/O requires CPU cycles to execute, it’s hardly surprising that the session consumed so much CPU that Database Resource Management limited it. To resolve this issue, you will need to rewrite the query to eliminate or reduce Cartesian joins.
More Uses
In the previous sections, you saw how to find issues that occurred at a specific point in the past in Active Session History. Hopefully, the content gave you an idea of how powerful Active Session History is and how it can be used in many circumstances. Here is another example of the power and usage of Active Session History: suppose a user complains that things seemed to have been slow from a specific client machine—prolaps01—between 4:55 p.m. and 5:05 p.m. on September 29. Because Active Session History also records the machine name, you can use the query shown in Listing 5 to display all the different events experienced by all sessions from the prolaps01 machine and how often each event occurred. You can see from the Listing 5 output that during that time, the sessions connected from the prolaps01 client machine experienced locking and Resource Manager–related waits many times. With this information, you can dig further inside the V$ACTIVE_SESSION_HISTORY view to identify specific sessions and what they were doing earlier to have experienced these waits.
Code Listing 5: Checking all events from a machine
select event, count(1)
from v$active_session_history
where machine = 'prolaps01'
and sample_time between
to_date('29-SEP-12 04.55.00 PM','dd-MON-yy hh:mi:ss PM')
and
to_date('29-SEP-12 05.05.00 PM','dd-MON-yy hh:mi:ss PM')
group by event
order by event;
EVENT COUNT(1)
———————————————————————————— ————————
… output truncated …
db file scattered read 93
db file parallel write 127
log file parallel write 134
db file sequential read 293
control file parallel write 612
control file sequential read 948
enq: TX - row lock contention 1309
resmgr:cpu quantum 1371
Although pulling specific data from Active Session History is great for spot analysis, sometimes you may want to take a look at the entire collection of Active Session History data within a time frame. Active Session History reports for a specific time period are great for that. You can generate an Active Session History report from Oracle Enterprise Manager or from the command line. For the latter, connect to the database as a DBA user and execute the following script at the SQL*Plus prompt: @$ORACLE_HOME/rdbms/admin/ashrpt.sql.
You can find more information on this procedure in “
NEXT STEPS.”
Active Session History Archive
Active Session History collects information on active sessions from the database instance every second. Depending on the database activity, that will lead to a lot of data collected inside the ASH buffer, but because the ASH buffer is a memory-resident structure, it has only a finite amount of space. In addition, when the instance goes down, the instance’s memory vanishes with it. Therefore, Oracle Database archives the information from the ASH buffer to a database table to make it persistent. This archived table data is visible in a view called DBA_HIST_ACTIVE_SESS_HISTORY. If you don’t find the data in the V$ACTIVE_SESSION_HISTORY view, check for it in the DBA_HIST_ACTIVE_SESS_HISTORY view, as shown in
Listing 6. The output shows that the session was experiencing row lock waits—you can get the lock and row information from the DBA_HIST_ACTIVE_SESS_HISTORY view by using the query shown in Listing 7.
Code Listing 7: Getting row lock information from the Active Session History archive
select sample_time, session_state, blocking_session,
owner||'.'||object_name||':'||nvl(subobject_name,'-') obj_name,
dbms_ROWID.ROWID_create (
1,
o.data_object_id,
current_file#,
current_block#,
current_row#
) row_id
from dba_hist_active_sess_history s, dba_objects o
where user_id = 92
and sample_time between
to_date('29-SEP-12 04.55.02 PM','dd-MON-yy hh:mi:ss PM')
and
to_date('29-SEP-12 05.05.02 PM','dd-MON-yy hh:mi:ss PM')
and event = 'enq: TX - row lock contention'
and o.data_object_id = s.current_obj#
order by 1,2;
Conclusion
Active Session History is a very powerful facility inside Oracle Database that records the information on all active sessions in the database instance in a buffer every second and exposes that data to you in a view called V$ACTIVE_SESSION_HISTORY and subsequently to a persistent table visible in the DBA_HIST_ACTIVE_SESS_HISTORY view. Active Session History offers a second-by-second record of the activities inside the session—even when that session has disconnected or ceased that activity—enabling you to go back in time and identify what was ailing a specific session at a certain point in the past.
Setup.sql
REM This to for setting up the test users, tables, etc.
REM
REM Create the user. If this user exists, use a different name
REM
create user arup identified by arup
/
grant create session, unlimited tablespace, create table to arup
/
connect arup/arup
REM
REM create all test tables
REM
drop table test1
/
drop table test2
/
drop table test3
/
create table test1
as
select * from all_objects
/
create table test2
as
select * from test1
/
create table test3
as
select rownum col1, created col2
from test2
/
Script: test1.sql
declare
l_dummy_1 varchar2(2000);
l_dummy_2 varchar2(2000);
l_stmt varchar2(2000);
begin
for i in 1..71540 loop
l_stmt :=
'select to_char(col2,''mm/dd/yyyy hh24:mi:ss'')'||
' from test3'||
' where col1 = to_char('||i||')';
dbms_output.put_line('l_stmt='||l_stmt);
execute immediate l_stmt into l_dummy_1;
l_stmt :=
'select col1 '||
'from test3 '||
'where col2 = to_date('''||
l_dummy_1||
''',''mm/dd/yyyy hh24:mi:ss'')'||
' and col1 = '''||to_char(i)||'''';
dbms_output.put_line('l_stmt='||l_stmt);
execute immediate l_stmt into l_dummy_2;
end loop;
end;
/
Script: test2.sql
declare
l_dummy varchar2(200);
begin
select max(test1.owner)
into l_dummy
from test1, test2, test2, test2, test2,
test2, test2, test2, test2,
test2, test2, test2, test2,
test2, test2, test2, test2,
test2, test2, test2, test2,
test2, test2, test2, test2;
dbms_lock.sleep(120);
end;
/
Script: upd1.sql
update test1 set status = 'D' where object_id = 2
/
|