Slow Batch Processing
The operations group noticed that its overnight batch processing was taking significantly longer after an additional data feed program was added to the nightly run schedule. The decision was made to add more processors to the database server to speed things up. Unfortunately, this did not do the trick; the programs ran at about the same overall speed as before.
The v$system_event view can come in handy in situations like this in order to take a quick glance at the overall system. By making a copy of the view's contents at two separate points in time, you can get a high level summary of what processes were waiting for during the time interval. Here is an extremely basic SQL*Plus script that displays the wait event activity for the instance over a 30 second time interval:
CREATE TABLE previous_events AS SELECT SYSDATE timestamp, v$system_event.* FROM v$system_event; EXECUTE dbms_lock.sleep (30); SELECT A.event, A.total_waits - NVL (B.total_waits, 0) total_waits, A.time_waited - NVL (B.time_waited, 0) time_waited FROM v$system_event A, previous_events B WHERE B.event (+) = A.event ORDER BY A.event;Of course, you could use Statspack, utlbstat, or any number of other tools out there to capture this information as well.
Running the above script while the batch programs were running generated the following output:
EVENT TOTAL_WAITS TIME_WAITED ------------------------------ ----------- ----------- LGWR wait for redo copy 115 41 buffer busy waits 53 26 control file parallel write 45 44 control file sequential read 25 0 db file scattered read 932 107 db file sequential read 76089 6726 direct path read 211 19 direct path write 212 15 enqueue 37 5646 file identify 2 0 file open 37 0 free buffer waits 11 711 latch free 52 44 log buffer space 2 8 log file parallel write 4388 1047 log file sequential read 153 91 log file single write 2 6 log file switch completion 2 24 write complete waits 6 517The idle events have been removed for clarity. What stands out here is the enqueue wait event. During the 30 second time interval, processes collectively spent over 56 seconds waiting for enqueues, or locks that you can see in v$lock. The most common types of locks to show up in an enqueue wait are the table-level lock (TM) and the transaction enqueue or row-level lock (TX).
The parameters for the enqueue wait event provide information about the type and location of the lock waited upon, but the information is fairly obscure. The first two bytes of P1RAW are the ASCII codes of the lock type, and the last two bytes are the requested lock mode. P2 and P3 correspond to the ID1 and ID2 columns in the v$lock view.
In order to get a sense of where the locking contention was coming from, I used the oradebug facility in SQL*Plus to activate system event 10046 for one of the batch programs as it was running. Since tracing incurs a lot of overhead, this significantly slowed down batch processing. But this was only done once, and with the greater good of eliminating the contention problems in mind.
To activate event 10046, I looked at v$session and v$process to find the Oracle PID of the dedicated server process running one of the batch programs. Then I ran the following commands in SQL*Plus:
SQL> oradebug setorapid 13 Unix process pid: 19751, image: oracle@dbserver.acme.com (TNS V1-V3) SQL> oradebug session_event 10046 trace name context forever, level 8 Statement processed. SQL>This caused the dedicated server process to begin writing a trace file including wait event information. An excerpt from the trace file is as follows:
EXEC #5:c=0,e=0,p=3,cr=2,cu=1,mis=0,r=1,dep=1,og=4,tim=2313020980 XCTEND rlbk=0, rd_only=0 WAIT #1: nam='write complete waits' ela= 11 p1=3 p2=2 p3=0 WAIT #4: nam='db file sequential read' ela= 4 p1=10 p2=12815 p3=1 WAIT #4: nam='db file sequential read' ela= 1 p1=10 p2=12865 p3=1 WAIT #4: nam='db file sequential read' ela= 5 p1=3 p2=858 p3=1 ===================== PARSING IN CURSOR #4 len=65 dep=1 uid=502 oct=6 lid=502 tim=2313021001 hv=417623354 ad='55855844' UPDATE CUSTOMER_CALLS SET ATTR_3 = :b1 WHERE CUSTOMER_CALL_ID=:b2 END OF STMT EXEC #4:c=1,e=10,p=3,cr=2,cu=3,mis=0,r=1,dep=1,og=4,tim=2313021001 WAIT #5: nam='db file sequential read' ela= 0 p1=10 p2=5789 p3=1 WAIT #5: nam='enqueue' ela= 307 p1=1415053318 p2=196705 p3=6744 WAIT #5: nam='enqueue' ela= 307 p1=1415053318 p2=196705 p3=6744 WAIT #5: nam='enqueue' ela= 53 p1=1415053318 p2=196705 p3=6744 WAIT #5: nam='db file sequential read' ela= 0 p1=10 p2=586 p3=1 WAIT #5: nam='db file sequential read' ela= 1 p1=3 p2=858 p3=1 EXEC #5:c=0,e=668,p=3,cr=5,cu=3,mis=0,r=1,dep=1,og=4,tim=2313021669We can see the enqueue wait events pretty clearly. Using the magical decoder ring on the P1 values, we can see that Oracle was waiting on TX locks requested in mode 6 (exclusive mode). (1415053318 is 54580006 in hexadecimal. The 54 corresponds to a "T", the 58 to an "X", and 0006 is exclusive mode. Look up the v$lock view in the Server Reference Guide for a list of lock types and modes.)
What is not clear from the excerpt above is what table and row the process was trying to lock when it had to wait. If we were looking at the database while the lock contention was happening, we could look at the v$locked_object view, or the ROW_WAIT_OBJ# and ROW_WAIT_ROW# columns of v$session. But since all I had to work with was the contents of the trace file, I could see that the waits occurred while processing cursor number five. Searching backwards through the trace file I found the exact SQL statement that corresponded to cursor number five at the time of the lock contention.
This turned out to be enough information to diagnose the contention problem and develop a workaround. By using autonomous transactions, we were able to modify the batch programs that were contending with each other so that they held locks on critical path rows for extremely short periods of time. Adjusting the schedules of the programs in the nightly run schedule also helped.
No comments:
Post a Comment