Saturday, 9 July 2016

Slow Batch Processing

Slow Batch Processing
The operations group noticed that its overnight batch processing was taking significantly longer after an additional data feed program was added to the nightly run schedule. The decision was made to add more processors to the database server to speed things up. Unfortunately, this did not do the trick; the programs ran at about the same overall speed as before.
The v$system_event view can come in handy in situations like this in order to take a quick glance at the overall system. By making a copy of the view's contents at two separate points in time, you can get a high level summary of what processes were waiting for during the time interval. Here is an extremely basic SQL*Plus script that displays the wait event activity for the instance over a 30 second time interval:
        CREATE TABLE previous_events
        AS
        SELECT SYSDATE timestamp, v$system_event.*
        FROM   v$system_event;

        EXECUTE dbms_lock.sleep (30);

        SELECT   A.event, 
                 A.total_waits - NVL (B.total_waits, 0) total_waits,
                 A.time_waited - NVL (B.time_waited, 0) time_waited
        FROM     v$system_event A, previous_events B
        WHERE    B.event (+) = A.event
        ORDER BY A.event;
Of course, you could use Statspack, utlbstat, or any number of other tools out there to capture this information as well.
Running the above script while the batch programs were running generated the following output:
        EVENT                          TOTAL_WAITS TIME_WAITED
        ------------------------------ ----------- -----------
        LGWR wait for redo copy                115          41
        buffer busy waits                       53          26
        control file parallel write             45          44
        control file sequential read            25           0
        db file scattered read                 932         107
        db file sequential read              76089        6726
        direct path read                       211          19
        direct path write                      212          15
        enqueue                                 37        5646
        file identify                            2           0
        file open                               37           0
        free buffer waits                       11         711
        latch free                              52          44
        log buffer space                         2           8
        log file parallel write               4388        1047
        log file sequential read               153          91
        log file single write                    2           6
        log file switch completion               2          24
        write complete waits                     6         517
The idle events have been removed for clarity. What stands out here is the enqueue wait event. During the 30 second time interval, processes collectively spent over 56 seconds waiting for enqueues, or locks that you can see in v$lock. The most common types of locks to show up in an enqueue wait are the table-level lock (TM) and the transaction enqueue or row-level lock (TX).
The parameters for the enqueue wait event provide information about the type and location of the lock waited upon, but the information is fairly obscure. The first two bytes of P1RAW are the ASCII codes of the lock type, and the last two bytes are the requested lock mode. P2 and P3 correspond to the ID1 and ID2 columns in the v$lock view.
In order to get a sense of where the locking contention was coming from, I used the oradebug facility in SQL*Plus to activate system event 10046 for one of the batch programs as it was running. Since tracing incurs a lot of overhead, this significantly slowed down batch processing. But this was only done once, and with the greater good of eliminating the contention problems in mind.
To activate event 10046, I looked at v$session and v$process to find the Oracle PID of the dedicated server process running one of the batch programs. Then I ran the following commands in SQL*Plus:
        SQL> oradebug setorapid 13
        Unix process pid: 19751, image: oracle@dbserver.acme.com (TNS V1-V3)
        SQL> oradebug session_event 10046 trace name context forever, level 8
        Statement processed.
        SQL>
This caused the dedicated server process to begin writing a trace file including wait event information. An excerpt from the trace file is as follows:
        EXEC #5:c=0,e=0,p=3,cr=2,cu=1,mis=0,r=1,dep=1,og=4,tim=2313020980
        XCTEND rlbk=0, rd_only=0
        WAIT #1: nam='write complete waits' ela= 11 p1=3 p2=2 p3=0
        WAIT #4: nam='db file sequential read' ela= 4 p1=10 p2=12815 p3=1
        WAIT #4: nam='db file sequential read' ela= 1 p1=10 p2=12865 p3=1
        WAIT #4: nam='db file sequential read' ela= 5 p1=3 p2=858 p3=1
        =====================
        PARSING IN CURSOR #4 len=65 dep=1 uid=502 oct=6 lid=502 
        tim=2313021001 hv=417623354 ad='55855844'
        UPDATE CUSTOMER_CALLS SET ATTR_3 = :b1 WHERE CUSTOMER_CALL_ID=:b2
        END OF STMT
        EXEC #4:c=1,e=10,p=3,cr=2,cu=3,mis=0,r=1,dep=1,og=4,tim=2313021001
        WAIT #5: nam='db file sequential read' ela= 0 p1=10 p2=5789 p3=1
        WAIT #5: nam='enqueue' ela= 307 p1=1415053318 p2=196705 p3=6744
        WAIT #5: nam='enqueue' ela= 307 p1=1415053318 p2=196705 p3=6744
        WAIT #5: nam='enqueue' ela= 53 p1=1415053318 p2=196705 p3=6744
        WAIT #5: nam='db file sequential read' ela= 0 p1=10 p2=586 p3=1
        WAIT #5: nam='db file sequential read' ela= 1 p1=3 p2=858 p3=1
        EXEC #5:c=0,e=668,p=3,cr=5,cu=3,mis=0,r=1,dep=1,og=4,tim=2313021669
We can see the enqueue wait events pretty clearly. Using the magical decoder ring on the P1 values, we can see that Oracle was waiting on TX locks requested in mode 6 (exclusive mode). (1415053318 is 54580006 in hexadecimal. The 54 corresponds to a "T", the 58 to an "X", and 0006 is exclusive mode. Look up the v$lock view in the Server Reference Guide for a list of lock types and modes.)
What is not clear from the excerpt above is what table and row the process was trying to lock when it had to wait. If we were looking at the database while the lock contention was happening, we could look at the v$locked_object view, or the ROW_WAIT_OBJ# and ROW_WAIT_ROW# columns of v$session. But since all I had to work with was the contents of the trace file, I could see that the waits occurred while processing cursor number five. Searching backwards through the trace file I found the exact SQL statement that corresponded to cursor number five at the time of the lock contention.
This turned out to be enough information to diagnose the contention problem and develop a workaround. By using autonomous transactions, we were able to modify the batch programs that were contending with each other so that they held locks on critical path rows for extremely short periods of time. Adjusting the schedules of the programs in the nightly run schedule also helped.

No comments:

Post a Comment

Optimizer Access Paths using Indexes

Introduction In this tutorial you will use the Optimizer Access Paths for the following cases (scenarios): Case 1: With and Without Index Ca...