It’s all About the Latch

Norfolk Door Latch, courtesy of Morris L. Hallowell IV – http://www.hallowellco.com/iron%20door%20and%20cabinet%20hardware.htm.

Recently a user posted a question to the IIUG Forums that I’ll paraphrase and explain here because I think that this is a subject that is generally not well understood:

“I have an application that runs multiple sessions that each insert staging rows into a driver table, update the rows, use that data to perform additional operations on other permanent tables, then delete the staging rows from the driver table. Each of the many sessions operates on a unique set of keys to the driver table rows and the table has row level locking and the sessions are running under dirty read isolation with LOCK MODE WAIT 30 set. Sometimes during the update or delete processing one or more sessions gets a -244 error (finderr: -244 Could not do a physical-order read to fetch next row) with the ISAM error indicating that the lock timeout expired.

If I add an index on the key the problem disappears. But given the LOCK MODE WAIT and dirty read isolation I would not expect locking to be a problem. What’s going on here?”

Here’s what’s happening:

Without the index every update and delete has to scan the entire table looking for matching rows to update/delete. Each row as it is examined has to be locked, dirty read or no dirty read, for a fraction of a second. As multiple scans zip through the table it is inevitable that one or more will encounter a row that’s locked. If multiple sessions have to delete multiple rows on the same page they will all have to latch the same page in the cache and so will queue up behind one another each holding a lock on a different row on that one page. They will single thread on the cache page latch and on the LRU latch needed to move the page from the clean part of the LRU queue to the dirty part or from the middle of the dirty queue to the most recently used end of the dirty queue. Those latches are exclusive even under dirty read isolation to maintain the integrity of the cache page and the LRU queues. All that time the N+1st session is waiting for a row lock on one of the rows in that frozen page to clear so it can read that row and continue its scan. That row lock cannot be satisfied, even under dirty read, because the cache page itself is latched. Eventually, sometimes, the last waiter times out before it gets access to the locks or latches that it needs.

The index almost eliminates the problem because that waiter who doesn’t need to update another row on the same cache page is able to go around the page locks because it is scanning the index instead which is not being locked and which this N+1st session does not have to acquire a lock for.