Investigate Potential AuditThread Deadlock
At UCSD we saw an AuditThread which had been idle for 13 days, indicating that a deadlock might exist somewhere in the AuditThread. They have audit blocking enabled, with a maximum of a 5 minute block. From my testing, when going beyond the 5 minutes, a NullPointerException is thrown and the audit fails.
At the very least I think it might be time to rework how the blocking works now that we have better tools with Functions, Supplies, etc. In addition, it might be good to look into adding a number of retries available, instead of a maximum number of total time spent waiting. This way instead of checking against elapsed time, we only increment our attempt number and sleep for the specified amount of time.
Info so far
- It occurs during an audit, so the beginning and end are ok
- RequestThread log entries are seen continuously, so the batch has not yet been closed
- ValidationThread log entries are not seen, a good candidate for where the blocking is occurring
- This would put our idle somewhere around
validator.add(item.getFileDigest(), token);
- The TokenValidator itself seems sound, should be able to process without a lock
- TokenValidator::add does attempt to acquire a lock which can block, and is only blocked during TokenValidator::processBatch
- TokenValidator::processBatch only seems to block when communicating with the IMS; I don't see the database calls being what is causing us to block indefinitely (though maybe it's possible)
- It's possible that the old IMSService::blockUntil had a bug in it, though it seemed to be sound