In a previous post, I discussed the DeveloperWorks Process Server Endurance article. In that article, they ran into a situation where the number of maxMessages was breached (100 by default). At my client site, we had a situation where messages in an Event Sequenced queue would remain locked forever, causing the maxMessages to be breached.
Strangely, It only occurred in our production cluster. We worked with IBM Support and it was found that this is a known issue that was resolved in WPS 18.104.22.168. Our cluster’s messaging engines are configured as active-passive which allows for fail-over processing.
Background explanation of Event Sequencing
By default, messages are processed in the order that they arrive in a queue. While this is a true statement, you have to remember that Message Driven Beans (MDBs) by default allow 10 instances to be receiving messages. So while they get picked up in order, they may be processed in parallel. A problem arrives when one message is dependent on the processing of a message earlier in the queue. Think of a typical database insert or update choice. If the row doesn’t exist, I want to insert it. If the row exists, then update it with the new info.
In a single threaded scenario, the first message is processed, creates the row. The second message arrives and updates the row. In a multi-threaded scenario, both messages are picked up at the same time and non-deterministically both may land up determining they need to insert the row, resulting in duplicate rows.
Event sequencing allows us to say “Hey, message1 and message2 are actually related. They have an id that means they work on the same data. Process server, if you see a message with an id, please ensure that you don’t process the next message with the same id until the first message has completed processing”. This results in a system where messages with the same id are processes sequentially while messages with unique ids are processed in parallel.
Background explanation of Event Sequencings Implementation in Process Server
Event Sequencing has it’s own database inside of the WPSDB that is created via the WPS install procedures. (FYI, Event Sequencing and any tables or tools that it uses are normally prefixed with ‘es’ somewhere. ‘esAdmin’ allows you to interact with the event sequencing runtime). In this table, event sequencing stores the UID of the message on the queue, the value of the user-defined key, and various module related information about where the message is supposed to go.
When a message arrives on a queue with event sequencing turned on, the first thing that happens is the sequencing mdb will pick up the message, query it’s table to determine if the key is currently ‘locked’ (in use). If it’s in use then the message remains locked on the queue, waiting for the lock to be released. If it’s not in use, the key becomes locked and then Event Sequencing passes the message off to the module/component which begins it’s normal processing. When the invocation is complete, event sequencing removes the semaphore from the database and the next message with that key is processed.
The Problem Report
Event sequencing works fine on a single server, but what happened on a clustered server is that the UID of the message actually changed depending on which messaging engine picked up the message. In a clustered fail-over scenario, the UID of the message became out of sync with the UID in the event sequencing database. The “new” UID message would complete, but event sequencing would display an error about “Unable to find UID xxxxxxxxx” when trying to unlock. It just left all the locks in place.
My Rant-y Feature Request
One thing I didn’t quite understand is why a lock wouldn’t time-out after X minutes. Event sequencing is bound by the global WebSphere transaction timeout value of 180 seconds. You have 180 seconds to complete your synchonous call or WebSphere will time it out and rollback. No lock in the database could possibly exist longer than this timeout. The user could then define the action to take (stop processing the next message, or just continue and assume the risk). Ceasing the processing of a certain key is a huge value to my client. We deal primarily in migration projects where the old system used to shut down the listener port on any error, rendering the application unusable. In today’s WPS architecture, we get Failed events but we lose the sequencing of the messages. I’d love to be able to say “key XYZ is in failed state, please just accumulate the subsequent messages while support performs problem determination”.
We’re in the process of upgrading our versions but support was able to reproduce and verify upgrading resolved the problem.