Event Sequencing Message Locking – Resolution!

In a previous post, I discussed the DeveloperWorks Process Server Endurance article. In that article, they ran into a situation where the number of maxMessages was breached (100 by default). At my client site, we had a situation where messages in an Event Sequenced queue would remain locked forever, causing the maxMessages to be breached.

Strangely, It only occurred in our production cluster. We worked with IBM Support and it was found that this is a known issue that was resolved in WPS 6.0.2.2. Our cluster’s messaging engines are configured as active-passive which allows for fail-over processing.

Background explanation of Event Sequencing

By default, messages are processed in the order that they arrive in a queue. While this is a true statement, you have to remember that Message Driven Beans (MDBs) by default allow 10 instances to be receiving messages. So while they get picked up in order, they may be processed in parallel. A problem arrives when one message is dependent on the processing of a message earlier in the queue. Think of a typical database insert or update choice. If the row doesn’t exist, I want to insert it. If the row exists, then update it with the new info.

In a single threaded scenario, the first message is processed, creates the row. The second message arrives and updates the row. In a multi-threaded scenario, both messages are picked up at the same time and non-deterministically both may land up determining they need to insert the row, resulting in duplicate rows.

Event sequencing allows us to say “Hey, message1 and message2 are actually related. They have an id that means they work on the same data. Process server, if you see a message with an id, please ensure that you don’t process the next message with the same id until the first message has completed processing”. This results in a system where messages with the same id are processes sequentially while messages with unique ids are processed in parallel.

Background explanation of Event Sequencings Implementation in Process Server

Event Sequencing has it’s own database inside of the WPSDB that is created via the WPS install procedures. (FYI, Event Sequencing and any tables or tools that it uses are normally prefixed with ‘es’ somewhere. ‘esAdmin’ allows you to interact with the event sequencing runtime). In this table, event sequencing stores the UID of the message on the queue, the value of the user-defined key, and various module related information about where the message is supposed to go.

When a message arrives on a queue with event sequencing turned on, the first thing that happens is the sequencing mdb will pick up the message, query it’s table to determine if the key is currently ‘locked’ (in use). If it’s in use then the message remains locked on the queue, waiting for the lock to be released. If it’s not in use, the key becomes locked and then Event Sequencing passes the message off to the module/component which begins it’s normal processing. When the invocation is complete, event sequencing removes the semaphore from the database and the next message with that key is processed.

The Problem Report

Event sequencing works fine on a single server, but what happened on a clustered server is that the UID of the message actually changed depending on which messaging engine picked up the message. In a clustered fail-over scenario, the UID of the message became out of sync with the UID in the event sequencing database. The “new” UID message would complete, but event sequencing would display an error about “Unable to find UID xxxxxxxxx” when trying to unlock. It just left all the locks in place.

My Rant-y Feature Request

One thing I didn’t quite understand is why a lock wouldn’t time-out after X minutes. Event sequencing is bound by the global WebSphere transaction timeout value of 180 seconds. You have 180 seconds to complete your synchonous call or WebSphere will time it out and rollback. No lock in the database could possibly exist longer than this timeout. The user could then define the action to take (stop processing the next message, or just continue and assume the risk). Ceasing the processing of a certain key is a huge value to my client. We deal primarily in migration projects where the old system used to shut down the listener port on any error, rendering the application unusable. In today’s WPS architecture, we get Failed events but we lose the sequencing of the messages. I’d love to be able to say “key XYZ is in failed state, please just accumulate the subsequent messages while support performs problem determination”.

Today

We’re in the process of upgrading our versions but support was able to reproduce and verify upgrading resolved the problem.

Author: dan

Comments

  1. Hi,
    This is a good one..!
    We have encountered similar problem on our pre-production cluster, and we seem to have resolved the same.
    Our current version of process servers is 6.0.2.25.

    Cheers..

  2. Hi Vijita,

    I am glad that you were able to find your issue in PREproduction and not POST like we did :). As a side note we ran into another instance where event sequencing messages lock. I believe there is both an sibus (was) ifix and wps event sequencing fix coming out, so be aware.

  3. I am working on the same project as Vijita. We have hit several different issues with the SI Bus and SCA due to our topologyb (multiple ME clusters in one cell). We have had to play around with targetSignificance etc for activation specs and connection factories. This seems to be because the underlying SI Bus does not like the use of remote puts and gets which is part of the default behaviour of WAS and WPS.

    We have had to tie our connection factories and activation specifications to specific ME’s so that they always connect to the one which hosts the queue points and do not try to use remote puts/ gets.

    Unfortunately, the SCA does not then honour the settings and so a further fix is required to sort that out.

    We have also had to load the system up with fix packs and we are now looking at WPS 6.0.2.4 plus fixes to resolve our issues.

    So far we have spent about 2 months trying to resolve these issues….

    Regards
    Chris

  4. Dan, Please contact Sunita about this. (Jeff B says you’re the worst volley-ball player he has ever seen.)

Leave a Reply

Your email address will not be published. Required fields are marked *