Developerworks has a fantastic article written about Endurance Testing with WebSphere Process Server. Taken from the abstract:
Endurance testing is an important aspect of reliability. This article provides insight into the various problems and solutions encountered by the WebSphere Process Server Validation team as they performed an endurance run on WebSphere Process Server V6.0.2. Upon completing this article, you will be able to tune up your own WebSphere Process Server environment for optimum performance and stability.
What they’ve done is setup clustered, production quality WPS instances through a typical scenario of WS invocation to BPEL to Business Rule lookups and more web services calls. Then they hammered the server with 30 business objects of 20KB in size per minute (1800 an hour or 43,000 a day) for a week and documented the issues they ran into. It also includes all the spots where they made performance tunes to the server and required databases.
I think the most valuable part of the article is the “Problems and solutions” section. Here, they describe all the problems that were encountered and the resolution. I’d like to point our three of them specifically, because I personally encountered them with my client. We actually do a smaller number of transactions per hour (700-1000) and still hit problems.
Observed out of memory errors after running for three days. We noticed that it was due to memory fragmentation.
We added the following JVM args:
-Dibm.dg.trc.print=st_verify -Xgcpolicy:optavgpause -Xcompactgc -Xnopartialcompactgc
My client had an issue where the JVM size was turned up too high and we ran out of native memory (if your server ever just “POOF” disappears, you ran out of native memory) and added the same lines.
Problems with event sequencing messages. WebSphere Process Server release comes with controlled max active message size defaulted to 100 (/opt/WebSphere Process Server/ properties/EventSequencing.properties). In our ND7 seven-day run for the PE scenario, endurance created problems in event sequencing some messages.
This problem was attributed to limited active message size. Making it 0 (which means infinite) solved the issue, and we saw the messages getting event sequenced properly.
In our project, we use Event Sequencing. We also experienced a situation where the queue stopped consuming messages because the ‘maxActiveMessages’ was blown. I believe in my heart of hearts that this is a PMR (Event Sequencing doesn’t unlock the messages properly, but that’s for another post). We changed it to 0 (unlimited) and our queue processed successfully*.
Messages do not process completely. The default value for Work managers > es-workmanager > Maximum number of threads = 20. During ND7: seven-day run for the PE scenario for endurance, we observed that certain messages were never processed completely. Some of them were queued up in the SCA queues and therefore were not processed.
Setting activation specs for BPEL defines how many instances of BPEL can be created. Depending upon the throughput, we set the number of concurrency for BPEL instances created at a given time, we changed the value to match BPEApiActivationSpec(JMS activation specification => BPEApiActivationSpec and BPEInternalActivationSpec, the max concurrent value is 40). Increasing the concurrency to 40, helped solve the issue.
My client currently has an issue where messages are sitting in internal BPEL SCA queues in the ‘LOCKED’ state and not getting processed. We haven’t yet followed this recommendation but I suspect we will need to.
The only negative thing I have to say about this article is that they focused too much on ‘working around’ the problems and not delving deeper within IBM to get them fixed. The article was published on 18 Jul 2007 but here it is 27 Feb 2008 and the problems still exist.