GlassFish lockups were Microsoft's fault

Just days after the official release of GlassFish V2 in September 2007, we migrated one of our production applications from JBoss 4 to GlassFish V2. I had been playing with GlassFish V1 and later V2 for many months and -really- liked the command line and web based admin console. Earlier I had been given the opportunity to make major changes to our application to port it over to Java EE 5 using features such as JSF, JPA, JAX-WS and EJB 3. I had also converted our proprietary JBoss JMX MBean service to a standard JCA resource adapter. The rollout of GlassFish V2 and the new version of our application went smoothly, and we lived happily ever after. Well, not quite. A few weeks later we experienced our first lockup.

Users accessing our web application were presented with a blank white screen, and the browser busy indicator spun forever. java.exe wasn't using much CPU, there was no disk activity, and the web admin console worked fine. Restarting GlassFish solved the problem and we forgot about it until it locked up again a few weeks later. This time users were presented with an error that said: "Maximum Connections Reached: 4096". Earlier we had purchased a GlassFish support contract for assurance, so I promptly opened a case with Sun. The support engineers thought that we were under heavy load and reached the maximum number of connections, and told me how to increase the default setting. I knew our application had nowhere near 4096 simultaneous users, but gave it a try anyway. It didn't help.

We kept experiencing lockups every couple of weeks, would report to Sun and restart GlassFish. One day we found that restarting GlassFish did not solve the problem. Even Postgres seemed to be dead, and restarting it didn't help either. After a bit of panic we found that simply rebooting Windows 2003 Server brought everything back to life. Sun said that I'm one of only a small handful of people in the world experiencing something called Non-Paged Pool leak in Windows 2003 Server TCP/IP stack, and that they have been working with Microsoft for over a year to diagnose it. Sun is absolutely 100% positive it is not caused by the JVM and therefore GlassFish, and told me that to fix our bi-weekly lockups we will need a Microsoft patch. We waited months for this patch (hotfix) which was never publicly released. By that time we were experiencing lockups once per week, sometimes two or three times per week. What was different? Increased user activity. We were restarting GlassFish every Friday to try and prevent the lockups from happening. After installing the Microsoft hotfix I swear the problem got worse. We resorted to restarting GlassFish every Monday, Wednesday and Friday. It seemed that restarting GlassFish didn't make any difference. Sometimes it would lock up 20 minutes after restarting. An interesting note: we noticed that the NP Pool column in Windows Task Manager for java.exe would stay pretty steady until a lockup. Once locked up it would increase by 1K every 15-20 seconds.

Around that time we decided to evaluate our other options. Sun was telling us that we would not suffer from the effects of the NP Pool leak if we switched to another operating system such as Windows 2000, XP, Vista, Solaris or Linux. The NP Pool leak is unique to Windows 2003 Server. We were also considering going back to JBoss since our application had been running on JBoss on the same computer (but with older JDBC driver version) for over a year without lockups. Then the bombshell. Someone reported the same lockup symptoms on Linux. Then another, then another. A bunch of people on the GlassFish users mailing list got involved in heated discussions about their daily lockups and how their apps worked fine on Tomcat.

Sun explained what they thought was happening, but no-one was buying it. I'll try to explain in my own words how GlassFish and Tomcat are different. Tomcat has a socket sever called Coyote that uses blocking IO. Since it uses blocking IO, it loads a new thread for each request. If there are 30 simultanious HTTP requests, there will be 30 threads. There is a default maximum of 100 threads. If all 100 threads are servicing requests, the 101st user will immediately get an HTTP 500 error. Years ago Jeanfrancois Arcand of Sun suggested to the Tomcat community that they use non-blocking IO and a small pool of threads to service requests. The Tomcat community rejected this idea, so he created a project called Grizzly. Grizzly is the default socket server in GlassFish. It uses nonblocking IO and has a default of 5 threads to service ALL web requests. If all threads are being used, new web requests will be queued until a thread is available. The number of Grizzly threads is configurable, and Sun has done extensive scalability testing. You can tell GlassFish to use Coyote instead of Grizzly by adding the following JVM option to domain.xml :

-Dcom.sun.enterprise.web.connector.useCoyoteConnector=true

If you do that, GlassFish will lock up after 100 threads are locked instead of 5. It seems to fix the problem, but only for a while longer. To some people, this was evidence that the problem is Grizzly and therefore they should abandon GlassFish. I don't understand that logic. I wanted to solve the problem so kept working with Sun. Jeanfrancois suggested that something in my application was blocking for a long time so the Grizzly thread servicing the request was not going back into the pool to be re-used. Four more web requests would follow, and they would also get into this blocked state. Now there are no threads to service requests. All new web requests would be queued, which explains why web users see a white screen and the browser's busy animation spun forever.

To prove it, Sun asked for me to provide a thread dump of GlassFish while it was in a locked up state. To do this I ran the following command from the command prompt:

asadmin generate-jvm-report --type=thread > thread_dump.txt

It didn't take long for them to point out that all five Grizzly threads were waiting on ResourceManager, which is responsible for JCA connection pools of things such as database connections. The ResourceManager had no usable database connections and was trying to create a new one. It was waiting on the Microsoft SQL Server 2005 JDBC driver. The JDBC driver was stuck in some prelogin() method. Turning on fine logging only revealed that the JDBC driver received some "TDS prelogin response error". Sun suggested that we use the JDBC driver's loginTimeout property to override the default value of 0 (wait forever). We tried this with no success. Next we tried the open source jTDS driver which is for MS SQL Server. Bingo!!!!!!! We've been running for over a month without experiencing lockups and are very happy that the problem was not GlassFish's fault.

Still there is some skepticism on the GlassFish users mailing list. My problem was because of JDBC, but some of the other people aren't using databases. One guy who is using a database says that the exact same .war file and MySQL JDBC driver never locks up on Tomcat, but locks up several times daily on GlassFish. All I can say is that Jeanfrancois's explanation of Grizzly's thread pool and queue make perfect sense, and obviously something is blocking those threads. Maybe you don't notice lockups on Tomcat because not all of the 100 threads lock up, and whatever is causing the lockups eventually times out and releases the threads before it has a chance to lock up all 100 threads?? Generate a thread dump and get Sun to work with you to figure it out. Unfortunately Sun hasn't been able to help everyone who provides a thread dump on the mailing list, and suggested that these users purchase a support contract. That doesn't bode well for people evaluating GlassFish. This blog entry is to give those people a better understanding of the problem, and hope that it can be resolved.

Comments (2)

Comments:

Thanks for sharing your interesting story.
We are moving from Weblogic and I really like Glassfish.
Lately we were experiencing some trouble with Glassfish but nothing fatal.
Mailing list is helpful but not all the time, specially when you really need it.
I will consider to suggest my company to purchase the support contract.

Posted by Kenneth on July 02, 2008 at 10:35 PM EDT #

Hi man, this sucks but I need to know the name of this stream that sema4 put me onto in like 2000 or something. He said somewhere it keeps him happy whilst hacking but I can't find that now. It had some low-profile name and a grey & orange colour scheme and this cube on the front.

Can you help me?

Posted by flamoot on July 08, 2008 at 07:06 AM EDT #

Post a Comment:
Comments are closed for this entry.