First I'll give a brief summary of the original problem and solution. Late September 2007 we migrated our production server from JBoss to GlassFish V2, which had just been released. We had been testing on betas and release candidates for months. Just weeks after the migration, GlassFish started to lock up causing users to get blank white screens. The browser's spinner would spin forever waiting for a response. Using our GlassFish support contract, we discovered the problem was Microsoft's JDBC driver that we were using to communicate with a remote SQL Server. It seemed that when SQL Server was under extreme load the JDBC would get a "TDS prelogin response error" and get stuck. It would block the thread indefinitely. GlassFish uses its own HTTP server called Grizzly which takes advantage of nio and asynchronous sockets. It needs only 5 threads (default config) for most loads, so once the JDBC driver locked all 5 threads, subsequent HTTP requests would block waiting for one of the threads to become available. Our solution was to switch to the open source jTDS JDBC driver. (Side note: I recently tried the latest version of the Microsoft JDBC driver and the problem still exists.)
The system ran well for six months, then in December 2008 we had our first Windows crash. The server was not accessible over the network, and the local console had a blank gray screen. We had to cycle the power to reboot Windows. This happened again, and again, and again until September 2009. Sometimes it would happen multiple times per day, 3-4 times per week, or every couple of weeks. There was no pattern. We did notice that Windows Task Manager's NP Pool column for java.exe would be relatively stable until "something" opened the flood gates causing it to rise rapidly until Windows would crash. We noticed the exact same thing while diagnosing our original problem with the Microsoft JDBC driver, and think that the problem is somehow related to the remote SQL Server being under extreme load. Also, we found that java.exe's memory was growing. It would baloon by hundreds of megabytes sometimes until we got Java Heap Space errors. That problem turned out to be a bug in an SQL query that tried to load every record in a multi-gigabyte table.
After fixing the SQL query, we still had hundreds of megabytes of BlobBuffer objects being leaked, which are part of the jTDS driver. The BlobBuffer objects seemed to be unreferenced, so we figured it was a bug in the jTDS driver. In September 2009 I read the release notes of the new jTDS v1.2.3 release. They had fixed a number of nasty bugs that seemed to be directly related to all of our problems:
- Corrected bug [1755448], login failure leaves unclosed sockets. - (we think this is the source of our NP Pool leak)
- Added missing finalizer in connection class to ensure resources are released if an application fails to close a connection.
- Corrected bug [2796385], running out of UDP sockets.
- Resolved problem [1957748], Java VM is leaking memory in File.deleteOnExit() - (we think this is the source of our BlobBuffer leak)
- Corrected bug [1869156] memory leak of WeakReferences
- Corrected bug [1793584], Login timeout canceled too early.
- Corrected bug [1843801] infinite loop if DB connection dies during a batch.
- Corrected bug [1883905] unintentional infinite wait.
I upgraded the driver on our production server and we've now been running problem free for a month! After two years, it looks like our GlassFish lockups and Windows crashes are finally behind us!
I suspect that I upgraded the JDBC driver in December 2008 during an application upgrade. After our first Windows crash I rolled back the application upgrade but forgot about the JDBC driver upgrade. The bugs were probably introduced after June 2008, and before December 2008. That would explain why we ran problem free for almost six months.
jTDS v1.2.3 now implements JDBC4 APIs so that it can be compiled on Java 6. They didn't actually implement a JDBC4 driver, they just got it to compile. When I installed it on GlassFish in our test environment I had to use GlassFish's proprietary "JDBC30DataSource=true" JDBC property to make the driver work. When I installed it on our production server, I did not have to add this property!

1 comments:
We've got a similar case with Glassfish and the cause was a printer driver. After several weeks of testing we found the printer driver could not repeatedly print 10k jobs. Unfortunately we have not alternative unless manufacture solve their driver bug.
Post a Comment