05 May 2011

Sometimes It's Not The Database

The other day for Meme Monday I put together a list of 9 Things That Go Wrong With SQL Server That Aren't The Disk.  That got me to thinking about the old "It's the database" stand-by for when The Application isn't working right.  Many of us have been there... The Application is acting up, and the knee-jerk response by the development or support team is, "it's the database"
The Database... Seriously?

The frustrating thing with this is when the problem is found to lie in The Application or in the computing infrastructure, as it often is, but all the business stakeholders remember hearing was "it's the database."

Then there are the few cases when it's true... a broken clock is right twice a day, after all.  This happens just often enough that the other support and infrastructure teams sometimes actively refuse to do any troubleshooting unless, and until, the DBA can prove that the problem lies with a non-database component.  Cross-discipline troubleshooting/resolution team?  Not when it's obviously the database

I was at a shop that recently had some recurring database connectivity issues with helpful error messages like "SQL Server Does Not Exist Or Access Denied", "Timeout Expired" and "General Network Error."  Now, you know where the finger points when the words "SQL Server" appear in an error message.

So the database team dutifully combed event logs, performance monitor and server side trace data, and was able to demonstrate pretty conclusively that nothing was wrong with the database or the server it was on.  Till the next time The Application got busy and started throwing errors, when they had to do it all over again.

What it ultimately came down to was the owners of The Application server had turned off connection pooling while troubleshooting an installation issue weeks earlier.  It turns out that, by default, Windows only allows 4000 sockets (ports) per destination IP address.  With the amount of database server traffic generated by the application server -- it had recently been "consolidated" -- they were able to hit that threshold pretty reliably.

Microsoft has a KB article on considerations for servers with connection pooling disabled that describes this nicely:
If you rapidly open and close 4000 sockets in less than 4 minutes, you will reach the default maximum setting for client anonymous ports, and new socket connection attempts fail until the existing set of TIME_WAIT sockets times out.
After weeks of recurring connection errors with the word "SQL Server" in them, the connection pooling issue was quietly fixed by upping the socket limit to some random number like 65535 :) and The Application performance returned to normal.  I'm sure you can guess the words that were never spoken to the business stakeholders... "It's Not The Database."

No comments: