05 May 2011

Sometimes It's Not The Database

The other day for Meme Monday I put together a list of 9 Things That Go Wrong With SQL Server That Aren't The Disk.  That got me to thinking about the old "It's the database" stand-by for when The Application isn't working right.  Many of us have been there... The Application is acting up, and the knee-jerk response by the development or support team is, "it's the database"
The Database... Seriously?

The frustrating thing with this is when the problem is found to lie in The Application or in the computing infrastructure, as it often is, but all the business stakeholders remember hearing was "it's the database."

Then there are the few cases when it's true... a broken clock is right twice a day, after all.  This happens just often enough that the other support and infrastructure teams sometimes actively refuse to do any troubleshooting unless, and until, the DBA can prove that the problem lies with a non-database component.  Cross-discipline troubleshooting/resolution team?  Not when it's obviously the database

I was at a shop that recently had some recurring database connectivity issues with helpful error messages like "SQL Server Does Not Exist Or Access Denied", "Timeout Expired" and "General Network Error."  Now, you know where the finger points when the words "SQL Server" appear in an error message.

So the database team dutifully combed event logs, performance monitor and server side trace data, and was able to demonstrate pretty conclusively that nothing was wrong with the database or the server it was on.  Till the next time The Application got busy and started throwing errors, when they had to do it all over again.

What it ultimately came down to was the owners of The Application server had turned off connection pooling while troubleshooting an installation issue weeks earlier.  It turns out that, by default, Windows only allows 4000 sockets (ports) per destination IP address.  With the amount of database server traffic generated by the application server -- it had recently been "consolidated" -- they were able to hit that threshold pretty reliably.

Microsoft has a KB article on considerations for servers with connection pooling disabled that describes this nicely:
If you rapidly open and close 4000 sockets in less than 4 minutes, you will reach the default maximum setting for client anonymous ports, and new socket connection attempts fail until the existing set of TIME_WAIT sockets times out.
After weeks of recurring connection errors with the word "SQL Server" in them, the connection pooling issue was quietly fixed by upping the socket limit to some random number like 65535 :) and The Application performance returned to normal.  I'm sure you can guess the words that were never spoken to the business stakeholders... "It's Not The Database."

02 May 2011

Meme Monday: It's not the disk!

Thomas LaRock, aka @SQLRockstar (not THAT rock-star), has a fun way to get the blog month started: Meme Monday.  This month, it's just a list of 9 things we frequently see go wrong with a database server that are NOT related to disk issues.

In no particular order...

9. Good old boys
8. "It's just like a spreadsheet" database design
7. Out of date statistics 
6. Improperly partitioned tables (see 7 above)
5. Good old boys
4. Microsoft patches
3. Vendor software
2. Internally developed software
1. Good old boys

I couldn't believe the number of disk related problems that kept coming to mind while putting this nod-disk related list together - especially when considering the SAN as a black-box hard disk.

You may have noticed that Good old boys made the list slightly more than once.  Unfortunately, that is based on years of direct observation.  I'm even going to manufacture a statistic to support its frequency in the list:

Nearly 60% of production database outages are, directly or indirectly, caused by the incompetence fostered by Good old boy cronyism...  and that's NOT the disk.