15 November 2010

Nearly Competent

I thought this was a pretty good insight from Jeff Atwood over at CodingHorror
You can tell a competent software developer from an incompetent one with a single interview question:
What's the worst code you've seen recently?
If their answer isn't immediately and without any hesitation these two words:
My own.
Then you should end the interview immediately. Sorry, pal.

I love the brutal efficiency of the question, and find it to be revealing in the same way as the “on a scale of 1 – 10, how would you rate yourself” question.  This question isn't quite as efficient -- there are more than two answers -- but any answer that isn't actually between 1 and 10 should end the interview.  If the applicant doesn’t realize that 15 is not less than 10, what can you possibly hope to salvage by continuing?

Yes. I’ve heard, with my own ears, somebody seriously rate themselves higher than 10 out of 10.  I'll never forget the claim: "On a scale of 1 to 10, I'm a 13.  My SQL Server skills are that good."  Of course the candidate was hired (it wouldn't be much of a story otherwise) and the consequences weren't pretty.  Years later, the team that made the hire is still trying to overcome the reputation of minimal competence, unresponsiveness, and outright contempt that this individual bequeathed them.

On a lighter note, a little tidbit about the most overlooked risk in software engineering (again, from CodingHorror):
Q: What is the most often-overlooked risk in software engineering?
A: Incompetent programmers.

My name is “Bob” and I’m a developer. Now let’s begin planning around my incompetence.

11 November 2010

Restore Me, Please

Our shop has a stored procedure to restore databases that takes care of all the ugly details of file moves and the like.  You supply a target_database_name and a backup_file and it does the rest.  No worrying about which paths to move your data and log files to on each server.  It's even supposed to distinguish between native and Red-Gate backups and do the appropriate magic.


A couple years ago, we had a rock-star DBA "optimize" the restore procedure, and it has only worked intermittently ever since.  We've all taken to using the native backup syntax so we don't depend on "the procedure."  However, we occasionally get a compressed database we need to restore (.sqb), and I recently ran through a painful restore that I thought I'd share to give a quick overview of restoring backups that have been compressed with Red-Gate SQL Backup.

After making a good-faith effort to restore with "the procedure," I grabbed version 6.4 of the Red-Gate SQB Converter so I could extract the .SQB file into its constituent .BAK files.  Not entirely related, but we tend to have 15 .BAK files wrapped up in each .SQB file.  It helps backup performance, but can be a killer on MemToLeave Virtual Address Space (VAS) on 32-bit servers.

In this particular case, it wouldn’t convert the file because it wanted a password.  This is often an indication that the converter version is older than the SQL backup version that created the .sqb file, but I'm sure it could have other meanings, too (like needing a password?).

Since I couldn't get my .bak files out of the .sqb file, I had to crack open "the procedure" and scarf its Red-Gate syntax to do the restore.  Other than the additional Red-Gate syntax, the process is the same we would use with native tools…

1.       Restore filelistonly to get your logical filenames
2.       Determine the location to place your .mdf and .ldf files on the server
3.       Restore your database with move

Here is the relevant syntax…

1.       Get your logical filenames with this:

EXEC master..sqlbackup N'-SQL "restore filelistonly from disk = ''path_to_your_sqb_file.sqb''"'

2.       Get the default file locations for your server (if you can create a database, you should have permissions to write restored database files there, too):

CREATE DATABASE bob
EXEC bob..sp_helpfile
DROP DATABASE bob

3.  Once you have your logical filenames, and the paths to where you want to move them, you can run the restore…

EXEC master..sqlbackup N'-SQL "restore database your_database from disk = ''path_to_your_sqb_file.sqb'' with move ''data_file'' to ''path_to_mdf_file.mdf'', move ''log_file'' to ''path_to_your_log_file.ldf''"'

I'm still a big fan of Red-Gate SQL Backup, I just don't listen to rock any more.

08 November 2010

The Patching Treadmill

Well, it's almost Patch Tuesday again.  Are you ready to apply and test them?  It's one of the most important things you can do to maintain the security and availability of your SQL Servers.  Remember Slammer?  The patch for that was months old, but hadn't been widely applied.

Keeping up with patches used to be a time consuming, hit-and-miss sort of affair.  By scheduling a regular monthly patch release, Microsoft has provided the predictability that many organizations need to manage and keep up with the patching process.  There isn't really an excuse any more for not having some sort of regularly scheduled patch deployment and testing process.

We happen to use Windows Server Update Services (WSUS) to authorize and deploy patches in our environment.  We push the patches through QA and test environments before pushing them to pilot groups then the rest of the enterprise.  It's a pretty decent system and has served us well.

Testing the patches is critical, because they do occasionally cause production outages of their own.  We had to back one out of production a couple of months ago because it affected the communication between one of our Windows applications and its WCF service.  Of course, since we're not perfect, the same patch made it back into test last month, but we were able to get it backed out before it made it through to production again (it had been mistakenly re-authorized in WSUS).  We'll have to really watch out this month: the only thing more embarrassing than getting caught by the same mistake twice is getting caught by it three times.  Three strikes, and you're out, right?

Just to help get things rolling, this is the schedule we use for applying Microsoft's monthly patches. Just place the Microsoft release on the second Tuesday then testing and production rollouts fall right in place.






This is based on an Excel Template by Vertex42

03 November 2010

My Descendents Thank You

I have 3,932,160 grandchildren.

At least I will in 600 years, if my children and theirs each produce two reproducing children per generation.  And that’s exactly when today’s performance optimization will pay for its own development time in CPU cycles.

I have written previously about intentionally leaving query results unordered to save some academic CPU cycles while burning money in end-user time – a poor “passive” optimization.  Now we’ll discuss actively making inane optimizations that will never pay for themselves – a poor “aggressive” optimization.

A developer modifies a query with the academic goal of saving several milliseconds per execution (perhaps as few as 6 ms).  If this was a query that ran thousands of times per hour during user time, we might have been able to quantify some savings to the business.  Unfortunately, the optimized queries were in some procedures that run once per day during a nightly batch cycle that had no time pressure.

I ran the numbers on the CPU savings and it will take more than 600 years for the saved CPU cycles to accumulate the same amount of time spent to “optimize” the query.  Because the CPU time for a batch process does not correspond to the real money being spent in user-time, this optimization will never be paid for… even in the 600 year break-even period for the CPU time.

These two types of performance optimization illustrate the danger of optimizing for purity of execution without giving any consideration to the practical implications of where our technical decisions force the business to spend its money.

If we keep wasting it like this, they certainly won’t be spending it on us.

29 October 2010

Don’t waste my (users’) time

Tradeoffs are everywhere.  This is particularly true in the realm of software development.  As the saying goes, “speed costs.  How fast do you want to go?”  We should always keep these tradeoffs in mind as we balance our technology decisions with the needs of the business.

Here is a quick tale of a not-so-great performance optimization that really emphasizes the importance of making optimization decisions that actually have business value…

The case of a missed sort optimization
A business user and her colleagues currently sort and search a grid manually because "without some kind of 'sort' you have no idea where in the list your application will be found".  These results apparently get scrambled each time the grid refreshes.  The company is spending wage dollars (read “real money”) to do the work that the CPU could be doing for “free”.

In this case, the ten seconds worth of real money spent by the developer to type “ORDER BY ” would have been a huge win.  One ten-second real money line of code would have saved thousands of several-second real money live employee sorts. I don’t know about you, but I get excited about single order-of-magnitude performance increases.  This was a three-orders-of-magnitude oversight!

Because the developer didn't make this 10 second investment up front, the company now has two options (there may be others... conversations are good at flushing out options!):
1.    do nothing and continue to spend  real money to do the sorting that the CPU could be doing for free
2.    spend thousands of dollars of real money to add an ORDER BY (or an appropriate index to force a sort).

I can hear you scoffing now, “Sure, thousands of dollars!” Picture it with this nearly exaggerated process –

We start with our line-of-business employees spending their time (remember, that’s real money) adding the suggestion to the feature request system and voting it up (picture Microsoft Connect for your business applications).  We just spent more real money getting the bug typed in and voted up than we would have spent with the initial sort, but I digress.

Next, we have the bug validation team spending real money making the appropriate contacts and follow ups with a decision maker who determines (spending more real money) if this is a legitimate feature request.

Once we have determined that this is a real bug, worthy of company resources (more real money), we proceed to the team (think real money multiplied by Dilbert) that determines which features and applications get worked by which development teams and when.

Now that we’ve gotten all the way through the verification, validation, prioritization, and schedulization phase, it’s time for a software development team (there’s that real money multiplier again) to actually begin working the bug – remember that ORDER BY thing we were talking about?  Our software development team is going to go through its own verification, validation, prioritization and schedulization process.

They have to reproduce the bug, find the source code module where they can actually make the fix, fix it (our initial 10 second line of code), then run it though the test deployment process, user acceptance testing, the pre-prod deployment process, the pre-prod testing process, the change request for production process (think about another team here), the production deployment process and the post-production testing process.

Don't get me wrong, this isn't some process oriented bureaucracy, this is a kanbanagilescrum .NET development shop with all the modern efficiencies as their disposal.

You probably thought I forgot where I was going with this.  It was close there for a second, but it's still here.  The problem with this isn't that the developer made a simple oversight on a search screen and nobody noticed before production.  That’s understandable, and we all do similar things routinely.  The problem is that there really are developers who intentionally leave all of the sorting to the end-user because they want to save the CPU sort cost for the 400 rows.

These decisions not only erode our users’ confidence in both the development team and the application, they consciously make the tradeoff between CPU time and User time – and choose infinitesimally small bits of CPU time in favor of measurably large amounts of real money paid in the form of user time.

Stop wasting my users’ time!

10 September 2010

Alter Ego

So the other day I'm doing some light reading from Paul Randal about DBCC CHECKDB when I learn that, beginning with SQL 2005, CHECKDB actually performs its checks on an internal database snapshot and not the "real" database.  That was a totally unexpected bit of knowledge, so it really caught my eye.  It's great to have somebody that loves to share these things with the world.  Thanks, Paul!

A little more reading revealed that the internal database snapshot is implemented as an alternate stream of the live database file.  And also that I have no control over where it's created, so it can cause a nearly full drive to fill up during a CHECKDB (the original reason for his original article).

I had never heard of an alternate stream before, but I did find a couple pretty good explanations, that I won't cover here, other than to mention that it's a feature of NTFS and allows multiple data streams to be stored within a single file (the second link has downloadable source-code for utilities that manipulate alternate data streams, if you really want to get down in the weeds).

I wanted to see one of these alternate streams in action myself, and discovered that you can view a file's alternate streams by using the /R switch of the DIR command.  Here is DBCC CHECKDB in action, complete with a peek at the alternate steam in the file system.

I'm using the following command to provide directory listings for the  AdventureWorks database:


declare @cmd sysname
set @cmd = 'dir /R "C:\\AdventureWorks_Data*"'
exec master..xp_cmdshell @cmd


This shows the AdventureWorks_Data.mdf file:


Now run dbcc checkdb('AdventureWorks') and, while that is still running, kick off the DIR command from a separate SSMS window.  This shows the AdventureWorks_Data.mdf file and the alternate stream that is used for the internal database snapshot:


Notice that the DIR command only reported a single file in the summary.  The alternate data stream is completely invisible to most DOS style commands, and you'll never see one in Windows Explorer.

I didn't really expect to do so much digging after a little reading about DBCC CHECKDB, but it turns out there was a lot to learn, and I couldn't resist diving into the mysterious internal database snapshot and its alternate data stream.

Enjoy!

07 September 2010

Feeling the Love

DBAs @ Midnight just kicked off its second season. If, like me, you missed the live season opener, you'll have to settle for the preview until the recording is available.

In the meantime, they are forming a zombie gang and soliciting public displays of affection. I don't know if I'd make a good zombie gangster but I sure do enjoy the show, so here's my PDA:



Have a great second season Sean and Jen!

06 September 2010

That's Virtually Twice as Fast

We do much of our software testing and validation on virtual machines. We happen to be using VMWare ESX as the host and are testing with Windows XP SP3 as the guest.

We were getting pretty poor launch performance from our application as compared to a physical machine. I'll measure launch time in Task Manager Grid Squares (TMGS). Think of them like story points for performance monitoring. They just represent some arbitrary amount of time (what it is... 9 seconds per square?).

We have a main application, a supporting GIS application and a 3rd application for moving things around the network, and it was taking as many as 10 Task Manager grid squares for the three applications to launch. Here is what the launch sequence looks like in Task Manager:

Here is the key for the image:
1. Main application launch
2. GIS application launch -- almost 1 minute of 100% CPU -- OUCH!
3. Network search
4. Opening the first search result
To try and help out with the launch speed, we added a virtual processor and a virtual GB of RAM to the virtual machines. This reduced our launch time from 10 Task Manager grid squares to just over 6.


You can see that opening a search result wasn't really helped out because it is rather network intensive. Main application launch is a mix of network and processor, and didn't fare much better with two processors than it did before the upgrade.

The real benefit we saw was the large gain in both the launch of the GIS application and the processing of search results, both of which are processor intensive. We see the times here cut by more than half.

03 September 2010

Chew Once, Swallow Twice

I've been working on a legacy system that's been sitting on the shelf for about a year. Most of the system is in VB.Net, so I've been writing a fair amount of "VB#" as habit dictates that I end each line with a semi-colon -- it works in T-SQL and C#.

A workmate did find an import/export routine written in C#, and it had a great catch statement I thought I'd share. First it catches any error and swallows it, a la ON ERROR RESUME NEXT. But, just for good measure, it also checks the error message so it can execute... a comment!

catch (Exception ex)
{
if (ex.Message.Contains("Violation of PRIMARY KEY"))
{
//ok
}

}


I guess any data integrity error worth suppressing is worth over-suppressing. I've probably done the same type of thing a dozen times, so it's nice to have a little fun with somebody else's forgotten bit of error control.

01 September 2010

What's in a name?

The software team I've been working with had a “string or binary data would be truncated” error where we were trying to copy a 100 character “buyer name” column from one table into a 50 character column in another table. We also had a "seller name" column with the potential for the same error and fixed that one up, too.

I took a quick peek at the data to see how the name lengths were distributed in each table. Both names had an average length in the mid- to upper-teens. Virtually none of the names were longer than 50 characters (probably why the bug went unnoticed for years).

Here are a couple visuals of the distribution:


Since this was a reasonably large sample (more than 100,000 names in each column), it probably makes a good guideline for full name column sizing.

As is usual with software defects, the standard “we inherited it” rule applies in this case. Which brings up the quick unrelated observation that I rarely meet developers who claim to write bugs, just folks who inherit them. I'll have to write more on that, later.

At any rate, it seems that bug writers like myself are so prolific that it doesn't take very many of us to keep all the real developers busy.

16 February 2010

Are you a reader or a writer?

I stumbled across a query that gives the percentage reads versus writes and thought it might be interesting to see how some of my servers stack up.

I gave the query a couple tweaks so it returns percentage directly, as well as total reads and writes. I needed the totals so that I could calculate read and write percentages for a group of servers.

select
sum(user_seeks + user_scans + user_lookups) as total_reads,
sum(user_updates) as total_writes,
cast(
cast(sum(user_seeks + user_scans + user_lookups) as decimal) * 100.0
/
cast(Sum(user_updates) + sum(user_seeks + user_scans + user_lookups) as decimal)
as decimal(5, 2)) as read_percent,
cast(
cast(Sum(user_updates) as decimal) * 100
/
cast(Sum(user_updates) + sum(user_seeks + user_scans + user_lookups) as decimal)
as decimal(5, 2)) as write_percent
from
sys.dm_db_index_usage_stats


I recycled the original query from http://statisticsio.com/Home/tabid/36/articleType/ArticleView/articleId/324/Is-8020-a-90rsquos-Estimate.aspx

After running this as a multi-server query against a group of 12 instances, I put the query results into a spreadsheet so I could see how individual servers stacked up against the group.

There was quite a bit of variation, but all the instances had more reads than writes. The most read-skewed instance in the group did nearly 40 reads for each write, while the least read-skewed server only did 4 reads for every 3 writes.

As a group, these instances were reading over writing at a 5 to 1 ratio: the 80/20 rule we hear so much about is alive and well on these instances.

Instance

Reads

Read Pct

Writes

Write Pct

Read/Write Ratio

instance01

5,299,601

73%

1,961,846

27%

2.7:1

instance02

35,440,282

91%

3,570,652

9%

9.93:1

instance03

9,083,137

79%

2,475,755

21%

3.67:1

instance04

27,181,812

64%

15,568,632

36%

1.75:1

instance05

11,438,327

72%

4,384,756

28%

2.61:1

instance06

391,095,544

94%

27,148,899

6%

14.41:1

instance07

295,659,557

57%

220,249,874

43%

1.34:1

instance08

488,649,390

97%

13,016,100

3%

37.54:1

instance09

47,661,242

87%

6,896,585

13%

6.91:1

instance10

63,179,128

80%

15,418,897

20%

4.1:1

instance11

195,456,589

87%

28,239,952

13%

6.92:1

instance12

747,410,692

86%

123,282,598

14%

6.06:1

Group Totals

2,317,555,301

83%

462,214,546

17%

5.01:1