November 09, 2006

How NOT to move to a new database server

We bought a new server (a shiny Dell 2950 with Windows 2003 x64 and SQL Server 2005) back in May with the idea of moving Shelby and our other SQL databases to it in a planned, smooth way in June or July. We needed to free up a server for our new Rez West office so we figured it was a great opportunity to upgrade the server hardware and maybe speed up Shelby a good bit.

Due to a number of delays in the Rez West office completion, we had plenty of time to play with the new server. As June melted into July, we systematically moved our non-Shelby databases to the new server and started configuring it for Shelby.

When we updated to Shelby 5.6 in early August, massive performance problems followed. After some investigation, we came to the conclusion that 5.6 placed enough additional resource demands on the server that the old guy, a Dell 2500, wasn't getting the job done any more.

Before we could plan a cut over, the real adventure began. The old server crashed on a Wednesday morning, a time of the week when Shelby is typically under its heaviest load. In response, I divided our IT staff into two teams: one team would work on repairing the old server, while the second team would work on moving to the new server. We would go with whichever team finished first. In the course of that, we discovered that Shelby 5.6 still wasn't compatible with SQL Server 2005, despite the fact that we had been told that the May 2006 update would be compatible. So we finished restoring the old server (the crash had been caused by Windows registry corruption) and after a 6-hour outage we had no choice but to stay there - as painfully slow as it was.

By early October, we were faced with a dilemma. We had to move off of the old server to free it up for Rez West (not to mention the fact that it was so slow it was nearly unusable), but we still didn't have a SQL 2005-compatible version from Shelby and we already had databases running on SQL 2005 - there was no going back. That's when we got the idea to use the free version of VMware to create a virtual server for Shelby that would be configured with SQL 2000 and run on the new 2950 server. Our thought was to run this configuration temporarily until Shelby's October update, which was promised to be SQL 2005-compatible. Seemed like a brilliant idea at the time. Little did we know.

We have been on that configuration for the last five weeks, suffering from massive instability the entire time, including another server crash causing a 12-hour unplanned outage two weeks ago. The source of our grief was SQL 2000, not Shelby, but the user experience was that Shelby would randomly get disconnected from its database and start throwing errors. Shelby error handling is an awful mess of infinite loops, so the user's only recourse was to bring up Task Manager, shut down the client, and restart it. We spent hours on the phone with Microsoft tech support. They had us apply all kinds of updates and patches, including an unreleased hotfix. We also tried a number of changes to our VMware configuration, some of which actually made the problem worse. Each time we and our user community hoped that stability would finally be achieved, only to be bitterly disappointed. Not only was this frustrating for our users, but it was embarrassing for us. Though they were characteristically gracious, the users must have been thinking, "Does our IT Dept. know what it's doing?"

Our goal for this time was simply to hang on until we could install the SQL 2005-compatible version of Shelby, which was released last week. Every day the database was flaky and users were inconvenienced. Then we had something go bad last week in Bank Reconciliation, preventing our Finance folks from reconciling October. Shelby tech support concluded there was a damaged software component, which would be reinstalled by updating to the latest software version. So we scheduled the update for Sunday afternoon. Unfortunately, the Shelby update procedure didn't work. We fought that all day Monday and finally got all desktops updated.

The last step was to install new backup software because the version of Veritas we were running doesn't support Windows 2003 Server x64. To address that, we installed a trial version of ARCserv yesterday.

We finally had all the pieces in place to cut over to SQL Server 2005, running directly on our 2950. Then today the Bank Rec thing started crashing SQL 2000 every time we ran it. Since it was crashing and has been completely unstable for five weeks, we figured we would go ahead and cut over with yet another unplanned outage.

So there you have a tale of how NOT to move to a new database server. By the way, so far it seems stable and faster by 50-60%. Now we're praying it's really over.

No comments: