Unusual errors with Distributed Replay

I’ve come to really like Distributed Replay in the last couple of years. I’ve used it to do a scale test, I’ve used it to test a workload for performance regressions before upgrading. It has a lot of possibilities.

One problem with it is there’s no GUI, so configuring it requires a fair bit of time spent playing around on the command line or in the depths of Component Services.

Another problem is that there aren’t a lot of people using it, so there isn’t a lot of detail on what to do if something goes wrong.

Make that ‘when something goes wrong’, DReplay is a little finicky.

The more common errors are firewall and COM related and they appear in several blog posts, a search for the error code usually turns up a resolution. However there was an error which I ran into recently which turned up exactly 0 results in google. So, to fix that problem, here’s a description of the error, the circumstances and what turned out to be the cause of the error.

To start, the scenario. The preprocess of the trace files had been done, the firewall configured, the COM setting changes made. The services were running, no errors showing in the logs. I’d used DReplay on the machine previously with the same processed trace file and it had worked fine. This time, however…


“Error DReplay   Failed to get client information from controller.”

After turning up nothing in google, I spent half the afternoon checking logs, restarting the services, restarting the computer, checking and rechecking the firewall and the COM settings. I finally went and checked the details of the controller and client services.


Anyone spotted the problem yet? For those who haven’t, let me highlight pieces of those last two screenshots.



Distributed Replay is not instanced. If there are two versions of SQL Server installed on the machine, and this laptop has SQL 2012 and SQL 2014, and the replay controller and client were installed with both, then the service points to the executables from the most recent installation. The older version’s executables are still there however, and they still execute. They throw errors, but the errors do not, in any way, indicate that there’s a version problem.

The above error is what the Replay option of DReplay returns. If the preprocess is run from the incorrect directory, the error returned is “Error DReplay   Object reference not set to an instance of an object.”


The fix is as simple as changing to the correct directory and running the correct version of DReplay, the one that matches the version which the services point to.

T-SQL Tuesday: Disasters don’t just come in huge

imageSo we’re supposed to talk about disasters we’ve had or discuss disaster recovery technologies. I’m going to take a slightly different approach…


I’m in the fortunate position of living in a region of the world that’s relatively free of natural disasters. We’re reasonably geologically stable, the nearest fault lines are the Great Rift Valley and somewhere in the Antarctic ocean. We don’t get tornadoes, we’re a long way from the ocean (and Madagascar partially protects the coast from typhoons and tsunamis)

Given that, and looking at the recent events in Japan and Southern USA, local IT managers might be grateful that their critical infrastructure is here, not there. But that is no reason for complacency, no reason to ignore putting a disaster recovery plan in place.

Major huge national disasters, while they attract a whole lot of attention (and rightly so) are probably not the main cause of IT disasters. IT disasters, for the most part, are likely to be caused more by smaller events like these1

  • A drive in the RAID 5 array fails, and the SAN admin switches out the wrong drive.
  • A SAN controller fails and overwrites a LUN or 2 with binary garbage.
  • The server room floor collapses dumping the SAN 2 floors down into a garage, with the server on top of it.
  • A water leak a floor above the server room results in the UPS getting a shower, and the resultant power surge fries the servers’ main boards
  • A developer with far too many permissions truncates an important table on the production server, thinking he was working on the development environment.
  • The mains power fails but the generators don’t come online because their fuel was drained a day earlier in preparation for maintenance.

Events like those (or probably even more mundane events) are the ones that we need to plan for. Relatively minor disaster that can leave business without critical infrastructure or critical data for hours or days.

You need to plan for the small disasters as well as the large ones. Plan for the dropped table. Plan for two drives failing in the RAID 5 array. Plan for the server’s power supply failing. Plan for the big disasters too, just don’t think that they’re the only thing that endangers your data and your business.

(1) I’ve personally seen 3 of those events happen, I’ve heard from people who have seen two more and I know of a company that’s at risk of one. They’re not just made-up improbably occurrences.

Full backups, the log chain and the COPY_ONLY option.

There have been a large number of posts made regarding the interaction between full database backups and the log chain (like this one). However there still seems to be a lot of confusion around regarding how they interact. I doubt I can clear up the confusion with yet another blog post, but I’ll try.

One that seems to come up now and again is the COPY_ONLY Option on backups. Typically the comment is something like ‘Use the COPY_ONLY option when taking ad-hoc full backups so that you don’t impact the log backups.’ Now we know from the blog posts linked above that full backups don’t ever break the log chain (and I’m not going to run yet more tests to prove it) so what is the copy only option there for?

Books Online states the following regarding the COPY_ONLY option for backups – “Specifies that the backup is a copy-only backup, which does not affect the normal sequence of backups. A copy-only backup is created independently of your regularly scheduled, conventional backups. A copy-only backup does not affect your overall backup and restore procedures for the database.”

Well, that doesn’t clear things up much. It does however go on to say this: “When used with BACKUP DATABASE, the COPY_ONLY option creates a full backup that cannot serve as a differential base. The differential bitmap is not updated, and differential backups behave as if the copy-only backup does not exist. Subsequent differential backups use the most recent conventional full backup as their base.”

So it’s not the log chain that copy only is there to not affect, it’s the differential base. Let’s test and see how it works.


Why is CheckDB rolling back transactions?

I saw this one on a forum recently and it’s a fun question to look into.

When running CheckDB on an in-use database, sometimes messages like the following will appear in the SQL error log

1 transactions rolled back in database ‘Testing’ (8). This is an informational message only. No user action is required.
DBCC CHECKDB (Testing) WITH no_infomsgs executed by theAdmin found 0 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 31 seconds.

What’s going on here? Why is checkDB rolling transactions back?

Well, the truth is, it’s not rolling anything back.

When CheckDB runs it has to be able to get a consistent view of the database, one without half-done transactions, without data modifications in progress. On SQL 2000 (and earlier), CheckDB used the transaction log to get that consistent view. It would read through the transaction log to see what had been done to the various structures in the database. In SQL 2005 this was changed and CheckDB now uses the database snapshot feature to get its consistent, point-in-time view of the database.

Database Snapshots

A database snapshot is a read-only, point-in-time copy of a database. It uses a combination of sparse file and copy-on-write technology to reduce space usage and, as a result, it’s usually much smaller than the database it was based on.

When a snapshot is created, any open transactions are rolled back within the context of the database snapshot to generate the consistent, point-in-time copy. This generates messages in the SQL error log stating how many transactions were rolled back


The first thing that CheckDB has to do is to create an internal, hidden database snapshot to get the consistent view of the DB that it needs. It will do this unless it’s run with the TABLOCK option. When the snapshot is created, any open transactions will be rolled back but only in the context of the snapshot, not in the context of the actual database. CheckDB will then run against the snapshot and will drop it once complete.

So the answer to the question of why checkDB is rolling back transactions is a simple one. It isn’t.

Backing up to NUL vs Backup with Truncate only

Or “It’s 10pm, do you know where your log records are?

Something that I’ve started to see recently is the idea that Backup Log … With Truncate Only, which was deprecated in SQL 2005 and gone in SQL 2008 can be simulated by backing the log up to the NUL device (backup log … to disk = ‘Nul’). Is that true? If so, is that a good idea?

To answer those questions, first we need to explore what the two statements do.

Backup Log With Truncate_Only

When this statement is run against a database in full or bulk-logged recovery, SQL discards the inactive log records for that database. From that point until a full database backup is done, the database is in pseudo-simple recovery, the inactive log records get discarded when a checkpoint runs and any attempt to back the log up results in an error message.

Msg 4214, Level 16, State 1, Line 1
BACKUP LOG cannot be performed because there is no current database backup.

So once the log has been truncated it’s exceedingly clear that the log chain is broken.


Why the DMVs are not a replacement for SQL Trace

With the introduction of the DMVs in SQL 2005, there’s now a wealth of information available on every aspect of SQL’s behaviour. It’s possible now to interrogate the SQL procedure cache to find out what queries have been running and how well or badly they are performing. So does all this mean that SQL Trace is now obsolete?

It does not. The execution stats available through sys.dm_exec_query_stats are only retained while the plan for the query is in cache. As soon as the plan is removed from cache (for whatever reason), the query stats for that query will be discarded. There are also a couple of reasons for a query’s plan not going into the cache at all.

Let’s try a quick example. I’m going to use the AdventureWorks database (SQL 2008), because it’s convenient. I’m going to create 4 procedures, run them several times, along with a couple other commands and then compare what a trace shows and what a query of the plan cache shows.


Deleting the transaction log

Or “Why is my database now inaccessible?

This is another things that I’ve seen a few things recently on various forums. A DB’s transaction log fills the disk, probably due to a lack of log backups, and then either SQL is stopped and the log file deleted or the database is detached and the log file deleted. Either way it’s not a particularly good thing to do.

No, I’ll be blunt, it’s a appallingly bad thing to do. The transaction log is not an optional piece of the database. It’s not like an installation log or activity log where the entries are there just in case anyone’s interested. The transaction log is what SQL uses to ensure, at all times, that the data in the database is transactionally consistent. It’s what gives SQL databases the consistency and durability properties that are required from a relational database engine.

The transaction log has a number of uses within SQL.

  1. In SQL 2000 and before, the inserted and deleted tables were materialised from the transaction log. This was changed in SQL 2005 and they’re now materialised from the row version store in TempDB.
  2. A transaction rollback uses the transaction log to determine what needs to be undone.
  3. In transactional replication, the log reader uses the transaction log to determine what changes need to be replicated
  4. Used by Change Data Capture in SQL 2008 to extract changes made to registered tables
  5. During restart-recovery to ensure that transactions that had committed when the service stopped but whose changes had not been written to the data file are replayed and to ensure that transactions that hadn’t completed are rolled back.

For now, I’m just going to consider the last one.


Corruption month

There’s been a rash of database-corruption posts recently. On the two forums that I read, there have so far been around 12 corruption-related problems so far this month. That’s frightening. Many of the problems were only solved either with some data loss or with a lot of work, or both. So what are the two things that most of these cases had in common.

  1. No backups
  2. No corruption checks


Having no backups is pretty inexcusable, whether its because no backups were scheduled or whether it’s because the backups were failing. Paul’s written about the importance of the right backups, and I’m not going to repeat him here.

Just having backups is not sufficient. The backups have to be restorable. A backup that’s not restorable is, in my opinion, worse than no backup at all. That means that the jobs have to be checked to ensure that they did run without error and the backups have to be restored somewhere to test them. Ultimately that’s the only way to be 100%, absolutely sure that a backup is restorable.


High Availability != Backups

Backup solutions are not the same as high availability solutions. They have different purposes and picking the wrong one for the wrong reasons may be disasterous.

A backup solution is a one that creates separate copies of data that can be used to restore the original in the case of data loss. A high availability solution is one designed to keep a system usable and available in the face of hardware (or similar) failures

Solutions like RAID, database mirroring, clustering and SAN replication are forms of high availability. They are not backup solutions and cannot replace good database backups. Having the database on a RAID array will not help if the DB becomes corrupt. SAN replication or database mirroring is of no use if a user somehow deletes critical records.

On the other hand, a backup strategy, however good, is not going to help much when a critical piece of hardware burns out and the business will be losing millions if the app’s not up within 10 minutes. If the app is critical, both backups and an appropriate high availability solution have to be considered. Picking the right high availability solution is complex. Entire books have been written on it.

The only time having no backups at all is an acceptable option is if it’s a system where the data can be completely recreated from another source with no difficulties (example here would be a reporting database that’s replicated directly from a different server)

Here’s another take on backups –

When did CheckDB last run?

If corruption is found in a database it’s very important to know when the corruption started. In the cases where a restore is needed to fix the corruption, knowing which backup is clean can save a great deal of time in restoring databases and checking them.

On SQL 2000, the only way to know was to to go back through the history of the checkDB jobs (everyone is running checkDB regularly, right?) and see when the last successful run was.

On SQL 2005 and higher, when checkDB runs successfully, ie without finding any errors, it writes the date of the last run into the database’s boot page.  Whenever SQL Server then starts that database up, be it during a server start, a database restore or bringing the database online, SQL prints a message into the error log.