Corruption

Repairing a replication subscriber

It’s a question which has come up a couple of times. If a subscriber of a transactional replication publication becomes corrupt, is running CheckDB with repair allow data loss safe?

The theory is, since the subscriber is a copy of another database, allowing CheckDB to discard data in the process of repairing won’t actually cause data loss (the data is still there in the publisher).

As with many things, it’s partially true, however there’s a little more to it that just that.

To see why, let’s set up some replication and intentionally corrupt some data in the subscriber and run a repair.

First, the source DB. I’m going to keep it really simple so that we can see the effects. A database with a single table called Customers, populated with 100 000 rows via SQLDataGenerator. The filler column simulates other columns that would be in a real table.

CREATE TABLE Customers (
CustomerID INT IDENTITY PRIMARY KEY,
FirstName VARCHAR(50),
Surname VARCHAR(50),
AccountNumber CHAR(15) UNIQUE,
AccountStatus VARCHAR(25),
Filler CHAR(200) DEFAULT ''
)

Once populated, we set up a transactional replication publication with default settings, and for simplicity have the subscriber be a second database on the same instance.  we query the table on the subscriber, we have exactly the same number of rows as the publisher has.

ReplicationSubscriber

To simulate a misbehaving IO subsystem, I’m going to take the subscription DB offline, open the mdf in a hex editor, scribble on top of a few pages in the user table, then bring the DB back online. A checkDB returns a couple of pages of errors ending with:

CHECKDB found 0 allocation errors and 16 consistency errors in table ‘Customers’ (object ID 421576540).

CHECKDB found 0 allocation errors and 18 consistency errors in database ‘ReplicationSubscriber’.

repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (ReplicationSubscriber).

Minimum level to repair is REPAIR_ALLOW_DATA_LOSS and, based on the error messages, doing so will deallocate five pages. Let’s run the repair and then see what’s left in the table.

ReplicationSubscriber After Repair

We’ve lost 155 rows from the Customers table. The replication publisher still has them, but the subscriber now does not. So what happens if someone makes some data modifications on the publisher and changes one of the rows that was lost at the subscriber?

UPDATE dbo.Customers
SET AccountStatus = 'Closed'
WHERE CustomerID = 48700;
GO

INSERT INTO dbo.Customers
(FirstName,
Surname,
AccountNumber,
AccountStatus
)
VALUES  ('Mary',
'White',
'4985563214AN'
'Pending'
)

Give it a few seconds to replicate and then let’s check the table in the publisher. The new customer isn’t there… If we go and have a look at the replication, it’s throwing errors.

BrokenReplication

“The row was not found at the Subscriber when applying the replicated command. “, which is quite true, the row’s not there any longer.

What’s happened is that one of the rows updated on the publisher was lost when CheckDB repaired the database. The rows lost due to the repair weren’t automatically fetched from the publisher, there’s no mechanism in transactional replication for that to happen. Hence when the row was updated on the publisher, the update statement was replicated and the row on the subscriber couldn’t be found, causing the replication to fail, retry and fail again. It will keep failing until it is either reinitialised or the missing row is manually added to the subscriber.

What we needed to have done, to prevent this outcome, was after the CheckDB deallocated pages, to manually (or with something like SQLDataCompare) to sync the subscriber with the publisher and explicitly add back the rows which were on the deallocated pages. The primary key values must be kept the same, as that’s what replication uses to identify which rows in the subscriber need to be updated.

So to go back to the original question, yes it’s probably fine to run CheckDB with repair on a replication subscription, providing that the data deallocated as part of the repair is manually synced from the publisher afterwards. If it isn’t, the replication will break as soon as one of the rows affected by the repair is updated or deleted at the publisher.

Quest Webcast: Dos and Don’ts of Database corruption

I was guest presenter on last week’s Quest Pain of the Week presentation. The presentation is available for download from Quest. This is a roundup of the questions from the presentation. (I have edited some of the questions for spelling, grammar and readability)

Q: Why do you have to wait for a maintenance window to take a backup? you can backup during the time the database is being used.

You don’t have to wait, and with a 24×7 system you obviously can’t. But full backups do have an impact even though database can be in use at the time. They’re IO intensive. If you’re going to do backups while the system is in use, make sure that the additional IO load doesn’t adversely effect user performance, or look into things like SAN-level snapshot backups.

Q: If you partition your data, can you only backup the latest partition?

Yes, however be mindful of the requirements for restoring from file/filegroup backups. In full or bulk-logged recovery if the older filegroups are not read-only, you’ll need log backups covering the entire span from the oldest of the file/filegroup backups you use to the newest.

If the partitions with older data are read-only, then this is very easy to do.

For more information see http://msdn.microsoft.com/en-us/library/ms177425.aspx

Q: What if you inherit a database that fails the integrity check every day and the backups have the same issue so you don’t know when the integrity check originally failed?

That just means that restore from backup (full, file/filegroup or page) is not an option for fixing. Probably means that there will be no avoiding data loss.

Q: The 825 errors – these are reported in the logs too aren’t they? I can review to see if these are occurring, even if the alert is not set?

Yes they are (and I don’t know why I thought during the presentation that they aren’t, I’ve written a blog post on them myself). https://www.sqlinthewild.co.za/index.php/2008/12/06/when-is-a-critical-io-error-not-a-critical-io-error/

Msg 825, Level 10, State 2, Line 1.
A read of the file ‘D:\Data\SomeDB.mdf’ at offset 0x00000020e24000 succeeded after failing 1 time(s) with error: incorrect checksum (expected: 0x7532c420; actual: 0x320e4240). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Q: Can you use Resource Governor for a large DB backup to keep the DB’s performance at an acceptable level?

Resource governor only limits CPU and memory, not IO.

Q: If you have data loss, shouldn’t you be doing a restore?

Maybe. It could be (due to time to restore or other constraints) that the fastest and acceptable solution is to repair (CheckTable or CheckFileGroup) and then sync in missing data later.

Q: I just had this happen on an archive database that I can play with. The problem occurred with lob’s. Since allow data loss removes the entire page that the corruption is on how does this work with lob’s that are not stored on the same page as the record?

The entire LOB chain (all LOB pages for the row) will be deallocated and the data/index record which is the parent will be deleted. (Ref: Chapter 11 of SQL Server 2008 Internals by Kalen Delaney et al. Chapter 11 written by Paul Randal)

Q: Is it possible to create a corrupted database in a test lab to get an 825 error?

Error 825 is a hard one to generate. To get that you need for an IO to fail or the retrieved page to be damaged and then on a retry (which occurs immediately) for the IO to succeed and the page to be correct. That’s hard to fake. You’ll need an IO filter driver that is coded to intermittently mess up IOs.

For 824, either use a hex editor to edit pages of a test database, or download and restore Paul Randal’s pre-corrupted databases.

For 823, create a database with a secondary file on a flash drive. Create a large table in the filegroup on the flash drive, checkpoint, DBCC dropcleanbuffers, then pull the flash drive out and query the table. If that doesn’t throw the error (meaning the pages are still in memory), update a row and checkpoint.

It goes without saying that neither of these should ever be done on a server!

Q: In case of db suspect (not recoverable) can we update the system table sysdatabases and change the status to make it online (this solution used to work in SQL 2000)

That is a SQL 2000 solution. In SQL 2005 and above, sysdatabases isn’t a table and the system tables cannot be updated as was possible in SQL 2000.

Fortunately there’s a fully documented and supported alternative – ALTER DATABASE <Database name> SET EMERGENCY. Once in emergency mode you can run CheckDB with repair_allow_data_loss and then bring the database online (assuming the repair succeeds), or extract data if the repair fails.

Q: Is Check DB the only tool available?

To check database integrity, yes.

Q: If you got a foreign key, you can restore a table but will integrity not be broken?

Relational and possibly transactional integrity may be broken. I’d guess that’s one reason table-level restores aren’t available natively in the product.

Q: Checksum on database backup – this is not an option in maintenance plan – true? To enable this, I have to script all database backups?

Correct, it’s not an option on maintenance plans. Personally I’m not a fan of maintenance plans for anything other than trivial environments.

Q: We use some SQL Express, do all these items work in Express? I had one that the hard drive died and I was able to get it back from suspect at which point I did a full backup and restored it elsewhere on a full SQL Server to fix it

The only limitation for CheckDB on Express is no parallelism.

File/filegroup restores can be done on Express, though they are offline restores.

Page restores can be done on Express, though they are offline restores


Thanks to Paul Randal and Robert Davis for assistance with a couple of these answers.

Does a missing data file send a database suspect?

Short answer: No

I keep seeing this come up on various forums when people ask what makes a database go suspect or what could have caused their database to be marked suspect. I can almost guarantee when that question comes up, one or more people will answer ‘missing files’. That may have been true on earlier versions (I don’t have a SQL 2000 or earlier instance to play with), but it is not true in the current versions of SQL. A missing file may result in the database being inaccessible (depending what file), but it will not result in a suspect database.

Let’s prove it.

I’m going to create a database with three files, two of which are in the primary filegroup, with one user table on each filegroup. (T-SQL code at the end) Before each test I’ll begin a transaction, modify data in both tables, shut SQL down so that there’s an uncommitted transaction in the log (database cannot be cleanly shut down), then rename a file before restarting SQL and see what happens.

I’m not going to play with the transaction log. That I’ve done before. In SQL 2005/2008, if the transaction log is missing and the database was cleanly shut down, SQL will recreate it. If the the transaction log is missing and the database was not cleanly shut down, the database goes into the RECOVERY_PENDING state, so no SUSPECT here.

Let’s try the file in the secondary filegroup first.

(more…)

Corruption in sys.dependencies

The second type of corruption I want to look at is that of missing references in sys.sql_dependencies. This typically appears on SQL 2005 in a database upgraded from SQL 2000.

On SQL 2000 it was possible to make direct updates to the system tables, and, in my opinion, if was done too often without the DBAs realising the long-term consequences.

As an interesting aside, early in 2005 I attended a series of pre-launch courses on SQL 2005. Among the other attendees was a senior DBA from one of our large banks. When he heard that 2005 hid the system tables and that it was no longer possible to update them, he went on a 10 minute rant about how MS was making his life impossible and removing essential features, etc, etc. It turned out that he did direct updates to drop users, drop logins, link logins to users, drop columns from tables and several other things. He absolutely would not accept that there were other ways to do those tasks and that modifying the system tables was risky. I’d hate to see the state of that database…

Back to the point…

(more…)

Stats blob corruptions

There have been a couple odd types of corruption that I’ve been seeing on the forums. I want to take the time to go into some detail, because the fix is possibly not obvious. The first one that I want to look at is corruption of a stat blob.

These are specifically SQL 2000 errors, I don’t know what the equivalent errors in SQL 2005 look like, if there are equivalent errors.

The errors that indicate corruption in the stats blob look like this.

Server: Msg 8929, Level 16, State 1, Line 4 Object ID 2:
Errors found in text ID 153175064576 owned by data record identified by RID = (1:526:13) id = 238623893 and indid = 1.
Server: Msg 8929, Level 16, State 1, Line 4 Object ID 2:
Errors found in text ID 153175195648 owned by data record identified by RID = (1:526:15) id = 50815243 and indid = 3.

The things to note are the Object ID, 2 being sysindexes in SQL 2000 and the reference to an indid in the section that identifies a record.

In SQL 2000, the statistics for an index (or indeed for column statistics) was stored in an image column in sysindexes called statblob (ref). Each index (and statistic) has a row in sysindexes and keeps the statisics data in that image column. This means that the statblob is just as susceptible to corruption as any other LOB column. Unfortunately it’s not as easily fixed. In fact, since it’s a corruption in the system tables checkDB will not repair it

All well and good, so how do we fix these?

(more…)

Deleting the transaction log

Or “Why is my database now inaccessible?

This is another things that I’ve seen a few things recently on various forums. A DB’s transaction log fills the disk, probably due to a lack of log backups, and then either SQL is stopped and the log file deleted or the database is detached and the log file deleted. Either way it’s not a particularly good thing to do.

No, I’ll be blunt, it’s a appallingly bad thing to do. The transaction log is not an optional piece of the database. It’s not like an installation log or activity log where the entries are there just in case anyone’s interested. The transaction log is what SQL uses to ensure, at all times, that the data in the database is transactionally consistent. It’s what gives SQL databases the consistency and durability properties that are required from a relational database engine.

The transaction log has a number of uses within SQL.

  1. In SQL 2000 and before, the inserted and deleted tables were materialised from the transaction log. This was changed in SQL 2005 and they’re now materialised from the row version store in TempDB.
  2. A transaction rollback uses the transaction log to determine what needs to be undone.
  3. In transactional replication, the log reader uses the transaction log to determine what changes need to be replicated
  4. Used by Change Data Capture in SQL 2008 to extract changes made to registered tables
  5. During restart-recovery to ensure that transactions that had committed when the service stopped but whose changes had not been written to the data file are replayed and to ensure that transactions that hadn’t completed are rolled back.

For now, I’m just going to consider the last one.

(more…)

Does Repair_Allow_Data_Loss cause data loss?

Short answer: Yes. That’s why the option has that name.

Long answer:

When CheckDB finds corruption it will, as part of the output, include the minimum level of repair that will fix the errors that it found. In SQL 2005 and higher, it will look something like the following.

CHECKDB found 42 allocation errors and 137 consistency errors in database ‘SomeDB’.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (SomeDB).

There are two options for the minimum repair level. REPAIR_REBUILD and REPAIR_ALLOW_DATA_LOSS.

If the minimum level is REPAIR_REBUILD then it means that the corruption was confined to just the nonclustered indexes. This is good. Nonclustered indexes are redundant data and hence can be dropped and recreated (which is pretty much what CheckDB with the REPAIR_REBUILD option does)

If the minumum level is REPAIR_ALLOW_DATA_LOSS then running CheckDB with the REPAIR_ALLOW_DATA_LOSS option will cause data loss. That’s why the option is named like that. How much it will lose depends on the extent of the corruption.

There’s one other case. What about when checkDB doesn’t specify a minimum level?

CHECKDB found 3 allocation errors and 7 consistency errors in database ‘SomeDB’.

That means that there’s damage to one or more of the allocation pages or there’s corruption in the system tables. That kind of corruption isn’t repairable, so no minimum repair level can be specified.

So, in conclusion, will running CheckDB with the REPAIR_ALLOW_DATA_LOSS option cause data loss? If that’s the minimum level that CheckDB specified, then yes it will.

For more info, see Paul Randal’s blog and an article that I wrote at SQL Server Central.

Edit: There is one case where the repair level repair_allow_data_loss is required, but no data will be lost. This is the case of the incorrect PFS page (see my article at SQLServerCentral for more info). If that is the only error, repair_allow_data_loss will be the minimum level, but no data will actually be lost.

When is a critical IO error not a critical IO error?

When it succeeds on the second try.

Anyone who’s done any research at all on database corruption and IO problems has probably seen an example of the 823 and 824 errors.

Error 823 is a physical IO error. It means that, for some reason, the OS was incapable of completing the requested IO. It may mean that the disk has failed, the disk is missing, a filter driver is misbehaving, or a number of other things. On SQL 2005 it looks something like this:

Error: 823, Severity: 24, State: 2.
The operating system returned error 21(The device is not ready.) to SQL Server during a read at offset 0x0000d2364e2000 in file ‘D:\Data\SomeDB.mdf’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Error 823 is classified as a severity 24, the second highest severity error that SQL has. As the error message says, it’s a very serious condition.

Error 824 is a logical IO error. It means that the page requested was read from disk successfully, but there was something wrong with the page. The page header may have invalid values in it, the checksum may be incorrect, the torn page bits may be incorrect, etc. On SQL 2005, it looks something like this:

(more…)