When is a critical IO error not a critical IO error?

When it succeeds on the second try.

Anyone who’s done any research at all on database corruption and IO problems has probably seen an example of the 823 and 824 errors.

Error 823 is a physical IO error. It means that, for some reason, the OS was incapable of completing the requested IO. It may mean that the disk has failed, the disk is missing, a filter driver is misbehaving, or a number of other things. On SQL 2005 it looks something like this:

Error: 823, Severity: 24, State: 2.
The operating system returned error 21(The device is not ready.) to SQL Server during a read at offset 0x0000d2364e2000 in file ‘D:\Data\SomeDB.mdf’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Error 823 is classified as a severity 24, the second highest severity error that SQL has. As the error message says, it’s a very serious condition.

Error 824 is a logical IO error. It means that the page requested was read from disk successfully, but there was something wrong with the page. The page header may have invalid values in it, the checksum may be incorrect, the torn page bits may be incorrect, etc. On SQL 2005, it looks something like this:


Identifying inaccurate statistics

I wrote previously about statistics, what they’re needed for and I briefly mentioned what can happen when they’re inaccurate.

One question I’ve seen asked a few times is on how to identify stats that are inaccurate, what DMV to use. The bad new is that there is no DMV that identifies inaccurate statistics. The Stats_Date function can be used to see when the stats were last updated, but that doesn’t say if that was too long ago ort not. If a table is readonly, or is changed twice a year, statistics that are four months old are perfectly valid.

The rowmodcnt column in sysindexes can give an approximate count of the number of rows that have changed since the stats were last updated, although sysindexes is deprecated and will disappear in the next version of SQL and the rowmodcnt column is no longer completely accurate. The rowmodcnt however is just a count of changes. It gives no indication of whether that number is too high and causing a problem. In some cases 50% of the table may have changed before there’s a problem with queries. In other cases 5% is enough to cause problems.


Recovery model and transaction logs

Or “Remind me again why we’re doing these backups in the first place

If there’s one topic that comes up again and again on the forums, it has to be the problem of full transaction logs, usually followed by the discovery that the DB is in full recovery mode and no log backups are running.

Quite often, someone will suggest to just truncate the log and shrink it. It’s a dangerous suggestion, not so much for what is said, but for what is not said. To understand why, requires a little background information. First, a look at recovery models and how they affect the transaction log.

Recovery modes

I’m going to ignore bulk-logged mode for now, mainly it’s less used than the other two and I’m not completely comfortable with how it works. I’m also going to ignore replication and database mirroring, as they complicate the issue.

Regardless of the recovery model that the database is in, transactions are logged into the transaction log before they are considered complete. The entries in the transaction log are considered active until all the data pages that were modified by the transaction have been written to disk. The two processes that write pages to disk are the lazy writer and the checkpoint process. Once all the pages that the transaction have been written to disk, the log records for that transaction are marked as inactive.


What is a deadlock

I’ve run into a fair bit of confusion in the forums recently on this, so I thought I’d quickly explain what a deadlock is, and how it differs from blocking and locks.


Locking is a normal part of SQL’s operations. When a row is read by a query, a lock is placed on that row to prevent the row from changing during the read. When a row is modified a lock is placed on the row to prevent any queries reading the value as it is changing.

Locks can occur at the row level, at the page level or at the table level. As long as they are short lived, they are not a problem in and of themselves


Token and perm user store growth

I wrote about TokenandPermUserStore issue a while back. I revisited the issue recently on 2 production servers to see if it was still an issue on a fairly recent SQL build.

The two servers in question run the same application, but with a vast difference in usage.

Server 1 – Itanium, 12 processors, 48 GB memory, build 3042 (SP2) running on Windows Server 2003 DataCenter SP1
Server 2 – x64, 16 processors, 64 GB memory, build 3215 (SP2 CU5) running on Windows Server 2003 DataCenter SP2

On both servers there’s a job that checks the size of the token cache and flushes it if it’s above 750 MB. The size was set because when I first set up the jobs, I noticed that the CMEMTHREAD waits were getting noticeable if the cache got above 1 GB.

On both servers, I set up a job that would log the size of the cache to a table. On server 1, that ran every 15 min. On server 2 it ran every 5 min. The results were a little frightening.


On recoverability

Had a lovely situation this morning.

I can’t go into details, but essentially a database had no off-site backups, no tape backups and the only full backups were on disk. Then the SAN glitched and both the data files and backups were corrupted.

I’m probably preaching to the converted, but I don’t think there are many more important things on a production server than ensuring good backups.

However, having the backups is not enough. Can they be restored? If there’s a complete drive failure of all drives connected to a server, can the databases be recovered?

When the pawpaw hits the fan (as the local saying goes), that’s the only thing that matters.

Memory and SQL 2005 SP2

Or “Why are all my processes waiting on memory. There’s tonnes of memory

It’s probably not new news that there was a fairly nasty memory-related bug in SQL 2005 RTM and SP1 that was related to the relaxing of limits on cache size. Specifically the TokenAndPermUserStore cache.

On systems with large amounts of memory (20GB+) and frequent ad-hoc queries or significant usage of dynamic SQL, the cache can grow quite large, and by quite large I’m talking upwards of 2GB. I think I saw the cache at close on 8GB at one time on one of my servers.

The problem with this is that is takes quite a bit of time to search through several GB of cache to find the required tokens. Making matters worse, access to that cache is synchronised, so only a single thread may have access at a time.

The main symptom of that problem is lots of CMEMTHREAD waits without an apparent wait resource and a higher than normal CPU usage.

But that problem was fixed in SP2 with a change to the caching behaviour. Right?


Shrinking databases

Or “Order the pages, shuffle the pages.

Do you ever shrink your data files? I’ve personally never been fond of it, especially for production databases. After all, they’ll simply have to grow again and, especially if the data files are on independent drives, there’s little difference between space free on the drive or space free in the data file. There is also a more insidious reason for not shrinking a database.

Let’s take a very simple database (The creation code is at the end of the post). I have two tables, both with a tens of thousands of rows. Both tables have a clustered index on a uniqueidentifier and are heavily fragmented (>99%).

DBCC SHOWCONTIG(LargeTable1) -- 99.30%
DBCC SHOWCONTIG(LargeTable2) -- 99.21%

To fix the fragmentation, rebuild both indexes. That fixes the fragmentation, but now the data file is using almost twice the space necessary.

DBCC ShowFileStats -- 3363 extents total, 1697 used (215 MB total, 106 MB free)

So, shrink the database to release the wasted space back to the OS

DBCC SHRINKDATABASE (TestingShrink, 10) -- Shrink to 10% free

That’s fixed the space issue. But now, have another look at those two indexes that were just rebuilt.