On Counts

Or “What’s the fastest way to count the rows?”

It’s a fairly common to need to know the number of rows in a table, the number of rows that match a certain condition or whether or not there are any rows that match a condition. There’s also a number of ways of doing so, some better than others. The problem being that counting is not a cheap operation, especially on big tables. It’s not as bad as a sort, but it still can be expensive.

So, given that, let’s take a look at some of the ways.

Querying the metadata

If all that’s needed is the number of rows in the table, and it’s not 100% important that the value be completely accurate all the time, the system metadata can be queried. In SQL 2000 and below, that info was in sysindexes. In 2005 and higher it’s been moved into sys.partitions.

SELECT OBJECT_NAME(object_id) AS TableName, SUM(rows) AS TotalRows
FROM sys.partitions
WHERE index_id in (0,1)
AND object_id = OBJECT_ID('TableName')
GROUP BY object_id

The advantage of this approach is that it is fast. Since it’s not actually counting anything and, in fact, isn’t even accessing the table that’s being counted, it’s the fastest way to get the count of rows in the table.

The disadvantage is it can only get the number of rows in the table and cannot consider any criteria at all. It also may not be 100% accurate, depending how and when the table’s rowcount metadata is updated by the SQL engine.

Checking for existence of rows

If all that’s needed is to know whether or not a row exists for a certain criteria, use EXISTS rather than counting the entire number of rows and checking that it’s greater than 0. It’s (slightly) quicker and it uses fewer IOs.

CREATE TABLE TestingCounts (
ID INT IDENTITY PRIMARY KEY,
AChar CHAR(1),
SomePaddingColumn CHAR(400)
)

INSERT INTO TestingCounts (SomePaddingColumn, AChar)
SELECT TOP (100000) a.name, LEFT(b.name,1)
FROM sys.columns a CROSS JOIN sys.columns b
GO

SET STATISTICS IO ON
SET STATISTICS TIME ON
GO

DECLARE @RowCount INT
SELECT @RowCount = COUNT(*) FROM TestingCounts WHERE ID>50000
IF (@RowCount > 0)
PRINT 'Count - Rows Exist'

IF EXISTS (SELECT 1 FROM TestingCounts WHERE ID>50000)
PRINT 'Exists - Rows Exist'

Count:

Table ‘TestingCounts’. Scan count 1, logical reads 2645, physical reads 0.
SQL Server Execution Times:
CPU time = 15 ms,  elapsed time = 13 ms.

Exists:

Table ‘TestingCounts’. Scan count 1, logical reads 3
SQL Server Execution Times:
CPU time = 0 ms,  elapsed time = 0 ms.

Count(*) vs Count(ColumnName)

The difference between these two is not always understood, and I’ve often seen hints and tips that say that COUNT(ColumnName) is better than COUNT(*). That advice is wrong and it’s possible that it comes from the (correct) advice that SELECT <Column List> is better than SELECT *.

In truth, COUNT(*) means to count the number of rows in the resultset. COUNT(ColumnName) means to count the number of rows in the resultset where that column is not null. Hence, COUNT(ColumnName) will return a different result to COUNT(*) if the column that’s used is nullable and contains any nulls.

So what does that mean for the performance of the two?

Because COUNT(*) just means count the rows then, assuming that there are no other criteria in the query, to satisfy it SQL will find the smallest index that exists on the table and scan the leaf pages to count the rows.

COUNT(ColumnName) requires that the column specified be checked to see if there are any null values. If the column that’s specified is defined as NOT NULLable, them SQL treats it just like a COUNT(*). If the column is nullable then, regardless or whether or not it contains any NULLs, it has to be checked. That means that to evaluate a COUNT(ColumnName), SQL must either scan an index that has that column in it or it must do a full table scan.

This means that, at best, a COUNT(Column) can be as fast as a COUNT(*), but it cannot be faster. It also means that COUNT(*) may return a different value to COUNT(Column)

-- using the same table created above

CREATE INDEX idx_AChar ON TestingCounts (Achar)
GO

SELECT COUNT(*) FROM TestingCounts WHERE ID%3 = 0
SELECT COUNT(SomePaddingColumn) FROM TestingCounts WHERE ID%3 = 0

Count(*):

Table ‘TestingCounts’. Scan count 1, logical reads 138.
SQL Server Execution Times:
CPU time = 16 ms,  elapsed time = 15 ms.

COUNT(SomePaddingColumn):

Table ‘TestingCounts’. Scan count 1, logical reads 5285
SQL Server Execution Times:
CPU time = 47 ms,  elapsed time = 47 ms.

Of course, if there are other conditions in the query or a group by, the situation becomes rather more complicated as to what indexes will be used. The main point though remains true. Because COUNT(ColumnName) has to check that column for NULL values and COUNT(*) does not, COUNT(ColumnName) cannot be the faster option.

23 Comments

  1. Jack Corbett

    Nice summary, Gail. Especially the Count(*) vs. Count(ColumnName) section. I think I knew it, but not as well as you expressed it.

    Reply
  2. ManorLT

    Well put. One thing I would change is the ‘(slightly)’ in “Checking for Existence”. On large tables, this will likely make a big difference. If the column is not indexed, ‘if exists’ requires only a scan until it reaches the first record with the value, instead of scanning the whole table. If it the column is indexed but there are a lot of duplicates, ‘if exists’ avoids having to count all the leaf level nodes of the index with the given value.

    Reply
  3. sunny

    After seeing the below query and comment

    SELECT OBJECT_NAME(object_id) AS TableName, SUM(rows) AS TotalRows
    FROM sys.partitions
    WHERE index_id in (0,1)
    AND object_id = OBJECT_ID(’TableName’)
    GROUP BY object_id

    a doubt came to mind. Since you said “It also may not be 100% accurate, depending how and when the table’s rowcount metadata is updated by the SQL engine”.My doubt is that if we run the above query immediately after updating IO STATISICS, would it not be accurate ?

    Reply
  4. Gail

    IO Statistics? You mean after updating column statistics?

    The data in sys.partitions is not dependent on column statistics and is not affected by column statistics.

    Reply
  5. Pingback: SQL Server and Cloud Links for the Week | Brent Ozar - SQL Server DBA

  6. pikes

    Instead of using exists, why not just select top 1 * into a CTE and count that? May not be as good performance wise, would need to test it 🙂

    Reply
  7. Chris Sherlock

    “If all that’s needed is to know whether or not a row exists for a certain criteria, use EXISTS rather than counting the entire number of rows and checking that it’s greater than 0.”

    The example uses:

    IF EXISTS (SELECT 1 FROM TestingCounts WHERE ID>50000)
    PRINT ‘Exists – Rows Exist’

    If you ever needed to implement this in a query, you could do the same thing. This is also known as a semi-join.

    e.g.

    SELECT o.OrgUnitName
    FROM OrgUnit AS o
    WHERE o.OrgUnitID IN
    (
    SELECT p.OrgUnitID FROM Person AS p
    WHERE p.OrgUnitID = o.OrgUnitID
    )

    Reply
  8. Chris Sherlock

    Oops… my above example query is correct, but with the wrong syntax!

    Try:

    SELECT o.OrgUnitID
    FROM OrgUnit AS o
    WHERE EXISTS
    (
    SELECT p.OrgUnitID FROM Person AS p
    WHERE p.OrgUnitID = o.OrgUnitID
    )

    Reply
  9. Gail

    Pikes: Why go to all the trouble of declaring a CTE and then counting that. You’re not gaining anything by doing so.

    Reply
  10. David

    Thats what I love about powershell, the number of rows is essentially a property of a table. Instantiate the database as an object, grab the tables collection and select the name of the table and the rowcount. Want the dataspace and indexspace usage for each table? Add these to the select list in the pipeline. Want to sort by rowcount or dataspaceused? Add a sort-object to the pipeline. Done.

    Reply
  11. pikes

    Yes you are right! I think your solution is the most elegant, by far.

    Reply
  12. Gail

    David, how does powershell get the row count? By actually counting the rows in the table or by using the metadata?

    Reply
  13. Seb

    Do you know why a SELECT COUNT(column1) FROM table1 takes about twice as much time to complete than a SELECT COUNT(column1) FROM table1 WHERE column1 IS NOT NULL? I looked at the execution plans but couldn’t find out why there is such a big difference. In my case 97% of the plan is due to an Index Scan in both cases, with the same values in the details.

    Reply
  14. julio

    Excelente información para los que somos movatos

    Reply
  15. Christopher

    great article I loved this one and was brought back it today, when trying to help someone with a problem at SSC:
    http://www.sqlservercentral.com/Forums/Topic714996-338-3.aspx

    Reply
  16. VBindra

    Great article!
    I think that COUNT(1) will also be treated in the same way as COUNT(*). Is it correct?

    Reply
  17. Gail

    It should be.

    Reply
  18. Sm

    Gail,

    Instead of “Select 1 from Tablename” IF we use “Select Top 1 ” will it decrease performance?

    Sm

    Reply
  19. Gail

    Why don’t you test it out and tell me?

    Reply
  20. Sm

    ok, let me test it ‘n Come back to you Gail..again..see you…then
    Sm

    Reply
  21. Sm

    Gail, Now just I have tested but all are same. may be its my dev box. but can you tell me how to calculate CPU time = 16 ms, elapsed time = 15 ms.. And also I tried with mine query Top 1 statements but no difference i found. Can u tell me y so?

    Sm.

    Reply
  22. Prakash

    Gail,

    Thanks a Lot for the Wonderful Explanation..

    Reply
  23. sql-developer

    There is a wrong assumption that sys.partitions uses statistics between many DBAs I talked to.
    I’m happy to read that you have stated that it is not based on statistics.

    Reply

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.