When COUNT() isn’t the only way to count


I have come across a situation a number of times in the past that seems to be one of those things that are so obvious when you see the solution, but can’t see them before the penny has dropped.

Imagine the following scenario:

You want to find the total number of orders that have the Order Status ‘A’ and the number of orders with an Order Status of ‘B’. This sounds like a simple enough request, that I’m sure you have heard of before.

Lets start off with some test data.

--Test Structure
USE master
go
IF DB_ID('Sandbox') IS NULL
BEGIN
    CREATE DATABASE Sandbox
END
GO

USE Sandbox
GO
IF OBJECT_ID('dbo.CountExample') IS NOT NULL
BEGIN
    DROP TABLE dbo.CountExample
END
GO
IF OBJECT_ID('dbo.Nums') IS NOT NULL
BEGIN
    DROP FUNCTION dbo.Nums
END
GO
-- Test Function to allow fast test data creation
CREATE FUNCTION [dbo].[Nums] (@m AS bigint)
RETURNS TABLE
AS
RETURN
WITH t0
AS (SELECT n = 1
UNION ALL
SELECT n = 1),
t1
AS (SELECT n = 1
FROM t0 AS a,
t0 AS b),
t2
AS (SELECT n = 1
FROM t1 AS a,
t1 AS b),
t3
AS (SELECT n = 1
FROM t2 AS a,
t2 AS b),
t4
AS (SELECT n = 1
FROM t3 AS a,
t3 AS b),
t5
AS (SELECT n = 1
FROM t4 AS a,
t4 AS b),
results
AS (SELECT ROW_NUMBER() OVER (ORDER BY n) AS n
FROM t5)
SELECT n
FROM results
WHERE n <= @m

GO
CREATE TABLE dbo.CountExample
(OrderId int NOT NULL,
OrderStatus char(1) NOT NULL)

GO

--Test data
INSERT INTO dbo.CountExample
(OrderId,
OrderStatus)
SELECT n,
CHAR(n % 27 + 64)
FROM dbo.Nums (1000) AS N
GO

Now that we have some test data and tables, we can take a look at what solutions are possible.

Solution 1:

The solution that I have seen come from a lot of people has been to basically run two queries, one for each Order Stautus and then collect these together returning the result.

Something along the lines of:

SELECT (SELECT COUNT(*) CountA
        FROM dbo.CountExample AS CE
        WHERE OrderStatus = 'A') CountA,
       (SELECT COUNT(*) CountB
        FROM dbo.CountExample AS CE
        WHERE OrderStatus = 'B') CountB

This delivers the correct result, but causes two separate queries to be run (one for each Order Status). There are variations of this solution, using sub-queries or CTEs, but I hope you get the idea that a separate COUNT() is required for each total that you want to calculate.

Solution 2:

The best way, that I know of, to achieve this would be to change the logic from a COUNT() to a SUM(). This sounds wrong at first, especially because the column Order Status is a char(1) and not an integer!

Take a look at how I have solved the problem with SUM():

 
SELECT SUM(CASE WHEN OrderStatus = 'A' THEN 1 ELSE 0 END) CountA,
       SUM(CASE WHEN OrderStatus = 'B' THEN 1 ELSE 0 END) CountB
FROM dbo.CountExample AS CE

Looking at the code, we can see that I have not just used SUM(), but also a CASE statement. CASE is one of my favourite constructs in T-SQL, as it allows you to perform logical processing of an entire set or only part of a set without filtering using a WHERE clause.

If you take a look at the execution plan, you will also see that the table is accessed once. This is an instant improvement over the “standard” solution of COUNT()-ing per Order Status and has the added bonus of never being noticeably more expensive, regardless of how many different Order Status totals are required.

So there you go.  COUNT() isn’t always the best way to count data in SQL Server.

Advertisements

9 thoughts on “When COUNT() isn’t the only way to count

  1. Thanks for the Post.
    But I have one question. Why to need SUM(), when we can use COUNT() in the same manner. e.g
    COUNT(CASE WHEN OrderStatus = ‘A’ THEN 1 ELSE NULL END)
    would do the job.

    I am not sure there would be any performance benefit, but your saying “COUNT() isn’t always the best way to count data in SQL Server” can be challenged by some other persons. 🙂

    PS: In your test data, Please add “GO” after creation of the “sandbox” DB to run it as a separate batch. It sometimes can create the objects in master DB instead. (Was the case with me)

    • Usmann, thanks for your comment.

      You are right in saying that COUNT() could be used. The idea behind my post was to provoke thought and make clear that COUNT() wasn’t necessary to actually count.

      I intentionally used SUM() to steer away from thinking explicitly about counting, especially as I have seen that people aren’t aware of COUNT() ignoring NULL values. This would just cloud the process, especially for someone who really hasn’t heard of this method before.

      P.s. Thanks for the catch on the missing GO. Re-formatting the code to display as code on WordPress managed to drop that statement!

  2. Moreover, after some runs with 1,000,000 rows, the COUNT() was ever so slightly better than SUM()

    COUNT() STATISTICS AVG
    SQL Server Execution Times:
    CPU time = 313 ms, elapsed time = 471 ms.

    SUM() STATISTICS AVG
    SQL Server Execution Times:
    CPU time = 344 ms, elapsed time = 475 ms.

    Please let me know If I may have missed something. Thanks.

    • Thanks for the performance comparison, but please don’t forget the premise of the post.

      I was comparing the traditional COUNT() and the SUM() using CASE and not the performance of COUNT() vs. SUM().

      If you compare my two sample queries over one million rows, you will see that SUM() with CASE beats COUNT() hands down.

      • Thanks for the feedback. I always like your style of addressing situations. Much appreciated. One more thing, My performance test was against the same logic. This is where I asked your input, Am I missing something?

        I used
        CHECKPOINT;
        DBCC DROPCLEANBUFFERS
        DBCC FREEPROCCACHE

        before each execution. Thanks.

  3. While the execution time may be faster, please don’t forget the (in my opinion more important) I/O costs of each query.

    If you run:

    SET STATISTICS IO ON

    before running the queries.You will notice that the COUNT(*) query has double the reads of the SUM() query. This gets progressively worse for each further COUNT() that would be added, whereas the SUM() will only ever gain in CPU time, because it just scans the table once.

    Consider the following:

    P = number of pages in the table
    N = number of separate COUNT() commands

    The reads for the COUNT() solution would be:

    P * N

    The reads for the SUM() solution would always be

    P

    I/O is almost always the bottleneck on a SQL Server, so reductions like this easy re-write can save you a lot of time and money.

    • Thanks again for giving your precious time. I am extremely sorry, but I am unable to understand your last comment. I have run the below queries with IO statistics ON (1,000,000 rows)

      SELECT COUNT(CASE WHEN OrderStatus = ‘A’ THEN 1 ELSE NULL END) CountA,
      COUNT(CASE WHEN OrderStatus = ‘B’ THEN 1 ELSE NULL END) CountB
      FROM dbo.CountExample AS CE

      SELECT SUM(CASE WHEN OrderStatus = ‘A’ THEN 1 ELSE 0 END) CountA,
      SUM(CASE WHEN OrderStatus = ‘B’ THEN 1 ELSE 0 END) CountB
      FROM dbo.CountExample AS CE

      and below are the statistics respectively

      (1 row(s) affected)
      Table ‘CountExample’. Scan count 1, logical reads 2349, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

      SQL Server Execution Times:
      CPU time = 313 ms, elapsed time = 323 ms.

      (1 row(s) affected)
      Table ‘CountExample’. Scan count 1, logical reads 2349, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

      SQL Server Execution Times:
      CPU time = 344 ms, elapsed time = 348 ms.

      Moreover, the following queries

      SELECT COUNT(*) FROM [dbo].[CountExample] AS CE

      SELECT SUM(orderid) FROM [dbo].[CountExample] AS CE

      yielded following statistics

      (1 row(s) affected)
      Table ‘CountExample’. Scan count 1, logical reads 2349, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

      SQL Server Execution Times:
      CPU time = 78 ms, elapsed time = 84 ms.

      (1 row(s) affected)
      Table ‘CountExample’. Scan count 1, logical reads 2349, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

      SQL Server Execution Times:
      CPU time = 281 ms, elapsed time = 277 ms.

      What my understanding was count() could be less resource intensive and it is being supplemented by the stats.

      As far as P * N is concerned, how N would be the playing factor? Please need details. I hope I am not bothering you. Thanks.

      • Usmann, my comments and comparisons (including I/O) were on my original queries.

        COUNT() and SUM() in themselves are pretty similar on resource usage. You rightly point that fact out.

        However, my method of counting separate Order Statuses beats the “traditional” way every time. My P and N explanation should make that clear if you only consider the methods I originally presented.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s