Syndication

On table variable row estimations

At first glance, the question of how many rows are estimated from a table variable is easy.

But, is it really that simple? Well, not really. To dig into the why, first we need to identify why table variables estimate 1 row. The obvious answer is because they don’t have statistics. However…

ALTER DATABASE CURRENT SET AUTO_CREATE_STATISTICS OFF
GO

CREATE TABLE Test (SomeCol INT);

INSERT INTO Test (SomeCol)
VALUES (1),(22),(37),(45),(55),(67),(72),(86),(91)

SELECT SomeCol FROM Test

SELECT * FROM sys.stats WHERE object_id = OBJECT_ID('Test')

DROP TABLE dbo.Test

That table has no statistics, but it still estimates rows correctly.

So it’s not just the absence of statistics. Hmmm… Where else is there a difference with a table variable?

It has to do with when the plans are generated. The XE event results are from an event tracking statement start and end and the post-compilation event for the plan. For the query using the table variable, the entire batch is compiled before the execution starts. For the permanent table, there are multiple compilation events.

And this is because of something called ‘deferred compile’. For the table variable, the entire batch is compiled at the start, at a time where the table variable does not exist, and because there are no statistics, no recompile is triggered after the insert. Hence, there cannot be any row estimation other than 1 row, because the table did not exist when the estimate was made.

For the permanent table, the compilation of the query that uses the table is deferred until the query starts, not when the batch starts. Hence the plan for the query is generated after the table exists, after it’s been populated. That’s the difference here.

Now, there’s still no statistics, and so there’s no way to get data distribution, but that’s not the only way to get information on the rows in the table. The Storage Engine knows how many rows are in the table, though data distribution isn’t known.

Hence, with a table variable we can expect to see an estimated row count other than 1 any time the table variable exists before the query that uses it is compiled.

That will happen when the table variable is a table-type parameter, when the query using it has the RECOMPILE option, and when SQL 2019’s deferred compile for table variables is in play.

CREATE OR ALTER PROCEDURE TestRowEstimations @Input TestTableType READONLY AS

SELECT SomeCol FROM @Input;

DECLARE @Test TABLE (SomeCol INT);

INSERT INTO @Test (SomeCol)
VALUES (1),(22),(37),(45),(55),(67),(72),(86),(91);

SELECT SomeCol FROM @Test;

SELECT SomeCol FROM @Test OPTION (RECOMPILE);
GO
Table-valued parameter
Normal select on compatibility mode 140
Normal select on compatibility mode 150
Select with OPTION(RECOMPILE)

Pluralsight Free April

It’s a few days into April, but not too late I hope to mention that Pluralsight is offering their entire library, free to new accounts, for the month of April. Sign up on their promotion page to take advantage of this offer.

And, I do have a few courses up there for anyone interested:

And there’s an upcoming course on index maintenance as well that I hope will be published shortly.

A rant about presentations

My company’s internal conference is in a couple of weeks, so this seems like a good time to have a quick rant about some presentation failings I’ve seen over the last year or so.

If you want to, or are planning to present at a conference (or even just a usergroup), please, please, please pay attention to the following.

Don’t read your presentation

Please don’t read the bullets on your slides one by one. Please also don’t read a speech off your phone. If I wanted to have something read to me, I’d get an audio book.

A presentation should feel dynamic. It is, and should feel like, a live performance.

If you need reminders or cue cards, that’s fine, but put keywords on them, points that need to be discussed, not the entire speech

Watch your font size

This is for the slides but especially for the demos. Font size of 30 is probably the smallest you should be using on slides.

In demos, if I’m sitting in the back row and can’t read the code, there may be a problem. My eyes are not the best though, so that might be a failing on my part. If, however, I’m sitting in the second row and can’t read the code, there’s definitely a problem.

If the conference insists on, or offers time for a tech check, take the opportunity to check your fonts. A tech check isn’t just ‘does my laptop see the projector? Yes, done.’ Walk to the back of the room, go through all the slides, start your demo, walk back to the back of the room. Make sure that everything is clearly visible.

Minimalistic slides

Please don’t put an essay on your slide. Please don’t have fancy animation (unless you’re doing a presentation on animation). Don’t have things that flash, flicker, spin or dance.

It’s distracting, and it probably means your audience is watching your slides and not listening to you. You should be the star of the presentation, not your slides. They’re a support character.

Themes

I like the Visual Studio dark theme. It’s nice to code with, it’s absolutely terrible on a projector. Especially if the room is not dark. For projectors you want strong contrast. Dark font on light background usually works. Dark blue on black does not, two similar shades of blue doesn’t.

Check that your demos are visible, check that the code is readable from the back of the room.

Learn how to zoom in, whether with the windows built in tools or installed apps. Use the zoom any time that what you’re showing may not be clear.

Repeat the question

Especially if the session is being recorded. Your voice is being recorded, the audience isn’t. It is so frustrating to listen to a recorded session, hear a minute of silence followed by the presenter giving a single word answer.

Even if the session is not being recorded, acoustics often make it possible for the presenter to hear a question while part of the audience hasn’t.

It also gives you a chance to confirm that you heard the question correctly and gives you a few moments to think on an answer.

Revisiting catch-all queries

I originally wrote about catch-all queries early in 2009, just as something that I’d seen several times in client code. It turned into the 3rd most popular post ever on my blog.

A lot’s changed since 2009. When I wrote the original post, most production servers were SQL 2005 or SQL 2000. SQL 2008 had been out less than a year and its fix for catch-all queries, the RECOMPILE hint, didn’t even work properly (it had an incorrect results bug in RTM, was pulled in SP1 and fixed in SP2)

As such, my feelings on how to solve the problem with catch-all queries has changed over the years.

Before I get to solutions, let’s start with the root cause of the problem with catch-all queries – plan caching and the need for plans to be safe for reuse.

Let’s take a sample query. I’ll use the same one I used in the original post.

CREATE PROCEDURE SearchHistory
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT ProductID, ReferenceOrderID, TransactionType, Quantity,
TransactionDate, ActualCost
FROM Production.TransactionHistory
WHERE (ProductID = @Product Or @Product IS NULL)
AND (ReferenceOrderID = @OrderID OR @OrderID Is NULL)
AND (TransactionType = @TransactionType OR @TransactionType Is NULL)
AND (Quantity = @Qty Or @Qty is null)
GO

There are two nonclustered indexes on the TransactionHistory table, one on ProductID, one on ReferenceOrderID and ReferenceLineID.

For the initial discussion, let’s just consider two of the clauses in the WHERE. I’ll leave the other two in the stored proc, but they won’t be used.

WHERE (ProductID = @Product Or @Product IS NULL)
AND (ReferenceOrderID = @OrderID OR @OrderID Is NULL)

We would expect, if the ProductID parameter is passed, to get a seek using the index on ProductID, if the ReferenceOrderID parameter is passed, to get a seek using the index on ReferenceOrderID, and if both are passed, then either an index intersection or a seek on one of the indexes, key lookup and secondary filter for the other, plus, in all cases, a key lookup to fetch the columns for the SELECT.

That’s not what we get (I cleared the plan cache before running each of these).

ProductScan

OrderScan

The expected indexes are used, but they’re used for scans not seeks. Why? Let’s just consider the second plan for a bit.

The index aren’t used for seeks, because plans must be safe for reuse. If a plan was generated with an index seek, seeking for ReferenceOrderID = @OrderID, and that plan was cached and reused later when @OrderID was NULL, we’d get incorrect results. ReferenceOrderID = NULL matches no records.

And so we have index scans with the full predicate (ReferenceOrderID = @OrderID OR @OrderID Is NULL) applied after the index is read.

This is not particularly efficient, as the properties on the index seek shows.

InefficientIndexScan

The entire index, all 113443 rows were read, to return a single row. Not ideal, but it’s far from the largest problem with this form of query.

The plan’s got an index scan on the index on ReferenceOrderID, and then a key lookup back to the clustered index. That key lookup has a secondary filter on it, (ProductID = @Product Or @Product IS NULL). The optimiser assumed that a small number of rows would be returned from the index seek on ReferenceOrderID (1.47 to be specific), and hence the key lookup would be cheap, but that’s not going to be the case if the plan is reused with a ProductID passed to it instead of a ReferenceOrderID.

Before we look at that, the performance characteristics for the procedure being called with the ReferenceOrderID parameter are:

PerformanceOrder

The duration and CPU are both in microseconds, making this a very fast query, despite the index scan.

Now, without clearing the plan cache, I’m going to run the procedure with only the ProductID parameter passed.

PerformanceProduct

CPU’s gone from an average of 8ms to around 120ms. Duration has gone from average around 6ms to about 125ms and reads have jumped from 271 (2 MB of data processed) to 340 597 (2.6 GB of data processed)

And this is for a table that has 113k records and a query that returned 4 rows.

The key lookup, which was fine when an OrderID was passed, is not fine when @OrderID is NULL and the index scan returns the entire table.

ExpensiveIndexScan

ExpensiveKeyLookup

The plans that the optimiser has come up with for this query form aren’t stable. They’re safe for reuse, they have to be, but performance-wise they’re not stable.

But, maybe it’s just this form of query, there are other ways to write queries with multiple optional parameters.

Let’s try the CASE and COALESCE forms.

CREATE PROCEDURE SearchHistory_Coalesce
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT ProductID, ReferenceOrderID, TransactionType, Quantity,
TransactionDate, ActualCost
FROM Production.TransactionHistory
WHERE ProductID = COALESCE(@Product, ProductID)
AND ReferenceOrderID = COALESCE(@OrderID, ReferenceOrderID)
AND TransactionType = COALESCE(@TransactionType, TransactionType)
AND Quantity = COALESCE(@Qty, Quantity)
GO

CREATE PROCEDURE SearchHistory_Case
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT ProductID, ReferenceOrderID, TransactionType, Quantity,
TransactionDate, ActualCost
FROM Production.TransactionHistory
WHERE ProductID = CASE WHEN @Product IS NULL THEN ProductID ELSE @Product END
AND ReferenceOrderID = CASE WHEN @OrderID IS NULL THEN ReferenceOrderID ELSE @OrderID END
AND TransactionType = CASE WHEN @TransactionType IS NULL THEN TransactionType ELSE @TransactionType END
AND Quantity = CASE WHEN @Qty IS NULL THEN Quantity ELSE @Qty END
GO

Coalesce

Case

These both give us full table scans, rather than the index scan/key lookup we saw earlier. That means their performance will be predictable and consistent no matter what parameter values are used. Consistently bad, but at least consistent.

It’s also worth noting that neither of these will return correct results if there are NULL values in the columns used in the WHERE clause (because NULL != NULL). Thanks to Hugo Kornelis (b | t) for pointing this out.

And then two more forms that were mentioned in comments on the original post, slightly more complicated:

CREATE PROCEDURE SearchHistory_Case2
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT  ProductID,
ReferenceOrderID,
TransactionType,
Quantity,
TransactionDate,
ActualCost
FROM    Production.TransactionHistory
WHERE   (CASE WHEN @Product IS NULL THEN 1
WHEN @Product = ProductID THEN 1
ELSE 0
END) = 1
AND (CASE WHEN @OrderID IS NULL THEN 1
WHEN @OrderID = ReferenceOrderID THEN 1
ELSE 0
END) = 1
AND (CASE WHEN @TransactionType IS NULL THEN 1
WHEN @TransactionType = TransactionType THEN 1
ELSE 0
END) = 1
AND (CASE WHEN @Qty IS NULL THEN 1
WHEN @Qty = Quantity THEN 1
ELSE 0
END) = 1
GO

CREATE PROCEDURE SearchHistory_Complex
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT  ProductID,
ReferenceOrderID,
TransactionType,
Quantity,
TransactionDate,
ActualCost
FROM    Production.TransactionHistory
WHERE ((ProductID = @Product AND @Product IS NOT NULL) OR (@Product IS NULL))
AND ((ReferenceOrderID = @OrderID AND @OrderID IS NOT NULL) OR (@OrderID IS NULL))
AND ((TransactionType = @TransactionType AND @TransactionType IS NOT NULL) OR (@TransactionType IS NULL))
AND ((Quantity = @Qty AND @Qty IS NOT NULL) OR (@Qty IS NULL))

These two give the same execution plans as the first form we looked at, index scan and key lookup.

Performance-wise, we’re got two different categories of query. We’ve got some queries where the execution plan contains an index scan on one or other index on the table (depending on parameters passed) and a key lookup, and others where the execution plan contains a table scan (clustered index scan) no matter what parameters are passed.

But how do they perform? To test that, I’m going to start with an empty plan cache and run each query form 10 times with just the OrderID being passed and then 10 times with just the ProductID being passed, and aggregate the results.

Procedure Parameter CPU (ms) Duration (ms) Reads
SearchHistory OrderID 5.2 50 271
ProductID 123 173 340597
SearchHistory_Coalesce OrderID 7.8 43 805
ProductID 9.4 45 805
SearchHistory_Case OrderID 12.5 55 805
ProductID 7.8 60 804
SearchHistory_Case2 OrderID 10.5 48 272
ProductID 128 163 340597
SearchHistory_Complex OrderID 7.8 40 272
ProductID 127 173 340597

 

The query forms that had the clustered index scan in the plan have consistent performance. On large tables it will be consistently bad, it is a full table scan, but it will at least be consistent.

The query form that had the key lookup have erratic performance, no real surprise there, key lookups don’t scale well and looking up every single row in the table is going to hurt. And note that if I ran the queries in the reverse order on an empty plan cache, the queries with the ProductID passed would be fast and the queries with the OrderID would be slow.

So how do we fix this?

When I first wrote about this problem 7 years ago, I recommended using dynamic SQL and discussed the dynamic SQL solution in detail. The dynamic SQL solution still works very well, it’s not my preferred solution any longer however.

What is, is the RECOMPILE hint.

Yes, it does cause increased CPU usage due to the recompiles (and I know I’m likely to get called irresponsible and worse for recommending it), but in *most* cases that won’t be a huge problem. And if it is, use dynamic SQL.

I recommend considering the RECOMPILE hint first because it’s faster to implement and far easier to read. Dynamic SQL is harder to debug because of the lack of syntax highlighting and the increased complexity of the code. In the last 4 years, I’ve only had one case where I went for the dynamic SQL solution for a catch-all query, and that was on a server that was already high on CPU, with a query that ran many times a second.

From SQL 2008 SP2/SQL 2008 R2 onwards, the recompile hint relaxes the requirement that the generated plan be safe for reuse, since it’s never going to be reused. This firstly means that the plans generated for the queries can be the optimal forms, index seeks rather than index scans, and secondly will be optimal for the parameter values passed.

CatchallIndexSeek

And performance-wise?

RecompilePerformance

Reads down, duration down and CPU down even though we’re recompiling the plan on every execution (though this is quite a simple query, so we shouldn’t expect a lot of CPU to generate the plan).

How about the other forms, do they also improve with the RECOMPILE hint added? As I did before, I’m going to run each 10 times and aggregate the results, that after adding the RECOMPILE hint to each.

Procedure Parameter CPU (ms) Duration (ms) Reads
SearchHistory OrderID 0 1.3 28
ProductID 0 1.2 19
SearchHistory_Coalesce OrderID 6.2 1.2 28
ProductID 3.2 1.2 19
SearchHistory_Case OrderID 1.6 1.3 28
ProductID 0 1.2 19
SearchHistory_Case2 OrderID 7.8 15.6 232
ProductID 7.8 11.7 279
SearchHistory_Complex OrderID 1.5 1.4 28
ProductID 0 1.2 19

 

What can we conclude from that?

One thing we note is that the second form of case statement has a higher CPU, duration and reads than any other. If we look at the plan, it’s still running as an index scan/key lookup, despite the recompile hint.

The second thing is that the more complex forms perform much the same as the simpler forms, we don’t gain anything by adding more complex predicates to ‘guide’ the optimiser.

Third, the coalesce form might use slightly more CPU than the other forms, but I’d need to test a lot more to say that conclusively. The numbers we’ve got are small enough that there might well be measuring errors comparable to the number itself.

Hence, when this query form is needed, stick to the simpler forms of the query, avoid adding unnecessary predicates to ‘help’ the optimiser. Test the query with NULLs in the filtered columns, make sure it works as intended.

Consider the RECOMPILE hint first, over dynamic SQL, to make it perform well. If the query has long compile times or runs very frequently, then use dynamic SQL, but don’t automatically discount the recompile hint for fear of the overhead. In many cases it’s not that bad.

Obsessing over query operator costs

A common problem when looking at execution plans is attributing too much meaning and value of the costs of operators.

The percentages shown for operators in query plans are based on costs generated by the query optimiser. They are not times, they are not CPU usage, they are not IO.

The bigger problem is that they can be completely incorrect.

Before digging into the why of incorrect percentages, let’s take a step back and look at why those costs exist.

The SQL query optimiser is a cost-based optimiser. It generates good plans by estimating costs for each query operator and then trying to minimise the total cost of the plan. These costs are based on estimated row counts and heuristics.

The costs we see in the query plan are these compilation time cost estimates. They’re compilation-time estimations, which means that they won’t change between one execution of a query using a particular plan and another query using the same plan, even if the parameter values are different, even if the row counts through the operators are different.

Since the estimations are partially based on those row counts, that means that any time the query runs with row counts different to what were estimated, the costs will be wrong.

Let’s look at a quick example of that.

Cost1

There are no customers with an ID of 0, so the plan is generated with an estimation of one row being returned by the index seek, and one row looked up to the clustered index. Those are the only two operators that do any real work in that plan, and each is estimated to read and fetch just one row, so each gets an estimation of 50% of the cost of the entire query (0.0033 it be specific)

Run the same query with a different parameter value, plans are reused and so the costs are the same.

Cost2

That parameter returns 28 rows, the index seek is probably much the same cost, because one row or 28 continuous rows aren’t that different in work needed. The key lookup is a different matter. It’s a single-row seek always, so to look up 28 rows it has to execute 28 times, and hence do 28 times the work. It’s definitely no longer 50% of the work of executing the query.

The costs still show 50%, because they were generated for the 0-row case and displayed here. They’re not run-time costs, they’re compile time, tied to the plan.

Another thing can make the cost estimations inaccurate, and that’s incorrect costing calculations by the optimiser. Scalar user-defined functions are the easiest example there.

CostsOff

The first query there, the one that’s apparently 15% of the cost of the batch, runs in 3.2 seconds. The second runs in 270 ms.

The optimiser gives scalar UDFs a very low cost (they have their own plans, with costs in them though) and so the costs for the rest of the query and batch are meaningless.

The costs in a plan may give some idea what’s going on, but they’re not always correct, and should not be obsessed over, especially not when the plan’s a simple one with only a couple of operators. After all, the cost percentages add to 100% (usually).

Books of 2016

I set myself a reading goal of 75 books for last year, and managed 73. I’m not overly happy about that, there were months where I barely managed to read anything

Books2016

The full list, with Amazon links is at the end of this post, I’ll mention a few of the standout books first.

Dust and Light, and its sequel Ash and Silver

A novel magic system, complex politics, a war, an ancient mystery and the main character is slap in the middle of all of them, and he doesn’t remember why.

An interesting theme in these is on memory and what we are if our memory is stripped away.

Halting State

Near-future Scotland. The book starts with a bank robbery, and the suspects are a bunch of orcs and a dragon. The robbery occurred in a persistent, online world, and the police are a little out of their depth. It gets more complicated from there.

Song for Arbonne

A beautifully written story of the land of Arbonne, land of troubadours and joglars and courtly love, worshipping a goddess and ruled by a Queen; and a land to the north where only the warrior god is worshipped and the king and high priest have sworn to conquer Arbonne.

Pandora’s Star

The first story of the Commonwealth saga, a futuristic society where space travel is almost unknown as wormholes link the worlds of the commonwealth together, and where people can live forever thanks to memory implants and rejuvenation techniques.

It all starts when an astronomer observes a star disappearing, enveloped in an instant by some form of Dyson sphere.

The Bands of Mourning

The last in the sequel series to Mistborn, we return to the world of Allomancy and mists. It’s hundreds of years after the end of “Hero of Ages”, the world is in an early Industrial Age.

This book completes the adventures of Wax and Wayne, started in Allow of Law and continued in Shadows of Self.

City of Stairs and its sequel City of Blades

Another completely different fantasy setting. For centuries the Divinities had ruled and protected the continent, their miracles feeding the people, protecting them, etc. Then on one day, the Divinities were killed and civilisation on the continent collapsed.

Almost 100 years later strange things with a divine feel to them are happening and must be investigated.

What If?

A book full of strange questions and well-researched answers, such as “What would happen if the Earth stopped spinning?” (Hint: Bad things would happen), or “What would happen if you tried to hit a baseball travelling at 90 percent of the speed of light?” (Hint: Bad things would happen).

It’s hilarious, it’s well-researched, it’s fantastic.

Full list:

Dust and Light: A Sanctuary Novel by Carol Berg
Skin Game: A Novel of the Dresden Files by Jim Butcher
Sacrifice (Star Wars: Legacy of the Force, Book 5) by Karen Traviss
Inferno (Star Wars: Legacy of the Force, Book 6) by Troy Denning
Fury (Star Wars: Legacy of the Force, Book 7) by Aaron Allston
Revelation (Star Wars: Legacy of the Force, Book 8) by Karen Traviss
The Republic of Thieves (Gentleman Bastards) by Scott Lynch
Ash and Silver: A Sanctuary Novel by Carol Berg
Arthur (The Pendragon Cycle, Book 3) by Stephen R. Lawhead
City of Stairs (The Divine Cities) by Robert Jackson Bennett
This May Go On Your Permanent Record by Kelly Swails
Words of Radiance: Part Two (The Stormlight Archive) by Brandon Sanderson
Rookie Privateer (Privateer Tales) (Volume 1) by Jamie McFarlane
Invincible (Star Wars: Legacy of the Force, Book 9) by Troy Denning
Shattered: The Iron Druid Chronicles by Kevin Hearne
White Tiger (Dark Heavens, Book 1) by Kylie Chan
Something More Than Night by Ian Tregillis
Crown of Renewal (Legend of Paksenarrion) by Elizabeth Moon
Halting State (Ace Science Fiction) by Charles Stross
The Crimson Campaign (The Powder Mage Trilogy) by Brian McClellan
The Long Way Down (Daniel Faust) (Volume 1) by Craig Schaefer
London Falling by Paul Cornell
Learning R by Richard Cotton
Song for Arbonne by Guy Gavriel Kay
Pendragon (The Pendragon Cycle, Book 4) by Stephen R. Lawhead
The First Casualty by Mike Moscoe
Dragons In The Stars (Star Rigger) by Jeffrey A. Carver
Girl on the Moon by Jack McDonald Burnett
Skinwalker (Jane Yellowrock, Book 1) by Faith Hunter
Terms of Enlistment (Frontlines) by Marko Kloos
Rath’s Deception (The Janus Group) (Volume 1) by Piers Platt
Valour by John Gwynne
Death from the Skies!: The Science Behind the End of the World by Philip Plait Ph.D.
The Fabric of the Cosmos: Space, Time, and the Texture of Reality by Brian Greene
Flex by Ferrett Steinmetz
Sword Coast Adventurer’s Guide by Wizards RPG Team
Lines of Departure (Frontlines) by Marko Kloos
The Dark Ability (Volume 1) by D.K. Holmberg
Shadows of Self: A Mistborn Novel by Brandon Sanderson
ATLAS (ATLAS Series) by Isaac Hooke
Pandora’s Star (The Commonwealth Saga) by Peter F. Hamilton
Interim Errantry: Three Tales of the Young Wizards by Diane Duane
Leviathan Wakes (The Expanse Book 1) by James S.A. Corey
Path of Destruction (Star Wars: Darth Bane, Book 1) by Drew Karpyshyn
Manifold: Time by Stephen Baxter
The Bands of Mourning: A Mistborn Novel by Brandon Sanderson
Physics of the Future: How Science Will Shape Human Destiny and Our Daily Lives by the Year 2100 by Michio Kaku
Ruin (The Faithful and the Fallen) by John Gwynne
Virtual Destruction: Craig Kreident (Craig Kreident Thrillers) (Volume 1) by Kevin J Anderson, Doug Beason
Before the Awakening (Star Wars) by Greg Rucka
The Weapon of a Jedi: A Luke Skywalker Adventure by Jason Fry
Parley (Privateer Tales) (Volume 3) by Jamie McFarlane
Calamity (The Reckoners) by Brandon Sanderson
Grail (The Pendragon Cycle, Book 5) by Stephen R. Lawhead
Into the Black (Odyssey One) by Evan Currie
Avalon:: The Return of King Arthur by Stephen R. Lawhead
Meeting Infinity by Gregory Benford, James S.A. Corey, Madeline Ashby, Aliette de Bodard, Kameron Hurley, John Barnes, S
Desert Rising by Kelley Grant
Deepsix by Jack McDevitt
The Steel Remains by Richard Morgan
Child of the Daystar (The Wings of War Book 1) by Bryce O’Connor
The Terran Privateer (Duchy of Terra) (Volume 1) by Glynn Stewart
Throne Of Jade by Naomi Novik
Wireless by Charles Stross
Outriders by Jay Posey
The Vorrh by Brian Catling
The Engines Of God by Jack McDevitt
Parallel Worlds: A Journey Through Creation, Higher Dimensions, and the Future of the Cosmos by Michio Kaku
What If?: Serious Scientific Answers to Absurd Hypothetical Questions by Randall Munroe
City of Blades (The Divine Cities) by Robert Jackson Bennett
To Hold the Bridge: Tales from the Old Kingdom and Beyond by Garath Nix
Footfall by Larry Niven, Jerry Pournelle
Brandon Sanderson’s White Sand Volume 1 by Brandon Sanderson, Rik Hoskin

What is a SARGable predicate?

‘SARGable’ is a weird term. It gets bandied around a lot when talking about indexes and whether queries can seek on indexes. The term’s an abbreviation, ‘SARG’ stands for Search ARGument, and it means that the predicate can be executed using an index seek.

Lovely. So a predicate must be SARGable to be able to use an index seek, and it must be able to use an index seek to be SARGable. A completely circular definition.

So what does it actually mean for a predicate to be SARGable? (and we’ll assume for this discussion that there are suitable indexes available)

The most general form for a predicate is <expression> <operator> <expression>. To be SARGable, a predicate must, on one side, have a column, not an expression on a column. So, <column> <operator> <expression>

SELECT * FROM Numbers
WHERE Number = 42;

Seek1

SELECT * FROM Numbers
WHERE Number + 0 = 42;

Scan1

SELECT * FROM Numbers
WHERE Number = 42 + 0;

Seek2

Any1 function on a column will prevent an index seek from happening, even if the function would not change the column’s value or the way the operator is applied, as seen in the above case. Zero added to an integer doesn’t change the value of the column, but is still sufficient to prevent an index seek operation from happening.

While I haven’t yet found any production code where the predicate is of the form ‘Column + 0’ = @Value’, I have seen many cases where there are less obvious cases of functions on columns that do nothing other than to prevent index seeks.

UPPER(Column) = UPPER(@Variable) in a case-insensitive database is one of them, RTRIM(COLUMN) = @Variable is another. SQL ignores trailing spaces when comparing strings.

The other requirement for a predicate to be SARGable, for SQL Server at least, is that the column and expression are of the same data type or, if the data types differ, such that the expression will be implicitly converted to the data type of the column.

SELECT 1 FROM SomeTable
WHERE StringColumn = 0;

Scan2

SELECT 1 FROM SomeTable
WHERE StringColumn = ‘0’;

Seek3

There are some exceptions here. Comparing a DATE column to a DATETIME value would normally implicitly convert the column to DATETIME (more precise data type), but that doesn’t cause index scans. Neither does comparing an ascii column to a unicode string, at least in some collations.

In generally though, conversions should be explicit and decided on by the developer, not left up to what SQL server decides.

What about operators?

The majority are fine. Equality, Inequality, IN (with a list of values), IS NULL all allow index usage. EXIST and IN with a subquery are treated like joins, which may or may not use indexes depending on the join type chosen.

LIKE is a slight special case. Predicates with LIKE are only SARGable if the wildcard is not at the start of the string.

SELECT 1 FROM SomeStrings
WHERE ASCIIString LIKE 'A%'

Seek4

SELECT 1 FROM SomeStrings
WHERE ASCIIString LIKE '%A'

Scan3

There are blog posts that claim that adding NOT makes a predicate non-SARGable. In the general case that’s not true.

SELECT * FROM Numbers
WHERE NOT Number > 100;

Seek5

SELECT * FROM Numbers
WHERE NOT Number <= 100;

Seek6

SELECT * FROM Numbers
WHERE NOT Number = 137;

Seek7

These index seeks are returning most of the table, but there’s nothing in the definition of ‘SARGable’ that requires small portions of the table to be returned.

That’s mostly that for SARGable in SQL Server. It’s mostly about having no functions on the column and no implicit conversions of the column.

(1) An explicit CAST of a DATE column to DATETIME still leaves the predicate SARGable. This is an exception that’s been specifically coded into the optimiser.

SQL Server 2016 features: R services

One of the more interesting features in SQL 2016 is the integration of the R language.

For those who haven’t seen it before, R is a statistical and data analysis language. It’s been around for ages, and has become popular in recent years.

R looks something like this (and I make no promises that this is well-written R). Taken from a morse-code related challenge

MessageLetters <- str_split(Message, "")

MessageEncoded <- list(1:length(MessageLetters))

ListOfDots <- lapply(lapply(c(MaxCharacterLength:1), function(x) rep.int(".", times = x)), function(x) str_c(x, collapse=''))
ListOfDashes <- lapply(lapply(c(MaxCharacterLength:1), function(x) rep.int("-", times = x)), function(x) str_c(x, collapse=''))

If you’re interested in learning R, I found the Learning R book to be very good.

SQL 2016 offers the ability to run R from a SQL Server session. It’s not that SQL suddenly understands R, it doesn’t. Instead it can call out to the R runtime, pass data to it and get data back

Installing the R components are very easy.

2016-05-23_09-39-43

And there’s an extra licence to accept.

2016-05-23_09-44-48

It’s worth noting that the pre-installed Azure gallery image for RC3 does not include the R services. Whether the RTM one will or not remains to be seen, but I’d suggest installing manually for now.

Once installed, it has to be enabled with sp_configure.

EXEC sp_configure 'external scripts enabled', 1
RECONFIGURE

It’s not currently very intuitive to use. The current way R code is run is similar to dynamic SQL, with the same inherent difficulties in debugging.

EXEC sp_execute_external_script
  @language = N'R',
  @script = N'data(iris)
    OutputDataSet <- head(iris)'
  WITH RESULT SETS (([Sepal.Length] NUMERIC(4,2) NOT NULL, [Sepal.Width] NUMERIC(4,2) NOT NULL, [Petal.Length] NUMERIC(4,2) NOT NULL, [Petal.Width]  NUMERIC(4,2) NOT NULL, [Species] VARCHAR(30)));
go

It’s possible to pass data in as well, using a parameter named @input_data_1 (there’s no @input_data_2) and from what I can tell from the documentation @parameter1, which takes a comma-delimited list of values for parameters defined with @params. There’s no examples using these that I can find, so it’s a little unclear how they precisely work.

See https://msdn.microsoft.com/en-us/library/mt604368.aspx and https://msdn.microsoft.com/en-us/library/mt591993.aspx for more details.

It’s not fast. The above piece of T-SQL took ~4 seconds to execute. This is on an Azure A3 VM. Not a great machine admittedly, but the R code, which just returns the first 6 rows of a built-in data set, ran in under a second on my desktop. This is likely not something you’ll be doing as part of an OLTP process.

I hope this external_script method is temporary. It’s ugly, hard to troubleshoot, and it means I have to write my R somewhere else, probably R Studio, maybe Visual Studio, and move it over once tested and working. I’d much rather see something like

CREATE PROCEDURE GetIrisData
  WITH Language = 'R' -- or USQL or Python or …
  AS
…
GO

Maybe in SQL Server 2020?

Upcoming conferences

It’s shaping up to a busy year for conferences, well busy by my standards that is. While I’m unfortunately missing SQLBits, I’ll still be getting a chance to enjoy an English summer.

InsideSQL

The InsideSQL conference is a new conference organised by Neil Hambly. It’s a deep-dive conference, with longer sessions than many conferences offer, and an opportunity to dig deep into topics.

I’m presenting two sessions there, first a look at SQL waits, why there are waits and what various waits types mean, second a nice deep discussion on SQL Server indexes.

SQLSaturday Iceland

Iceland has been on my to-visit list for some time, so a SQLSaturday there? Perfect excuse for a short visit and a bit of exploration. I wonder if there’s any chance I’ll get to see the Aurora Borealis.

I’m doing a full-day precon on the Thursday (Friday is an Icelandic public holiday) on execution plans, as well as a regular session on Query Store on the Saturday.

South African SQLSaturdays

Fast forward to September, the South African SQL Saturdays are again running on back-to-back weekends. Johannesburg on the 3rd of September, Cape Town on the 10th and Durban planned for the 17th.

We’d love to have more international speakers join us for these. The local weather is lovely in September, and the exchange rate to the dollar/pound so poor that you won’t believe how cheap things are here. Come down for a two week African holiday, and get three SQLSaturday presentations in on the side.

SQL 2016 features: Stretch Database

Stretch database allows for a table to span an ‘earthed’ SQL Server instance and an Azure SQL Database. It allows for parts (or all) of a table, presumably older, less used parts, to be stored in Azure instead of on local servers. This could be very valuable for companies that are obliged to retain transactional data for long periods of time, but don’t want that data filling up the SAN/flash array.

After having played with it, as it is in RC2, I have some misgivings. It’s still a useful feature, but probably not as useful as I initially assumed when it was announced.

To start with, the price. Stretch is advertised as an alternative to expensive enterprise-grade storage. The storage part is cheap, it’s costed as ‘Read-Access Geographically Redundant Storage’ blob storage.

PriceStorage

Then there’s the compute costs

PriceCompute

The highest tier is 2000 DSU at $25/hour. To compare the costs to SQL Database, a P2 has the same compute costs as the lowest tier of Stretch, and that’s with a preview discount applied to Stretch. It’s going to be a hard sell to my clients at that price (though that may be partially because of the R15=$1 exchange rate).

The restrictions on what tables are eligible are limiting too. The documented forbidden data types aren’t too much of a problem. This feature’s intended for transactional tables, maybe audit tables and the disallowed data types are complex ones. HierarchyID, Geography, XML, SQL_Variant.

A bigger concern are the disallowed features. No computed columns, no defaults, no check constraints, can’t be referenced by a foreign key. I can’t think of too many transactional tables I’ve seen that don’t have one or more of those.

It’s looking more like an archive table, specifically designed to be stretchable will be needed, rather than stretching the transactional table itself. I haven’t tested whether it’s possible to stretch a partitioned table (or partition a stretched table) in order to partition switch into a stretched table. If it is, that may be the way to go.

I have another concern about stretch that’s related to debugging it. When I tested in RC2, my table was listed as valid by the stretch wizard, but when I tried, the ALTER TABLE succeeded but no data was moved. It turned out that the Numeric data type wasn’t allowed (A bug in RC2 I suspect, not an intentional limitation), but the problem wasn’t clear from the stretch-related DMVs. The problem is still present in RC3

StretchDMV

The actual error message was no where to be found. The new built-in extended event session specifically for stretch tables was of no additional help.

StretchXE

The error log contained a different message, but still not one that pinpointed the problem.

This blog post was based on RC2 and written before the release of RC3, however post RC3 testing has shown no change yet. I hope at least the DMVs are expanded before RTM to include actual error messages and more details. We don’t need new features that are hard to diagnose.

As for the other limitations, I’m hoping that Stretch will be like Hekaton, very limited in its first version and expanded out in the next major version. It’s an interesting feature with potential, I’d hate to see that potential go to waste.