A Little Bit of SQL: Execution Plan

Tuesday, August 26, 2014

JOINs and Indexes

A couple of days ago, when performing some tests to check some timeouts in the DB, I discovered that the JOIN being done had no indexes in neither of both ends. So the first idea was to create both indexes, and then I was asked "Is it necessary to create index in both tables when doing a join between them?"

My first answer was "Yes, it's a good practice to do that". But the question kept hanging in my head: is it REALLY necessary to do so? So I went directly to the MSSMS to find out.

This is the scenario:

(Yes, no FK between them, and cannot be created because there were Child.ParentId that had no corresponding Parent.ParentId, strangely. No, the affected rows cannot be deleted)

Indexes are the following:

Table Child (3.8m rows approx)

Index_name	Index_description	Index_keys
IX_Child_Description	nonclustered located on PRIMARY	Description
PK_Child	clustered, unique, primary key located on PRIMARY	ChildID

Table Parent (5.6 m rows approx)

Index_name	Index_description	Index_keys
IX_Parent_Description	nonclustered located on PRIMARY	Description

No index on ParentId on neither of both tables. In a normal scenario, since the FK would exist, a Unique index on Parent.ParentID would also exist.

The query was similar to this one:

select * from Child c
inner join Parent p on p.ParentID = c.ParentID
where p.Description = 'XXXXXXXXXXX'

So, as you can see the predicate is filtering the Parent table only. Let's see the resulting execution plan for this query:

There is a Index Scan being done on the Child table, because there is no index on the Child.ParentId column (Even the MSSMS suggests creating an index on that column). Let's first create and index on Parent.ParentID then, and re execute the query:

create unique clustered index PX_Parent on Parent(ParentId)

Better, but still an Index Scan on the Child table. Let's create the index and re execute the query:

create index IX_Child_ParentId on Child(ParentId)

Now, an Index Seek on the new index is being done. So, both indexes were used. In this scenario, creating both was useful. Ok, so far, I've said nothing new. But, what if I change the query like this:

select * from Child c
inner join Parent p on p.ParentID = c.ParentID
where c.Description = 'XXXXXXXXXXX'

So instead of filtering on the Parent table, I filter on the Child table. Is the index on Child.ParentId still useful? Let see...

No, the new index was not used. So, should I drop it, the execution plan should remain the same:

drop index Child.IX_Child_CommonId

The resulting execution plan:

So, the index is used only if I filter first on the Parent table, and then the retrieved rows are used to filter in the Child table. In this scenario, creating both indexes wasn't necessary.

Summarizing, creating indexes on both tables when joining is a good practice, yes, but not always practical. If the scenario would have been the second query, creating an index only in the Parent.ParentId table would have solved the problem. Have in mind that this could differ depending on the rows on the table, statistics, different predicates, etc...

In the end, one would think that creating both indexes seems a good idea. Truth be told, is up to the dev to check if it's really necessary. Disk space could be an important constraint, if the table has a large amount of rows. In another context, for instance in a "chaotic" environment in which queries change a lot, it would be a good idea to create both indexes "just in case" to avoid potential performance issues ... As always, is up to the DBA to decide whether to create both indexes or not.

Wednesday, October 2, 2013

Different table alias => Different execution plan

So, finally I decided to start writing a little bit of my SQL experiences along my career as a DB involved person.

For my first entry, I decided to write about a problem I had a year ago, about a query that performed very poorly in some conditions, mostly because of SQL parameter sniffing. I had the following scenario:

And the problematic query was the next (inside a SP, of course):

       
declare @SubscriptionName   nvarchar(60)

select @SubscriptionName = SubscriptionName
from Subscription where SubscriptionId = 246939

select
          c.ClientId
         ,c.ClientName                 
         ,su.SubscriptionNumber   
from Client c
inner join Subscription su
       on c.SubscriptionId = su.SubscriptionId
where su.SubscriptionName = @SubscriptionName

The parameter sniffing occurred because the distribution of the SubscriptionName field in Subscription table. We have a lot of duplicated names for different SubscriptionIDs. So, for instance, we have:

       
select count(*) as [ = 1 ] from (
select SubscriptionName, count(*) as SubscriptionCount
from Subscription su
group by SubscriptionName
having count(*) = 1 ) as Subscriptions

= 1

26275

       
select count(*) as [ > 1] from (
select SubscriptionName, count(*) as SubscriptionCount
from Subscription su
group by SubscriptionName
having count(*) > 1 ) as Subscriptions

> 1

11093

So, a lot of non-repeated, against a lot of repeated. But, a deeper analysis showed this:

select count(*) as [ >= 1000] from (
select SubscriptionName, count(*) as SubscriptionCount
from Subscription su
group by SubscriptionName
having count(*) >= 1000 ) as Subscriptions

>= 1000

select count(*) as [ < 1000]  from (
select SubscriptionName, count(*) as SubscriptionCount
from Subscription su
group by SubscriptionName
having count(*) < 1000 ) as Subscriptions

< 1000

37351

So now we have a LOT of low repeated names, and very few (in comparison) highly repeated names. And when one of these rows were used to create the execution plan (or vice versa), the resulting plan was not adequate for the other kind of distribution.

I couldn't change the table’s structure (nor create any index), and I was working on SQL2005 at that time, so a lot of possible optimizations could not be applied. No RECOMPILE option was available either, since the SP was HEAVILY used, and the DBA was explicitly against it. So I had to think in a different way of dealing whit this. So he proposed using different aliases to the Subscription table according to the distribution of the Name used for the search.

So, if the Name count was above 1000, one alias should be used, and if not, the default alias had to be used:

declare
  @SubscriptionName   nvarchar(60),
  @sql                nvarchar(max),
  @alias              nvarchar(6)

select @alias = N'su'

select @SubscriptionName = SubscriptionName 
from Subscription 
where SubscriptionId = @SubscriptionId

if (select count(SubscriptionName) 
   from Subscription 
   where SubscriptionName = @SubscriptionName) > 1000
  select @alias = N'sus'

set @sql =
  N'select top 100
    c.ClientId
  , c.ClientName                 
  , ' + @alias + '.SubscriptionNumber   
  from Client c
  inner join Subscription ' + @alias + '
  on c.SubscriptionId = ' + @alias + '.SubscriptionId
  where ' + @alias + '.SubscriptionName = @SubscriptionName'

exec sp_executesql @sql, 
    N'@SubscriptionName nvarchar(60)', 
      @SubscriptionName

This way, I was forcing the optimizer to use one or another execution plan. If the SubscriptionName distribution was under 1000, this was the generated execution plan:

But, if the SubscriptionName distribution was above 1000, this happened:

If I used the same alias for both cases, the execution plan was the same, no matter what distribution the column had. So, this way was able to force the optimizer to use different ways to deal with the name distribution issue we had (and have) on that table. I think it’s a nice way to handle some specific parameter sniffing problems, when you’re not able to use the “normal” ways of dealing with it.

Ok, that’s it, thanks for reading! This is my first time writing a IT blog, so any comments and suggestions are highly welcomed. Stay tuned, I will be talking about running background traces in SQL server to help debugging for the next installment of this blog.