Wednesday, August 27, 2014

NULL vs Empty Strings

A lot of situations I had to face, info I had to learn, or study I had to make about SQL came from a job colleague asking me something. This is no different, since this time the question was: "I have an address field. It is better to store the empty fields as NULL or as empty strings?"

Ok, let's get down to business. This is a rather easy test, since we only need to create two tables, one with a varchar (or nvarchar) NULLable field , and another table with the same datatype, but non NULLable field. After that, fill the first table with NULL data, and the second one with empty strings. Also fill both tables with random numbers. This is the creation/filling script:
create table temp_null(id int identity primary key, data nvarchar(4000) NULL)
create table temp_emptystring(id int identity primary key, data nvarchar(4000) NOT NULL)

declare @counter int = 0
while (@counter < 100000)
begin
 insert into temp_null(data) values(NULL)
 insert into temp_emptystring(data) values('')

 set @counter = @counter + 1
end

set @counter = 0
while (@counter < 100000)
begin
 insert into temp_null(data) values(abs(checksum(@counter)))
 insert into temp_emptystring(data) values(abs(checksum(@counter)))

 set @counter = @counter + 1
end
Let's also create some indexes on both tables on the data column:
create index IX_temp_null_data on temp_null(data)

create index IX_temp_emptystring on temp_emptystring(data)
Ok, now with the tables/index created , let's check different things. First, space used:
exec sp_spaceused 'temp_null'
exec sp_spaceused 'temp_emptystring'
name rows reserved data index_size unused
temp_null 200000 11544 KB 6720 KB 3752 KB 1072 KB
name rows reserved data index_size unused
temp_emptystring 200000 10392 KB 6720 KB 3144 KB 528 KB

Less space used by date on the empty string option, and also less space used on index on the same option, ok. Next, search performance check. For that, let's execute some search queries. The search will be done on the data column, if it's NULL, or equal to an empty string, and also if it's NOT NULL, or different than an empty string:
select * from temp_null
where data is NULL

select * from temp_emptystring
where data = ''

select * from temp_null
where data is not NULL

select * from temp_emptystring
where data != ''
These are the results:

Lower logical reads and lower execution plan consumption when using empty strings. So using NULL values is not so performant as using empty strings. Good info to have in mind when designing new tables and CRUDs.

Tuesday, August 26, 2014

JOINs and Indexes

A couple of days ago, when performing some tests to check some timeouts in the DB, I discovered that the JOIN being done had no indexes in neither of both ends. So the first idea was to create both indexes, and then I was asked "Is it necessary to create index in both tables when doing a join between them?"

My first answer was "Yes, it's a good practice to do that". But the question kept hanging in my head: is it REALLY necessary to do so? So I went directly to the MSSMS to find out.

This is the scenario:
(Yes, no FK between them, and cannot be created because there were Child.ParentId that had no corresponding Parent.ParentId, strangely. No, the affected rows cannot be deleted)


Indexes are the following:

Table Child (3.8m rows approx)

  Index_name   Index_description Index_keys
IX_Child_Description nonclustered located on PRIMARY Description
PK_Child clustered, unique, primary key located on PRIMARY ChildID

Table Parent (5.6 m rows approx)

  Index_name   Index_description Index_keys
IX_Parent_Description nonclustered located on PRIMARY Description



No index on ParentId on neither of both tables. In a normal scenario, since the FK would exist, a Unique index on Parent.ParentID would also exist. 

The query was similar to this one:
select * from Child c
inner join Parent p on p.ParentID = c.ParentID
where p.Description = 'XXXXXXXXXXX'
So, as you can see the predicate is filtering the Parent table only. Let's see the resulting execution plan for this query:



There is a Index Scan being done on the Child table, because there is no index on the Child.ParentId column (Even the MSSMS suggests creating an index on that column). Let's first create and index on Parent.ParentID then, and re execute the query:
create unique clustered index PX_Parent on Parent(ParentId)

Better, but still an Index Scan on the Child table. Let's create the index and re execute the query:
create index IX_Child_ParentId on Child(ParentId)
Now, an Index Seek on the new index is being done. So, both indexes were used. In this scenario, creating both was useful. Ok, so far, I've said nothing new. But, what if I change the query like this:
select * from Child c
inner join Parent p on p.ParentID = c.ParentID
where c.Description = 'XXXXXXXXXXX'
So instead of filtering on the Parent table, I filter on the Child table. Is the index on Child.ParentId still useful? Let see...


No, the new index was not used. So, should I drop it, the execution plan should remain the same:
drop index Child.IX_Child_CommonId
The resulting execution plan:


So, the index is used only if I filter first on the Parent table, and then the retrieved rows are used to filter in the Child table. In this scenario, creating both indexes wasn't necessary. 

Summarizing, creating indexes on both tables when joining is a good practice, yes, but not always practical. If the scenario would have been the second query, creating an index only in the Parent.ParentId table would have solved the problem. Have in mind that this could differ depending on the rows on the table, statistics, different predicates, etc...

In the end, one would think that creating both indexes seems a good idea. Truth be told, is up to the dev to check if it's really necessary. Disk space could be an important constraint, if the table has a large amount of rows. In another context, for instance in a "chaotic" environment in which queries change a lot, it would be a good idea to create both indexes "just in case" to avoid potential performance issues ... As always, is up to the DBA to decide whether to create both indexes or not.