A Little Bit of SQL

Tuesday, October 4, 2016

SQL Best Practices

Following last week's post about SQL Forbidden Patterns, I also work with a list of Best Practices checklist that help developers to know what to do (and do not do) to get the best possible scripts and SQL solutions before hand them over to me for code review.

As you can see, they are related to the previous post I've been publishing here in the blog. I personally think they are very helpfull when developing SQL scripts, and most important of all, it ease my job ;-)

Wednesday, September 28, 2016

SQL Forbidden Patterns

A couple of years ago I had to give a class on SQL Forbidden Patterns that were raising among the developers at the company I work for. A couple of weeks ago I had to revisit that class, and I thought that it would be a good idea to share the class' content publicly.

Inside the ZIP file you'll find info on how to avoid specific patterns that affect negatively the performance of the SQL server, and some other stuff that doesn't, but we had to specify a standard to avoid everyone going their own way (all with examples):

UDFs

Triggers

Cursors

Datatypes

Ranking Functions

WHERE/JOIN clauses

DISTINCT

String Delimiters

System Tables

IDENTITY

ID Values

DateTime format

GO Commands

Common mistakes

And here it is! SQL Forbidden Patterns

Let me know your thoughts on the comments section!

Special thanks to Guillaume Columeau, who helped me (back then) to build up the content of the course.

Wednesday, December 3, 2014

Using Cross Database Ownership Chaining and Views

In this post, I'll talk about an error we faced when we started working with integrated security in the web application and its connection with SQL Server:

Msg 916, Level 14, State 1, Procedure ViewClients, Line 3
The server principal "Login2" is not able to access the database "Blog" under the current security context.

First, let me show you the scenario:

We have a web application, in two different web servers, pointing to the same SQL Server, but each app to a different DB in the same instance. The tricky part here is that DB B is built on views of DB A, so there are no "real" tables, but only views.

Programmatically, we use SPs to access the data, so the SPs in DB B execute selects on the views on DB B, that are created from DB A

So, for integrated security, we wanted to create two different logins, each one with a different user to each DB:

Web Server A -> Login A -> User A -> DB A

Web Server B -> Login B -> User B -> DB B

The problem arouse the moment we wanted to access the app in Web Server B, we received the error message above mentioned.

The first option was to create the User B in the DB A, and giving the same permissions than the User A. But this way we were allowing access and select permissions to User B in the DB A, and this was not what we wanted. So what to do? After doing a little research, I came up with Cross Database Ownership Chaining (which you can check in the excellent entry). Briefly, with this we could avoid giving User B permissions on DB A, and at the same time allowing the SPs and views in DB B access the data in DB A. After that, we could execute the SPs in DB B with no problems.

To check if a DB has the chaining option enabled, you can use this query:

select name, is_db_chaining_on 
from sys.databases 
where name in (dbname here)

Here is a fully functional code example of the analyzed scenario

Of course, if misused, this option could lead to security holes, so don't enable it so fast. Here is an article about possible dangers of enabling Cross-Database Chaining

Wednesday, August 27, 2014

NULL vs Empty Strings

A lot of situations I had to face, info I had to learn, or study I had to make about SQL came from a job colleague asking me something. This is no different, since this time the question was: "I have an address field. It is better to store the empty fields as NULL or as empty strings?"

Ok, let's get down to business. This is a rather easy test, since we only need to create two tables, one with a varchar (or nvarchar) NULLable field , and another table with the same datatype, but non NULLable field. After that, fill the first table with NULL data, and the second one with empty strings. Also fill both tables with random numbers. This is the creation/filling script:

create table temp_null(id int identity primary key, data nvarchar(4000) NULL)
create table temp_emptystring(id int identity primary key, data nvarchar(4000) NOT NULL)

declare @counter int = 0
while (@counter < 100000)
begin
 insert into temp_null(data) values(NULL)
 insert into temp_emptystring(data) values('')

 set @counter = @counter + 1
end

set @counter = 0
while (@counter < 100000)
begin
 insert into temp_null(data) values(abs(checksum(@counter)))
 insert into temp_emptystring(data) values(abs(checksum(@counter)))

 set @counter = @counter + 1
end

Let's also create some indexes on both tables on the data column:

create index IX_temp_null_data on temp_null(data)

create index IX_temp_emptystring on temp_emptystring(data)

Ok, now with the tables/index created , let's check different things. First, space used:

exec sp_spaceused 'temp_null'
exec sp_spaceused 'temp_emptystring'

name	rows	reserved	data	index_size	unused
temp_null	200000	11544 KB	6720 KB	3752 KB	1072 KB

name	rows	reserved	data	index_size	unused
temp_emptystring	200000	10392 KB	6720 KB	3144 KB	528 KB

Less space used by date on the empty string option, and also less space used on index on the same option, ok. Next, search performance check. For that, let's execute some search queries. The search will be done on the data column, if it's NULL, or equal to an empty string, and also if it's NOT NULL, or different than an empty string:

select * from temp_null
where data is NULL

select * from temp_emptystring
where data = ''

select * from temp_null
where data is not NULL

select * from temp_emptystring
where data != ''

These are the results:

Lower logical reads and lower execution plan consumption when using empty strings. So using NULL values is not so performant as using empty strings. Good info to have in mind when designing new tables and CRUDs.

Tuesday, August 26, 2014

JOINs and Indexes

A couple of days ago, when performing some tests to check some timeouts in the DB, I discovered that the JOIN being done had no indexes in neither of both ends. So the first idea was to create both indexes, and then I was asked "Is it necessary to create index in both tables when doing a join between them?"

My first answer was "Yes, it's a good practice to do that". But the question kept hanging in my head: is it REALLY necessary to do so? So I went directly to the MSSMS to find out.

This is the scenario:

(Yes, no FK between them, and cannot be created because there were Child.ParentId that had no corresponding Parent.ParentId, strangely. No, the affected rows cannot be deleted)

Indexes are the following:

Table Child (3.8m rows approx)

Index_name	Index_description	Index_keys
IX_Child_Description	nonclustered located on PRIMARY	Description
PK_Child	clustered, unique, primary key located on PRIMARY	ChildID

Table Parent (5.6 m rows approx)

Index_name	Index_description	Index_keys
IX_Parent_Description	nonclustered located on PRIMARY	Description

No index on ParentId on neither of both tables. In a normal scenario, since the FK would exist, a Unique index on Parent.ParentID would also exist.

The query was similar to this one:

select * from Child c
inner join Parent p on p.ParentID = c.ParentID
where p.Description = 'XXXXXXXXXXX'

So, as you can see the predicate is filtering the Parent table only. Let's see the resulting execution plan for this query:

There is a Index Scan being done on the Child table, because there is no index on the Child.ParentId column (Even the MSSMS suggests creating an index on that column). Let's first create and index on Parent.ParentID then, and re execute the query:

create unique clustered index PX_Parent on Parent(ParentId)

Better, but still an Index Scan on the Child table. Let's create the index and re execute the query:

create index IX_Child_ParentId on Child(ParentId)

Now, an Index Seek on the new index is being done. So, both indexes were used. In this scenario, creating both was useful. Ok, so far, I've said nothing new. But, what if I change the query like this:

select * from Child c
inner join Parent p on p.ParentID = c.ParentID
where c.Description = 'XXXXXXXXXXX'

So instead of filtering on the Parent table, I filter on the Child table. Is the index on Child.ParentId still useful? Let see...

No, the new index was not used. So, should I drop it, the execution plan should remain the same:

drop index Child.IX_Child_CommonId

The resulting execution plan:

So, the index is used only if I filter first on the Parent table, and then the retrieved rows are used to filter in the Child table. In this scenario, creating both indexes wasn't necessary.

Summarizing, creating indexes on both tables when joining is a good practice, yes, but not always practical. If the scenario would have been the second query, creating an index only in the Parent.ParentId table would have solved the problem. Have in mind that this could differ depending on the rows on the table, statistics, different predicates, etc...

In the end, one would think that creating both indexes seems a good idea. Truth be told, is up to the dev to check if it's really necessary. Disk space could be an important constraint, if the table has a large amount of rows. In another context, for instance in a "chaotic" environment in which queries change a lot, it would be a good idea to create both indexes "just in case" to avoid potential performance issues ... As always, is up to the DBA to decide whether to create both indexes or not.

Thursday, June 26, 2014

Replacing ORs by using UNION ALL

A common problem I often find in SPs is the use of an OR in the WHERE statements. It's very common for the business ask to be able to get something given certain conditions, that can affect several fields on a table, like getting a person by his name OR by his last name (or both). Although it seems that OR is easiest way to solve this (and yes, it is, from a human POV), the main problem using them in the predicates is that is not escalable.

Depending on the distribution of data in rows, or even the amount of rows in a table, using ORs can drastically affect performance, since the DB can switch from choosing to do an Index Seek, or do an Index Scan. This is not always bad, since depending on data on rows, sometimes it's better, but in this case, we can avoid them when using ORs by replacing them by UNION ALL.

For instance, let's check this scenario:

We have the table Subscriptions has more than 5m rows, and has 3 indexes (among others):

PK {SubscriptionID}
IX {PolicyNumber}
IX {AssociationNumber}
IX {IsInternal, PolicyNumber}

And this query:

select *
from 
 Subscriptions
where IsInternal = '1'                 
and (PolicyNumber like @sSubscriptionNu or 
     AssociationNumber like @sSubscriptionNu)

You see the OR in the predicate? If we execute this query, and check the execution plan, the engine cannot use the IXs, but uses directly the PK, causing an index scan:

Ok, lets work on the query then, removing the OR by using an UNION ALL:

select *
from 
     (select SubscriptionID, IsInternal from Subscriptions 
      where PolicyNumber like @sSubscriptionNu 
      union all
      select SubscriptionID, IsInternal from Subscriptions 
      where AssociationNumber like @sSubscriptionNu
      ) S
where S.IsInternal = '1'

And let's check the execution plan now:

At first sight (even when it looks more complex) we can see an improvement in the execution plans. The first one costs 75% of the batch, and the second one 25%. Now, let's compare the results of both versions (old and new) using the SQL Profiler:

Even here, we can see the logical reads have drop from 25599 to 379! Great improvement, it is.

Of course, there are a lot of other scenarios where ORs is used and can be replaced by an UNION ALL, but the idea es the same. It is just a bit more complicated to refactor the SQL code.

As we saw, replacing the OR by an UNION ALL is better, but not always. If we execute the same query over an empty table (an extreme case, I know) the OR is better. But since a normal operation in a production environment would be with the table filled with a lot of rows, using an UNION ALL is prefered. So we can say that the UNION ALL is more scalable than the OR, since any change in the statistics, table content, or even a simple change in the query can turn a efficient query in a sluggish one.

Friday, April 4, 2014

Finding Serializable transactions

Situation: The main app in my company was creating Serializable transactions, and thus creating Range Locks in production DB.

So, we had to modify the app to change the way transactions were being generated, since we found that the framework was using TransactionScope and, as you may know (or not), the default value for new connections is Serializable. So, the app was modified, and after that, my boss came over to my desk and told me "We need to know if there are some Serializable transactions in XXX environment. Can you check that?"

Challenge accepted!

Since I had no direct access to the server, and so I cannot left a SQL Profiler running, I had to think in a workaround. So, for that, my idea was to use SQL Server Traces, specifically using the ‘Audit Login’ event (I have to thank Guillaume Columeau for that, an SQL Server expert in Paris). In the TextData column of the trace, you can check the connection isolation level. Also, all connection properties are displayed in that column.

First of all, I had to try the idea in my local environment. So, after creating the trace (I’m not going to get into detailed info about the tools you can use to create them, since there a lot of places in Internet for that. Just Google it ;-) I run the app and did some random flows, which used the DB, with the previous version of the app, a version which I know for sure it was creating serializable transaction. Anyway, this is what I found:

select
      TextData,
      HostName,
      ApplicationName
from fn_trace_gettable('C:\Program Files\Microsoft SQL Server\MSSQL10_50.INTEGRATION\MSSQL\LOG\INTEGRATION$DBM_Logins.trc', default)

(I truncated the TextData field, since it's a little bit long)

So, indeed, as I was expecting, the app were generating serializable trxs. So I had to identify the SP that was being executed inside it. In the traces, we also can retrieve the execution start time (in the StartTime column)

The SQL instance has also a trace running for the queries, so all I had to do as to check which SPs were executed in that timeframe:

select
      ObjectName
from fn_trace_gettable('C:\Program Files\Microsoft SQL Server\MSSQL10_50.INTEGRATION\MSSQL\LOG\INTEGRATION$DBM_Queries_And_Locks.trc', default)
where StartTime >= '2014-02-17T16:20:21.390'
and EndTime <= '2014-02-17T16:30:44.053'

Searching for that SP inside the app, I could confirm that it was using TransactionScope, with default values, so SERIALIZABLE was being used. Remember that this was a previous version, before the change I mentioned at the beggining of this blog entry, so it was the expected result. So, I was done for the POC, and ready to move to the real environment.

End of story: I replicated all of this in the environment (with help from the DBA, the same guy I mentioned before) my boss asked me to check, and, happily, there was no serializable transaction being generated. so the app was correctly modified.

I’m attaching the code for creating (and stopping) the trace. LoginTrace.zip