Thursday, September 27, 2012

SQL Server Indexes (Clustered and Non - Clustered Indexes) Part-1


         Indexing has become a critical buzzword in the arena of databases. Not only database users, but also all of us are using indexes in our daily life.  Got puzzled? Let me give you one scenario. Have you ever noticed how did a librarian arranges and orders all the books in the library? And when you are asking for a particular book, he immediately gives it to you within some couple of minutes without searching the whole library. This is possible because he has already indexed each book and categorized them according to the author and the publisher and again ordered all the books falling under same author and publisher in an alphabetical order of the title. So when you ask for a particular book, he just refers to the shelf belonging to the author and publisher of the book and the book is with you in no time.
Let’s take another common example of maintaining your household components. What you generally do when you are in need of Feviquick glue. I can bet on that you will search the whole house like meJ. But the point here is that you can avoid this irritating search if you would have remembered or noted the place where you had kept the Feviquick glue. Not only this, but for all that is part of our daily activities, like books, clothes, visiting places, etc. things that we use regularly or frequently, we remember them i.e. we keep information about them in our memory (brain). This is what we called in database language as indexing so that next time when we use them, it becomes easier for us to access them. One thing you might have noted here is that I am insisting about frequent and regular things and not about rare or occasional things like you might not want to remember the places where you don’t think you would have any business to do or the strangers in the street to whom you are not expecting to meet again. But if you want to remember also, nobody will resist you as human brains have enough capacity to remember everything that comes throughout its life span. But in database world we have space constraints as well as processing constraints, so we must take wise decisions while indexing. From the above discussion, we concluded that an index is a structure or classification used to find information easier.
Let’s have a look on how SQL Server is maintaining indexes in the databases.

Definition: Indexes are relational data structure created for a table or a view in a database in order to make data easier to search and retrieve when the query is processed. The disk space required to store the index structure is usually less than that required by the table. And the main reason behind are the key values on which the indexing are usually done and not on all the columns.
Let’s have an example to elaborate it in more detail: Consider a table nameddbo.EmployeeDim. The details about dbo.EmployeeDim are given below:
Database: OLAP
Table Name: dbo.EmployeeDim
Data:

emp_name
emp_id
emp_DOB
Department
Arun
101
1984-03-16 00:00:00.000
sales
Ahmed
201
1985-05-24 00:00:00.000
finance
Bobby
104
1985-04-09 00:00:00.000
HR
John
156
1982-07-12 00:00:00.000
marketing
Vasim
132
1982-02-12 00:00:00.000
production
Steve
229
1984-12-15 00:00:00.000
admin
Arif
203
1983-02-19 00:00:00.000
production





Script:

USE [OLAP]
GO
/****** Object:  Table [dbo].[EmployeeDim]   ******/
CREATE TABLE [dbo].[EmployeeDim](
                [emp_name] [varchar](50) NULL,
                [emp_id] [int] NULL,
                [emp_DOB] [datetime] NULL,
                [Department] [varchar](50) NULL
) ON [PRIMARY]


Table Structure:

Column_name
Type
Length
emp_name
varchar
50
emp_id
int
4
emp_DOB
datetime
8
Department
varchar
50

Here in this example, we will use information provided by the employee table. You might have noted here that none of the columns are sorted i.e. the information about the employees are not in an ordered fashion. Although this is a small example with limited number of rows, but guess a table like this having tones of records and u have been told to find the DOB (date of birth) and department for the employee id ‘229’ how u will get it then and imagine how much time it will take to scan each row to get the information about the employee id ‘229’. Before going for a deep discussion, let’s have some idea about the terms used in indexing.
Table Scan: Whenever we are firing a query on a database, the SQL server query optimizer is trying to fetch the best possible way to run the query for better performance. In this process, when the Query Optimizer determines that there are no useful indexes available to produce the output result set, then it will go for table scan. In table scan, it has to scan the whole table row by row in order to get the requested rows. This is very slow and is recommended to avoid them. There are rarely some cases where table scan are faster than index scan i.e. where the table size is less.
Index Scan: Auxiliary data structure used to accelerate access to data within the database. Here, it will not crawl on the whole table rather it will directly fetch those rows and pages that qualify the requesting query. For this a separate data structure is maintained to keep information about the key values of the table (described below).
Note: In some references, you may find that Table scan is called as Index scan and Index scan is named as Index seek.
The main property of index is that it is in sorted form. For example, as in case of dictionary, all the words are in sorted form, so when you are looking for any word, you are directly jumping to the corresponding page related to the word. In the same way the database engine is directly seeking the demanded rows if the corresponding column is indexed.

In our above example, suppose we want to find the employee name, his DOB and department information for employee id ‘132’. The below query will do your job.
   
SELECT [emp_name]
      ,[emp_DOB]
      ,[Department]
  FROM [OLAP].[dbo].[EmployeeDim]
WHERE [emp_id] = 156
Let’s have a look on the background process that has carried on for this simple query. As of now, there is no index in our table named [EmployeeDim], so the database engine will search each record, row by row, to get the desired result. Below is shown how the database engine performs the operation.




The Execution Plan for the above query shows that 100 % of the total cost is spent for Table Scan.


Now let’s create one index on the column named [emp_id] and see how it is working. Follow the below procedures to create an index on a table:
Step1: Open SQL Server Management Server and navigate to [EmployeeDim] table as below:








Step2: Select the column named [emp_id] and click on “Manage Indexes and Keys” on the toolbar as below:
             
Step3: On “Indexes/Keys” dialog box, click on Add button and click on the ellipse button shown beside the Columns property.
             
Step4: Then on “Index Columns” dialog box, select the column ‘emp_id’ in Column Name and click OK button. After that press Close button and save the table
          

The above index can be easily created by executing one simple T-SQL query (This is why coders are the best in programming worldJ). The t-sql command to create index is given below:
USE [OLAP]
GO
/****** Object:  Index [IX_EmployeeDim]    ******/
CREATE NONCLUSTERED INDEX [IX_EmployeeDim] ON [dbo].[EmployeeDim]
(
            [emp_id] ASC
) 
After creation of indexes it is always good to have a look on the newly created index, you can do this by two ways:
•           Expand Indexes folder under the table EmployeeDim, you will find one index was created on the name of IX_EmployeeDim.


•           Or else you can run the below query to check all the available indexes under a table.

EXEC sp_helpindex 'dbo.EmployeeDim'

Now let’s execute the same query to find out the information about the employee having employee id 132. This time we may see that SQL Server engine is performing its operation in a different way and is shown as below:

A separate data structure is created for the column emp_id in ascending order and has the reference to the original row with the SL_No column (SL_No is taken to describe the example pictorially; you can assume it as the row_id maintained internally for each row by the database engine, it is usually not visible to the user).


Here the SQL server database engine will directly jump to the row 132 in the index structure by using search algorithm, usually it performs B-tree search for faster extraction of data. As soon as it got the row, it will seek the corresponding record in the main table by the reference id (row_id) and will display the rest of the information from the table about the employee having emp_id ‘132’.
As in this example, we have taken a small table, so performance difference is negligible, but in case of big tables with records more than lakhs, indexes plays a vital role in fetching data promptly.

Types of indexes: We have basically two types of indexes in SQL Server.
Clustered index:  It’s a type of index where the data itself is arranged at the indexed column i.e. we have the actual data at the leaf nodes. For example, in a science definition pocket book, every word is arranged alphabetically and the definition of the word is given along with it. Here there is no need to look up for the referencing page, as in case of other example like in a general book with indexes at the end of the book, contains page number along with the term or word which is then followed to get the information about the term.
In our above example, if we create cluster index on the column emp_id, the table itself will be rearranged according to the sorting order of emp_id.
Syntax:
USE [OLAP]
GO
/****** Object:  Index [IX_EmployeeDim]  ******/
CREATE CLUSTERED INDEX [IX_EmployeeDim] ON [dbo].[EmployeeDim]
(
            [emp_id] ASC
)
After executing the query, perform a select statement on the table [EmployeeDim], you will notice that all the rows are ordered by [emp_id].

SELECT * FROM dbo.EmployeeDim

Advantage:
·         Seeking operations are much faster than normal indexing.
Disadvantage:
·         It will rearrange the whole data at every operation of UPDATE, INSERT or DELETE on key column / cluster index column, in order to keep ordering of the data in that column.

Non-Clustered index: Here leaf node contains index pages or reference to the actual data record instead of data pages (actual data). Refer the first example for details about the syntax. It will not affect the default order of the table rather a separate data structure is maintained for each index.
Advantage:
·         Can have more than one non-clustered index on a table.
·         Reordering of the table data is not necessary
Disadvantage:
·         Need separate data structure. As a result consume more space
·         Over indexing i.e. more number of indexes may cause performance degradation.

Cluster Indexing
Non-Cluster Indexing
• Actual data at the leaf nodes.
• Reference to the data on the leaf nodes.
• Data for the index column is sorted physically.
• Data in the index structure is sorted.
• Updating data in the cluster index column affects the whole table.
• Updating data does not affect the whole table, only the index structure has to be rebuild for the index column.
• Only 1 Cluster Index is possible for one table.
• Up to 249 Non Clustered Indexes are possible for one table in SQL Server 2005 and 999 in SQL Server 2008.
• Clustered index are unique in nature for a given table.
• This may not be unique.
• Can be efficient on fixed or rarely changing tables.
• Is good for table which is modified frequently.
• Index order matches the physical storage order of the data.
• Index order does not match with the physical order of the data.
• Does not required additional disk space to store data in the index as it is physically reordering the main table.
• Separate data structure is maintained for each index, resulting additional space consumption.


Best practices for Indexing:
·         Create index on columns that are frequently used for data retrieval or in filter conditions. For example in SELECT statement and WHERE clause or in ORDER BY,GROUP BY, and DISTINCT clauses. As in our first example, if most of the queries are referencing emp_name column instead of emp_id column, then it is wise to create index on emp_name instead of emp_id i.e. create indexes based on use.
·         Every table should have a clustered index otherwise it will act as heap table only even if it has non-clustered indexes, resulting performance degradation.
·         Create indexes on selective columns only that are frequently used.
·         Try to keep the keys of clustered index as small as possible in size. (i.e. int, smallint)
·         Optimize cluster index for ranged data i.e. data that are retrieved using BETWEEN, >, >=, <, and <=.
·         Creative covering indexes are a wise decision for most frequently used queries.
·         In case of composite indexes, order the columns according to the restrictiveness i.e. smaller quantity first as described in the composite index example above.
·         Create index on all foreign keys which may be helpful in joining operations.
·         Use multiple narrow indexes (i.e. on smaller number of columns, may be single column index) rather than a few wide indexes (composite indexes).
·         Avoid using indexes that are rarely used or else drop them.
·         Identify poorly performing queries with the help of SQL Profiler, Query Execution plan, and try to optimize it.
·         Last but not the least; remove indexes while performing BULK insert or heavy insert operations especially in case of Clustered Index as it will perform additional operations for rearranging the indexes for each insertion of the record. Instead, drop indexes before BULK insert and recreate it after insertion operation finishes.







No comments:

Post a Comment