Reducing data footprints on test systems

Picture the scene:

You are a DBA. You have to make everyone happy, including DEVs (yes, including them too). Part of that is to ensure that your DEVs have a fresh copy of your production data so that they can test their programming with current data and structures. The issue with that is your production database is multiple hundreds of GB in size and your test system doesn’t have the capacity to store a full copy of your database. Your IT Budget will also not stretch to purchasing more storage! What can you do?

I had a similar situation recently and took a look at the possibilties and here is what I came up with:

1. Try out tools like SQL Virtual Restore to allow you to “mount” a backup from a remote server avoiding the space issue all together. This is, of course, a solution that requires a license (no budget maybe?)

2. Use the great features offered to you by SQL Server Developer Edition: data compression to the rescue!

So of course, I chose door number 2.

As you may know, Developer Edition offers the same features as Enterprise Edition with the caveat that the instance cannot be used for production purposes in any way. This means that the seriously powerful and useful data compression becomes available for your test servers! This counts even if you only use Standard Edition in production – joy of joys! The good thing being that Developer Edition only costs around $50 (or less if you have an MSDN subscription) – even more joys!

If you have never seen/used compression take a quick look over on MSDN to see what it all is (Data Compression). Basically, you can compress data in indexes and tables at the storage level avoiding any query rewrites and still profiting from the storage savings. This can be a major advantage in terms of raw storage needs, but can also benefit you in high read environments with low RAM sizes. The data remains compressed when held in the buffer pool and is only de-compressed when being accessed. This means that you can keep more data in the buffer pool and reduce hard disk I/O (obviously not for free, compressing and de-compressing costs CPU cycles).  This may be acceptable on a test system with extremely constrained storage space.

The usage in my example scenario is now an even better proposition, not only can I reduce my storage footprint, I can also potentially increase test system performance on the I/O side of the equation (who doesn’t have I/O issues, especially on a test box!).

The next hurdle is of course identifying which indexes and tables that are in the database you want to squeeze down. This is possible via SSMS’ object explorer, but only if you want to spend an eternity doing so! The best way is to take a look at the meta-data tables/views/functions to interrogate the system objects. These are really interesting on their own (at least they are to me), but after writing my Index Script Creator, I realised the potential for using the information in these system objects to allow me to automate certain tasks like this one.

Similarly to the Index Script Creator, my Index Compressor Script (uploaded as a doc  but it is a .sql file really – stupid wordpress restrictions!) runs off and investigates all tables and indexes (down to the partition level) and then generates an index rebuild command to compress any currently non-compressed partitions of indexes / tables. The resulting commands can then be run and space can be reclaimed – just be aware that rebuilding indexes does cause log activity.

After compressing your tables and indexes you should have a much smaller amount of data (depending on how well your data compresses) and be able to reclaim some disk space.

I realise that this method will mean that you have to have a large enough disk to have the database in its uncompressed state to begin with, but you can end up with a cost effective solution to a budget-constrained test environment.

Happy zipping!

Presenting the Index Creator Script

Born of a need to originally script out indexes in a way similar to how SSMS creates them, I created a script to do just that. The original was very quick-n-dirty as can be seen here:
I had not accounted for very much, other than the indexes as they were. All options, schema information etc. was ignored, as we have nothing special at work and I really needed those indexes quickly.

Since posting on ASK, I have tinkered on-and-off with the script for a while. I am now at a point where I think other people could really profit from it and no longer have it stuck to some hard-coded schemas etc.

Presenting the Index Creator Script v1.0! This script will go through the current database, finding all indexes (optionally system indexes too) and supply you with create index scripts.

It is clever enough to spot the difference between / usage of :

– Primary Keys
– Unique Constraints
– Clustered and Non-Clustered Indexes
– Filtered Indexes (produces the filter too)
– Included Columns
– Partitioned Tables/Indexes (although the partition schema and functions are not produced – yet!)
– Data Compression (on a partition level if used – yes this is possible!)
– Fill Factor
– Index Padding
– Locking (Row and Page)

I have supplied two versions of the code; one for SQL 2005 and one for SQL 2008 and above. This is done as SQL 2008 offers Data Compression, which is implemented in the indexes and partitions. Some of the script relies on this information and would not be backwards compatible.

It has been very interesting coding this script, as it has enlightened me on the structures in SQL server with regards to indexes. For example, as of SQL 2005, regardless of edition, SQL Server creates indexes using partitions. Although partitions cannot be used by editions lower than Enterprise/Developer Edition, all indexes are created with at least one partition. This makes sense, as that would mean there would have to be a separate structures depending upon edition. This way, regardless of edition, the storage engine works the same, you just don’t get the option of creating partitions on an edition lower than Enterprise/Developer. As soon as you migrate a database to Enterprise Edition, you get the possibility of then splitting the indexes on to multiple partitions.

Even better than that, I found out that indexes can be compressed by partition. I sort of knew this already, but in writing the script I saw this in even more clarity. Each partition of an index can use a different level of compression. This can be very interesting, especially if the costs of compression are high, but the benefits in storage are high too. Think of a CPU bound system where some partitions are accessed often and would need a lower compression to reduce CPU load, with other partitions that are accessed rarely which can benefit from the higher compression ratio.

I hope these scripts are of some benefit. If you have comments/questions/suggestions, please get in touch.

Index Creator Script – This is a zip file. Download, change file extension and open in your favourite ZIP manager (damn you wordpress!). There are 2 .sql files in there (one for 2005 and one for 2008). Disclaimer – use at your own risk, I am not responsible if it breaks your PC/Server.

In reality, this script can’t break things, but you have been warned!

UPDATE: Thanks to @Fatherjack for the quick heads-up on a syntax error. Things should look good now though! 🙂

Who accessed my indexes?

I saw and answered a question today at ASK about the DMV sys.dm_db_index_usage_stats, a really useful DMV that shows you the usage of indexes in a database since the statistics were last reset (this happens when SQL Server is restarted, or the database is attached/detached or brought online).

The question asked what the data in the column system_scans actually meant.  To better understand this, we have to understand what the usage stats are.  This is my take on the matter (please get in touch with your views/comments).

The usage statistics for an index keep track of how often an index has been accessed, be it from a user through queries or through the system itself for internal use.  The statistics are updated when one or more of the following occur:

  • an index is used to satisfy a query
  • an index is updated when data has been changed/deleted/inserted
  • an index has been rebuilt/reorganised
  • the statistics for an index have been recalculated

Note that the list above can be split into two categories.  User actions (the first two) and system actions (the second two).  These categories are also separated in the index usage statistics.  This allows you to see at a glance, who (users/system) is responsible for the most index usage, and also what usage you are seeing.

The original question posted at ASK was wanting to know what the system_scan column data really meant.  This counter keeps track of how often the index has been scanned (entirely or partially) by SQL Server internally.  This type of operation would be for index rebuilds/reorgs or statistics updates.  The system runs off and scans the index to collect the information it needs.  This may have been triggered by a user requesting a statistics refresh, but the real work is still classed as being done by SQL Server internally.

If a user were to write a query that scanned an index, then the column user_scans would be incremented by one as it was a request by the user for data from the index.  Equally, if an index seek is performed, the counter user_seeks would be incremented.

As a rule of thumb, if you have low user reads (user_scans+user_seeks) and high system values or high user updates, then an index may be a good candidate for being dropped.

But remember, the usage statistics are only as valid as your system uptime.  First, a freshly started system will have low values across the board.  Secondly, a system that has been running for over a year may have skewed values just through the long uptime.  An index may have had high usage 6 months ago, but is now behaving totally differently after a further 6 months of use.