<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Simon Munro &#187; Data Access</title>
	<atom:link href="http://simonmunro.com/tag/data-access/feed/" rel="self" type="application/rss+xml" />
	<link>http://simonmunro.com</link>
	<description>Software Development and Public Cloud</description>
	<lastBuildDate>Mon, 06 Feb 2012 19:55:17 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='simonmunro.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Simon Munro &#187; Data Access</title>
		<link>http://simonmunro.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://simonmunro.com/osd.xml" title="Simon Munro" />
	<atom:link rel='hub' href='http://simonmunro.com/?pushpress=hub'/>
		<item>
		<title>The Trouble With Sharding</title>
		<link>http://simonmunro.com/2009/09/10/the-trouble-with-sharding/</link>
		<comments>http://simonmunro.com/2009/09/10/the-trouble-with-sharding/#comments</comments>
		<pubDate>Thu, 10 Sep 2009 21:35:35 +0000</pubDate>
		<dc:creator>simonmunro</dc:creator>
				<category><![CDATA[Cloud]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[SQL Azure]]></category>
		<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Azure]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[Data Access]]></category>
		<category><![CDATA[Database]]></category>
		<category><![CDATA[Microsoft]]></category>
		<category><![CDATA[Sharding]]></category>

		<guid isPermaLink="false">http://simonmunro.wordpress.com/2009/09/10/the-trouble-with-sharding/</guid>
		<description><![CDATA[Database sharding, as a technique for scaling out SQL databases, has started to gain mindshare amongst developers.&#160; This has recently has been driven by the interest in SQL Azure, closely followed by disappointment because of the 10GB database size limitation, which in turn is brushed aside by Microsoft who, in a vague way, point to [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=simonmunro.com&amp;blog=8574700&amp;post=34&amp;subd=simonmunro&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p align="justify">Database sharding, as a technique for scaling out SQL databases, has started to gain mindshare amongst developers.&#160; This has recently has been driven by the interest in SQL Azure, closely followed by disappointment because of the 10GB database size limitation, which in turn is brushed aside by Microsoft who, in a vague way, point to sharding as a solution to the scalability of SQL Azure.&#160; SQL Azure is a great product and sharding is an effective (and successful) technique, but before developers that have little experience with building scalable systems are let loose on sharding (or even worse, vendor support for ‘automatic’ sharding), we need to spend some time understanding what the issues are with sharding, the problem that we are trying to solve, and some ways forward to tackle the technical implementation.</p>
<p align="justify">The basic principles of sharding are fairly simple.&#160; The idea is to partition your data across two or more physical databases so that each database (or node) has a subset of the data.&#160; The theory is that in most cases a query or connection only needs to look in one particular shard for data, leaving the other shards free to handle other requests.&#160; Sharding is easily explained by a simple single table example.&#160; Lets say you have a large customer table that you want to split into two shards.&#160; You can create the shards by having all of the customers who’s names start with ‘A’ up to ‘L’ in one database and another for those from ‘M’ to ‘Z’, i.e. a partition key on the first character of the Last Name field.&#160; With 13 characters in each shard you would expect to have an even spread of customers across both shards but without data you can’t be sure – maybe there are more customers in the first shard than the second, and maybe you particular region has more in one than the other.&#160; </p>
<p align="justify">Lets say that you think that it will be better to shard customers by region to get a more even split and you have three shards; one for the US, one for Europe and one for the rest of the world.&#160; Although unlikely, you may find that although the number of rows is even that the load across each shard differs.&#160; 80% of your business may come from a single region or even if the amount of business is even, that the load will differ across different times of the day as business hours move across the world.&#160; The same problem exists across all primary entities that are candidates for sharding.&#160; For example, your product catalogue sharding strategy will have similar issues.&#160; You can use product codes for an even split, but you may find that top selling products are all in one shard.&#160; If you fix that you may find that top selling products are seasonal, so today’s optimal shard will not work at all tomorrow.&#160; The problem can be expressed as</p>
<blockquote><p align="justify"><em><font color="#000000">The selection of a partition key for sharding is dependant on the number of rows that will be in each shard and the usage profile of the candidate shard over time.</font></em></p>
</blockquote>
<p align="justify">Those are some of the issues just trying to figure out your sharding strategy – and that is the easy part.&#160; Sharding seems to have a rule that the application layer is responsible for understanding how the data is split across each shard (where the term ‘partition’ is applied more to the RDBMS only and partitioning is transparent to the application).&#160; This creates some problems: </p>
<ul>
<li>
<div align="justify">The application needs to maintain an index of partition keys in order to query the correct database when fetching data.&#160; This means that there is some additional overhead – database round trips, index caches and some transformation of application queries into the correctly connected database query.&#160; While simple for a single table, it is likely that a single object may need to be hydrated from multiple databases and figuring out where to go and fetch each piece of data, dynamically (depending on already fetched pieces of data), can be quite complex.</div>
</li>
<li>
<div align="justify">Any sharding strategy will always be biased towards a particular data traversal path.&#160; For example, in a customer biased sharding strategy you may have the related rows in the same shard (such as the related orders for the customer).&#160; This works well because the entire customer object and related collections can be hydrated from a single physical database connection, making the ‘My Orders’ page snappy.&#160; Unfortunately, although it works for the customer oriented traversal path, the order fulfilment path is hindered by current and open orders being scattered all over the place.</div>
</li>
<li>
<div align="justify">Because the application layer owns the indexes and is responsible for fetching data the database is rendered impotent as a query tool because each individual database knows nothing about the other shards and cannot execute a query accordingly.&#160; Even if there was shard index availability in each database, then it would trample all over the domain of the application layers’ domain, causing heaps of trouble.&#160; this means that all data access needs to go through the application layer , which create a lot of work to implement an object implementation of all database entities, their variations and query requirements.&#160; SQL cannot be used as a query language and neither can ADO, OleDB or ODBC be used – making it impossible to use existing query and reporting tools such as Reporting Services or Excel.</div>
</li>
<li>
<div align="justify">In some cases, sharding may be slower.&#160; Queries that need to aggregate or sort across multiple queries will not be able to take advantage of heavy lifting performed in the database.&#160; You will land up re-inventing the wheel by developing your own query optimisers in the application layer.</div>
</li>
</ul>
<p align="justify">In order to implement sharding successfully we need to deal with the following:</p>
<ol>
<li>
<div align="justify">The upfront selection of the best sharding strategy.&#160; What entities do we want to shard?&#160; What do we want to shard on?&#160; </div>
</li>
<li>
<div align="justify">The architecture and implementation of our application layer and data access layer.&#160; Do we roll our own?&#160; Do we use an existing framework?</div>
</li>
<li>
<div align="justify">The ability to monitor performance and identify problems with the shards in order to change (and re-optimise) our initially chosen sharding strategy over time as the amount of data and usage patterns change over time.</div>
</li>
<li>
<div align="justify">Consideration for other systems that may need to interface with our system, including large monolithic legacy systems and out-of-the-box reporting tools.</div>
</li>
</ol>
<p align="justify">So some things to think about if you are considering sharding:</p>
<ul>
<li>
<div align="justify">Sharding is no silver bullet and needs to be evaluated architecturally, just like any other major data storage and data access decision.</div>
</li>
<li>
<div align="justify">Sharding of the entire system may not be necessary.&#160; Perhaps it is only part of the web front-end that needs performance under high load that needs to be sharded and the backoffice transactional systems don’t need to be sharded at all.&#160; So you could build a system that has a small part of the system sharded and migrates data to a more traditional model (or data warehouse even) as needed.</div>
</li>
<li>
<div align="justify">Sharding for scalability is not the only approach for data – perhaps some use could be made of non-SQL storage.</div>
</li>
<li>
<div align="justify">The hand coding of all the application objects may be a lot of work and difficult to maintain.&#160; Use can be made of a framework that assists or a code generation tool could be used.&#160; However, it has to be feature complete and handle the issues raised in this post.</div>
</li>
<li>
<div align="justify">You will need to take a very careful approach to the requirements in a behavioural or domain driven style.&#160; Creating a solution where every entity is sharded, every object is made of shards, and every possible query combination that could be thought up is implemented is going to be a lot of work and result in a brittle unmaintainable system.</div>
</li>
<li>
<div align="justify">You need to look at your database vendors’ support of partitioning.&#160; Maybe it will be good enough for your solution and you don’t need to bother with sharding at all.</div>
</li>
<li>
<div align="justify">Sharding, by splitting data across multiple physical databases, looses some (maybe a lot) of the essence of SQL – queries, data consistency, foreign keys, locking.&#160; You will need to understand if that loss is worthwhile – maybe you will land up with a data store that is too dumbed down to be useful.</div>
</li>
</ul>
<p align="justify">If you are looking at a Microsoft stack specifically, there are some interesting products and technologies that may affect your decisions.&#160; These observations are purely my own and are not gleaned from NDA sourced information.</p>
<ul>
<li>
<div align="justify">ADO.NET Data Services (Astoria) could be the interface at the application level in front of sharded objects.&#160; It replaces the SQL language with a queryable RESTful language.</div>
</li>
<li>
<div align="justify">The Entity Framework is a big deal for Microsoft and will most likely, over time, be the method with which Microsoft delivers sharding solutions.&#160; EF is destined to be supported by other Microsoft products, such as SQL Reporting Services, SharePoint and Office, meaning that sharded EF models will be able to be queried with standard tools.&#160; Also, Astoria supports EF already, providing a mechanism for querying the data with a non SQL language.</div>
</li>
<li>
<div align="justify">Microsoft is a pretty big database player and has some smart people on the database team.&#160; One would expect that they will put effort into the SQL core to better handle partitioning within the SQL model.&#160; They already have Madison, which although more read-only and quite closely tuned for specific hardware configurations, offers a compelling parallelised database platform.</div>
</li>
<li>
<div align="justify">The Azure platform has more than just SQL Azure – it also has Azure storage which is a really good storage technology for distributed parallel solutions.&#160; It can also be used in conjunction with SQL Azure within an Azure solution, allowing a hybrid approach where SQL Azure and Azure Storage play to their particular strengths.</div>
</li>
<li>
<div align="justify">The SQL azure team has been promising some magic to come out of the Patterns &amp; Practices team – we’ll have to wait and see.</div>
</li>
<li>
<div align="justify">Ayende seems to want to add sharding to <a href="http://ayende.com/Blog/archive/2009/09/06/sql-azure-sharding-and-nhibernate-a-call-for-volunteers.aspx">nHibernate</a>.</div>
</li>
</ul>
<p align="justify">Database sharding has typically been the domain of large websites that have reached the limits of their own, really big, datacentres and have the resources to shard their data.&#160; The cloud, with small commodity servers, such as those used with SQL Azure, has raised sharding as a solution for smaller websites but they may not be able to pull off sharding because of a lack of resources and experience.&#160; The frameworks aren’t quite there and the tools don’t exist (like an analysis tool for candidate shards based on existing data) – and without those tools it may be a daunting task.</p>
<p align="justify">I am disappointed that the SQL Azure team throws out the bone of sharding as the solution to their database size limitation without backing it up with some tools, realistic scenarios and practical advice.&#160; Sharding a database requires more than just hand waving and PowerPoint presentations and requires a solid engineering approach to the problem.&#160; Perhaps they should talk more to the Azure services team to offer hybrid SQL Azure and Azure Storage architectural patterns that are compelling and architecturally valid.&#160; I am particularly concerned when it is offered as a simple solution to small businesses that have to make a huge investment in a technology and and architecture that they are possibly unable to maintain.</p>
<p align="justify">Sharding will, however, gain traction and is a viable solution to scaling out databases, SQL Azure and others.&#160; I will try and do my bit by communicating some of the issues and solutions – let me know in the comments if there is a demand.</p>
<p align="justify">Simon Munro</p>
<p align="justify"><a href="http://twitter.com/simonmunro">@simonmunro</a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/simonmunro.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/simonmunro.wordpress.com/34/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/simonmunro.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/simonmunro.wordpress.com/34/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/simonmunro.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/simonmunro.wordpress.com/34/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/simonmunro.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/simonmunro.wordpress.com/34/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/simonmunro.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/simonmunro.wordpress.com/34/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/simonmunro.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/simonmunro.wordpress.com/34/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/simonmunro.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/simonmunro.wordpress.com/34/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=simonmunro.com&amp;blog=8574700&amp;post=34&amp;subd=simonmunro&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://simonmunro.com/2009/09/10/the-trouble-with-sharding/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/b1b8b0098653a14d0338ffac00b5e52c?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">simonmunro</media:title>
		</media:content>
	</item>
	</channel>
</rss>
