<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Tokutek &#187; Martin Farach-Colton</title>
	<atom:link href="http://tokutek.com/author/martin/feed/" rel="self" type="application/rss+xml" />
	<link>http://tokutek.com</link>
	<description>Database Performance</description>
	<lastBuildDate>Thu, 02 Sep 2010 17:47:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Announcing TokuDB v4.1</title>
		<link>http://tokutek.com/2010/08/announcing-tokudb-v4-1/</link>
		<comments>http://tokutek.com/2010/08/announcing-tokudb-v4-1/#comments</comments>
		<pubDate>Sun, 15 Aug 2010 19:40:38 +0000</pubDate>
		<dc:creator>Martin Farach-Colton</dc:creator>
				<category><![CDATA[TokuView]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=1618</guid>
		<description><![CDATA[Tokutek is pleased to announce immediate availability of TokuDB for MySQL, version 4.1. It is designed for continuous querying and analysis of large volumes of rapidly arriving and changing data, while maintaining full ACID properties. New in TokuDB v4.1 includes important improvements, most notably support for SAVEPOINT and improved Fast Loader performance (introduced in v4.0). [...]]]></description>
			<content:encoded><![CDATA[<p>Tokutek is pleased to announce immediate availability of <a href="http://tokutek.com/products/tokudb-for-mysql-v4/">TokuDB for MySQL, version 4.1</a>. It is designed for continuous querying and analysis of large volumes of rapidly arriving and changing data, while maintaining full ACID properties.</p>
<p>New in TokuDB v4.1 includes important improvements, most notably support for  SAVEPOINT and improved Fast Loader performance (introduced in v4.0).  </p>
<p>This new release builds on TokuDB&#8217;s unique combination of capabilities:</p>
<ul>
<li>10x-50x faster indexing for faster querying</li>
<li>Full support for ACID transactions</li>
<li>Short recovery time (seconds or minutes, not hours or days)</li>
<li>Immunity to database aging to eliminate performance degradation and maintenance headaches</li>
<li>5x-15x data compression for reduced disk use and lower storage costs</li>
</ul>
<p>Because of its high indexing performance and transaction support, TokuDB is well suited to Web applications that must simultaneously store and query large volumes of rapidly arriving data, including:</p>
<ul>
<li>Logfile Analysis</li>
<li>High-speed Webcrawling</li>
<li>Real-time clickstream analysis</li>
<li>Social Networking</li>
<li>eCommerce Personalization</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2010/08/announcing-tokudb-v4-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Announcing TokuDB 2.1.0</title>
		<link>http://tokutek.com/2009/08/announcing_tokudb_210/</link>
		<comments>http://tokutek.com/2009/08/announcing_tokudb_210/#comments</comments>
		<pubDate>Fri, 07 Aug 2009 00:23:00 +0000</pubDate>
		<dc:creator>Martin Farach-Colton</dc:creator>
				<category><![CDATA[TokuView]]></category>

		<guid isPermaLink="false">http://announcing_tokudb_210</guid>
		<description><![CDATA[Tokutek&#0174; announces the release the release of the TokuDB storage engine for MySQL&#0174;, version 2.1.0. This release offers the following improvements over our previous release: Faster indexing of sequential keys. Faster bulk loads on tables with auto-increment fields. Faster range queries in some circumstances. Added support for InnoDB. Upgraded from MySQL 5.1.30 to 5.1.36. Fixed [...]]]></description>
			<content:encoded><![CDATA[<p>Tokutek&#0174; announces the release the release of the <a href="http://www.tokutek.com/early_release.php">TokuDB storage engine for MySQL&#0174;, version 2.1.0</a>.  This release offers the following improvements over our previous release:</p>
<ul>
<li> Faster indexing of sequential keys.
<li> Faster bulk loads on tables with auto-increment fields.
<li> Faster range queries in some circumstances.
<li> Added support for InnoDB.
<li> Upgraded from MySQL 5.1.30 to 5.1.36.
<li> Fixed all known bugs.
</ul>
<h3>About TokuDB</h3>
<p>TokuDB for MySQL is a storage engine built with Tokutek&#8217;s Fractal Tree&#0153; technology. TokuDB provides near seamless compatibility for MySQL applications. Tables can be individually defined to use TokuDB, MyISAM, InnoDB&#0174; or other MySQL-compliant storage engines. Data is loaded, inserted, and queried using standard MySQL commands, with no restrictions or special requirements. Fractal Trees can index data up to 50 times faster than traditional database technologies, enabling near real time analysis on large volumes of rapidly arriving data.
</p>
<p>
Notice: Tokutek and Fractal Tree are trademarks or registered trademarks of Tokutek, Inc.  MySQL is a registered trademark of Sun Microsystems, Inc.  InnoDB is a registered trademark of Oracle Corporation.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2009/08/announcing_tokudb_210/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Depth of  a B-tree</title>
		<link>http://tokutek.com/2009/04/the_depth_of_a_b_tree/</link>
		<comments>http://tokutek.com/2009/04/the_depth_of_a_b_tree/#comments</comments>
		<pubDate>Tue, 28 Apr 2009 23:06:00 +0000</pubDate>
		<dc:creator>Martin Farach-Colton</dc:creator>
				<category><![CDATA[TokuView]]></category>

		<guid isPermaLink="false">http://the_depth_of_a_b_tree</guid>
		<description><![CDATA[Schlomi Noach recently wrote a useful primer on the depth of B-trees and how that plays out for point queries &#8212; in both clustered indexes, like InnoDB, and in unclustered indexes, like MyISAM. Here, I&#8217;d like to talk about the effect of B-tree depth on insertions and range queries. And, of course, I&#8217;ll talk about [...]]]></description>
			<content:encoded><![CDATA[<p>Schlomi Noach recently wrote a useful primer on the <a href="http://code.openark.org/blog/mysql/the-depth-of-an-index-primer" title="depth of B-trees">depth of B-trees</a> and how that plays out for point queries &#8212; in both clustered indexes, like InnoDB, and in unclustered indexes, like MyISAM.  Here, I&#8217;d like to talk about the effect of B-tree depth on insertions and range queries.  And, of course, I&#8217;ll talk about alternatives like Fractal Trees, since that&#8217;s the basis of Tokutek&#8217;s storage engine for MySQL. </p>
<p>Please see Schlomi&#8217;s post for details, but I can summarize a few points, partly because I need some vocabulary for the points I&#8217;d like to make below.  Scholmi notes that there are two main features determining the depth of a B-tree (or B+-tree):</p>
<ol>
<li> The number of rows in the database.  We&#8217;ll call that <i>N</i>.</li>
<li>The size of the indexed key.  Let&#8217;s call <i>B</i> the number of key that fit in a B-tree node.  (Sometimes <i>B</i> is used to refer to the node size itself, rather than the number of keys it holds, but I hope my choice will make sense directly.) </li>
</ol>
<p>Given these quantities, the depth of a B-tree is log<sub><i>B</i></sub> <i>N</i>, give or take a little.  That&#8217;s just (log <i>N</i>)/log <i>B</i>.  Now we can rephrase Scholmi&#8217;s point as noting that small keys means a bigger <i>B</i>, which reduces (log <i>N</i>)/log <i>B</i>.  If we cut the key size in half, then the depth of the B-tree goes from (log <i>N</i>)/log <i>B</i> to (log <i>N</i>)/log 2<i>B</i> (twice as many keys fit in the tree nodes), and that&#8217;s just (log <i>N</i>)/(1+log <i>B</i>).</p>
<p>Let&#8217;s put some numbers in there.  Say you have a billion rows, and you can currently fit 64 keys in a node.  Then the depth of the tree is (log 10<sup>9</sup>)/ log 64 &asymp; 30/6 = 5.  Now you rebuild the tree with keys half the size and you get log 10<sup>9</sup> / log 128 &asymp; 30/7 = 4.3.  Assuming the top 3 levels of the tree are in memory, then you go from 2 disk seeks on average to 1.3 disk seeks on average, for a 35% speedup. </p>
<p>That&#8217;s a nice savings, assuming, of course, that the new, smaller key you used is as useful for queries.  And the time for an insertion into a B-tree enjoys the same savings.  An insertion is O((log <i>N</i>)/log <i>B</i>) &#8212; about the same as point queries, up to a constant, but in any case, you&#8217;d still get a similar speedup. </p>
<p>What about range queries?  Here things aren&#8217;t so sensitive to the depth of the tree.  In a range query, you seek to a leaf that has your first row, and then you iterate throw all rows, jumping to sibling leaves as necessary, until you reach your ending row.  The initial time to seak to the first leaf &#8212; the part that depends on the depth of the tree &#8212; is typically in the noise.</p>
<p>There is, however, a more subtle effect going on here.  If you have fast insertions (say, by using smaller keys), you can afford to keep more indexes.  And having the the right index around can make a huge difference to a range query: not having an index can mean that a range query over a small number of rows can become a full table scan.  I&#8217;ve seen a 5-order-of-magnitude speedup in such queries once the right index is added.</p>
<p>Here&#8217;s a simple example of of what I&#8217;m talking about, using iiBench. [Details available <a href="http://tokutek.com/products/iibench/">http://tokutek.com/products/iibench/</a>.] </p>
<pre>
mysql> show variables like 'query_cache_type';
+------------------------------+---------+
| Variable_name                | Value   |
+------------------------------+---------+
| query_cache_type             | OFF     |
+------------------------------+---------+
7 rows in set (0.00 sec)

mysql> select count(*) from tokudb_1B_noindex where customerid = 50000;
+----------+
| count(*) |
+----------+
|    10014 |
+----------+
1 row in set (11 min 17.55 sec)

mysql> alter table tokudb_1B_noindex add index cust_idx (customerid);

mysql> select count(*) from tokudb_1B_noindex where customerid = 50000;
+----------+
| count(*) |
+----------+
|    10014 |
+----------+
1 row in set (0.31 sec)
</pre>
<p>We issue a query on a table with no secondary indexes, and then again after building an appropriate secondary index &#8212; with query caching off, of course.  The speedup is from more than 11 minutes, to 0.31 seconds, in this case a factor 2185x faster.</p>
<p>So for range queries, it all boils down to having the right set of indexes.  Faster insertions means you can keep more indexes, so the savings on insertion times mentioned above can help: if you can afford to keep 4 indexes on your live data using big keys, you can afford around 6 indexes if you use keys half the size. [Well, it's more complicated than that, because this assumes you are keeping 3 levels from each for the index in memory.  Caveat indexor.]</p>
<p>Once again, this assumes that you can find small keys to use for all your indexes.  And the fastest indexes for range quires are <a href="http://en.wikipedia.org/wiki/Index_(database)#Covering_Index" title="covering indexes">covering indexes</a>, which tend to have large keys. [Stay tuned for a tokuview posting on covering indexes.]  For non-covering indexes, we&#8217;d have to do lots of point queries into the main table.  Yikes!</p>
<p>So we are caught in a bind.  We&#8217;d like build secondary indexes to make range queries fast, and we&#8217;d like the secondary indexes to be covering for our important queries, but covering indexes are slow to build using B-trees, partly because the keys are large and the trees are deep, but for a bunch of other reasons as well, reasons I&#8217;ll cover in another blog post.</p>
<p>What to do?  Of course, my solution is to use TokuDB for MySQL, which has much faster insertions.  TokuDB isn&#8217;t built on B-trees, but rather on Fractal Trees, so the whole large-key question goes away.  See the <a href="http://www.tokutek.com/technology.php" title="Technology Brief on Tokutek's Technology Page">Technology Brief on Tokutek&#8217;s Technology Page</a> for a discussion of the insertion performance of Fractal Trees and TokuDB.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2009/04/the_depth_of_a_b_tree/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hacking for Faster Insertions: Is this really how you want to spend your time?</title>
		<link>http://tokutek.com/2008/04/hacking-for-faster-insertions-is-this-really-how-you-want-to-spend-your-time/</link>
		<comments>http://tokutek.com/2008/04/hacking-for-faster-insertions-is-this-really-how-you-want-to-spend-your-time/#comments</comments>
		<pubDate>Fri, 04 Apr 2008 23:18:00 +0000</pubDate>
		<dc:creator>Martin Farach-Colton</dc:creator>
				<category><![CDATA[TokuView]]></category>

		<guid isPermaLink="false">http://blogs.tokutek.com/tokuview/hacking_for_faster_insertions_is_this_really_how_you_want_to_spend_your_tim/#When:18:18:00Z</guid>
		<description><![CDATA[Recall that I&#8217;ve claimed that it takes 28 years to fill a disk with random insertions, given a set of reasonable assumptions. Recall what they are: We are focusing on the storage engine (a la MySQL) level, and we are looking at a database on a single disk &#8212; the one we are using for [...]]]></description>
			<content:encoded><![CDATA[<p>Recall that I&#8217;ve claimed that it takes 28 years to fill a disk with random insertions, given a set of reasonable assumptions.  Recall what they are:</p>
<p>We are focusing on the storage engine (a la MySQL) level, and we are looking at a database on a single disk &#8212; the one we are using for illustration is the 1TB Hitachi Deskstar 7K1000.  It has a disk seek time 14ms and transfer rate of around 69MB/s. [See tomshardware.com]  We insert random <key,value> pairs, each 8 bytes.  So that&#8217;s 62.5 billion pairs to fill the disk, and at 4KB-size blocks, that 2^28 leaves (= 2^40 bytes / 2^12 bytes/leaf).</p>
<p>Now, my analysis requires each insertion to induce a disk seek.  Suppose we do something clever with main memory.  After all, we have this main memory hanging around.  It should be possible to buffer up some insertions, and once we fill up main memory, insert key/value pairs that belong on the same leaf.  Thus, fetching a leaf will be amortized over the number of rows that we expect to insert into a leaf.</p>
<p>So what&#8217;s the best we could do?  If we could somehow figure it out, the best we could do would be to take the leaf that&#8217;s going to receive the most insertions, fetch that one in, and do all the insertions from memory into that one.  This assume we won&#8217;t overflow that leaf and have to rebalance, etc.  We&#8217;re just going to count the cost of fetching in a leaf.</p>
<p>At first, you&#8217;d get a lot of advantage, since you start with an empty database with no leaves, so the first set of insertions (256 of them, in fact) will all go to the first leaf.  So let&#8217;s consider the situation where we&#8217;ve filled up the database half way.  We&#8217;ll discount this time and just look at the time to fill up the second half of the disk.  At that point, we have 2^27 leaves.</p>
<p>Now, let&#8217;s suppose we have 1GB = 2^30 bytes of memory.  It really won&#8217;t make an appreciable difference, it turns out, if we have as much or half as much memory.  With this much memory, we can get 2^27 rows into memory.  </p>
<p>Now we have what is know as a <b>ball and bins</b> problem.  Think of the leaves as a set of 2^27 bins, and the in-memory rows as 2^27 balls.  Since the insertion keys are random, we can think of each row/ball as picking a random leaf/bin where it goes.  Now, what is the expected maximum number of balls in any single bin?  The answer is well known &#8212; ln (2^27) / lnln (2^27) = 18.7/2.9 = 6.4.  So we expect the very best leaf to typically get 6.4 rows inserted.</p>
<p>So now the maximum speedup we could expect is 14 years/6.4 = 2.1875 years.  Now, that&#8217;s a lot better, but in fact, it&#8217;d be hard (and probably impossible) to get anywhere near that performance.  Why?  First, there&#8217;s the question of how you know which rows go on the same leaf without actually having the leaves.  You could keep a B-tree in memory of the minimum key in each leaf, but this would fill half of memory.  Second, we didn&#8217;t count the time to fill the first half of memory.  We could do a finer analysis of the time needed to load the last half of the database, plus the preceding 1/4 rows, plus the preceding 1/8, etc., until you have a small enough database to fit in memory, and that ends up at around 4 years.  </p>
<p>Finally, and most importantly, the factor of 6.4 is far too generous.  The average number of balls in each bin is 1, and we only expect a single bin to get above 6 balls.  Then we immediately empty that bin (by inserting the rows into the database).  Now we throw some more rows in memory (balls into bins), and they are very unlikely to produce yet another 6-ball bin.  That are almost certainly going to land in bins that are either empty or have a single ball.  Not to get too technical, the number of bins with a large number of balls drops exponentially as the number of balls increases, until we get down to single bin with more than 6 balls.  As you empty these very full bins, it takes a large number of new balls to get more bins that have a lot of balls.</p>
<p>Now, if I were still an academic, I&#8217;d take the time to study this recycling balls-and-bins problem, but instead I have a company now, and theorems of this time are not really on any critical path!  Still, one of my partners, Michael Bender, suspects that the answer is going to be lnln 2^27 = 2.9.  I&#8217;d be very surprised if the answer is any more than this.</p>
<p>So a realistic number for going through all this work is that it&#8217;ll take 28 years/2.9 = 9.7 years.  Your disk will still die before you fill it.</p>
<p>My recommendation: don&#8217;t use this heuristic!</p>
<p><u>Sorting before inserting?</u></p>
<p>What about another heuristic?  How about, when memory fills, sort the in-memory rows, then merge the sorted order of the database with the sorted order of the in-memory inserted rows.  To compute the time, we need to figure out how good a job we do of keeping the leaves of the tree in order on the disk.  If we keep them all nicely in order and packed &#8212; which is to say, if every time we merge, we rebuild the index from scratch, we get the following:  The bandwidth time for one batch of memory is 1GB/69MB/s = 14.5s.  Discounting sort time, the first memory-full of rows takes 14.5s, the second batch takes 14.5s * 2, then 14.5s * 3, up until 14.5s * 1000, for the last batch before filling memory.  That gives a total time of 14.5s * 1000 * 1001/2 = 7.3 million seconds = 84 days.  What an improvement!  Still, it&#8217;s almost 3 months to index the data, compared to the 10 hours it takes to log the data. So your yield is 0.2% of bandwidth.</p>
<p>Let me point out that we&#8217;ve made some pretty big assumptions here.  For one thing, we&#8217;ve assumed that you are rebuilding the index from scratch each time.  It&#8217;s much faster to build an index on data given in sequential order, but surely it&#8217;s not free!  So the 3 months to build the index this way is a gross underestimate.  If you don&#8217;t rebuild the index each time, your database ages.  It&#8217;s pretty tricky to estimate the time it takes to build the index this way, but a very nasty bit of estimating later, I get that it takes at least 41 years (my main assumption is that once the database is half full, it is fully aged).  Oy!</p>
<p><u>Bigger Batches</u></p>
<p>By now you&#8217;re saying to yourself that you can get a 1 TB database built in much less than 3 months.  If you have all the data ahead of time, you can sort the data in roughly 40 hours, plus another at least 10 hours to build the index (assuming our magical no-overhead indexer for sorted data).  So  you get a tradeoff between</p>
<ul>
<li> Waiting for all the data to come in (and get stale), then 2 days to build the index.
<li> Insert as you go at a limited rate, so that the entire project takes 3 months.  Throughout the process, all inserted data is available for querying.
<li> Anything in between.
</ul>
<p><u>Conclusions</u></p>
<p>And we&#8217;re back to the old tradeoff: you can get faster insertion if you are willing to let enough of your data get stale.  I want a storage engine that&#8217;s fast for insertions, fast for range queries and doesn&#8217;t age.  So does everyone else.  This would end 90% of the kludges of how DBAs deal with database (ok, so I made up the 90%, but it&#8217;s probably not far off, is it?).</p>
<p>Next, we&#8217;re going to look at Cache Obliviousness, and see if we can get any closer to this impossible dream.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2008/04/hacking-for-faster-insertions-is-this-really-how-you-want-to-spend-your-time/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Tradeoff: Insertions versus Point Queries</title>
		<link>http://tokutek.com/2008/03/tradeoff-insertions-versus-point-queries/</link>
		<comments>http://tokutek.com/2008/03/tradeoff-insertions-versus-point-queries/#comments</comments>
		<pubDate>Tue, 11 Mar 2008 23:55:01 +0000</pubDate>
		<dc:creator>Martin Farach-Colton</dc:creator>
				<category><![CDATA[TokuView]]></category>

		<guid isPermaLink="false">http://blogs.tokutek.com/tokuview/tradeoff_insertions_versus_point_queries/#When:18:55:01Z</guid>
		<description><![CDATA[I&#8217;ve been waving my hands about lower bounds. Well, sometimes I haven&#8217;t been waving my hands, because the lower bounds are tight. But in other cases (lenient insertions, range queries), the lower bounds are very far from what we&#8217;re used to. So now, for a bit of math: Brodal and Fagerberg showed in 2003 that [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been waving my hands about lower bounds.  Well, sometimes I haven&#8217;t been waving my hands, because the lower bounds are tight.  But in other cases (lenient insertions, range queries), the lower bounds are very far from what we&#8217;re used to.</p>
<p>So now, for a bit of math:</p>
<p>Brodal and Fagerberg showed in 2003 that there&#8217;s a tradeoff between insertions and queries.  The insertions they consider are lenient.  Well, any lower bound for lenient is a lower bound for strict, but they also gave upper bounds, so it matters.  Also, they don&#8217;t know from lenient, but if you look at their upper bound, they are implementing lenient insertions.  The queries they consider are, unfortunately, point queries.  That&#8217;s too bad for us, because we&#8217;ve already seen that point queries are just too slow to be of interest on hard disks.</p>
<p>Still, they have matching upper and lower bounds, so let&#8217;s see what they say:</p>
<p>They show that, for any p between 0 and 1, if Query time &lt; 1/p * log_B N then Insertion time &gt; (1/pB^{1-p} log_B N).  And visa versa, that is, we get a lower envelope of Query, Insertion times that are possible.  Let try p=1.  This says that if Query time is at most log_B N (which a B-tree gets), then insertion time is at least log_B N (also what a B-tree gets).  So a B-tree is optimal for one point of this tradeoff.</p>
<p>More interestingly, when p=1/2, the query time doubles to 2log_B N.  The insertion time is then no less than 2/sqrt{B} * log_B N.  Wow.  Taking B=4K, sqrt{B} = 64, and we get something that&#8217;s 32 times faster than a B-tree.  But we&#8217;re taking about a lower bound.  Maybe it&#8217;s not a very good lower bound.  Maybe there&#8217;s no data structure that does this well.</p>
<p>It turns out there is (please see their <a href="http://www.brics.dk/~gerth/pub45.html">paper</a> for more details).  Consider what that means: you can speed up insertions by a factor of 32 by slowing down queries by a factor of 2.  That&#8217;s a pretty favorable tradeoff!  Unfortunately, their solution doesn&#8217;t do range queries any better than B-trees &#8212; in fact, it&#8217;s somewhat more cumbersome, though no worse asymptotically than B-trees.</p>
<p><span style="text-decoration: underline;">Conclusions</span></p>
<p>The Brodal &amp; Fagerberg is quite nice.  They&#8217;ve really captured something interesting about storing data on disk.  However, it&#8217;s not the result we need, because the tradeoff that I think is interesting &#8220;in the field&#8221; is one between insertions and range queries, the kind of thing that translates to a MySQL database.  But range queries depend on locality on secondary storage, and the typical mathematical models in use for thinking about external memory (the Disk Access Model and the Cache-Oblivious Model, about which more later) don&#8217;t capture this sort of locality.  They focus on intra-block locality, not inter-block locality.  I don&#8217;t know of any theoretical results that address the problem we have been considering.</p>
<p>So if you can&#8217;t do theory, you do hacks.  Next time, we&#8217;ll start looking at some of the hacks that people use (or claim to use) to speed up their databases.  Which of them play nicely with disks?</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2008/03/tradeoff-insertions-versus-point-queries/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tradeoffs: Updates versus Range Queries</title>
		<link>http://tokutek.com/2008/03/tradeoffs-updates-versus-range-queries/</link>
		<comments>http://tokutek.com/2008/03/tradeoffs-updates-versus-range-queries/#comments</comments>
		<pubDate>Tue, 04 Mar 2008 20:14:57 +0000</pubDate>
		<dc:creator>Martin Farach-Colton</dc:creator>
				<category><![CDATA[TokuView]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=42</guid>
		<description><![CDATA[Sorry for the delay, now on to range queries and lenient updates. Let&#8217;s call them queries and updates, for short. So far, I&#8217;ve shown that B-trees (and any of a number of other data structures) are very far from the &#8220;tight bound.&#8221; I&#8217;ll say a bound is a tight if it&#8217;s a lower bound and [...]]]></description>
			<content:encoded><![CDATA[<p>Sorry for the delay, now on to range queries and lenient updates.  Let&#8217;s call them queries and updates, for short.  So far, I&#8217;ve shown that B-trees (and any of a number of other data structures) are very far from the &#8220;tight bound.&#8221;  I&#8217;ll say a bound is a tight if it&#8217;s a lower bound and you can come up with data structure that matches it.</p>
<p>So how do we match the bandwidth bound for queries and updates?  I already mentioned in passing how to do this, but let&#8217;s look more closely.</p>
<p><u>Fast Updates</u></p>
<p>The way to get fast updates is to log them.  You can easily saturate disk bandwidth by writing out the insertion, deletion and update requests with no index.  </p>
<p>A query now will typically start by sorting the data.  Even a point query requires looking at all the data, but a range query requires looking at all the data log times (in order to sort it), or using a large amount of extra storage.  Let&#8217;s focus on sorting for range queries.  </p>
<p>For queries that range over all the data, sorting may be tolerable (after all, you&#8217;d be doing a lot better in terms of bandwidth than a B-tree gets over the same query!).  This solution does, however, have some drawbacks.  First, for smaller queries, where you only look at x% of the data, the fraction of bandwidth you are getting is x/log N, for appropriate base of the log.  You can beat this somewhat if you put a lot of work into it.  But you&#8217;re still going to get very little of your bandwidth.</p>
<p>[2 points if you've figured out that some queries don't actually require sorting the data or using lots of space.  2 more points if you figured out that end users don't want this kind of noise in their database, where some queries turn out to be unpredictably slow.  A final 2 points for figuring out that almost no one wants to learn enough about a data base to be able to predict which queries will be slow.]</p>
<p><u> Fast Queries</u></p>
<p>The way to get fast queries is to lay out the data on disk carefully.  That is, the best you could do would be to take all your data, sort it, and insert it in order into a sensible database (MySQL + InnoDB will do nicely) that will lay out the leaves more or less in order.  Then range queries become linear scans on the disk, and you can get close to bandwidth rates.  This is one of the main tricks for getting performance out of a data warehouse.  Dumping and reloading data for this purpose is also one of the time-consuming things that DBAs do.</p>
<p>Inserting data now requires batching data, sorting it and inserting it.  The time per insertion might not be so bad, but the latency is killer.  It is not uncommon for data to become available for query in the data base a day or two after it comes into the system.</p>
<p><u>Conclusion</u></p>
<p>As we try to tease apart what&#8217;s fast and what&#8217;s inherently slow in a database, we get the idea that there&#8217;s some basic tradeoff between updates (even lenient ones) and range queries.  Is this tradeoff inherent, or is there some way to get bandwidth speeds on both lenient updates and range queries?</p>
<p>Next time we&#8217;ll look at some tradeoff lower bounds and try to get closer to some answer.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2008/03/tradeoffs-updates-versus-range-queries/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How Fast Can Updates Run?</title>
		<link>http://tokutek.com/2008/02/how-fast-can-updates-run/</link>
		<comments>http://tokutek.com/2008/02/how-fast-can-updates-run/#comments</comments>
		<pubDate>Mon, 11 Feb 2008 19:27:12 +0000</pubDate>
		<dc:creator>Martin Farach-Colton</dc:creator>
				<category><![CDATA[TokuView]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=45</guid>
		<description><![CDATA[Last time, I introduced the notion of strict and lenient updates. Now it&#8217;s time to see what the performance characteristics are of each. Just to rehash, we are focusing on the storage engine (a la MySQL) level, and we are looking at a database on a single disk &#8212; the one we are using for [...]]]></description>
			<content:encoded><![CDATA[<p>Last time, I introduced the notion of strict and lenient updates.  Now it&#8217;s time to see what the performance characteristics are of each.</p>
<p>Just to rehash, we are focusing on the storage engine (a la MySQL) level, and we are looking at a database on a single disk &#8212; the one we are using for illustration is the 1TB Hitachi Deskstar 7K1000.  It has a disk seek time 14ms and transfer rate of around 69MB/s [See tomshardware.com]  We will insert and delete random <key,value> pairs, each 8 bytes.  So that&#8217;s 62.5 billion pairs to fill the disk.</p>
<p><u>Strict Updates</u></p>
<p>These are the easier update types to analyze.  Please review the definition of strict updates from the last blog entry.  Now notice that each insertion or deletion requires a point query.  For example, during an insertion, in order to determine if there&#8217;s already a row with a particular key value in the database, one must look up that key.  In order to tell how many rows are removed during a delete operation (if any), one must look up the key being deleted.  There&#8217;s no getting around the fact that the disk head needs to move to the location on disk of the information being returned.</p>
<p>Since these point queries are not the main point of what&#8217;s going on, and since many people might not realize that they are even happening, I&#8217;m going to call them <b>cryptogets</b> (the get because a point query is often referred to as a get at the storage engine level).</p>
<p>Since no strict update can be faster than a point query, filling up the disk with data takes at least 28 years.  And once again, this is independent of the data structure used.</p>
<p><u>Lenient Updates</u></p>
<p>What about lenient updates?  The only lower bound is the 10 hour bandwidth lower bound we&#8217;ve already talked about.  Ten hours is the time it takes to write the data out sequentially on disk.  And certainly, if you don&#8217;t need to return any information at the time of an update, you could just log the updates, which is to say, you could write them out sequentially to disk.</p>
<p>That makes queries slow, but the point is that a log is a (very simple) data structure that matches the disk-bandwidth lower bound.</p>
<p>What about B-trees?  A B-tree index will do (at least) one disk seek per update.  That means that it will take at least 28 years to finish filling the disk.  This ratio is even worse than B-trees achieved for range queries.</p>
<p>[Yes, yes, I know, what about presorting the data before inserting?  Won't that make things much faster?  Yes, it will, but that's the subject of another blog entry.]</p>
<p><u>Conclusions</u></p>
<p>B-trees do a disk seek per insertion during which they swap in all the information needed to implement a strict update.  They don&#8217;t get any performance benefit from implementing lenient updates.  I believe that this is the reason that people don&#8217;t make the distinction between different classes of B-trees.</p>
<p>So far I&#8217;ve focused on lower bounds.  In future blog entries, I&#8217;ll talk about algorithms for actually implementing lenient updates.  Ok, here comes the punch line: lenient updates stomp all over strict updates, performancewise.</p>
<p>But first, we&#8217;ll look at the tradeoff between updates and range queries.  Stay tuned.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2008/02/how-fast-can-updates-run/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Updates &amp; Discipline</title>
		<link>http://tokutek.com/2008/02/updates-discipline/</link>
		<comments>http://tokutek.com/2008/02/updates-discipline/#comments</comments>
		<pubDate>Tue, 05 Feb 2008 13:57:13 +0000</pubDate>
		<dc:creator>Martin Farach-Colton</dc:creator>
				<category><![CDATA[TokuView]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=48</guid>
		<description><![CDATA[So far, I&#8217;ve analyzed point and range queries. Now it&#8217;s time to talk about insertions and deletions. We&#8217;ll call the combination updates. Updates come in two flavors, and today we&#8217;ll cover both. Depending on the exact settings of your database, the updates give a varying amount of feedback. For example, when a key is deleted, [...]]]></description>
			<content:encoded><![CDATA[<p>So far, I&#8217;ve analyzed point and range queries.  Now it&#8217;s time to talk about insertions and deletions.  We&#8217;ll call the combination updates.  Updates come in two flavors, and today we&#8217;ll cover both.</p>
<p>Depending on the exact settings of your database, the updates give a varying amount of feedback.  For example, when a key is deleted, all rows with that key are deleted (assuming the database allows duplicate keys).  The normal behavior is to return the number of rows deleted.  The normal behavior when deleting a key that has no corresponding rows in the database is to return an error message.  On insertion, one can allow duplicate or not.  In the latter case, the storage engine returns an error message if a duplication insertion is attempted.   </p>
<p>We&#8217;ll see that the details of error messages have a profound influence on the lower-bound arguments I&#8217;ve been making (and we&#8217;ll see a bit later that they have a profound influence on how quickly you can actually implement these operations).  Here, we define two classes of update operations:<br />
<br />
<u>Strict Updates</u></p>
<p>These are the insertions and deletions that come by default in MySQL, and many other databases.  </p>
<ul>
<li>Insertions let you know if there&#8217;s already a row with that key in the database.</li>
<li>Unique-key indices: Returns an error on the insertion of a repeated key.</li>
<li>Deleting a key will tell you how many rows had that key.</li>
<li>Deletion of a non-existing key will return an error.</li>
</ul>
<p>As I say, these are the types of updates that we are all used to.<br />
</p>
<p><u>Lenient Updates</u></p>
<p>The alternative to strict updates is lenient updates.  In a sense, they are equivalent, in that there&#8217;s nothing you can do with one that you can&#8217;t do with the other.  It&#8217;s just going to turn out that lenient updates can be made to run <b>much</b> faster.</p>
<ul>
<li>Insertions do not let you know if there&#8217;s already a row with that key in the database. </li>
<li>Unique-key indices: Overwrites with new value on the insertion of a repeated key.  Does not return an error.</li>
<li>Deleting a key will not tell you how many rows had that key.</li>
<li>Deletion of a non-existing key will not return an error.</li>
</ul>
<p><u>Conclusions</u></p>
<p>What&#8217;s the upshot?  I&#8217;ve been presenting operations in terms of lower bounds that are dominated by disk seeks (slow!) versus those dominated by disk bandwidth (much faster).  See if you can use this analysis to understand the behavior of these two types of updates.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2008/02/updates-discipline/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Range Queries: Is the Bottleneck Seeks or Bandwidth?</title>
		<link>http://tokutek.com/2008/01/range-queries-is-the-bottleneck-seeks-or-bandwidth/</link>
		<comments>http://tokutek.com/2008/01/range-queries-is-the-bottleneck-seeks-or-bandwidth/#comments</comments>
		<pubDate>Tue, 22 Jan 2008 02:25:27 +0000</pubDate>
		<dc:creator>Martin Farach-Colton</dc:creator>
				<category><![CDATA[TokuView]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=51</guid>
		<description><![CDATA[Last time I talked about point queries. The conclusion was that big databases and point queries don&#8217;t mix. It&#8217;s ok to do them from time to time, but it&#8217;s not how you&#8217;re going to use your database, unless you have a lot of time. Today, I&#8217;d like to talk about range queries, which seem much [...]]]></description>
			<content:encoded><![CDATA[<p>Last time I talked about point queries.  The conclusion was that big databases and point queries don&#8217;t mix.  It&#8217;s ok to do them from time to time, but it&#8217;s not how you&#8217;re going to use your database, unless you have a lot of time.  Today, I&#8217;d like to talk about range queries, which seem much more useful for the analysis of big databases, say in a business intelligence setting.</p>
<p>Recall that the focus is on the storage engine (a la MySQL) level, and a database on a single disk &#8212; the one we are using for illustration is the 1TB Hitachi Deskstar 7K1000.  It has a disk seek time 14ms and transfer rate of around 69MB/s [See tomshardware.com]  Now imagine filling the disk with random <key,value> pairs, each 8 bytes.  So that&#8217;s 62.5 billion pairs.</p>
<p><u>Range Queries</u></p>
<p>Suppose the above data is stored in a B-tree, and that you&#8217;d like to iterate over all the data in order by key.  Further suppose that the B-tree has 4KB blocks.  Thus each leaf has 256 key-value pairs.  The tree has around 244 million leaves.   Before we figure out how long it&#8217;s going to take to look at those leaves, we have to understand how aged the B-tree is.  </p>
<p>A B-tree that you get from loading up the data in sorted order in one fell swoop has all the leaves laid out in a relatively nice order, so that sibling leaves tend to be on the same track, and you get lots of leaves for each disk seek.  As the B-tree ages (either from doing insertions &#038; deletions, or from inserting things in a more random order to begin with), the leaves tend to be scattered all over the place.  Let&#8217;s assume that the tree is well aged to see how bad things can get.</p>
<p>A range query over the whole data set in our scenario will give a disk seek per leaf, which would take 3.4 x 10^6 seconds (= 244 x 10^6 x 14 x 10^-3 seconds) or 39 days.  For one query.</p>
<p>Now the question is, can you do better?  The first thing to note is that I have made my argument as to how long a B-tree would take, and in particular one that&#8217;s aged and has a specific block size.  In the point query case, I made an argument about *any* data structure.  In this case, I can&#8217;t.  I can&#8217;t say that there isn&#8217;t some other data structure that ages well, for example, and thus does better.  For example, in the extreme case where the data has been sorted on disk, it can be read at full bandwidth, in which case it would take 10 hours to do a range query.  That&#8217;s still a longish time, but it&#8217;s not 39 days, and it means that the B-tree example is using about 1% of available disk bandwidth.</p>
<p>So there are certainly ways to fix this.  Use a bigger block size for the B-tree (and kill insertion performance).  Sort the data before insertion into the B-tree (and kill insertion performance).  The point is that unlike point queries, the natural lower bound is band width, not disk seeks, and the textbook data structure for storing data on disk has poor performance for this operation.  </p>
<p>As I&#8217;ve already argued, range queries are the natural query type for large data sets, not least because of the argument above, that point queries can&#8217;t even be implemented on disks, but also more positively because large data is useful for large-scale market analysis, where aggregate statistics about the data set are mined.  And that&#8217;s implemented by range queries.</p>
<p><u>Conclusion</u></p>
<p>Unlike point queries, where performance will only be gained by (expensive!) hardware changes, data structural changes are key for range queries.  In a way, we already know this.  What is a data warehouse if not a database that&#8217;s been optimized for range queries at the expense of insertion times.  </p>
<p>In future postings, we&#8217;ll see if there are any options to data warehouse for achieving fast range queries (hint: there are).  In the next posting, we&#8217;ll look at insertion rates.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2008/01/range-queries-is-the-bottleneck-seeks-or-bandwidth/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
