<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Tokutek &#187; zardosht</title>
	<atom:link href="http://tokutek.com/author/zardosht/feed/" rel="self" type="application/rss+xml" />
	<link>http://tokutek.com</link>
	<description>Database Performance</description>
	<lastBuildDate>Mon, 26 Jul 2010 19:52:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>On &#8220;Replace Into&#8221;, &#8220;Insert Ignore&#8221;, and Secondary Keys</title>
		<link>http://tokutek.com/2010/07/on-replace-into-insert-ignore-and-secondary-keys/</link>
		<comments>http://tokutek.com/2010/07/on-replace-into-insert-ignore-and-secondary-keys/#comments</comments>
		<pubDate>Wed, 21 Jul 2010 20:51:55 +0000</pubDate>
		<dc:creator>zardosht</dc:creator>
				<category><![CDATA[TokuView]]></category>
		<category><![CDATA[B-tree]]></category>
		<category><![CDATA[disk seek]]></category>
		<category><![CDATA[Fractal Trees]]></category>
		<category><![CDATA[insert ignore]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[replace into]]></category>
		<category><![CDATA[TokuDB]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=1584</guid>
		<description><![CDATA[In posts on June 30 and July 6, I explained how implementing the commands &#8220;replace into&#8221; and &#8220;insert ignore&#8221; with TokuDB&#8217;s fractal trees data structures can be two orders of magnitude faster than implementing them with B-trees. Towards the end of each post, I hinted at that there are some caveats that complicate the story [...]]]></description>
			<content:encoded><![CDATA[<p>
In posts on <a href="http://tokutek.com/2010/06/making-replace-into-fast-by-avoiding-disk-seeks/" />June 30</a> and <a href="http://tokutek.com/2010/07/making-insert-ignore-fast-by-avoiding-disk-seeks/" />July 6</a>, I explained how implementing the commands &#8220;replace into&#8221; and &#8220;insert ignore&#8221; with <a href="http://tokutek.com/2010/04/how-fractal-trees-work-talk-at-mysql-2010/" />TokuDB&#8217;s fractal trees data structures</a> can be two orders of magnitude faster than implementing them with B-trees. Towards the end of each post, I hinted at that there are some caveats that complicate the story a little. In this post, I explain one of the complications: secondary indexes.</p>
<p>
Secondary indexes act the same way in TokuDB as they do in InnoDB. They store the defined secondary key, and the primary key as a pointer to the rest of the row. So, say the table foo has the following schema:</p>
<pre>
create table (a int, b int, c int, primary key (a), key(b));
</pre>
<p>And we did:</p>
<pre>
insert into foo values (1,10,100),(2,20,200);
</pre>
<p>
Logically, there is one dictionary that stores all the data (this is the clustered primary key). Let us call it the main dictionary:</p>
<pre>
key  value
1    10,100
2    20,200
</pre>
<p>And there is another dictionary for the secondary key that stores the column &#8216;b&#8217; and the primary key, &#8216;a&#8217;:</p>
<pre>
key  value
10   1
20   2
</pre>
<p>
For secondary indexes to work properly, there must be a one to one correspondence between elements in the secondary index and in the primary index. If this correspondence is broken, then the table is corrupt.</p>
<p>
Now suppose we were to execute:</p>
<pre>
replace into foo values (1,1000,1000);
</pre>
<p>
This does:<br />
<UL><br />
<LI> in main dictionary, overwrite the value of key &#8217;1&#8242; and value &#8217;10,100&#8242; with key &#8217;1&#8242; and value &#8217;1000,1000&#8242;.<br />
<LI> in secondary dictionary, remove the key &#8217;10&#8242; with value &#8217;1&#8242;.<br />
<LI> in secondary dictionary, insert the key &#8217;1000&#8242; and key &#8217;1&#8242;.<br />
</UL></p>
<p>
Notice that we cannot perform the second step unless we know the content of the existing row that is being replaced. Learning the content of the existing row requires a lookup in the main dictionary, which incurs a disk seek.</p>
<p>
So, when executing &#8220;replace into&#8221; or &#8220;insert ignore&#8221; on tables with secondary keys, all engines must still incur a disk seek on the primary dictionary to learn where associated elements are in a secondary index, whereas if no secondary keys exist, then TokuDB&#8217;s fractal trees can avoid this disk seek.</p>
<p>
Even with secondary indexes, fractal tree indexes are preferred. B-trees still incur additional disk seeks on insertions into secondary indexes that fractal trees do not. However, with no secondary indexes, fractal trees can do away with the mandatory disk seek whereas B-trees do not.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2010/07/on-replace-into-insert-ignore-and-secondary-keys/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why &#8220;insert &#8230; on duplicate key update&#8221; May Be Slow, by Incurring Disk Seeks</title>
		<link>http://tokutek.com/2010/07/why-insert-on-duplicate-key-update-may-be-slow-by-incurring-disk-seeks/</link>
		<comments>http://tokutek.com/2010/07/why-insert-on-duplicate-key-update-may-be-slow-by-incurring-disk-seeks/#comments</comments>
		<pubDate>Wed, 14 Jul 2010 19:17:49 +0000</pubDate>
		<dc:creator>zardosht</dc:creator>
				<category><![CDATA[TokuView]]></category>
		<category><![CDATA[B-tree]]></category>
		<category><![CDATA[disk seek]]></category>
		<category><![CDATA[Fractal Trees]]></category>
		<category><![CDATA[insert]]></category>
		<category><![CDATA[insert ignore]]></category>
		<category><![CDATA[insert on duplicate key update]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[replace into]]></category>
		<category><![CDATA[TokuDB]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=1580</guid>
		<description><![CDATA[In my post on June 18th, I explained why the semantics of normal ad-hoc insertions with a primary key are expensive because they require disk seeks on large data sets. I previously explained why it would be better to use &#8220;replace into&#8221; or to use &#8220;insert ignore&#8221; over normal inserts. In this post, I explain [...]]]></description>
			<content:encoded><![CDATA[<p>
In my post on June 18th, I explained why the semantics of normal ad-hoc insertions with a primary key are expensive because they require disk seeks on large data sets. I previously explained why it would be better <a href="http://tokutek.com/2010/06/making-replace-into-fast-by-avoiding-disk-seeks/" />to use &#8220;replace into&#8221;</a> or <a href="http://tokutek.com/2010/07/making-insert-ignore-fast-by-avoiding-disk-seeks/" />to use &#8220;insert ignore&#8221;</a> over normal inserts. In this post, I explain why another alternative to normal inserts, &#8220;insert &#8230; on duplicate key update&#8221; is no better in MySQL, because the command incurs disk seeks.</p>
<p>
The reason &#8220;insert ignore&#8221; and &#8220;replace into&#8221; can be made fast with <a href="http://tokutek.com/2010/04/how-fractal-trees-work-talk-at-mysql-2010/" />TokuDB&#8217;s fractal trees</a> is that the semantics of what to do in case a duplicate key is found is simple. In one case, you ignore, and in the other, you overwrite. With specific tombstone messages defined for these simple semantics, we defer the uniqueness check to a more opportune time.</p>
<p>
The semantics of &#8220;insert &#8230; on duplicate key update&#8221; are not simple:<br />
<UL><br />
<LI>if the primary (or unique) key does not exist, insert the new row<br />
<LI>if the primary key does exist, perform some update as defined in the SQL statement<br />
</UL></p>
<p>
The problem is we do not have a way of encoding the SQL update function into a message, the way we are able to encode &#8220;replace into&#8221; as an &#8216;i&#8217; and &#8220;insert ignore&#8221; as an &#8216;ii&#8217;. If we did, we could similarly make &#8220;insert &#8230; on duplicate key update&#8221; fast.</p>
<p>
I am not claiming that this is not theoretically possible, just that the storage engine API in MySQL does not allow for the encoding of updates as messages. Instead, what MySQL does is the following:<br />
<UL><br />
<LI>call handler::write_row to attempt an insertion, if it succeeds, we are done<br />
<LI>if handler::write_row returns an error indicating a duplicate key, outside of the handler, apply the necessary update to the row<br />
<LI>call handler::update_row to apply the update<br />
</UL></p>
<p>
The storage engine API does not have any access to the function that applies an update to the existing row. This is why the storage engine has no way of encoding any SQL update function (even some simple ones, such as &#8220;increment column a&#8221;).</p>
<p>
So, in the meantime, to implement these semantics, B-trees and Fractal Tree data structures both:<br />
<UL><br />
<LI>look up the primary (or unique) key to verify existence<br />
<LI>take the appropriate action based on whether the primary (or unique) key exists<br />
</UL></p>
<p>
The first step incurs a disk seek on large data sets with an ad-hoc primary (or unique key). And that is why it is slow.</p>
<p>
So, the moral of the story is this. In MySQL, &#8220;insert &#8230; on duplicate key update&#8221; is slower than &#8220;replace into&#8221;. Although the sematics are slightly different in the case where the primary key is found (the former is defined as an update, whereas the latter is defined as a delete followed by an insert), if possible, the simpler semantics of &#8220;replace into&#8221; allow it to be faster than &#8220;insert &#8230; on duplicate key update&#8221;.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2010/07/why-insert-on-duplicate-key-update-may-be-slow-by-incurring-disk-seeks/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Making &#8220;Insert Ignore&#8221; Fast, by Avoiding Disk Seeks</title>
		<link>http://tokutek.com/2010/07/making-insert-ignore-fast-by-avoiding-disk-seeks/</link>
		<comments>http://tokutek.com/2010/07/making-insert-ignore-fast-by-avoiding-disk-seeks/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 20:57:15 +0000</pubDate>
		<dc:creator>zardosht</dc:creator>
				<category><![CDATA[TokuView]]></category>
		<category><![CDATA[B-tree]]></category>
		<category><![CDATA[disk seek]]></category>
		<category><![CDATA[Fractal Trees]]></category>
		<category><![CDATA[insert]]></category>
		<category><![CDATA[insert ignore]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[TokuDB]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=1577</guid>
		<description><![CDATA[In my post from three weeks ago, I explained why the semantics of normal ad-hoc insertions with a primary key are expensive because they require disk seeks on large data sets. Towards the end of the post, I claimed that it would be better to use “replace into” or “insert ignore” over normal inserts, because [...]]]></description>
			<content:encoded><![CDATA[<p>
In my <a href="http://tokutek.com/2010/06/disk-seeks-are-evil-so-let%e2%80%99s-avoid-them-pt-4/" />post from three weeks ago</a>, I explained why the semantics of normal ad-hoc insertions with a primary key are expensive because they require disk seeks on large data sets. Towards the end of the post, I claimed that it would be better to use “replace into” or “insert ignore” over normal inserts, because the semantics of these statements do NOT require disk seeks. In my <a href="http://tokutek.com/2010/06/making-replace-into-fast-by-avoiding-disk-seeks/" />post last week</a>, I explained how the command “replace into” can be fast with <a href="http://tokutek.com/2010/04/how-fractal-trees-work-talk-at-mysql-2010/" />TokuDB&#8217;s fractal trees</a>. Today, I explain how &#8220;insert ignore&#8221; can be fast, using a strategy that is very similar to what we do with &#8220;replace into&#8221;.</p>
<p>
The semantics of &#8220;insert ignore&#8221; are similar to that of &#8220;replace into&#8221;:<br />
<UL><br />
<LI> if the primary (or unique) key does not exist: insert the new row<br />
<LI> if the primary (or unique) key does exist: do nothing<br />
</UL></p>
<p>
B-trees have the same problem with &#8220;insert ignore&#8221; that they have with &#8220;replace into&#8221;. They perform a lookup of the primary key, incurring a disk seek. We have already shown how fractal trees do not incur this disk seek for &#8220;replace into&#8221;, so let&#8217;s see how we can avoid disk seeks with &#8220;insert ignore&#8221;.</p>
<p>
The only difference with &#8220;replace into&#8221; is when the primary (or unique) key exists, instead of overwriting the old row with the new row, we disregard the new row. So, all we need to do is tweak our tombstone messaging scheme (that we use for deletes and &#8220;replace into&#8221;) so that when &#8220;insert ignore&#8221; commands do not overwrite old rows with new rows. Similar to deletes and replace into, with this scheme, &#8220;insert ignore” can be two orders of magnitude faster than insertions into a B-tree.</p>
<p>
Here is what we do. We insert a message into the fractal tree, with a new message &#8220;ii&#8221;, to signify that we are doing an &#8220;insert ignore&#8221;. The only difference between this message and the normal &#8220;i&#8221; message for insertions is what we do on queries and merges. On queries, if the message is an &#8220;ii&#8221;, then the value in the LOWER node is read, and not the higher node. On merges, if the higher node has a message of &#8220;ii&#8221;, the value in the LOWER node takes precedence over the value in the higher node.</p>
<p>
Let&#8217;s look at an example that is similar to what we looked at for &#8220;replace into&#8221;:</p>
<pre>
create table foo (a int, b int, primary key (a));
</pre>
<p>
Suppose the fractal tree for this table looks as follows:</p>
<pre>
- 

- -

- - - -

....

(i (1,1)) (i (2,2)) (i (3,3)) (i (4,4)) ... (i (1000,1000)) ... (i (2^32, 2^32))
</pre>
<p>
The ‘i’ stands for insertion message. Now suppose we do:</p>
<pre>
insert ignore into foo values (1000, 1001).
</pre>
<p>
With fractal trees, we insert (ii (1000,1001)) into the top node. The tree then looks as such:</p>
<pre>
(ii (1000,1001)) 

- -

- - - -

....

(i (1,1)) (i (2,2)) (i (3,3)) (i (4,4)) ... (i (2^32, 2^32))
</pre>
<p>
So upon querying the key ’1000&#8242;, a cursor notices that (1000,1001) has a message of &#8220;ii&#8221;. If it finds another value for the key 1000 in a lower node, it reads that value, otherwise, it reads (1000,1001). Because (1000,1000) is located in a lower node, the cursor returns (1000,1000) to the user. On merges, the message in the lower node, (1000,1000) overwrites the message in the higher node, (1000,1001).</p>
<p>
While &#8220;insert ignore&#8221; can be fast, there are caveats (indexes, triggers, replication), just as there are with &#8220;replace into&#8221;. In a future posting, I will get into some of them.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2010/07/making-insert-ignore-fast-by-avoiding-disk-seeks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Making &#8220;Replace Into&#8221; Fast, by Avoiding Disk Seeks</title>
		<link>http://tokutek.com/2010/06/making-replace-into-fast-by-avoiding-disk-seeks/</link>
		<comments>http://tokutek.com/2010/06/making-replace-into-fast-by-avoiding-disk-seeks/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 04:10:02 +0000</pubDate>
		<dc:creator>zardosht</dc:creator>
				<category><![CDATA[TokuView]]></category>
		<category><![CDATA[B-tree]]></category>
		<category><![CDATA[disk seek]]></category>
		<category><![CDATA[Fractal Trees]]></category>
		<category><![CDATA[insert]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[replace]]></category>
		<category><![CDATA[replace into]]></category>
		<category><![CDATA[TokuDB]]></category>
		<category><![CDATA[update]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=1548</guid>
		<description><![CDATA[In this post two weeks ago, I explained why the semantics of normal ad-hoc insertions with a primary key are expensive because they require disk seeks on large data sets. Towards the end of the post, I claimed that it would be better to use &#8220;replace into&#8221; or &#8220;insert ignore&#8221; over normal inserts, because the [...]]]></description>
			<content:encoded><![CDATA[<p>
In <a href="http://tokutek.com/2010/06/disk-seeks-are-evil-so-let%e2%80%99s-avoid-them-pt-4/" />this</a> post two weeks ago, I explained why the semantics of normal ad-hoc insertions with a primary key are expensive because they require disk seeks on large data sets. Towards the end of the post, I claimed that it would be better to use &#8220;replace into&#8221; or &#8220;insert ignore&#8221; over normal inserts, because the semantics of these statements do NOT require disk seeks. In this post, I explain how the command &#8220;replace into&#8221; can be fast with fractal trees. </p>
<p>
The semantics of &#8220;replace into&#8221; are as follows:<br />
<UL><br />
<LI>if the primary (or unique) key does not exist, insert the new row<br />
<LI>if the primary (or unique) key does exist, overwrite the existing row with the new row<br />
</UL></p>
<p>
The slow, expensive way B-trees use to implement these semantics are:<br />
<UL><br />
<LI>look up the primary (or unique key), to verify its existence<br />
<LI>if it does not exist, insert the new row, otherwise overwrite the existing row<br />
</UL></p>
<p>
The first step incurs a disk seek. That slows down performance considerably.</p>
<p>
Instead, with <a href="http://tokutek.com/2010/04/how-fractal-trees-work-talk-at-mysql-2010/" />TokuDB&#8217;s fractal tree data structure</a>, we can follow a similar <a href="http://tokutek.com/2010/06/disk-seeks-are-evil-so-let%e2%80%99s-avoid-them-pt-3-deletions/" />strategy to what we use for deletes</a>. Recall that for deletes, we do not look up the key we wish to delete, and physically remove its data from the tree. Instead, we use tombstone messaging, or tombstone deletions, to defer the physical removal of the data to a better time. The same idea applies here. Instead of searching for the primary key, as we are forced to do with B-trees, we insert an insertion message into the fractal tree, and defer the existence check for later.</p>
<p>
Let us look at an example. Take the following table:</p>
<pre>
create table foo (a int, b int, primary key (a));
</pre>
<p>
Suppose the fractal tree for this table looks as follows:</p>
<pre>
- 

- -

- - - -

....

(i (1,1)) (i (2,2)) (i (3,3)) (i (4,4)) ... (i (1000,1000)) ... (i (2^32, 2^32))
</pre>
<p>
The &#8216;i&#8217; stands for insertion message. Now suppose we do:</p>
<pre>
replace into foo values (1000, 1001).
</pre>
<p>
With fractal trees, we simply insert (i (1000,1001)) into the top node. The tree then looks as such:</p>
<pre>
(i (1000,1001)) 

- -

- - - -

....

(i (1,1)) (i (2,2)) (i (3,3)) (i (4,4)) ... (i (2^32, 2^32))
</pre>
<p>
Similar to deletes, with this scheme, &#8220;replace into&#8221; can be two orders of magnitude faster than insertions into a B-tree.</p>
<p>
On queries, a message in a higher node overrides messages in lower nodes. So upon querying the key &#8217;1000&#8242;, a cursor notices that (1000,1001) is located higher than (1000,1000), and therefore returns (1000,1001) to the user. On merges, the message in the higher node overwrites the message in the lower node.</p>
<p>
So, by using messages, fractal trees can achieve the same performance boost for &#8220;replace into&#8221; as it does for insertions. In fact, using &#8220;replace into&#8221; in the manner above is how a customer has achieved an 80x speedup under actual field conditions. The details can be found <a href="http://tokutek.com/customers/a-social-networking-case-study/" />here</a>.</p>
<p>
Next week, I explain how &#8220;insert ignore&#8221; can similarly be fast. Also, while this shows how &#8220;replace into&#8221; can be fast (and IS fast for a lot of scenarios we see with TokuDB), there are some caveats (with indexes, triggers, and replication). I will get into those in a couple of weeks.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2010/06/making-replace-into-fast-by-avoiding-disk-seeks/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Making Updates Fast, by Avoiding Disk Seeks</title>
		<link>http://tokutek.com/2010/06/making-updates-fast-by-avoiding-disk-seeks/</link>
		<comments>http://tokutek.com/2010/06/making-updates-fast-by-avoiding-disk-seeks/#comments</comments>
		<pubDate>Tue, 22 Jun 2010 14:19:23 +0000</pubDate>
		<dc:creator>zardosht</dc:creator>
				<category><![CDATA[TokuView]]></category>
		<category><![CDATA[B-tree]]></category>
		<category><![CDATA[disk seek]]></category>
		<category><![CDATA[Fractal Trees]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[TokuDB]]></category>
		<category><![CDATA[update]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=1519</guid>
		<description><![CDATA[The analysis that shows how to make deletions really fast by using clustering keys and TokuDB&#8217;s fractal tree based engine also applies to make updates really fast. (I left it out of the last post to keep the story simple). As a quick example, let&#8217;s look at the following statement: update foo set price=price+1 where [...]]]></description>
			<content:encoded><![CDATA[<p>
The <a href="http://tokutek.com/2010/06/making-deletions-fast-by-avoiding-disk-seeks/" />analysis</a> that shows how to make deletions really fast by using <a href="http://tokutek.com/2009/05/introducing_multiple_clustering_indexes/" />clustering keys</a> and TokuDB&#8217;s <a href="http://tokutek.com/2010/04/how-fractal-trees-work-talk-at-mysql-2010/" />fractal tree</a> based engine also applies to make updates really fast. (I left it out of the last post to keep the story simple). As a quick example, let&#8217;s look at the following statement:</p>
<pre>
update foo set price=price+1 where product=toy;
</pre>
<p>
Executing this statement has two steps:<br />
<UL><br />
<LI>a query to find where product=toy<br />
<LI>a combination of insertions and deletions to change old rows to new rows<br />
</UL></p>
<p>
The analysis is identical to that for deletions. Just like for deletes, clustering keys make the query go fast, as explained <a href="http://tokutek.com/2010/06/making-deletions-fast-by-avoiding-disk-seeks/" />here</a>. In this case, the appropriate clustering key would be on &#8216;product&#8217;. And fractal tree data structures make the insertions and deletions go fast, as explained <a href="http://tokutek.com/2010/05/disk-seeks-are-evil-so-lets-avoid-them-part-1/" />here</a> and <a href="http://tokutek.com/2010/06/disk-seeks-are-evil-so-let%e2%80%99s-avoid-them-pt-3-deletions/" />here</a>.</p>
<p>
So, the same story applies. Updates are a combination of queries and value changes. To make updates fast, both the query and the value changes need to be fast. With B-tree based storage engines, users may be hesitant to add indexes to speed up the query, due to the added cost of value changes in the index. With TokuDB, this tradeoff does not exist, because TokuDB uses fractal tree data structures. Using a clustering key to drive down the cost of the query, and using fractal tree indexes to keep the cost of the data changes down, leads to very fast updates.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2010/06/making-updates-fast-by-avoiding-disk-seeks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Disk seeks are evil, so let’s avoid them, pt. 4</title>
		<link>http://tokutek.com/2010/06/disk-seeks-are-evil-so-let%e2%80%99s-avoid-them-pt-4/</link>
		<comments>http://tokutek.com/2010/06/disk-seeks-are-evil-so-let%e2%80%99s-avoid-them-pt-4/#comments</comments>
		<pubDate>Fri, 18 Jun 2010 21:12:49 +0000</pubDate>
		<dc:creator>zardosht</dc:creator>
				<category><![CDATA[TokuView]]></category>
		<category><![CDATA[B-tree]]></category>
		<category><![CDATA[disk seek]]></category>
		<category><![CDATA[Fractal Trees]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[TokuDB]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=1495</guid>
		<description><![CDATA[Continuing in the theme from previous posts, I&#8217;d like to examine another case where we can eliminate all disk seeks from a MySQL operation and therefore get two orders-of-magnitude speedup. The general outline of these posts is: B-trees do insertion disk seeks. While they&#8217;re at it, they piggyback some other work on the disk seeks. [...]]]></description>
			<content:encoded><![CDATA[<p>
Continuing in the theme from previous posts, I&#8217;d like to examine another case where we can eliminate all disk seeks from a MySQL operation and therefore get two orders-of-magnitude speedup. The general outline of these posts is:<br />
<UL><br />
<LI> B-trees do insertion disk seeks.  While they&#8217;re at it, they piggyback some other work on the disk seeks. This piggyback work requires disk seeks regardless.<br />
<LI> <a href="http://tokutek.com/2010/04/how-fractal-trees-work-talk-at-mysql-2010/">TokuDB&#8217;s Fractal Tree indexes</a> don&#8217;t do insertion disk seeks.  If we also get rid of the piggyback work, we end up with no disk seeks, and a two order of magnitude improvement.<br />
</UL></p>
<p>
So it&#8217;s all about finding out which piggyback work is important (important enough to pay a huge performance penalty for), and which isn&#8217;t.</p>
<p>
This blog post is about one of the most straightforward operations: ad-hoc insertions with a primary key.  Since the difference I&#8217;ve identified between B-tree indexes and Fractal tree indexes is the disk seeks on insertions, it may seem that there&#8217;s little to look at in this case.  But the semantics of insertion into a primary key is as follows:  if the primary key being inserted does not already exist in the table, insert the element, otherwise return an error notifying the user that a duplicate key exists.</p>
<p>
So, if we have a table:</p>
<pre>
create table foo (a int, b int, primary key (a))
</pre>
<p>and we run:</p>
<pre>
insert into foo values (1000,1);
</pre>
<p>
Before inserting (1000,1) into the table, the storage engine must first check to see if any row in the table where a=1000 exists. If so, return an error stating that there is a duplicate key. The requirement to return an error is VERY expensive: it requires a disk seek.</p>
<p>
The only way a disk based storage engine can verify if a key exists is to look up the element. This is the same reasoning used for unique secondary keys <a href="http://tokutek.com/2010/05/disk-seeks-are-evil-so-lets-avoid-them-pt-2/">here</a>. If the primary key is ad-hoc, the lookup requires disk seeks. And disk seeks are evil.</p>
<p>
Once again, B-trees already pay for the disk seek just for the insertion.  Because the leaf node must be in memory to do the<br />
insertion, verifying uniqueness essentially comes for free.  So B-trees suffer from disk seeks, whether you do duplicate checking or not.</p>
<p>
So how can users achieve fast performance with a Fractal tree based storage engine &#8212; as compared to B-trees, which will be slow regardless? Remove this requirement of returning an error if a duplicate is key found. It is way too expensive. MySQL has syntax to perform other actions instead of returning an error:<br />
<UL><br />
<LI>replace into<br />
<LI>insert ignore<br />
<LI>insert &#8230; on duplicate key update<br />
</UL></p>
<p>
The question is, do these commands do piggyback work that requires disk seeks? If they do, performance will be slow for ANY disk based storage engine. If not, then fractal tree data structures will be achieve up a speedup of up to two orders of magnitude. In next week&#8217;s post, I examine these cases.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2010/06/disk-seeks-are-evil-so-let%e2%80%99s-avoid-them-pt-4/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Making Deletions Fast, by Avoiding Disk Seeks</title>
		<link>http://tokutek.com/2010/06/making-deletions-fast-by-avoiding-disk-seeks/</link>
		<comments>http://tokutek.com/2010/06/making-deletions-fast-by-avoiding-disk-seeks/#comments</comments>
		<pubDate>Tue, 08 Jun 2010 16:19:41 +0000</pubDate>
		<dc:creator>zardosht</dc:creator>
				<category><![CDATA[TokuView]]></category>
		<category><![CDATA[B-tree]]></category>
		<category><![CDATA[delete]]></category>
		<category><![CDATA[disk seek]]></category>
		<category><![CDATA[Fractal Trees]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[TokuDB]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=1491</guid>
		<description><![CDATA[In my last post, I discussed how fractal tree data structures can be up to two orders of magnitude faster on deletions over B-trees. I focused on the deletions where the row entry is known (the storage engine API handler::delete_row), but I did not fully analyze how MySQL delete statements can be fast. In this [...]]]></description>
			<content:encoded><![CDATA[<p>
In my <a href="http://tokutek.com/2010/06/disk-seeks-are-evil-so-let%e2%80%99s-avoid-them-pt-3-deletions/">last post</a>, I discussed how fractal tree data structures can be up to two orders of magnitude faster on deletions over B-trees. I focused on the deletions where the row entry is known (the storage engine API handler::delete_row), but I did not fully analyze how MySQL delete statements can be fast. In this post, I do. Here I show how one can use TokuDB, a storage engine that uses fractal tree data structures, to make MySQL deletions run fast.</p>
<p>
Let&#8217;s take a step back and analyze the work needed to be done to execute a MySQL delete statement. Suppose we have the table:</p>
<pre>
create table foo (
	id auto_increment
	a int,
	b int,
	primary key (id)
)
</pre>
<p>
Say we wish to perform the following operation that deletes 100,000 rows:</p>
<pre>
delete from foo where a=1;
</pre>
<p>
In MySQL, this statement executes in two steps. First, MySQL finds all the rows where a=1, via a query. The query is equivalent to &#8220;select * from foo where a=1;&#8221; Then for each row found, MySQL deletes the row by calling handler::delete_row(row). So the time to execute deletions is equivalent to the time to find the rows (T_query) plus the time to delete the found rows (T_change). Or, stated another way:</p>
<pre>
T_delete = T_query + T_change
</pre>
<p>
The previous post shows why T_change is faster for fractal tree data structures than for B-trees. Let&#8217;s use that fact to see how we can make deletions fast under MySQL.</p>
<p>
Back to the original problem at hand:</p>
<pre>
delete from foo where a=1;
</pre>
<p>
For this statement, T_query is a table scan. Because no index on &#8216;a&#8217; exists, every element must be processed to find where a=1. For large tables, this can be expensive. The (minor) advantage here is T_change, the cost of removing the 100,000 rows, requires no disk seeks, because rows where a=1 stay in memory as MySQL deletes them. The problem remains, we process too much data in T_query to delete some rows.</p>
<p>
So how can we speed up a query? Indexing! Or specifically, add an index on &#8216;a&#8217;. The schema is now:</p>
<pre>
create table foo (
	id auto_increment
	a int,
	b int,
	primary key (id),
	key (a)
)
</pre>
<p>
Now what is the cost of &#8220;delete from foo where a=1;&#8221;? Well, for TokuDB, InnoDB, and MyISAM, T_query requires roughly 100,000 disk seeks, because point queries are needed to retrieve the entire row. The key (a) is not a covering index. The (minor) advantage here (again) is that T_change has no additional disk seeks, because the query does the disk seeks necessary to bring the rows where a=1 into memory, both in the primary index (or .MYD file for MyISAM) and secondary index.</p>
<p>
The problem remains that we are still doing about 100,000 disk seeks!</p>
<p>
So how can we further speed up T_query? Remember, we want to make the query &#8220;select * from foo where a=1;&#8221; faster. <a href="http://tokutek.com/2009/05/introducing_multiple_clustering_indexes/">Clustering indexes</a>! The schema looks like:</p>
<pre>
create table foo (
	id auto_increment
	a int,
	b int,
	primary key (id),
	clustering key (a)
)
</pre>
<p>
Suppose MyISAM and InnoDB supported clustering indexes (they don&#8217;t). What is their cost for performing the deletion? Well, T_query becomes much faster, because it would be a range query as opposed to 100,000 point queries. But T_change is still expensive. As explained in my last post, B-trees require disk seeks for deletion when the row is not in memory. The rows that need to be deleted in the primary key are not in memory and require disk seeks. So, for MyISAM and InnoDB, we are still stuck with at least 100,000 disk seeks. For this reason, some erroneously think &#8220;clustering keys do not help here&#8221;.</p>
<p>
Now what about TokuDB, a storage engine that uses fractal tree data structures? T_query is also fast, because a range query is performed. But T_change is ALSO fast, because the deletions in the primary index do NOT always require disk seeks. This is where the advantage of fractal tree data structures over B-trees comes into play, allowing clustering keys to speed up a deletion procedure.</p>
<p>
So, the moral of the story is this. Deletions are a combination of queries and value changes. To make deletions fast, both the query and the value changes need to be fast. With B-tree based storage engines, users may be hesitant to add indexes to speed up the query, due to the added cost of value changes in the index. With TokuDB, this tradeoff does not exist, because TokuDB uses fractal tree data structures. Using a clustering key to drive down the cost of T_query, and using fractal tree indexes to keep the cost of T_change down, leads to very fast deletes.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2010/06/making-deletions-fast-by-avoiding-disk-seeks/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Disk seeks are evil, so let’s avoid them, pt. 3 (Deletions)</title>
		<link>http://tokutek.com/2010/06/disk-seeks-are-evil-so-let%e2%80%99s-avoid-them-pt-3-deletions/</link>
		<comments>http://tokutek.com/2010/06/disk-seeks-are-evil-so-let%e2%80%99s-avoid-them-pt-3-deletions/#comments</comments>
		<pubDate>Wed, 02 Jun 2010 17:40:26 +0000</pubDate>
		<dc:creator>zardosht</dc:creator>
				<category><![CDATA[TokuView]]></category>
		<category><![CDATA[B-tree]]></category>
		<category><![CDATA[disk seek]]></category>
		<category><![CDATA[Fractal Trees]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[TokuDB]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=1488</guid>
		<description><![CDATA[As mentioned in parts 1 and 2, having many disk seeks are bad (they slow down performance). Fractal tree data structures minimize disk seeks on ad-hoc insertions, whereas B-trees practically guarantee that disk seeks are performed on ad-hoc insertions. As a result, fractal tree data structures can insert data up to two orders of magnitude [...]]]></description>
			<content:encoded><![CDATA[<p>
As mentioned in parts <a href="http://tokutek.com/2010/05/disk-seeks-are-evil-so-lets-avoid-them-part-1/">1</a> and <a href="http://tokutek.com/2010/05/disk-seeks-are-evil-so-lets-avoid-them-pt-2/">2</a>, having many disk seeks are bad (they slow down performance). Fractal tree data structures minimize disk seeks on ad-hoc insertions, whereas B-trees practically guarantee that disk seeks are performed on ad-hoc insertions. As a result, fractal tree data structures can insert data up to two orders of magnitude faster than B-Trees can.</p>
<p>
In this post, let&#8217;s examine deletions, and get an intuitive understanding for why fractal-tree data structures exhibit the same two orders of magnitude faster deletions than B-trees. In MySQL 5.1, this advantage is really eye-popping for TokuDB v. InnoDB, because InnoDB does not use its insert buffer for deletions. I understand there is a delete buffer in 5.5, which I haven&#8217;t experimented with yet.</p>
<p>
B-trees exhibit the same weakness on deletions as they do on insertions: they need to have the appropriate leaf node in memory. For large tables, bringing the leaf node into memory often requires a disk seek. Fractal tree data structures do not have this requirement.</p>
<p>
Before going on, a clarification. In MySQL, delete statements have two steps: queries and value changes. For instance, the statement:</p>
<pre>
delete from foo where a=1;
</pre>
<p>must first query all rows where a=1 (the first step), and then proceed to remove the rows that are found (the second step). In this post, we focus on the second step. For storage engine developers, this is the function handler::delete_row. In a future post, I will analyze the first step, tie it together with this post, and show how deletions can be fast in MySQL with TokuDB. </p>
<p>
Back to deletions. Let&#8217;s analyze value changes. We know the contents of the row being deleted. So how can fractal tree data structures avoid an unnecessary disk seek? The answer: deletion messages (sometimes called tombstone deletes).</p>
<p>
Suppose we have a fractal tree data structure with the following elements inserted: (1), (3),&#8230;(999). Up until now, we shown the fractal tree as such.</p>
<pre>
-

- -

- - - -

...

1 3 5 7 9 ... 999
</pre>
<p>
In reality, the elements stored are not just keys, but rather (message, key) pairs. The message may be one of two operations: insertion or deletion. We represent an insertion with the message &#8216;i&#8217;. So, the fractal tree looks more like this:</p>
<pre>
-

- -

- - - -

...

(i,1) (i,3) (i,5)... (i,999)
</pre>
<p>
To delete an element, for example (5), we insert a deletion message into the tree, marking it with a &#8216;d&#8217;. So, after deleting (5), the fractal tree data structure looks like this:</p>
<pre>
(d,5)

- -

- - - -

...

(i,1) (i,3) (i,5)... (i,999)
</pre>
<p>
With this scheme, deletions are as fast as insertions, which is to say two orders of magnitude faster than insertions or deletions into a B-tree. </p>
<p>
On queries, a message in a higher node overrides messages in lower nodes. So upon querying (5), a cursor notices that (d,5) is located higher than (i,5), and therefore the key (5) does not exist in the fractal tree data structure. On merges, the deletion message and insertion message cancel each other out, and space is reclaimed.</p>
<p>
So, by using deletion messages and treating deletions like insertions, fractal tree data structures can achieve the same performance boost (two orders of magnitude) over B-trees.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2010/06/disk-seeks-are-evil-so-let%e2%80%99s-avoid-them-pt-3-deletions/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Disk seeks are evil, so let&#8217;s avoid them, pt. 2</title>
		<link>http://tokutek.com/2010/05/disk-seeks-are-evil-so-lets-avoid-them-pt-2/</link>
		<comments>http://tokutek.com/2010/05/disk-seeks-are-evil-so-lets-avoid-them-pt-2/#comments</comments>
		<pubDate>Tue, 25 May 2010 20:23:26 +0000</pubDate>
		<dc:creator>zardosht</dc:creator>
				<category><![CDATA[TokuView]]></category>
		<category><![CDATA[Fractal Trees]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[TokuDB]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=1484</guid>
		<description><![CDATA[In part 1, I discussed why having many disk seeks are bad (they slow down performance), and how fractal tree data structures minimize disk seeks on ad-hoc insertions, whereas B-trees practically guarantee that disk seeks are performed on ad-hoc insertions. As a result, fractal tree data structures can insert data up to two orders of [...]]]></description>
			<content:encoded><![CDATA[<p>
In <a href="http://tokutek.com/2010/05/disk-seeks-are-evil-so-lets-avoid-them-part-1/">part 1</a>, I discussed why having many disk seeks are bad (they slow down performance), and how fractal tree data structures minimize disk seeks on ad-hoc insertions, whereas B-trees practically guarantee that disk seeks are performed on ad-hoc insertions. As a result, fractal tree data structures can insert data up to two orders of magnitude faster than B-Trees can.</p>
<p>
Now that insertion disk seeks are out of the way (and I don&#8217;t want to shortchange the importance of getting rid of these seeks!), let&#8217;s look at other places where databases perform seeks, and see if we can get rid of them.  Over my next couple of posts, I will look at several use cases and analyze whether disk seeks are required. If disk seeks are required, then performance will suffer on large amounts of data, for TokuDB and any other disk-based storage engines.</p>
<p>
If disk seeks are not required, things get interesting. Removing these unnecessary disk can speed up a database as long as all disk seeks in a command execution are removed. Since TokuDB eliminates seeks on insertions, we should avoid disk seeks altogether. Since B-trees induce disk seeks on ad-hoc insertions, cleaning up the remaining disk seeks has limited utility. </p>
<p>
For today, let&#8217;s look at a simple use case that may be obvious: insertions on secondary indexes v. unique secondary indexes. Take the following table:</p>
<pre>
Create Table: CREATE TABLE `t` (
  `a` int(11) NOT NULL AUTO_INCREMENT,
  `b` int(11) DEFAULT NULL,
  `c` int(11) DEFAULT NULL,
  PRIMARY KEY (`a`),
  UNIQUE KEY `b_unique` (`b`),
  KEY `b_norm` (`b`)
)
</pre>
<p>
Suppose most of the table resides on disk.</p>
<p>
Now I run:</p>
<pre>
insert into t (b) values (1000);
</pre>
<p>
Are there any mandatory disk seeks involved?</p>
<p>
When inserting into the fractal tree for the primary dictionary, we use an auto increment value, so insertions are sequential. Insertions run really fast, because a disk seek is usually not mandatory (disk seeks eventually happen when blocks get full, but they do not occur on EACH insertion).</p>
<p>
Let&#8217;s look at inserting into b_unique and b_norm visually. Take the following identical fractal trees for &#8216;b_unique&#8217; and &#8216;b_norm&#8217;:</p>
<pre>
-

- -

- - - -

...

1, 3, 5, 7, ..., 999, 1001, 1003, ...
</pre>
<p>
To insert into &#8216;b_norm&#8217;, the fractal tree can simply insert (1000) in the top node. To insert into &#8216;b_unique&#8217;, the fractal tree must first search for 1000, verify that it is not between 999 and 1001, and then insert into the top node. This lookup causes a disk seek and slows down the insertion.</p>
<p>
Note that because B-trees require disk seeks to do insertions anyway, some operations come with no additional cost in B-trees. Uniqueness checks are one such example. As a result, some B-tree users may not think twice about making secondary keys unique (after all, unique keys can help the query optimizer). Fractal tree data structures, on the other hand, incur a huge cost for a uniqueness check.</p>
<p>
So, the moral of this story is if you care about insertion performance and avoiding disk seeks, try to avoid unique secondary keys, and go with normal secondary keys.  Otherwise, fractal tree data structures will be just as slow as B-trees, and not two orders of magnitude faster.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2010/05/disk-seeks-are-evil-so-lets-avoid-them-pt-2/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Disk Seeks are Evil, so Let&#8217;s Avoid Them, Part 1</title>
		<link>http://tokutek.com/2010/05/disk-seeks-are-evil-so-lets-avoid-them-part-1/</link>
		<comments>http://tokutek.com/2010/05/disk-seeks-are-evil-so-lets-avoid-them-part-1/#comments</comments>
		<pubDate>Thu, 20 May 2010 19:42:21 +0000</pubDate>
		<dc:creator>zardosht</dc:creator>
				<category><![CDATA[TokuView]]></category>

		<guid isPermaLink="false">http://tokutek.com/?p=1479</guid>
		<description><![CDATA[Disk seeks are expensive. Typically, a disk can perform no more than a few hundred seeks per second. So, any database operation that induces a disk seek is going to be slow, perhaps unacceptably slow. Adding disks can sometimes help performance, but that approach is expensive, adds complexity, and anyhow minimizing the disk seeks helps [...]]]></description>
			<content:encoded><![CDATA[<p>Disk seeks are expensive. Typically, a disk can perform no more than a few hundred seeks per second.  So, any database operation that induces a disk seek is going to be slow, perhaps unacceptably slow.  Adding disks can sometimes help performance, but that approach is expensive, adds complexity, and anyhow minimizing the disk seeks helps more.</p>
<p>
TokuDB <a href=http://tokutek.com/2010/04/how-fractal-trees work-talk-at-mysql-2010/>fractal tree data structures</a> deliver insertion performance benefits over traditional B-trees by performing fewer disk seeks on random insertions (in effect, turning random I/O into sequential I/O). This is why TokuDB typically outperforms InnoDB on insertion workloads, because TokuDB&#8217;s random insertions into secondary indexes is much faster than InnoDB&#8217;s insertions &#8212; up to two orders of magnitude faster.</p>
<p>
So let&#8217;s consider the first place where TokuDB avoids a disk seek as opposed to a B-tree.  On an insertion, a B-tree seeks need to have the  appropriate leaf node in memory.  For large tables, this requires a disk seek.  A detailed view of what Fractal Tree indexes do is available in<br />
this <a href=http://tokutek.com/2010/04/fractal-tree-video-from-opensql-camp-portland-in-2009/>video</a>.  Here is an intuitive way to understand why fractal tree indexes are fast at insertions (an sample fractal tree is in the figure below). Seven out of eight (87.5%) insertions will be in the top three nodes, which will always be in memory. So, at least 87.5% of insertions will be strictly in-memory, doing no seeks, and in practice this percent is even higher. Now, when the fractal tree does write to disk, it will write nodes of greater depth. These nodes are written at disk bandwidth, and so the performance is not  limited by disk seek time. This is why fractal trees indexed insertions are so fast. </p>
<div id="attachment_1481" class="wp-caption alignnone" style="width: 310px"><a href="http://tokutek.com/wp-content/uploads/2010/05/simple-fractal-tree.png"><img src="http://tokutek.com/wp-content/uploads/2010/05/simple-fractal-tree-300x220.png" alt="" title="Simple Fractal Tree" width="300" height="220" class="size-medium wp-image-1481" /></a><p class="wp-caption-text">Simple Fractal Tree</p></div>
<p>
Insertions, therefore, are an operation that do not require a disk seek.  There are some data structures that perform a disk seek (B-trees) and others that don&#8217;t (Fractal Tree indexes).  Other operations require a disk seek, no matter what data structure you use, e.g. point queries,  uniqueness checks, etc.  In a previous post, I talked about how to avoid disk seeks by replacing point queries from secondary indexes to the primary table by using a <a href=http://tokutek.com/2009/05/introducing_multiple_clustering_indexes/>clustering index</a>. </p>
<p>
So now that TokuDB gets rid of insertion disk seeks, it&#8217;s time to get rid of as many as possible.  In the coming weeks, I&#8217;ll be posting a series of blogs about other cases where disk seeks can be avoided.</p>
]]></content:encoded>
			<wfw:commentRss>http://tokutek.com/2010/05/disk-seeks-are-evil-so-lets-avoid-them-part-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
