In this posting I’ll describe TokuDB’s multiple clustering index feature. (This posting is by Zardosht.)
In general (not just for TokuDB) a clustered index or a clustering index is an index that stores the all of the data for the rows. Quoting the MySQL 5.1 reference manual:
Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record.
Most storage engines allow at most one clustered index for each table. For example, MyISAM does not support clustered indexes at all, whereas InnoDB allows only the primary key be a clustered index.
In TokuDB, we have added functionality and grammar syntax to define multiple clustered indexes. As a result, users can get the performance advantages of clustered indexes for multiple indexes.
How to define multiple clustered indexes
To define a clustered index in TokuDB, simply add the key word “clustering” in front of any index that is to be created. Here are some examples:
create table foo (a int, b int, clustering key (a)) engine=tokudb; alter table foo add clustering index clstr_key(a); create clustering index clstr_key on foo(a);
How clustering keys work
Like InnoDB, TokuDB clusters data with its primary key (if no primary key is defined, then TokuDB auto-generates a hidden one). The primary index maintains a copy of the entire row. Each secondary index stores, for each row, the row’s secondary key and primary key.
A clustering index maintains a copy of the entire row, not just the primary key. As a result, when querying on a clustering key lookups in the primary index are always avoided. A clustering index is a covering index for any query. Another way to think of a clustering index is that it is a materialization of the table sorted in another order.
An example using clustering keys
Suppose we had 40M rows in an iiBench table with this schema:
Create Table: CREATE TABLE `purchases_index` ( `transactionid` bigint(20) NOT NULL AUTO_INCREMENT, `dateandtime` datetime DEFAULT NULL, `cashregisterid` int(11) NOT NULL, `customerid` int(11) NOT NULL, `productid` int(11) NOT NULL, `price` float NOT NULL ) ENGINE=TOKUDB AUTO_INCREMENT=40000001 DEFAULT CHARSET=latin1
Now suppose we wanted to perform the following queries in sequence:
- Query 1:
select sum(price) from purchases_index where customerid > 51 and customerid < 55;
which scans more than 1200 rows.
- Query 2:
select * from purchases_index where customerid = 100;
which returns 406 rows.
- Query 3:
select productid, dateandtime from purchases_index where customerid = 200 or customerid=300;
which returns 820 rows.
Ideally, to perform the first query, we would want a covering index of (customerid, price) on the table. To perform the third query, we would want a covering index of (customerid, productid, dateandtime). For the second query, a covering index would need to include all the fields.
With other storage engines, the common solution is to define a key on the field (customerid). Here are the results with TokuDB when using a key on customerid:
Query 1: 9.47 seconds Query 2: 3.11 seconds Query 3: 6.16 seconds
Here are the results with MyISAM when using a key on customerid:
Query 1: 2.66 seconds Query 2: 0.91 seconds Query 3: 1.78 seconds
Here are the results with TokuDB when using a clustering key on customerid:
Query 1: 0.14 seconds Query 2: 0.06 seconds Query 3: 0.20 seconds
As we can see, clustering indexes provide the same order of magnitude of performance as covering indexes, but for a broader range of queries.
A limitation of clustering keys
A clustering key cannot be defined to be unique.

Hi! I can see this being very useful for some workloads, but I think it is very important to mention the main reason why most engines don’t support multiple clustered indexes: INSERT and UPDATE performance will degrade and disk usage will go up. If you maintain multiple copies of the data, it’s that many more I/O ops that need to happen when the data changes.
@Eric, thanks for the comment – excellent points! Because TokuDB provides very high indexed insertion rates and compression, TokuDB can really make clustering indexes work. In many cases, TokuDB can meet or exceed insertion rate and disk use requirements even with multiple clustering and/or non-clustering indexes, even on large tables.
How is this any different from a covering index in MyISAM (or InnoDB) obtained by just appending all remaining columns after the indexed keys?
Especially with the restriction on no unique indexes, that seems to be exactly the same?
From the description it sounds like this is just a syntax shortcut for creating covering indexes, which are supported in MyISAM and InnoDB (but not Falcon). But a covering index is not a clustered index, since as Eric writes it does not improve insert/update speed in any scenarios.
@Kristian, good questions. I will address covering indexes vs. clustering indexes in a blog post later today.
Can you describe how much extra storage is needed for each add-on clustered index/table ?
@Venu, with TokuDB, we typically measure compression ratios of 3x to 12x relative to the amount of raw data in a table. Since a clustering index is another copy of the table in a different sort order, it will probably consume somewhere between 1/3 and 1/12 of the table’s raw data size.
Ah, now I think I see the difference between this and normal covering indexes. The difference here is that the extra fields are only stored in the leaf nodes (assuming Btree index). The keys in interior nodes need not store the extra fields, unlike in traditional covering indexes where the extra fields needlessly increase the key size proper.
That might be useful, especially when the row size is much bigger than the key size (not at all uncommon).
This is very nice. I missed this post when you first wrote it. It can be used to provide much better bounds on worst-case query performance (a few disk reads versus thousands) and do a much better job at making queries ‘online’.
[...] Summary: A query like TPC-H Query 17 can be sped up by large factors by using straight_joins and clustering indexes. (This entry posted by [...]
[...] blogging about TokuDB’s multiple clustering indexes feature, Baron Schwartz suggested we contribute the patch to allow other storage engine to [...]
[...] recently posted a blog entry on clustering indexes, which are good for speeding up queries. Eric Day brought up the concern that clustering indexes [...]
[...] I (Zardosht) posted an entry introducing clustering indexes. Here, I elaborate on three differences between a clustering index and a covering [...]