Debunking Yet Another Myth: Column-Stores As A Storage-Layer Only Optimization

| | Comments (0)
Editors note: This post was co-authored by Daniel Abadi and Samuel Madden

Our goal in this myth-debunking series has been to shed some light on common misconceptions about column-store databases that we have come across in our conversations with people in the database community. For example, many people think a column in a column store and an index in a row store are similar data structures. We debunked this two posts ago. Other people think that you can simulate a column-store by vertically partitioning a row-store. We debunked this previously, though the difference between these two is more subtle than the difference between a column and an index. This post we debunk yet another myth, which is perhaps even more subtle than the vertical partitioning myth:

Myth: The performance advantages of a column store can gained by replacing a row-oriented storage layer of a DBMS with a column-oriented storage layer without also rewriting the row-oriented query execution system that plans and processes queries on top of the storage system.


Experiments that debunk this myth

At first blush, it may seem that all of the benefits of column orientation can be had by simply replacing the storage layer. For example, one commonly cited benefit of column stores is: Because they store each attribute of a table separately, they are better for read queries since they can avoid reading in unnecessary attributes. This benefit for read queries can clearly be obtained solely by replacing the storage layer -- the query executor simply requests the columns it needs and the storage layer reads them off disk and merges them into rows so that the query executer can process them in the standard fashion. The other commonly cited advantage of column stores is that by storing data from the same attribute domain consecutively, they see higher data value locality, and thus an improved compression ratio. Again, this improvement is at the storage layer, and the executor need not be aware of compression at all if data is decompressed as it is read off of disk.

In our recent SIGMOD paper, "Column-Stores vs. Row-Stores: How Different Are They Really?", we ran some experiments to discover how closely related the performance of column stores is to these storage layer optimizations. We took the C-Store column-store database (the academic precursor to Vertica) and ran it on the same Star Schema Benchmark discussed in our previous two posts. We then ran the same benchmark on the same C-Store software after stripping out all column-specific operations in the query executer (which sits on top of the storage layer). More specifically, the column-oriented storage layer reads in the subset of data needed to answer a particular query (i.e., only those columns accessed by the query), merges these columns into tuples, and then runs a traditional query executer on these tuples. So in the first case we have a column-oriented executer sitting on top of a column-oriented storage layer, and in the second case we have a traditional (row-oriented) executer sitting on top of a column-oriented storage layer. Our benchmark results of these two systems are presented in the figure below:

Executor performance in row and column stores
Surprisingly, there is an order of magnitude difference between the two systems despite the identical column-oriented storage layer from the C-Store codebase. Clearly, the query executer plays a central role in the column-store performance story.


Examining strategies that lead to performance improvements

In our paper we describe four key column-oriented execution strategies that are responsible for the bulk of the performance difference. We successively remove each of these strategies in order to break down the performance contribution of each one. In the following, we give a brief overview of these four strategies:

  • Operating directly on compressed data
  • Late materialization
  • Block iteration
  • The invisible join


Operating Directly On Compressed Data

Compression is usually considered a tradeoff: One trades reduced I/O (reading in less data) for increased CPU cycles (to perform the decompression). If I/O is the main system bottleneck, then this is a good tradeoff to make. However, if a query executor can operate directly on compressed data, decompression can be avoided completely and there is no longer a CPU tradeoff. Column stores are well suited for operating directly on compressed data since they can use schemes like run-length encoding, where a sequence of repeated values is replaced by a count and the value (e.g., 1, 1, 1, 2, 2 → 1 × 3, 2 × 2) that are very easy to operate on directly. Repeated values are far more likely to occur within a column of data from the same attribute than across different attributes within a row (tuple). Run-length encoding is also particularly useful for sorted or secondarily sorted columns. (In fact, operating directly on run-length encoded data does more to improve performance than simply avoiding the need to perform decompression; it also allows the query executor to perform the same operation on multiple column values at once, further reducing CPU costs.)


Late Materialization

A row-oriented query executer sitting on top of a column-oriented storage layer results in query plans where columns that are needed for a particular query are read off disk (or from memory) and immediately merged together into tuples so that standard, row-oriented database operations (e.g., selections and joins) can be performed over these projected tuples. This approach to query execution forces an "early materialization" of rows from component columns, since tuples need to be materialized at the beginning of a query plan before the row-oriented executer can process its input data. In contrast, a sophisticated column-oriented query executer will maintain the input data in columns and run query operators over these columns in isolation as much as possible. For example, a selection predicate can be applied to just the column involved in the predicate (of course, the query executer needs to keep track of the tuple ids of the tuples that match the predicate). Such an approach is typically called "late materialization" since tuples are not materialized until late in a query plan (sometimes not until the end).

The advantages of late materialization are four-fold. First, selection and aggregation operators tend to render the construction of some tuples unnecessary (if the executor waits long enough before constructing a tuple, it might be able to avoid constructing it altogether). Second, if data is compressed using a column-oriented compression method, it can be operated on directly when using late materialization, but must be decompressed before it can be combined with values from other columns. Hence, not using late materialization removes the advantages of operating directly on compressed data. Third, cache performance is improved when operating directly on column data, since a given cache line is not polluted with surrounding irrelevant attributes for a given operation (as shown in the PAX research paper). Fourth, the block iteration optimization described in the next subsection has a higher impact on performance for fixed-length attributes. In a row store, if any attribute in a tuple is variable-width, then the entire tuple is variable width. In a late materialized column store, fixed-width columns can be operated on separately.


Block Iteration

In order to process a series of tuples, most row stores first iterate through each tuple, and then need to extract the needed attributes from these tuples through a tuple representation interface. In many cases, such as in MySQL, this leads to tuple-at-a-time processing, where there are 1-2 function calls to extract needed data from a tuple for each operation (which can result in a relatively high overhead if the operation is simple, such as a predicate application).

Some row stores (such as IBM DB2) reduce the overhead of tuple-at-a-time processing by making many tuples available at once to query operators and evaluating this entire block of tuples in single operator call. However, most row-stores do not perform this optimization, since they tend to be optimized for transactional workloads where tuple-at-a-time processing tends not to be a key bottleneck. On the other hand, most column-stores operate on blocks of values at once, and since attributes are already stored separately, no attribute extraction is needed. Furthermore, if the column is fixed width, these values can be iterated through directly as an array. Operating on data as an array not only minimizes per-tuple overhead, but it also exploits potential for parallelism on modern CPUs, as loop-pipelining techniques can be used.


Invisible Join

C-Store uses a special column-oriented join technique (designed especially for star schema joins) that rewrites a join into a special type of predicate evaluation on a fact table column that can in some cases be evaluated without repeatedly consulting the dimension table (see our paper for more details).


Summary: Understanding the performance differences

Of the four techniques discussed so far, we found that operating directly on compressed data and late materialization accounted for the majority of the performance difference between the column-oriented executors. In particular, operating directly on compressed data resulted in about a factor of two performance improvement and late materialization resulted in about a factor of three.

The bottom line is that column stores are more than just a storage layer optimization. They also contain many optimizations at the query executer level which also contribute greatly to query performance.

Two more side notes on this topic:

  1. Observation clarifies performance claims. Without the observations in this blog post, the column store claims that they are two orders of magnitude faster than row stores would be confusing. You would need to have tables that are hundreds of columns wide, with queries accessing 1% of these columns to get two orders of magnitude if column-stores only yielded the storage layer benefit of reading in fewer columns. Now, it should be clear that column-stores rely on the storage layer to get one order of magnitude performance improvement and the query executer to get the other order of magnitude.

  2. Not all column stores perform identically. The reason for this is that even though the storage layer is similar across these systems, they often have substantially different query executers. It is worth understanding the column-oriented query execution features a system offers before selecting a particular column store to use. Some systems that claim to be "column stores" got their start by taking an open source row store database and upgrading the storage layer to be column-oriented. Such systems are doomed to perform worse than true column-stores, that provide both a column-oriented storage system and a column-oriented query executor.

Editors note: For 150 more pages on column-oriented query execution, see Daniel Abadi's PhD thesis

 

Categories

Leave a comment

About this Post

This page contains a single post by Daniel Abadi published on December 10, 2008 3:04 PM.

Field Fodder -- Compression in Real World Datasets was the previous entry in this blog.

The Innovator's Dilemma for Analytic Database Systems is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.