A Look at Dremel

I recently presented the paper Dremel: Interactive Analysis of Web-Scale Datasets by Melnik et al. (2010) in my team’s paper reading group at work. The paper presents one of Google’s foundational data analysis systems and is an interesting read for anyone interested in the intersection of data analytics, database systems and distributed systems. In particular, it is enlightening to understand the design decisions required to make this system operate at the scale of petabyte-sized datasets – inevitably many datasets at Google’s scale.

The purpose of this blog post is to provide a walkthrough of the paper in alternative wording, which may be useful to you if you have read the paper and are seeking clarification on certain points; or simply prefer the language of a blog over that of an academic paper. That said, the paper is itself a relatively easy read and I encourage you to review the original paper before or after reading this post.

TL;DR

For the busy executives among you, I like to provide a quick TL;DR summary:

Dremel is Google’s internal data analysis and exploration system. It is designed for interactive (i.e. fast) analysis of read-only nested data. Its design builds on ideas from parallel database management systems as well as web search.

The key contributions of the paper are:

  • A column-striped storage format for nested data,
  • A SQL-like language adapted to nested data,
  • Use of execution trees from web search systems applied to database queries.

Overall, the paper shows that it is possible to build a system that enables analysis of trillions of read-only records spanning petabytes of storage at interactive speeds usually below 10 seconds. If you want to learn more about how, read on.

Motivation

Let’s begin by understanding the motivation for building Dremel. The observation the Dremel paper makes in its introduction is that data found in typical “web computing” environments is often non-relational. Non-relational here means that it is not natural to fit the data exchanged by such web computing systems into a relational model, i.e. a flat collection of tuples as you would store them in your run-of-the-mill MySQL database. Typical examples of such non-relational data include:

  • Messages exchanged by distributed systems (think: Protocol Buffer messages or Thrift structs),
  • Structured documents like websites,
  • Data structures used in everyday programming languages.

Notice especially that each of these categories of data usually involve nested data. Further, the paper makes the observation that if you were to store such data using a relational model, the process of de-structuring, or “unnesting”, and re-structuring this data, as well as splitting it into well-normalized database tables, is prohibitive at web scale.

So, what if we could …

  • Store such data in its original, nested structure,
  • Access data as-is without re-structuring it,
  • Describe queries on nested data without complicated joins

… and do all of this blazingly fast on petabytes of data for interactive analysis. Well, lo and behold, this is what Dremel was designed for!

Trivia

Before we dive into the nitty gritty details of how Dremel works, it may be entertaining to know some trivia about it first.

Dremel has been in production at Google since 2006. A selection of use cases for Dremel at Google include analysis of:

  • Crawled web documents,
  • Spam,
  • Build system results and
  • Crash reports.

Further, there are two ways to use Dremel outside of Google. The first is Google’s BigQuery service, which Google provides as part of its cloud offering. The second is Apache Drill, effectively an open source re-implementation of Dremel.

But wait, what about the name? Turns out, Dremel is a pretty clever name: “Dremel is a brand of power tools that primarily rely on their speed as opposed to torque”.

Technical Details

The next few paragraphs digest the technical details of the Dremel paper, roughly in their original order.

Data Model

Arguably the most important pillar of the Dremel system is that its data model centers around nested data. What is nested data? In essence, and likely in practice, this just means data that can be described with the Protocol Buffers specification. Protocol Buffers is Google’s data serialization format, and comes with an interface description language (IDF) that allows defining nested data structures. Here is an example of a nested data structure, a.k.a. record, defined in the Protocol Buffer IDF:

message Document {
  required int64 DocId;
  optional group Links {
    repeated int64 Backward;
    repeated int64 Forward; }
  repeated group Name {
    repeated group Language {
      required string Code;
      optional string Country; }
    optional string Url; }}

If you’ve never seen a Protocol Buffers record (or message) before, you should notice:

  • It describes a nested data structure,
  • Fields or sub-records (groups) can be:
    • required (must be present),
    • optional (can be omitted),
    • repeated (occur zero or more times).

There is an important connection between optional and repeated fields in that both can be omitted. Omitted here means that it need not have a value when serializing a concrete object from some programming language into the final data format (which can then be stored or sent over the wire). This will become very relevant in the later discussion on lossless storage of such records.

A final piece of important nomenclature associated with nested data records is a field’s path. In the above record, each named entity is a field. The path to a field is the dot-separated list of fields one would traverse in order to reach that field. Examples of field paths are:

  • Document.Links.Forward,
  • Document.Name,
  • Document.DocId.

In the paper and this article, the top-most field name is often omitted, e.g. Name.Language.Country is synonymous with Document.Name.Language.Country.

Storage Model

Given a data model, any data management system must decide how to store data following this model. This is called the storage model. For this, Dremel borrows from the world of Online Analytical Processing (OLAP) relational databases, which for many years now have exploited the benefits of column-oriented storage. The idea of column-oriented storage in the world of relational databases is very simple: Instead of storing tuples “row after row” in memory, data is stored “column after column”. For this table:

A B C
0 1 2
3 4 5
6 7 8

storing the data in this table in a row-oriented fashion would mean storing them in memory row-by-row:

Data:        0 | 1 | 2 $ 3 | 4 | 5 $ 6 | 7 | 8
Memory Cell: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

where I use $ to denote row boundaries. In column-oriented layout, columns are stored next to each other in memory:

Data:        0 | 3 | 6 $ 1 | 4 | 7 $ 2 | 5 | 8
Memory Cell: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

where $ now indicates a column boundary. The latter layout is beneficial to queries that do not read entire rows at a time, but access only specific columns. It turns out that most data analysis workloads are like this. For example, the following “business intelligence” analysis

SELECT SUM(price) from Purchases WHERE date > '2019-01-01'

looks at only two columns in a table with potentially hundreds of columns. In column-oriented storage, the DB engine can scan through only the columns it needs, while in row-oriented storage it (naively) would need to access the entire row (all columns).

Things are not quite the same for Dremel since its data model revolves around nested data as opposed to flat, relational data. Nevertheless, Dremel is an analytical query engine and wants to access individual fields quickly. Therefore, one innovation Dremel makes is to employ a column-striped layout for its storage model too, resulting in greater efficiency and fast encoding and decoding of data. It is noteworthy to mention what column-oriented storage actually means for nested data. Effectively, it means storing the “leaf” fields side-by-side in memory just like the column values in the second example above. In the record presented in the previous section, examples of leaf fields would be Document.DocId or Document.Links.Backward. Note that group fields like Document.Name aren’t stored themselves, since they exclusively influence the logical structure of the data but do not store data themselves. Only a leaf field can actually have a value.

The figure below, taken from the paper, illustrates the difference between storing a record in record-oriented format, which you can think of as row-oriented, versus the column-oriented format that Dremel employs instead.

storage

In terms of how columns are physically stored, the paper explains that a distributed file system such as the Google File System (GFS) is a suitable choice.

The next section discusses the challenges in implementing column-striped storage for nested data.

Lossless Columnar Representation

As the previous paragraph explained, Dremel uses column-oriented storage, which means leaf fields are stored contiguously in memory. Interestingly, this means that the values are “lifted” out of their logical structure. Take the following record:

Name
    Language:
        Code: 'en'
    Language:
        Code: 'en-us'
Name
    Language:
        Code: 'en-gb'

In a column-oriented storage model, values of the Name.Language.Code field will be stored in memory like this:

'en'
'en-us'
'en-gb'

In the original record, each value had a logical location within the record structure. However, now values are stored in an entirely flat manner. Upon re-assembly, how do we know that the first two values belong to the first Name group and the third value to the second Name group?

This is where the Dremel paper introduces two additional pieces of information that is stored with each field value to achieve lossless columnar representation. The first piece of information is called the repetition level and the second is termed the definition level.

Repetition Level

The repetition level tells us at what repeated field in a field’s path the value has repeated. Recall that in our running record example, both Name and Name.Language are repeated groups. As such, for a field path like Name.Language.Code, the repetition value tells us whether a particular value belongs to a repetition of the inner group Name.Language or to the outer group Name. Consider our previous record:

Name:
    Language:
        Code: 'en'
    Language:
        Code: 'en-us'
Name:
    Language:
        Code: 'en-gb'

For the value en-us, it was the Language group that repeated “most recently”. For the en-gb value, it was the entire Name group that repeated most recently. As such, the repetition level of the latter value is 1, because in the field path Name.Language.Code it is the first field, Name, that was repeated. For the former example, en-us, the repetition value is 2, because it is the second field, Name.Language, that repeated last. For the the very first value in the record, en, the repetition level is 0 because it’s the Document itself that repeated most recently (it didn’t actually “repeat” because it’s the first record in this example, but alas, this is the base case).

Equipped with the power of assigning a repetition level to each value we store, we can go back and amend our in-memory layout from above to the following:

'en'    | 0
'en-us' | 2
'en-gb' | 1

Here I’ve written the repetition numbers in the right-most column.

Definition Level

Another number Dremel associates with each stored value is its definition level. This level indicates how may fields in a path that could be omitted, are actually present. In Name.Language.Code, both Name and Name.Language are repeated groups, which means they could be omitted. If Name is however present in a field’s path, the definition level is bumped to 1. If Name.Language is present too, the definition level is bumped to 2. This number becomes relevant for encoding of NULL values. In the following record:

Name:
    Language:
        Code: 'en-us'
        Country: 'USA'
    Language:
        Code: 'fr'
Name:
    Language:
        Code: 'en-gb'
        Country: 'USA'

The second Name.Language group has omitted the Name.Language.Country field. This information has to be stored in form of a NULL value in the Name.Language.Country column. What the definition level tells us, then, is how much of the record surrounding a field is omitted. In the above record, the Name and Name.Language group are present, but not the final Name.Language.Country field. The definition level of that NULL value would be 2, because it’s two ancestor fields are present in the record. In this record:

Name:
    Url: 'http://www.example.com'

the Language group is omitted as a whole, so the definition level of the NULL value we store for the invisible Name.Language.Country field is 1. Zooming out further:

Document:
    DocId: 1
Document:
    DocId: 2

this record doesn’t even have a Name group, so we’d store a NULL value for the Name.Language.Country column with definition value 0, because none of the optional or repeated fields in the path Name.Language.Country, which are Name, Name.Language and Name.Language.Country, are present.

My understanding of the purpose of this value is not to aid in lossless storage. This, as far as I can tell, is accompalished by the repetition level alone. I believe the purpose of this number is to be able to store NULL values without storing an actual value. That is, any definition level smaller than the length of the path indicates that the value is NULL. The exact value then tells us how much of the path is omitted. Many real-world datasets are very sparse (have lots of NULL values), so efficient encoding of NULL values would have high priority in a design for a system like this.

Storage of Repetition and Definition Level

The paper mentions a number of optimizations that are used to store definition and repetition levels efficiently. One such optimization is to omit explicit NULL values as the definition level already reveals whether a value is present in the final record or not. Further, definition levels are not stored for required fields, since by the nature of that qualification the field and thus its whole path must be defined. Also, if the definition level is zero, the repetition level must be zero too because the whole group (one level below the whole record, e.g. Document) is omitted. Levels in general stored in a bit-packed fashion.

Assembly of Records

As we’ll review in more detail later, Dremel emits records as its query output. To do so, it must assemble records from the data columns it reads from memory. What is particularly important and interesting about this task is that Dremel must be able to read partial records. That is, records that follow the structure of the record schema (the Protocol Buffer specification) while omitting certain fields or entire groups. The paper describes that at Google many datasets are sparse, often with thousands of fields in the record schema, of which usually only a small subset is queried.

Most of the examples we’ve looked at so far were partial records, for example in

Name:
    Language:
        Code: 'en-us'

the only field present is Name.Language.Code. Fields like Name.Language.Country and entire groups such as Links are omitted. Nevertheless, the record follows the original schema, with Language nested under Name and Code under Language.

To do the actual record assembly, Dremel creates a Finite State Machine (FSM). Here is an example figure from the paper:

assembly

This FSM constructs the record by logically jumping from column to column and scanning values from memory until exhausted. Edge values indicate repetition levels.

Query Language

Now that we know more about how Dremel encodes and assembles records, let’s move on to investigate how Dremel allows users to query these records. For this, Dremel provides a SQL-like query language. Overall, the look and feel of this language is very much like SQL, with the noteable exception that fields can be referenced by their full, nested path. Here is an example query:

SELECT DocId AS Id,
  COUNT(Name.Language.Code) WITHIN Name AS Cnt,
  Name.Url + ',' + Name.Language.Code AS Str
FROM t
WHERE REGEXP(Name.Url, '^http') AND DocId < 20;

In general, this query language takes a table – a collection of records following one particular schema – as input and produces zero or more (partial) records as output. Note again: instead of rows, it outputs records. Here is an example of a single record emitted as output:

DocId: 10
Name:
    Cnt: 2
    Language:
        Str: 'http://A,en-us'
        Str: 'http://A,en'
Name:
    Cnt: 0

There are three important points to mention about this query language, which can all be observed in the query and its result. These are how Dremel deals with selection, projection and aggregation. The following paragraphs dive into these three aspects.

Selection

The first point is about how Dremel does selection. Selection is accomplished via the usual WHERE statement. In the query above, it is the statement

WHERE REGEXP(Name.Url, '^http') AND DocId < 20;

For the nested record structures Dremel operates on, selection conceptually has the semantics of “pruning subtrees”, as the paper puts it. The idea is that given a record substructure, or subtree, the WHERE statement has the effect of taking the entire tree out of consideration if it does not match the selection predicate. Above, if a particular Name substructure of a Document record does not have a Url field matching the '^http' regular expression, the whole Name substructure is thrown away. This means its Name.Language.Code field, albeit a different field from Name.Url, will not be used to produce a record even though it is referenced in the projection (line 2 in the full query). The second clause in the selection, DocId < 20, prunes away entire records, i.e. Documents, at the top-most level.

Projection

Projections in Dremel have the interesting semantics that if a projection involves two fields at different levels of nesting, values will be produced for every field at the most-nested level. Consider the projection

Name.Url + ',' + Name.Language.Code AS Str

Here, Name.Url is at nesting level 2 (starting at Document), while Name.Language.Code is nested one level more. This means there could be many Name.Language.Code values for one Name.Url within a Name group. The semantics of Dremel here dictate that one value will be produced for each value at the most-nested level, i.e. one value for every Name.Language.Code, using the same Name.Url. This is exemplified by the sample output shown above, which could have been produced by a record like this (disregarding selection):

Name:
    Url: 'http://A'
    Language:
        Code: 'en'
    Language:
        Code: 'en-us'

Aggregation

Dremel’s aggregation semantics are showcased by the following line in the original query:

COUNT(Name.Language.Code) WITHIN Name AS Cnt

Make particular notice of the WITHIN keyword, which is not part of ANSI SQL since ANSI SQL’s data model includes only flat data. Because Dremel operates on nested data, it is necessary to specify at which level of nesting the aggregation should be performed. Consider this record:

Name:
    Language:
        Code: 'en'
    Language:
        Code: 'en-us'
Name:
    Language:
        Code: 'en-gb'

Counting how many Name.Language.Code fields there are could mean different things here. We could count how many such fields there are in the entire record or we could count how many such fields there are within each Name group. This is what the WITHIN keyword gives us control over. Above, we specify to aggregate WITHIN Name, which means we’ll produce one Cnt value for each Name group. This produces the output from above:

DocId: 10
Name:
    Cnt: 2
    Language:
        Str: 'http://A,en-us'
        Str: 'http://A,en'
Name:
    Cnt: 0

If we instead wrote

COUNT(Name.Language.Code) WITHIN Name.Language AS Cnt

we would get

DocId: 10
Name:
    Language:
        Cnt: 1
        Str: 'http://A,en-us'
    Language:
        Cnt: 1
        Str: 'http://A,en'
Name:
    Language:
        Cnt: 0

Query Execution

Having discussed how Dremel encodes and stores data, as well as how it lets users write queries to access this data, let us now explore how Dremel bridges the gap between these two steps, i.e. how it executes a query.

For query execution, Dremel borrows concepts from the domain of web-search computing. One concept from this realm is that of a serving tree. The way Dremel executes a query is that it first sends it from the client (e.g. the web interface a data analyst uses to write her query) to a root server. The root server then reads metadata about the tables the query accesses. For clarification, a table here means a collection of records. This root server then forwards a rewritten version of the query to a number of intermediary servers below it in the serving tree, which may, recursively, do the same. This process repeats until the (rewritten) query reaches leaf servers. Leaf servers are the servers that have a direct link to the storage layer in which data is located, such as a GFS cluster. Each leaf server has access to a subset of the entire table.

serving tree

The crucial process in this serving tree is that of query-rewriting. How are queries rewritten? Fundamentally, a server at one level in the serving tree will rewrite its query such that the work the query does can be divided among the servers below it. Once the servers below get their results, the original server aggregates these results and passes it further up the serving tree. This repeats all the way to the root server. Classic divide-and-conquer.

Let’s look at an example. A root server may receive the following query from a client:

SELECT COUNT(A) FROM T

This query does a count of all A fields in the table T. It can be rewritten into

SELECT SUM(C) FROM (R_1 UNION ALL ... R_n)

where R_i is defined as

SELECT COUNT(A) AS C FROM T_i

Here, T_i is a disjoint partition of the original table T. At the leaf servers, such a partition is called a tablet. Each level of the serving tree performs a further such rewrite of the query until the query reaches the leaf servers. The leaf servers then scan their partition of T to do the actual counting work. These counts are passed back up to intermediate serves, which sum these counts into their own local count. At the root server, this sum becomes the total count we originally wanted.

The end result of this recursive decomposition of the original query into smaller and smaller queries is that the final amount of work to be done by leaf servers is small in comparison to the overall work. This allows the entire query to be executed faster than if the root server had to scan the entire table sequentially itself. One experiment in the paper actually shows that changing the serving tree topology from 1:2900 (one root server, 2900 leaf servers) to 1:100:2900 (one root server, 100 intermediate servers, 2900 leaf servers) results in an order of magnitude performance gain for particular queries.

Query Dispatch

Finally, the paper touches upon interesting details about how Dremel approaches query dispatch. Specifically, this topic touches upon how Dremel deals with concurrent execution of multiple queries.

The paper explains that Dremel is, of course, a multi-user system, where many users can schedule queries at the same time. Dremel’s query dispatcher is responsible for coordinating the execution of these queries. This dispatcher has a notion of different levels of priority between queries, where higher priority queries are (probably) executed sooner than lower priority ones during times of high load.

To allow queries to execute in parallel, Dremel divides its capacity into so-called slots. A slot is equal to one thread on a leaf server. For example, a system with 3,000 leaf servers, each running 8 threads that could scan the storage layer, has 3,000 x 8 = 24,000 available slots in total. Recall that a table is itself partitioned into tablets. A single slot is assigned one or more tablets that it alone can read from. If a tablet is split into 100K tablets, that means that each slot would be responsible for 5 tablets, of which it can read one at a time. If two queries need to access the same tablet, the slot would presumably be able to service only one of these queries and this is where a delay could occur for the second query.

For fault tolerance, each tablet is two to three-way replicated, such that if a slot cannot reach a tablet, it can contact at least one other source of that same tablet data. On the topic of tablets: It is unclear to me whether a tablet can contain multiple columns, but I simplistically think of them as fixed-size splits of one column. For example, one tablet may contain 50MB worth of data for the Name.Language.Code field. The paper lists 100K-800K as typical examples for the number of tablets for one table.

One very interesting mention in the paper is that the query dispatcher has a parameter that controls the percentage of tablets that must be read before a query is considered copmlete. It turns out that reducing this parameter from 100% to something slightly lower, like 98%, can result in “signficant” speedups to the overall execution.

Takeaways

well there it is

Well, there it is: Dremel, a platform for “Interactive Analysis of Web-Scale Datasets”. What can we take away from it? I would say:

  • Scan-based queries of read-only nested data can be executed at interactive speeds (<10s),
  • Columnar storage can be applied to nested data just as it has been applied to relational data in DBMS,
  • If trading speed against accuracy is acceptable, a query can be terminated much earlier and yet see most of the data.

Overall, I found the paper very well written and insightful on many fronts. I hope this article provided some further detail where more detail might have been necessary but was outside the scope of an academic paper. As such, query on!