I should first issue a disclaimer since I work for Google. This doesn’t
reflect the opinion of my employer, so any inaccuracies are my fault alone.
If it’s any consolation, at least Google isn’t trying to sell Mapreduce as a product.
David J. DeWitt and Michael Stonebraker recently wrote an article titled Mapreduce: a giant step backwards. While the article makes some very good points, I think they have failed to appreciate exactly what Mapreduce is good for. To quote from this response, I had a major WTF moment when I read the article by DeWitt and Stonebraker. Apparently I am not the only one who had this reaction.
The article reminds me of a conversation I had with a relational database guru (an IBM fellow) back in 2000. At the time he made a comment about how much better the web would be if people just stored their data in “a relational database” instead of on this messy web thing. At the time I remember thinking he needed to reboot his brain, because the web was about decentralization and expression, not data. In fairness the guy was very smart and I respected his opinion on many things. In this case I think he was so skilled with a hammer that everything looked like a nail to him.
What is Mapreduce?
Before addressing the specific points in the article by DeWitt and Stonebraker, I’d like to say what I think Mapreduce really is. Mapreduce is a component of a system for distributed computing. It is not a data storage system, and it is not a general purpose computing platform. It is well tailored for a class of applications that are of interest to Google, and we use it routinely to process petabytes of information.
From a conceptual standpoint, Mapreduce is a very simple system. The basic idea is to use multiple phases to process an input data set, producing an output data set. It is designed for massive data sets, and the only requirement for the data sets is that they must be readable and writable at high speed. At Google, these data sets are often stored in the Google File System (GFS), or bigtable, though the data might also be stored in a relational database if it can support sufficient data rates.
A diagram is shown to the right. A Mapreduce consists of three phases. First comes a map phase that takes input records and produces output (key,value) pairs. This is followed by a shuffle phase that groups the (key,value) pairs by common values of the key, and finally a reduce phase that takes all pairs for a given key and produces a new value for the same key and this is the output of the Mapreduce.
The power of this technique is that it fits a broad class of problems that are important to solve at Google.
The article by DeWitt and Stonebraker
The article by DeWitt and Stonebraker levels several criticisms at the Mapreduce paradigm:
- A giant step backward in the programming
paradigm for large-scale data intensive applications - A sub-optimal implementation, in that it uses brute force instead of indexing
- Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago
- Missing most of the features that are routinely included in current DBMS
- Incompatible with all of the tools DBMS users have come to depend on
I’ll address these individually.
1. A giant step backwards?
Whether a programming concept is a step backward depends on which direction you are trying to walk. Mapreduce is a step toward solving problems that were not easy to solve with existing systems.
When the authors get around to explaining their statement, it is changed somewhat to say “MapReduce is a step backwards in database access”. That’s weird because MapReduce has virtually nothing to do with database access. MapReduce doesn’t care where the data comes from, so long as it gets the data fast enough.
The authors base their critisicm on the following points that they claim are ignored by MapReduce:
- Schemas are good
- Separation of the schema from the application is good
- High level access languages are good
The statements that “schemas are good” is troubling to me, because schemas can also be bad. What makes a schema good is when it facilitates the kind of processing you want to do on the underlying data. If you have a common set of access patterns and a common set of algorithms you want to apply on data, then schemas can be helpful.
The statement that “separation of the schema from the application” is also correct but simplistic. Schemas are not without cost, because data must be brought into conformance with a schema, and schemas need to adapt to the characteristics and usage of the data. The shuffle phase of a Mapreduce can be interpreted as an attempt to bend data into a schema arranged around a single index consisting of keys in a set of (key,value) pairs. By making the schema dynamic, Mapreduce is able to solve a broader class of problems.
The strangest statement in the paper is probably the following:
If a programmer wants to write a new application against a data set, he or she must discover the record structure. In modern DBMSs, the schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover such structure.
SQL allows a program to query on data type, but this is almost irrelevant since application programs are written to depend on semantics rather than raw data types. Just knowing that something is an unsigned 64-bit integer is not very helpful, since the application must know whether this represents a timestamp, a hash value, a number of bytes, a userid, or whatever. Oh wait. You can’t store unsigned 64-bit integers in many relational databases. Bad example.
Mapreduce is a part of the complete Google computing infrastructure, but there is another part that maintains a catalog of commonly used serializable data structures called protocol buffers (see the paper on sawzall). Protocol buffers are expressed in a data description language, and a compiler is used to generate code for these in a high-level language for these. Programmers typically refer to the catalog of protocol buffers for easy parsing and serialization of data, and to the comments that describe the semantics of data.
The statement that “high level access languages are good” is certainly true, but presumably the authors should have used the singular form of the word, namely “language”. Relational databases have been notably poor in their support for any high-level language other than SQL. This is particularly true for object-oriented languages, which is why object databases and object-relational database layers were created. It’s probably accurate to say that most programming with databases still uses crude adaptation layers like ODBC and JDBC to bridge between SQL and application code. High level languages are definitely good, but applications should be written in the best language for the task, and databases are often an impediment to this.
2. A sub-optimal implementation
The statement that Mapreduce uses brute force instead of indexing neglects the fact that a relational database system has to perform essentially a map operation in order to construct the index.
DeWitt and Stonebraker are correct in saying that MapReduce is a poor implementation of the SELECT statement in SQL. It’s fair to point out that most databases provide for laughable implementations of the query
The discussion about skew is equally strange, since the paper by Sanjay and Jeff contains a section that specifically addresses this problem. Perhaps the confusion was caused by the use of the terms “load balancing” rather than “skew”.
Finally, here’s my answer to the comment that “we have serious doubts about how well MapReduce applications can scale”. This is nothing short of giggle material.
3. Not novel at all
All good work builds on the work of predecessors, and all academic papers must confine their citations to the pieces that the authors deem most relevant. Since the contribution of the MapReduce paper is to describe a system for processing large data sets, they concentrated on previous systems work.
4. Missing most of the features that are routinely included
in current DBMS
Good. That way Mapreduce doesn’t have to drag around all the sludge that makes relational database systems run so slowly, and can concentrate on solving the problem it was intended to solve.
OK, maybe you can tell that by this time I was losing my patience with the irrelevant criticisms of Mapreduce. Mapreduce didn’t set out to replicate the functionality of a relational database, and while it’s possible to implement many things in a relational database, that doesn’t mean it is the best solution for a given problem.
5. Incompatible with all of the tools DBMS users have come to depend on
It’s true. If you are dependent on DBMS tools, then Mapreduce won’t help you with your dependency. It also won’t help you get rid of your pack a day smoking habit.
The nature of systems research
At the risk of over-generalization, advances in computer science fall into one of three categories:
- theory
- applications
- systems
The creators of the Mapreduce system didn’t claim to advance the theory of distributed systems or databases.
They were motivated by a few applications, specifically in document and logs processing. From these core applications they designed a system that is capable of solving a broad class of problems on huge data sets. The same could be said for relational databases, which were designed primarily around indexing for selects, and ACID for updates.
The success of a system is determined by the capability the system provides.
Both Mapreduce and relational databases are useful systems, but they don’t really compete with each other. The Mapreduce system has evolved over time, and incorporates many useful features such as fault tolerance, automated load balancing, and scheduling that makes it extremely powerful for solving a broad
class of problems. It functions as part of a system that includes bigtable, GFS, protocol buffers, sawzall, and cluster management. For business reasons this system has mostly remained proprietary to Google, but other groups have constructed systems (notably Hadoop) to implement some of the same functionality. The authors of the Mapreduce system aren’t selling anything, but are merely reporting on the success that their community has had with their very remarkable system.
0 responses so far ↓
There are no comments yet...Kick things off by filling out the form below.
You must log in to post a comment.