Everyone needs a hobby
February 7th, 2008 ·
→ Tags: Amusements
The oldest email address in existence?
February 5th, 2008 ·
For some reason today I was wondering what the oldest email address is at this point. All addresses in RFC 821 are from the now-defunct .ARPA domain, but presumably the postmaster user at the oldest domain is a reasonable contender. root@localhost is another popular one, but the real question is - what is the oldest personal SMTP email address that is still in usage?
→ Tags: The internet
How to pronounce FQL
January 28th, 2008 ·
If you are a geek, you know about the SQL query language, which is pronounced “sequel”. Facebook just released a new query language with the acronym FQL. I suspect that this will come to be known as “fecal”.
→ Tags: Amusements
Papers, please?
January 28th, 2008 · 1 Comment
Before you view this blog posting, perhaps I should be asking for your papers. It seems that Rudy wants me to.
→ 1 CommentTags: Politics
Mapreduce: a major disruption to database dogma
January 23rd, 2008 ·
I should first issue a disclaimer since I work for Google. This doesn’t
reflect the opinion of my employer, so any inaccuracies are my fault alone.
If it’s any consolation, at least Google isn’t trying to sell Mapreduce as a product.
David J. DeWitt and Michael Stonebraker recently wrote an article titled Mapreduce: a giant step backwards. While the article makes some very good points, I think they have failed to appreciate exactly what Mapreduce is good for. To quote from this response, I had a major WTF moment when I read the article by DeWitt and Stonebraker. Apparently I am not the only one who had this reaction.
The article reminds me of a conversation I had with a relational database guru (an IBM fellow) back in 2000. At the time he made a comment about how much better the web would be if people just stored their data in “a relational database” instead of on this messy web thing. At the time I remember thinking he needed to reboot his brain, because the web was about decentralization and expression, not data. In fairness the guy was very smart and I respected his opinion on many things. In this case I think he was so skilled with a hammer that everything looked like a nail to him.
What is Mapreduce?
Before addressing the specific points in the article by DeWitt and Stonebraker, I’d like to say what I think Mapreduce really is. Mapreduce is a component of a system for distributed computing. It is not a data storage system, and it is not a general purpose computing platform. It is well tailored for a class of applications that are of interest to Google, and we use it routinely to process petabytes of information.
From a conceptual standpoint, Mapreduce is a very simple system. The basic idea is to use multiple phases to process an input data set, producing an output data set. It is designed for massive data sets, and the only requirement for the data sets is that they must be readable and writable at high speed. At Google, these data sets are often stored in the Google File System (GFS), or bigtable, though the data might also be stored in a relational database if it can support sufficient data rates.
A diagram is shown to the right. A Mapreduce consists of three phases. First comes a map phase that takes input records and produces output (key,value) pairs. This is followed by a shuffle phase that groups the (key,value) pairs by common values of the key, and finally a reduce phase that takes all pairs for a given key and produces a new value for the same key and this is the output of the Mapreduce.
The power of this technique is that it fits a broad class of problems that are important to solve at Google.
The article by DeWitt and Stonebraker
The article by DeWitt and Stonebraker levels several criticisms at the Mapreduce paradigm:
- A giant step backward in the programming
paradigm for large-scale data intensive applications - A sub-optimal implementation, in that it uses brute force instead of indexing
- Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago
- Missing most of the features that are routinely included in current DBMS
- Incompatible with all of the tools DBMS users have come to depend on
I’ll address these individually.
1. A giant step backwards?
Whether a programming concept is a step backward depends on which direction you are trying to walk. Mapreduce is a step toward solving problems that were not easy to solve with existing systems.
When the authors get around to explaining their statement, it is changed somewhat to say “MapReduce is a step backwards in database access”. That’s weird because MapReduce has virtually nothing to do with database access. MapReduce doesn’t care where the data comes from, so long as it gets the data fast enough.
The authors base their critisicm on the following points that they claim are ignored by MapReduce:
- Schemas are good
- Separation of the schema from the application is good
- High level access languages are good
The statements that “schemas are good” is troubling to me, because schemas can also be bad. What makes a schema good is when it facilitates the kind of processing you want to do on the underlying data. If you have a common set of access patterns and a common set of algorithms you want to apply on data, then schemas can be helpful.
The statement that “separation of the schema from the application” is also correct but simplistic. Schemas are not without cost, because data must be brought into conformance with a schema, and schemas need to adapt to the characteristics and usage of the data. The shuffle phase of a Mapreduce can be interpreted as an attempt to bend data into a schema arranged around a single index consisting of keys in a set of (key,value) pairs. By making the schema dynamic, Mapreduce is able to solve a broader class of problems.
The strangest statement in the paper is probably the following:
If a programmer wants to write a new application against a data set, he or she must discover the record structure. In modern DBMSs, the schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover such structure.
SQL allows a program to query on data type, but this is almost irrelevant since application programs are written to depend on semantics rather than raw data types. Just knowing that something is an unsigned 64-bit integer is not very helpful, since the application must know whether this represents a timestamp, a hash value, a number of bytes, a userid, or whatever. Oh wait. You can’t store unsigned 64-bit integers in many relational databases. Bad example.
Mapreduce is a part of the complete Google computing infrastructure, but there is another part that maintains a catalog of commonly used serializable data structures called protocol buffers (see the paper on sawzall). Protocol buffers are expressed in a data description language, and a compiler is used to generate code for these in a high-level language for these. Programmers typically refer to the catalog of protocol buffers for easy parsing and serialization of data, and to the comments that describe the semantics of data.
The statement that “high level access languages are good” is certainly true, but presumably the authors should have used the singular form of the word, namely “language”. Relational databases have been notably poor in their support for any high-level language other than SQL. This is particularly true for object-oriented languages, which is why object databases and object-relational database layers were created. It’s probably accurate to say that most programming with databases still uses crude adaptation layers like ODBC and JDBC to bridge between SQL and application code. High level languages are definitely good, but applications should be written in the best language for the task, and databases are often an impediment to this.
2. A sub-optimal implementation
The statement that Mapreduce uses brute force instead of indexing neglects the fact that a relational database system has to perform essentially a map operation in order to construct the index.
DeWitt and Stonebraker are correct in saying that MapReduce is a poor implementation of the SELECT statement in SQL. It’s fair to point out that most databases provide for laughable implementations of the query
The discussion about skew is equally strange, since the paper by Sanjay and Jeff contains a section that specifically addresses this problem. Perhaps the confusion was caused by the use of the terms “load balancing” rather than “skew”.
Finally, here’s my answer to the comment that “we have serious doubts about how well MapReduce applications can scale”. This is nothing short of giggle material.
3. Not novel at all
All good work builds on the work of predecessors, and all academic papers must confine their citations to the pieces that the authors deem most relevant. Since the contribution of the MapReduce paper is to describe a system for processing large data sets, they concentrated on previous systems work.
4. Missing most of the features that are routinely included
in current DBMS
Good. That way Mapreduce doesn’t have to drag around all the sludge that makes relational database systems run so slowly, and can concentrate on solving the problem it was intended to solve.
OK, maybe you can tell that by this time I was losing my patience with the irrelevant criticisms of Mapreduce. Mapreduce didn’t set out to replicate the functionality of a relational database, and while it’s possible to implement many things in a relational database, that doesn’t mean it is the best solution for a given problem.
5. Incompatible with all of the tools DBMS users have come to depend on
It’s true. If you are dependent on DBMS tools, then Mapreduce won’t help you with your dependency. It also won’t help you get rid of your pack a day smoking habit.
The nature of systems research
At the risk of over-generalization, advances in computer science fall into one of three categories:
- theory
- applications
- systems
The creators of the Mapreduce system didn’t claim to advance the theory of distributed systems or databases.
They were motivated by a few applications, specifically in document and logs processing. From these core applications they designed a system that is capable of solving a broad class of problems on huge data sets. The same could be said for relational databases, which were designed primarily around indexing for selects, and ACID for updates.
The success of a system is determined by the capability the system provides.
Both Mapreduce and relational databases are useful systems, but they don’t really compete with each other. The Mapreduce system has evolved over time, and incorporates many useful features such as fault tolerance, automated load balancing, and scheduling that makes it extremely powerful for solving a broad
class of problems. It functions as part of a system that includes bigtable, GFS, protocol buffers, sawzall, and cluster management. For business reasons this system has mostly remained proprietary to Google, but other groups have constructed systems (notably Hadoop) to implement some of the same functionality. The authors of the Mapreduce system aren’t selling anything, but are merely reporting on the success that their community has had with their very remarkable system.
Lyrics that inspire
January 12th, 2008 ·
During your lifetime there are a few songs that will stop you in your tracks and grab your attention. Maybe it’s the moment, or the melody, or the people you are with, or a snatch of the lyrics. Here’s my list.
I don’t need to fight to prove I’m right. I don’t need to be forgiven.
The Who
Lindsey Buckingham, Fleetwood Mac
When you got nothing, you got nothing to lose.
Bob Dylan
The Who
I can’t complain - sometimes I still do.
Life’s been good to me so far.
Joe Walsh
→ Tags: Inspirations
Managing a blog
January 5th, 2008 ·
You may have noticed that there are essentially no comments allowed on this blog. The reason is simple - the vast majority of comments that are generated for blogs are pure spam, and I don’t want to deal with it. There are also a huge number of sites who copy my content and then link to my site, hoping to get me to link back to them so they can accumulate precious PageRank. Ordinary readers of the web may not be aware of the nonsense that has crept in at the corners of the web, but if you try running a blog or a web site or a mail server then you end up spending most of your time dealing with riff raff.
If you want to get some notice for comments, then you can always resort to the traditional means of social interaction by contacting me. You don’t have to agree with me to get a link from me, but it helps to have some social skills.
→ Tags: The internet
An Atheist President
December 6th, 2007 ·
That title is a phrase that cannot be constructed in American politics. We somehow seem to have forgotten the meaning of “religious freedom”. Romney has a press conference to explain why a member of the fourth largest christian sect in the United States is qualified to be president in light of his unorthodox religion - this is proof positive that the United States has become a theocracy in which religious tolerance has vanished from public discourse.
As a lifelong atheist I have mostly tried to keep away from religion, because religion is evil. This is mostly possible in my life, provided I am willing to ignore the constant reminders that America is a mostly christian country. I really don’t care that much, but it has started to bug me that our foreign policy is mostly guided by religious intolerance in response to religious intolerance.
At what point can humanity come to respect all individual opinions about the origin of life and the role of faith?

Income Inequality in the Attention Economy
November 6th, 2007 ·
I recently submitted a paper with the title “income Inequality in the Attention Economy’. This is my first paper in welfare economics, and the results in the paper came as a surprise to me. Among other things it shows that an increasing amount of attention is concentrated on a tiny number of websites. Among over 200 million websites (suitably defined), fully 50% of the traffic falls in only 1000 web sites. If you look only at the top one million websites (ignoring the bottom 199 million web sites), then the same thing is true - only 0.1% of the websites in this group get 50% of the traffic. In other words, in the attention economy of the web, the rich are getting richer.
It’s interesting to speculate about why this is true, and what the implications are for “heavy tail markets” that economists have spoken about lately. Suffice it to say that I was surprised by the results!
The paper is available here.
→ Tags: Research · The internet
Novel Manhole Covers
October 6th, 2007 ·
A recent posting by Bruce Schneier about some beautiful Japanese manhole covers reminded me of something that occurred a few weeks ago. While taking a walk around the Google complex I noticed an unusual manhole cover. As my colleague and I continued walking I realized that this offers an entirely new form of advertising opportunity, where advertising could be embedded into manhole covers. We could even make it so that when someone stepped on the surrounding surface, it would register as a “click”, resulting in a payment by the advertiser.

Now you’re probably thinking that this is the stupidest idea you have ever heard of. In reality I’d be surprised if someone had not already done this. There are definitely limits of how far we should be willing to tolerate advertising, and lots of people complain about the pervasiveness of advertising around the world. In recent months the authorities who run the Golden Gate Bridge were pondering whether to put advertising on the bridge. I think that goes way over the line of what is reasonable, but when society keeps insisting on lower taxes, our public facilities will eventually be starved to the point where they look for advertising as a business model.
I just have to admire a society that will have such beautiful manhole covers without being obnoxious about it.
→ Tags: Amusements