Special Interest Group on CRAP

Thoughts by Kevin McCurley
Not affiliated with ACM. They have their own crap.

Special Interest Group on CRAP header image 1

Keyspace Tuneview

February 23rd, 2008 ·

I recently went to buy a dock for my iPod, to try and play music from the iPod through my stereo system. I read a bunch of reviews of devices, and opted to purchase a Keyspace Tuneview through Amazon. Step 1 is realizing that you aren’t purchasing it through Amazon, but that’s typical (and extremely irritating - more on this later).

When the device finally arrived, it appeared to be intact so I started hooking up all the cables and gizmos. The dock worked ok provided I used the button on the iPod, but the remote never functioned. It never synced up, and after trying every possible avenue to fix it, I finally gave up. We have all sorts of ways to fix it, such as searching online for tips, going to the manufacturer’s web site, following the directions on the manual. After investing two hours in this process, i sent an email to the manufacturer. After two days with no response, I finally gave up and returned it for a refund.

This experience reminded me of several things:
1. many consumer electronics components these days have very poor quality control.
2. Both Microsoft and Apple have failed to deliver products that really address user desires in the home entertainment space. Windows media player is a disaster, and Apple apparently forgot that people sometimes like to browse and organize their large photo collections. There are lots of manufacturers trying to plug the wholes in these products, but most of them are poorly implemented and lack inter-operation.
3. Amazon has decided to expand their business through promotions for third party resellers. I applaud their efforts in organizing reviews and centralizing purchasing, but their reputation will stand on the reputations of their partners and my experiences so far have not been good. In my case the shipping date was misrepresented on the amazon web site until after I made the purchase. Moreover, when you order things from multiple sources you end up having an order arrive in a bunch of separate packages with individual shipping charges, and every reseller has their own screwball return and warranty policies. If I wanted this much chaos I wouldn’t have gone to Amazon.

That’s my consumer rant for today.

Tags: Rants

The “war on terror”

February 21st, 2008 ·

I recently was around someone who used the term “war on terror”. I almost started ranting, but decided to pause and think about why that term bugs me so much. Sometimes it’s better to think before you speak.

I was reminded of it again when I heard a talk by Thomas Barnett from the TED conference. He used the term “holocaust” to describe what is currently going on in Sudan, and had an excellent explanation for why the US and the world is so impotent to do anything about it. It also explains why our existing military is so useless in the “war on terror”.

The United States has the greatest military imaginable now, but it doesn’t provide us with our national security. This is not really a failure on their part, because they have been built up in a way that was optimized to fight the wars we saw in the past. Unfortunately their capabilities are only a small part of what contributes to national security. As Barnett points out, the US military consists of a force that is young, muscular, well armed, and slightly pissed off. They want someone to pick a fight with and they want to kick ass. George Bush wishes he was one of those. Both are unfortunately useless in the face of terror, because terror is a political movement that can only be fought as a political movement. At the moment our president is a politial eunuch.

Using the military to fight the “war on terror” is like using a flamethrower to treat skin inflammation. Calling it a war is in fact kind of silly, because most people think wars should be fought with armies. In the 1960s, it was popular to use the term “war” to describe other social movements, such as the war on poverty or the war on drugs or the war on illiteracy. If you think about it, the military would be useless for these wars as much as it is useless in the war on terror. Some people were upset by the use of the term “war” to describe the initiatives, but the terminology was chosen to motivate the public, and the only thing they could get motivated about was war.

We’ll probably never eradicate terror, but we can do many things to reduce it. Getting most of the world pissed off at us is certainly not an optimal strategy, because it just creates a fertile ground for recruiting terrorists, and devalues the value of our words. We need some leadership in this country, and we need a foreign policy that addresses the political forces of terror. It’s time we have a president who can translate will into useful actions. It’s time we treated the war on terror more like the war on poverty, and it’s time we had a security force that was tuned for security instead of being tuned for the the old wars.

Tags: Inspirations · Politics

Bitching about the war

February 7th, 2008 ·

There have been a lot of complaints about Hillary having voted for funding the war in Iraq, and I’m sure that she regrets that decision just as those who voted for the Gulf of Tonkin resolution regretted it later. It’s easy to forget but at the time, the Bush administration was spreading lies about WMD in order to coerce a decision that was pre-ordained by their theology.

If you really want to contemplate disaster, think about the wars that John McCain will get us into if he becomes president.

Tags: Politics

Everyone needs a hobby

February 7th, 2008 ·

Tags: Amusements

The oldest email address in existence?

February 5th, 2008 ·

For some reason today I was wondering what the oldest email address is at this point. All addresses in RFC 821 are from the now-defunct .ARPA domain, but presumably the postmaster user at the oldest domain is a reasonable contender. root@localhost is another popular one, but the real question is - what is the oldest personal SMTP email address that is still in usage?

Tags: The internet

How to pronounce FQL

January 28th, 2008 ·

If you are a geek, you know about the SQL query language, which is pronounced “sequel”. Facebook just released a new query language with the acronym FQL. I suspect that this will come to be known as “fecal”.

Tags: Amusements

Papers, please?

January 28th, 2008 · 1 Comment

Before you view this blog posting, perhaps I should be asking for your papers. It seems that Rudy wants me to.

→ 1 CommentTags: Politics

Mapreduce: a major disruption to database dogma

January 23rd, 2008 ·

I should first issue a disclaimer since I work for Google. This doesn’t
reflect the opinion of my employer, so any inaccuracies are my fault alone.
If it’s any consolation, at least Google isn’t trying to sell Mapreduce as a product.

David J. DeWitt and Michael Stonebraker recently wrote an article titled Mapreduce: a giant step backwards. While the article makes some very good points, I think they have failed to appreciate exactly what Mapreduce is good for. To quote from this response, I had a major WTF moment when I read the article by DeWitt and Stonebraker. Apparently I am not the only one who had this reaction.

The article reminds me of a conversation I had with a relational database guru (an IBM fellow) back in 2000. At the time he made a comment about how much better the web would be if people just stored their data in “a relational database” instead of on this messy web thing. At the time I remember thinking he needed to reboot his brain, because the web was about decentralization and expression, not data. In fairness the guy was very smart and I respected his opinion on many things. In this case I think he was so skilled with a hammer that everything looked like a nail to him.

What is Mapreduce?

Before addressing the specific points in the article by DeWitt and Stonebraker, I’d like to say what I think Mapreduce really is. Mapreduce is a component of a system for distributed computing. It is not a data storage system, and it is not a general purpose computing platform. It is well tailored for a class of applications that are of interest to Google, and we use it routinely to process petabytes of information.

From a conceptual standpoint, Mapreduce is a very simple system. The basic idea is to use multiple phases to process an input data set, producing an output data set. It is designed for massive data sets, and the only requirement for the data sets is that they must be readable and writable at high speed. At Google, these data sets are often stored in the Google File System (GFS), or bigtable, though the data might also be stored in a relational database if it can support sufficient data rates.

The mapreduce concept A diagram is shown to the right. A Mapreduce consists of three phases. First comes a map phase that takes input records and produces output (key,value) pairs. This is followed by a shuffle phase that groups the (key,value) pairs by common values of the key, and finally a reduce phase that takes all pairs for a given key and produces a new value for the same key and this is the output of the Mapreduce.

The power of this technique is that it fits a broad class of problems that are important to solve at Google.

The article by DeWitt and Stonebraker

The article by DeWitt and Stonebraker levels several criticisms at the Mapreduce paradigm:

  1. A giant step backward in the programming
    paradigm for large-scale data intensive applications
  2. A sub-optimal implementation, in that it uses brute force instead of indexing
  3. Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago
  4. Missing most of the features that are routinely included in current DBMS
  5. Incompatible with all of the tools DBMS users have come to depend on

I’ll address these individually.

1. A giant step backwards?

Whether a programming concept is a step backward depends on which direction you are trying to walk. Mapreduce is a step toward solving problems that were not easy to solve with existing systems.

When the authors get around to explaining their statement, it is changed somewhat to say “MapReduce is a step backwards in database access”. That’s weird because MapReduce has virtually nothing to do with database access. MapReduce doesn’t care where the data comes from, so long as it gets the data fast enough.

The authors base their critisicm on the following points that they claim are ignored by MapReduce:

  • Schemas are good
  • Separation of the schema from the application is good
  • High level access languages are good

The statements that “schemas are good” is troubling to me, because schemas can also be bad. What makes a schema good is when it facilitates the kind of processing you want to do on the underlying data. If you have a common set of access patterns and a common set of algorithms you want to apply on data, then schemas can be helpful.

The statement that “separation of the schema from the application” is also correct but simplistic. Schemas are not without cost, because data must be brought into conformance with a schema, and schemas need to adapt to the characteristics and usage of the data. The shuffle phase of a Mapreduce can be interpreted as an attempt to bend data into a schema arranged around a single index consisting of keys in a set of (key,value) pairs. By making the schema dynamic, Mapreduce is able to solve a broader class of problems.

The strangest statement in the paper is probably the following:

If a programmer wants to write a new application against a data set, he or she must discover the record structure. In modern DBMSs, the schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover such structure.

SQL allows a program to query on data type, but this is almost irrelevant since application programs are written to depend on semantics rather than raw data types. Just knowing that something is an unsigned 64-bit integer is not very helpful, since the application must know whether this represents a timestamp, a hash value, a number of bytes, a userid, or whatever. Oh wait. You can’t store unsigned 64-bit integers in many relational databases. Bad example.

Mapreduce is a part of the complete Google computing infrastructure, but there is another part that maintains a catalog of commonly used serializable data structures called protocol buffers (see the paper on sawzall). Protocol buffers are expressed in a data description language, and a compiler is used to generate code for these in a high-level language for these. Programmers typically refer to the catalog of protocol buffers for easy parsing and serialization of data, and to the comments that describe the semantics of data.

The statement that “high level access languages are good” is certainly true, but presumably the authors should have used the singular form of the word, namely “language”. Relational databases have been notably poor in their support for any high-level language other than SQL. This is particularly true for object-oriented languages, which is why object databases and object-relational database layers were created. It’s probably accurate to say that most programming with databases still uses crude adaptation layers like ODBC and JDBC to bridge between SQL and application code. High level languages are definitely good, but applications should be written in the best language for the task, and databases are often an impediment to this.

2. A sub-optimal implementation

The statement that Mapreduce uses brute force instead of indexing neglects the fact that a relational database system has to perform essentially a map operation in order to construct the index.

DeWitt and Stonebraker are correct in saying that MapReduce is a poor implementation of the SELECT statement in SQL. It’s fair to point out that most databases provide for laughable implementations of the query

SELECT URL,SNIPPET FROM WEB WHERE DOCUMENT CONTAINS “relational” OR SYNONYM OR STEM GROUP BY MEANING ORDER BY RELEVANCE,QUALITY

The discussion about skew is equally strange, since the paper by Sanjay and Jeff contains a section that specifically addresses this problem. Perhaps the confusion was caused by the use of the terms “load balancing” rather than “skew”.

Finally, here’s my answer to the comment that “we have serious doubts about how well MapReduce applications can scale”. This is nothing short of giggle material.

3. Not novel at all

All good work builds on the work of predecessors, and all academic papers must confine their citations to the pieces that the authors deem most relevant. Since the contribution of the MapReduce paper is to describe a system for processing large data sets, they concentrated on previous systems work.

4. Missing most of the features that are routinely included
in current DBMS

Good. That way Mapreduce doesn’t have to drag around all the sludge that makes relational database systems run so slowly, and can concentrate on solving the problem it was intended to solve.

OK, maybe you can tell that by this time I was losing my patience with the irrelevant criticisms of Mapreduce. Mapreduce didn’t set out to replicate the functionality of a relational database, and while it’s possible to implement many things in a relational database, that doesn’t mean it is the best solution for a given problem.

5. Incompatible with all of the tools DBMS users have come to depend on

It’s true. If you are dependent on DBMS tools, then Mapreduce won’t help you with your dependency. It also won’t help you get rid of your pack a day smoking habit.

The nature of systems research

At the risk of over-generalization, advances in computer science fall into one of three categories:

  1. theory
  2. applications
  3. systems

The creators of the Mapreduce system didn’t claim to advance the theory of distributed systems or databases.

They were motivated by a few applications, specifically in document and logs processing. From these core applications they designed a system that is capable of solving a broad class of problems on huge data sets. The same could be said for relational databases, which were designed primarily around indexing for selects, and ACID for updates.

The success of a system is determined by the capability the system provides.
Both Mapreduce and relational databases are useful systems, but they don’t really compete with each other. The Mapreduce system has evolved over time, and incorporates many useful features such as fault tolerance, automated load balancing, and scheduling that makes it extremely powerful for solving a broad
class of problems. It functions as part of a system that includes bigtable, GFS, protocol buffers, sawzall, and cluster management. For business reasons this system has mostly remained proprietary to Google, but other groups have constructed systems (notably Hadoop) to implement some of the same functionality. The authors of the Mapreduce system aren’t selling anything, but are merely reporting on the success that their community has had with their very remarkable system.

Tags: Research

Lyrics that inspire

January 12th, 2008 ·

During your lifetime there are a few songs that will stop you in your tracks and grab your attention. Maybe it’s the moment, or the melody, or the people you are with, or a snatch of the lyrics. Here’s my list.

I don’t need to fight to prove I’m right. I don’t need to be forgiven.

The Who

You can go your own way.

Lindsey Buckingham, Fleetwood Mac

When you got nothing, you got nothing to lose.

Bob Dylan

I’ll tip my hat to the new constitution
Take a bow for the new revolution
Smile and grin at the change all around me

The Who

I can’t complain - sometimes I still do.
Life’s been good to me so far.

Joe Walsh

Tags: Inspirations

Managing a blog

January 5th, 2008 ·

You may have noticed that there are essentially no comments allowed on this blog. The reason is simple - the vast majority of comments that are generated for blogs are pure spam, and I don’t want to deal with it. There are also a huge number of sites who copy my content and then link to my site, hoping to get me to link back to them so they can accumulate precious PageRank. Ordinary readers of the web may not be aware of the nonsense that has crept in at the corners of the web, but if you try running a blog or a web site or a mail server then you end up spending most of your time dealing with riff raff.

If you want to get some notice for comments, then you can always resort to the traditional means of social interaction by contacting me. You don’t have to agree with me to get a link from me, but it helps to have some social skills.

Tags: The internet