We have a number of primary data directives at Factual: one is to create a global coverage of the world’s POI and businesses. At the time of writing we manage over thirty-one million entities in thirty-four countries; many more are in the oven.

We produce and manage this content on a country-by-country basis — this continues to make sense, as each country requires its own validation and canonicalization configurations. However, we are crafting a number of initiatives that abstract international boundaries to help create truly global applications. The first, and perhaps most important, step is standardizing Local and POI data on a single schema, employed across all countries.

Anyone who has worked with Local Data is a member of a small fraternity, bound by a deep and shared frustration that this stuff is, without doubt, a pig to work with. The vagaries of postal formatting and political administrative hierarchies will always ensure that the World is never easily typed.

In capturing an un-normalized world, our choice therefore was either to capture every attribute for each country and retain semantic clarity, or to compromise in return for ease-of-use across international countries.  As the mobile use case is increasingly international, we opted for the latter. This will undoubtedly entail some semantic shoehorning, but the big win is that creating and growing truly global location-based applications becomes easier, as you can now manage all of your Factual place data centrally.

The Schema

This schema presented here is a superset; We will formally support only a subset of these in any given country.

factual_id Factual ID
name Business/POI name
po_box PO Box. As they do not represent the physical location of a brick-and-mortar store, PO Boxes are often excluded from mobile use cases. We’ve isolated these for only a limited number of countries, but more will follow
address Street address
address_extended Additional address incl. suite numbers
locality City, town or equivalent
region State, province, territory, or equivalent
admin_region Additional sub-division, usually but not always a country sub-division
post_town Town employed in postal addressing
postcode Postcode or equivalent (zipcode in US)
country The ISO 3166-1 alpha-2 country code
tel Telephone number with local formatting
fax Fax number formatted as above
website Authority page (official website)
latitude Latitude in decimal degrees (WGS84 datum). Value will not exceed 6 decimal places (0.111m)
longitude as above, but sideways
category String name of category tree and category branch
status Boolean representing business as going concern: closed (0) or open (1) We are aware that this will prove confusing to electrical engineers
email Contact email address of organization

 

If this looks familiar, it should: it’s based on vcard/hcard attribute names, and you’ll see similar attribute names in Facebook’s OpenGraph markup, and in the return values from the Bing and Google Maps APIs (Ovi currently adheres to the city/district/state model, and Yahoo punts entirely). It could be argued that we should employ something more semantically rich, like Google’s multi-typed attributes in their Places API, but we learn towards ease-of-use, where we lean at all.

I’ll shortly post a link to a JSON file on GitHub where we enumerate the supported attributes for each country, and note where the ‘proper’ label differs from the default attribute. This schema harmonization is the first step in our data-centric approach to Geo and Local; we hope it helps engender more global location-based applications and, at the very least, makes ingesting and managing local data that much easier.

- Tyler Bell, Factual Product Bod

How long does it take to change your language?

Sun once estimated that it takes on average 18 months to evaluate and approve proposals for changes to the Java language. That estimate doesn’t include building, testing, and releasing the change. Nor does it include the time and effort needed to update compilers, IDEs, and so on.

This means that if you want a useful change added to the Java language, you may have to wait years – if the change is made at all.

What if things were different?

What if you could change your programming language to suit your needs? What if you could change it without asking permission, and without waiting on any kind of community process or support? With Clojure, you can. Clojure is a Lisp, therefore has macros, and macros allow you to do these things.

So how do macros work? First let’s clarify what Clojure macros are not. Clojure macros are not like “macros” you might find elsewhere, such as Excel macros or C macros. We’re not talking about a tool that records keystrokes to play back later. Nor are we talking about simple text substitution.

Clojure macros allow your program to interact with the compiler, control order of evaluation and program flow, modify code, and write new code at compile time.

If this sounds like too much power, it’s because it is. But Lisp traditionally errors on the side of giving us too much power. As Clojure programmers, we benefit from this reckless allocation of power.

If you wish to modify the core Clojure language to better serve your own selfish needs, there’s a good chance you can do it with macros. Clojure’s macro facility is well defined, powerful, flexible and fun. And you don’t need anyone’s permission or help to use it.

We’re going to demonstrate some of this power with three small examples. Providing an actual tutorial on writing Clojure macros is beyond the scope of this article. What we want to achieve here is a healthy understanding of what the Clojure macro facility brings to the table, and why we as software developers should take it seriously.

Example 1: “unless”

Suppose you wanted to add an unless idiom to Java, so that you could write:

unless(false) {
  System.out.println("This will get printed out.");
}

Well, too bad. Java doesn’t support unless, and there’s no way for you to add it at the language level. But in Clojure we could write a simple macro:

(defmacro unless [condition & body]
  `(when (not ~condition) ~@body))

Now in our normal codebase we can do things like:

user> (unless (> 8 1) (println "The println is never even evaluated"))
nil
user> (unless (> 1 8) (println "This gets run!"))
This gets run!

So with a two-line snippet of code, we just added a fundamental flow control construct to our language. And we didn’t need Larry Ellison’s permission or James Gosling’s help to do it.

I double-dog dare you to petition Oracle to add unless to the Java language.

Example 2: “time”

Consider the common desire to get a simple micro benchmark for a chunk of code. Suppose we’re starting with this Java method…

public void myStuff(int n) {
  return betterStuff(
    ExternalLib.doStuff(n));
}

… and we want to wrap a micro benchmark around the call to doStuff(). We’re forced to do something like this:

public void myStuff(int n) {
  MyTimer t = new MyTimer();
  int stuff = ExternalLib.doStuff(n);
  t.stop();
  System.out.println("Duration: " + t.duration());
 
  return betterStuff(stuff);
}

Wow, that got real ugly real fast.

If you’re familiar with AOP you might be tempted to define an aspect that allows you to annotate arbitrary methods and get micro benchmarks. But AOP is a whole ‘nother language, has serious limitations, and must be explicitly bolted on to your project. By contrast, Clojure macros are written in Clojure, are a natural part of the language, and afford far more raw power and flexibility than AOP.

Let’s look at supporting micro benchmarks in Clojure. First let’s see how our code looks before:

(defn my-stuff [n]
  (better-stuff
    (ExternalLib/doStuff n)))

Now with explicit micro benchmarking:

(defn my-stuff [n]
  (let [start (System/currentTimeMillis))
        stuff (ExternalLib/do-stuff n)]
    (println "Duration:" (- System/currentTimeMillis start))
    (better-stuff stuff)))

Just like the Java example, that got real ugly real fast. But there’s clearly a pattern that can be abstracted out. What we’d really like to be able to write is this:

(defn my-stuff [n]
  (better-stuff
    (time (ExternalLib/do-stuff n))))

Notice that we’re not changing our initial syntax, nor are we introducing any extra syntax – we’re simply wrapping time around our target code.

But could that possibly work? It seems tricky because time would need full control over the order of evaluation. That is, time would need to start a timer, evaluate (do-stuff n), stop the timer and output the micro benchmark, then finally return the result from the call to (ExternalLib/do-stuff n).

In Java, this kind of shenanigans is simply not allowed. But in Clojure we can support exactly what we want, by writing a small macro:

(defmacro time [expr]
  `(let [start# (System/currentTimeMillis)
         ret# ~expr]
     (println (str "Duration: " (- (System/currentTimeMillis) start#)))
     ret#))

In some ways the time macro isn’t a big deal. We saved ourselves from from some syntax ugliness – so what?

Well, if you take a step back you can see it goes deeper than that. We modified the Clojure language itself, including having our way with control flow. Without that ability, we wouldn’t be able to create such a nice time idiom. Instead, we’d be forced to change the syntax of all code we wanted to benchmark. But with Clojure’s macro facility we were able to support very concise syntax.

Example 3: “with-open”

Java 7 is going to be released with a new feature called try-with-resources. This is going to relieve us of some of the boilerplate try/finally Java code we’ve had to write all these years for making sure to close down resources when we’re done with them. This is a great little improvement to the Java language!

So how long did it take to get this feature added with Java 7? According to this page, the call for proposals began on February 27, 2009. At the time of writing (early 2011), the official release date for Java 7 is mid 2011. That’s well over 2 years turnaround time!

Actually, we should count ourselves lucky. What if this change hadn’t been approved by the JCP? We’d be totally out of luck. On the other hand, we can implement this idiom with a Clojure macro with very little fanfare. We might call it with-open. Once written, it would allow us to do things like this:

;; my-res will be automatically closed once do-something-with is done
(with-open [my-res (open-a-resource)]
  (do-something-with my-res))

More examples… are waiting for you in Clojure core

Clojure itself already comes with a wide variety of macros that do useful things. Since Clojure is an open source project, a great way to get more exposure to real world macros is to browse the part of Clojure that are implemented in Clojure, and look for the telltale defmacro definitions.

In fact, the macros discussed here are actually implemented in Clojure already. You can go code diving for time (a little fancier than shown here), unless (known as when-not and implemented a little differently), and with-open. And as you can imagine, that’s just the beginning.

Remember Spider-Man

With so much raw power and flexibility, macros come with their own unique risks. The best Clojure programmers are aware of this and exercise caution when creating macros. As Spider-Man might say: with great power comes great responsibility.

Macros should be viewed as a kind of “nuclear option”. We should use them sparingly and with great deliberation. The Clojure community has been known to joke that “the first rule of macro club is: don’t write macros”. What they mean is, we should avoid writing a macro when a simpler solution will do (such as writing a plain old function instead).

We should also place a high value on composability. Due to the nature of macros, especially their sheer power, it’s not hard to get carried away and lose sight of composability. But our code should be easy and natural to use with other code in a modular fashion.

An additional concern is code readability. Macros, if not written and documented well, can make it especially difficult for others to understand your code.

These concerns shouldn’t scare you away from the judicious use of macros. However, they should give you pause. In other words, be careful and remember your core values as you bring your nukes online.

“For The Win”

Macros are one reason many consider Lisp to be the most powerful language ever designed. While I won’t try to make that argument here, I will suggest that Clojure macros are very special.

If someone is considering picking up a new JVM language – or any language, for that matter – and they ask you something like…

Many languages support first class functions, have useful idioms like map and reduce, great libraries, plenty of syntactic sugar, etc. Is there anything unique to Clojure that makes it worth considering above the rest?

… you can tell them, “Dude, macros FTW.”

It is a universal truth that our own code is perfect and that had we simply written every line of our project, every library, none of this would be happening. Let’s say that you’ve decided you’re ready to take matters into your own hands, but for some reason, your employer just isn’t into paying you to completely rewrite the kernel. What then? How in the world are you going to get your jobs to finish with other people’s wtf-segfaulting libraries in there?

The unexpected death of an otherwise properly configured Hadoop job can come from two basic failure modes: unpredictable crashes and regular uncaught exceptions. Regular uncaught exceptions are just that, regular – they happen in the same place every time but just weren’t caught. The only regular thing about unpredictable crashes is that if you run things long enough, you’ll have some.

Other People’s Code… May Have Issues

In one of our science projects, we have a few Hadoop Streaming jobs that run over ruby and rely on libxml to parse documents. This creates a perfect storm of badness – the web is full of really bad html and libxml occasionally goes into infinite loops or outright segfaults. On some documents, it always segfaults.

The interesting thing about Hadoop is that handling unpredictable crashes is built right in. Depending on how unstable things are, you might have to do some tuning to keep from retreading too much ground or giving up too soon, but you barely have to think about it. This might seem counterintuitive at first, but when you think about the redundant/commodity model, it makes sense to simply retry your failed tasks with a change of venue and hope for the best. And indeed, it pretty much works as advertised.

On the other hand, I was really surprised by the amount of magic required to just rerun tasks and skip over records that cause consistent failures. I was of course being naive, expecting some kind of “stop_failing_dammit=true” flag. After thinking about this some more, though, I get why it’s hard and not-so-well documented. The very reasonable assumption that Hadoop makes is that you catch your exceptions. As it turns out, particularly when you’re doing Streaming, stuff can just break and there’s not much you can do about it.

Handling Random Failures

To deal with random failures, all you have to do is give Hadoop enough attempts per task to have a very high likelihood of finishing the job. One not-so-scientific option, of course, is to set the attempt number so high that a job never fails. The reasonable thing to do is just set it to a small multiple of FAILURE_RATE * TOTAL_RECORDS / NUMBER_OF_TASKS.

For the map tasks, set the number of attempts with:

mapred.map.max.attempts

For the reduce tasks, set the number of attempts with:

mapred.reduce.max.attempts

If the same task fails more than the number of times specified in these settings, your job is killed.

Handling Predictable Failures

Earlier, I mentioned that some documents just cause libxml to segfault. The problem with a segfault is that it’s not an exception ruby can catch, so the task fails every time in the same place.

Fortunately, Hadoop has a feature for skipping bad records. However, when we were wrestling with our job’s stability issues, we discovered it was not as easy as setting “fail_horribly=false” or even the real setting, “mapred.skip.mode.enabled=true”.

Particularly with Streaming, Hadoop doesn’t know exactly which record causes a task to fail, so it can’t just automagically retry the task, bypassing the offending input. There are a few things we can do to point it in the right direction and some choices we can make to weigh the value of each record against wasted work. When a task fails and skipping mode is triggered, Hadoop can essentially do a binary search on failed ranges.

Triggering Skip Mode:

Hadoop will go into skip mode if the same record fails a set number of times. The default is 2.

mapred.skip.attempts.to.start.skipping=2

Trading Records for Attempts:

Once in skip mode, we can choose to hold on to all but the bad record, toss out the entire range, or something in between. This setting controls how long Hadoop will keep retrying and failing to narrow the range.

mapred.skip.map.max.skip.records=1 #only skip the bad record
mapred.skip.map.max.skip.records=0 #don’t go into skip mode
mapred.skip.map.max.skip.records=Long.MAX_VALUE #don’t try to narrow

Giving Hadoop Some Help:

Because there is lots of buffering going on, telling Hadoop every time a record is processed will help it figure out where things went wrong. In Streaming, this can be done by emitting the following counter:

reporter:counter:SkippingTaskCounters,MapProcessedRecords,1

Putting it all Together

To demonstrate how all of these settings fit together, I wrote a simple ruby script to simulate predictable failures. The input is a file with 99,990 “good” records and 10 “bad” ones. The goal is to run the job to the end.

Our mapper (naughty_or_nice.rb):

#!/usr/bin/env ruby
STDIN.each do |line|
  if line
    line.chomp!
    if line=='Naughty'
      #simulate a failure
      warn "No Presents!"
      exit!
    else
      warn "reporter:counter:SkippingTaskCounters,MapProcessedRecords,1"
      puts 'You get presents'
    end
  end
end

Running our job:

      hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1+169.127-streaming.jar          \
      -D mapred.skip.mode.enabled=true                                      \
      -D mapred.skip.map.max.skip.records=1                                 \
      -D mapred.skip.attempts.to.start.skipping=2                           \
      -D mapred.map.tasks=1000                                              \
      -D mapred.map.max.attempts=10                                        \
      -D mapred.reduce.tasks=0                                              \
      -D mapred.job.name="Skip Bad Records Test"                            \
      -input  "/user/hadoop/samples/skip-bad-records/skip_records_test.txt" \
      -output "/user/hadoop/samples/skip-bad-records/output/"               \
      -mapper "$APP_PATH/samples/map_reduce/naughty_or_nice.rb"

And here is our result:

Notice that we kept all of our good records and skipped the 10 “Naughty” ones.

To demonstrate the mapred.skip.map.max.skip.records setting, I tried it again allowing 2 records to be thrown out:

Notice that some “Nice” children missed out on presents because their names were listed next to “Naughty” ones. Santa did save himself some work, though.

Some Final Concerns

It looks like the setting to manually increment the record counter, mapred.skip.map.auto.incr.proc.count=false doesn’t seem to get picked up by the jobconf. It looks like it can only be set for the whole cluster which doesn’t make it awesome for mixing native Java and Streaming.

Good Luck Out There

Getting work done often depends on tolerating some amount of instability and getting around some known issues. If you’re struggling getting through some Hadoop jobs, hopefully this helps.

Is that a fact?

One of the product management interview questions I’ve been asking of recent is this: “how tall was Napoleon?” The question requires a prop – a print out of search results for how tall was napoleon.

There are a couple assumptions I ask candidates to make. First, that his height is one of many that I want them to be able to algorithmically determine, so they can all be assembled into a large table of how tall every famous statesman was. Therefore, their solution should be generalized. Second, their algorithm can access anything on the search results page, anything linked to the search results page, as well as anything else they think would be useful, so long as it can be mapped back to the search results page in some way.

The most basic answers tend to involve strategies like extracting all the numbers on the search results page and averaging them or taking their mode.

Just to gather the numbers, there are lots of problems that need to be solved. If you try to grab his height off a web page, how does your algorithm know which of the many numbers on that page might actually represent his height? Once your algorithm grabs a bunch of potential heights (5’2.5, 5 foot 2½ inches, 5’2 1/2″, five-foot-two), how does it canonicalize the many representations? How do you know the subject referenced by each height you find (Napolean I versus Napoleon II versus Napoleon III versus Napoleon Bonaparte, etc.) is actually the one you are looking for? Eventually, you might deal with things like weighing the confidence you have in each height you retrieved based on the different sites you sourced the heights from. The list goes on.

But then there’s a subtlety to the problem in the case of my having chosen Napoleon: there are actually two different inch standards commonly used to report Napoleon height!

Napoleon is commonly considered short because he is often cited at around 5’2″. That’s 5’2″, however, measured under the old French standard for an inch (pouce). Translated to the current inch most of us are familiar with (the English standard), he was actually closer to 5’7″. Typical search results will therefore have a smattering of data that clusters around these two separate points. Averaging the numbers from both clusters to determine his height therefore will give a very wrong answer. Similarly, picking one of the two clusters by mode may well yield a very incomplete answer.

This, in my mind, is where the question can lead to a more interesting conversation.

Things I look for in a good answer:

  1. Did the candidate have adequate solutions for extracting the numbers off the page, canonicalizing, etc.?
  2. Did the candidate realize that Napoleon might be an outlier relative to the general problem of figuring out how tall any given statesman is? If so, how would their algorithm determine this?
  3. If they did realize the outlier nature, did they have a solution in mind for establishing a definitive height? Alternately, can they come up with a way to present the search for this fact to end users in order to take advantage of a broader audience of problem solvers?

The latter gets to the crux of what Factual delivers. We’re trying to weave together every fact we can (and not just from the internet!) into databases that developers can feel confident about building off of and contributing back into. That means encountering variations on the Napoleon problem over and over, and figuring out how to solve them systematically. It also means considering how the presentation of data to end users might incorporate even more helpful inputs to determine fact from fiction.  It’s a pretty interesting journey so far.

Incidentally, Napoleon himself gets some credit for enabling one of the better solutions to this problem: searching for his height in centimeters. It is during his rule that the metric system was adopted in France.

Citations:

Data Testing Challenge

Testing code is hard, but testing data is even harder.

When writing code, you’d like to be able to test the “happy path”, error conditions, throughput, thread-safety, fault-tolerance, etc. Back in the real world, however, you find yourself delighted when the code you’re working with has even a single test. The silver lining is that code testing is becoming easier and easier, with tools like continuous integration, dependency injection, mocking frameworks, unit testing frameworks, Fitnesse, automatic test generators like AgitarOne, and so on.

For data, as for code, there are plenty of things to test, including consistency, accuracy, coverage, correctness, completeness, and so on. However, not only is it hard to test data using automated tests (as opposed to manual review), it’s often hard to even know where to start!

How does one test data???

Let’s say I gave you the following list of all of the Starbucks cafes in zip codes 90067 and 98134, and then asked you to tell me if the data is high quality. What would you do? How would you approach the problem?

1999 Ave. of the Stars, Century City, CA 90067, (310) 557-0785
2025 Ave. of the Stars, Century City, CA 90067, 310-228-1234
2049 Century Park East, Century City, CA 90067, (310) 286-3023
2401 Utah Avenue South, Seattle, WA, 98134, (206) 447-0828
4115 4th Avenue S, Seattle, 98134, (206)-381-1553
1926 1st Ave S., Seattle, WA, 98134, (206)-624-6045
134071 Seventh Ave NW, Seattle, WA, 98134, (206)-521-1242

Here are some of the things you might want to look for:

  1. Is the data complete? For example, the ‘state’ for the Starbucks on 4th Avenue S. is missing.
  2. Is the data comprehensive? For example, there’s a Starbucks at 1875 Century Park East, Century City, CA, but it’s not on the list.
  3. Is the data correct? For example, the Starbucks at 1926 1st Ave S. should actually be at 1962 1st Ave S.
  4. Is the data structured correctly? For example, you did find a telephone number for the Starbucks at 2401 Utah Avenue South, but it’s actually the fax number, not the phone number.
  5. If the data was cleaned/canonicalized, was it cleaned properly? For example, notice that all of the phones have slightly different formats. Also, “Ave. of the Stars” should probably be “Avenue of the Stars”, and there are other small issues as well: sometimes the address contains ‘Ave’ and sometimes it contains ‘Avenue’; ‘S’, ‘S.’, and ‘South’ are used interchangeably; we encounter digits and spelled out numbers in street names (i.e. ’4th’ vs ‘Seventh’); and so on.
  6. Are there outliers in your data? For example, the last Starbucks in the list (which was made up) has a 6 digit street number. If street numbers are never more than 5 digits, then you know you have a problem.
  7. If you clustered data (for deduping), is it underfolded? For example, if you had Starbucks@123 First St, Los Angeles, CA, 90067 and Starbucks@123 1st Street, Los Angeles, CA 90067, they SHOULD be considered the same object.
  8. If you clustered data (for deduping), is it overfolded? For example, if you had Starbucks@123 First St, Los Angeles, CA, 90067 and Starbucks@123 First Blvd, Los Angeles, CA 90067, they SHOULD NOT be considered the same object.

Not only are these attributes hard to test, but often the tests are as hard as the original problem! For example, if you can write an automated test to see if dedupe is overfolding or underfolding, then presumably you can update the deduping algorithm to use this code to make sure you’re not overfolding or underfolding. That leaves you back at square one for testing.

Hard problems offer great rewards

While testing data quality is a big challenge, it also yields great dividends. When you’re talking about millions or even billions of pieces of data, improving the quality by a few percent has a huge absolute impact.

Your turn!

Here are a few example problems that you might encounter during the Factual interview process. If you work on one of these challenges and are happy with your results, please email your solution to us at jobs@factual.com. We’d love to meet you!

A few practical notes: First, for efficiency reasons, a lot of our data-crunching code is in Java. If your answers are in Java, that would be ideal; if not, that’s okay too. Second, these problems are hard and have a tendency to suck up as much time as you allot to them. Please don’t feel like you have to go overboard and write thousands of lines of code. The goal is to play with data and have some fun, not to take a week off from your regular activities. =)

Problem 1: Outlier Detection

Given a file containing between 10,000 and 1,000,000 inputs (1 line = 1 input), write a program to emit outliers along with brief justifications of why they are outliers. You can assume the total amount of data will not exceed 500MB. For example:

Sample Input

12345
12342
23532
34539
348597
30948
30487
abcde
38947
98374
… 9990 more 5-digit numbers …

Sample Output

348597 (99.99% of inputs have length 5)
abcde (99.99% of inputs consist of only digits)

The Ideal Solution

  1. Is relatively fast and memory efficient. For 500MB of data, it’s okay to take a few seconds or a gigabyte or two of RAM, but remember, you will often encounter much more data in production!
  2. Provides helpful justifications (e.g. “99.95% of strings have length 4″ instead of “looks suspicious”).
  3. Finds a significant portion of actual outliers without too many false positives.
  4. Works well across multiple domains: marathon times, UK postal codes, book ISBNs, country names, etc.
  5. Consists of well-written/well-structured code, so that it’s easy to tweak the outlier-detection algorithm 6 months from now.
  6. Half credit for unimplemented ideas. If you implement algorithm “X”, great! If you don’t want to invest more time but make a note that “Algorithm X seems like it would be promising”, we’ll happily give you partial credit.

Sample Data

It’s hard to give sample data for a problem like this without strongly biasing approaches and solutions. If you’d like some sample data, we recommend that you download a dataset of marathon times for a race, all words in the English language, etc. If your heuristics work well, then you (probably) won’t see many outliers in the downloaded data. However, if you manually add couple of suspicious lines, they should be flagged.

Problem 2: Dedupe

You will be given a file with up to 1,000,000 inputs (< 500MB of data). There will be one input per line, and each input will consist of the following TAB-separated elements: id, address, city, region, and postal code. You can assume that the data itself will not contain tabs, so just splitting each line by ‘\t’ will be fine.

Your goal is to generate a list of pairs of elements that you think are identical. For example:

Sample Input

1 123 First Street Los Angeles California 90025
2 123 First St. West Los Angeles CA 90025
3 123 First Street Paris Texas 75462
4 123 First Street Paris France 75008
5 123 1er Street Paris France 75008

Sample Output

(1,2)
(4,5)

The Ideal Solution

  1. Works for addresses from various countries, although even doing a good job with US data is an impressive feat.
  2. Does not explode combinatorially. If you’re comparing every possible pair of inputs, that won’t scale very far.
  3. Does not overfold often (i.e. two inputs are declared dupes, but they are not).
  4. Does not underfold often (i.e. two inputs are dupes, but they are not marked that way by the solution).
  5. Consists of well-written/well-structured code, so that it’s easy to tweak the outlier-detection algorithm 6 months from now.
  6. Some algorithm configuration is okay, but should not be excessive.
  7. Half credit for unimplemented ideas. If you implement algorithm “X”, great! If you don’t want to invest more time but make a note that “Algorithm X seems like it would be promising”, we’ll happily give you partial credit.

Sample Data

As with outlier detection, sample data tends to bias solutions. A good test would be to find a list of locations for some chain (Starbucks, McDonald’s, etc.). Once you have this list, copy some of the inputs in the list and modify them slightly (add a typo, change the street number, reformat the phone number, etc). Your dedupe code will mark some of your copied rows as duplicates of existing inputs, while other copied rows will be correctly marked as too different from any existing row.

A Practical Guide to Varnish

Why Varnish Matters…

What is Varnish?

Varnish is an open source, high performance http accelerator that sits in front of a web stack and caches pages.  This caching layer is very configurable and can be used for both static and dynamic content.

One great thing about Varnish is that it can improve the performance of your website without requiring any code changes.  If you haven’t heard of Varnish (or have heard of it, but haven’t used it), please read on.  Adding Varnish to your stack can be completely noninvasive, but if you tweak your stack to play along with some of varnish’s more advanced features, you’ll be able to increase performance by orders of magnitude.

Some of the high profile companies using Varnish include: TwitterFacebookHeroku and LinkedIn.

Our Use Case

One of Factual’s first high profile projects was Newsweek’s “America’s Best High Schools: The List”. After realizing that we had only a few weeks to increase our throughput by tenfold, we looked into a few options. We decided to go with Varnish because it was noninvasive, extremely fast and battlefield tested by other companies. The result yielded a system that performed 15 times faster and a successful launch that hit the front page of msn.com.  Varnish now plays a major role in our stack and we’re looking to implement more performance tweaks designed with Varnish in mind.

A Simple Use Case

The easiest and safest way to add Varnish to your stack is to serve and cache static content.  Aside from using a CDN, Varnish is probably the next best thing that you can use for free.  However, dynamic content is where you can squeeze real performance out of your stack if you know where and how to use it.  This guide will only scratch the surface on how Varnish can drastically improve performance.  Advanced features such as edge side includes and header manipulation allow you to leverage Varnish for even higher throughput.  Hopefully, we’ll get to more of these advanced features in future blog posts, but for now, we’ll just give you an introduction.

“Hello World”

Installation

Please follow the installation guide on Varnish’s documentation page. http://www.varnish-cache.org/docs

Assuming you’ve installed it correctly, you should be able to run both your webserver and Varnish on different ports. The rest of this guide will assume that you have your webserver running on port 8080, Varnish running on port 80.

Varnish Configuration Language: VCL

Varnish uses its own domain specific language for configuration. Unlike a lot of other projects, Varnish’s configuration language is not declarative. Its very expressive and yet easy to follow. For ubuntu, Varnish’s config file is located here: /etc/varnish/default.vcl A lot of the examples we’ll dive into are based on Varnish’s own documentation here.

This is a simple Varnish config file that will cache all requests whose URI begins with “/sytlesheets”. There are a few things to note here that we’ll explain later:

  • the removal of the Accept-Encoding header
  • the removal of Set-Cookie
  • return(lookup) and return(pass) in vcl_recv
# Defining your webserver.
backend default {
  .host = "127.0.0.1";
  .port = "8080";
}
 
# Incoming request
# can return pass or lookup (or pipe, but not used often)
sub vcl_recv {
 
  # set default backend
  set req.backend = default;
 
  # remove
  unset req.http.Accept-Encoding;
 
 
  # lookup stylesheets in the cache
  if (req.url ~ "^/stylesheets") {
    return(lookup);
  }
 
  return(pass);
}
 
# called after recv and before fetch
# allows for special hashing before cache is accessed
sub vcl_hash {
 
}
 
 
# Before fetching from webserver
# returns pass or deliver
sub vcl_fetch {
  if (req.url ~ "^/stylesheets") {
    # removing cookie
    unset beresp.http.Set-Cookie;
 
    # Cache for 1 day
    set beresp.ttl = 1d;
    return(deliver);
  }
}
 
# called after fetch or lookup yields a hit
sub vcl_deliver {
 
}
 
#
sub vcl_error {
 
}

Now lets look at a few things in detail:

Removing Accept-Encoding Header

The reason this is done is because Varnish doesn’t handle encodings (gzip, deflate, etc…). Instead, Varnish will defer to the webservers to do this. For now, we’re going to ignore this header and just have the webservers give us non-encoded content. The proper way to handle encodings is to have the encoding normalized, but we’ll discuss this later.

Removal of Set-Cookie

We do this because we don’t want the webserver giving us session-specific content. This is just a safe guard and is probably a little unnecessary, but its probably a good thing to note when caching. We’ll discuss session-specific content later.

Returning “pass” vs “lookup”

Returning “pass” tells Varnish to not even try to do a cache lookup. Returning “lookup” tells Varnish to lookup the object from its cache in lieu of fetching it from the webserver. If the object is cached, the webserver is never hit. If it isn’t in the cache, then vcl_fetch is called before fetching the content from the webserver.

Manipulating the Hashing Function

User/Session Specific Content

Let’s say that we want to cache every users “/profile” page. This can be done by including the cookie in the hash function like this:

sub vcl_hash {
  if (req.url ~ "^/profile$") {
    set req.hash += req.http.cookie;
  }
}

Canonicalized Url Caching

In Ruby on Rails, it is common practice to attach trailing timestamps at the end of static content to ensure that the web browser doesn’t cache it (e.g. /stylesheets/main.css?123232113). Let’s say we don’t want to include this when we cache our stylesheets. Here is an example that will remove the trailing timestamp.

sub vcl_hash {
  if (req.url ~ "^/stylesheets") {
    set req.url = regsub(req.url, "\?\d+", "");
  }
}

Browser Specific CSS

Caching browser specific content.  One trick we use is to have a small portion of our css be browser specific to handle various differences between browsers.  We do this by having a dynamic call that will serve up css based on the User-Agent header.  The problem with this technique is that we’ll have different css being served by the same url.   Varnish can still cache this by adding the User-Agent header to the hash like such:

sub vcl_hash {
  if (req.url ~ "^/stylesheets/browser_specific.css") {
    set req.hash += req.http.User-Agent
  }
}

ACLs

Varnish has options to create ACL’s to allow access to certain requests:

# create ACL
acl admin {
  "localhost";
  "192.168.2.20";
}
 
sub vcl_recv {
  # protect admin urls from unauthorized ip's
  if (req.url ~ "^/admin") {
    if (client.ip ~ admin) {
      return(pass);
    } else {
      error 405 "Not allowed in admin area.";
    }
  }
}

Purging

There are times when we need to purge certain cached objects without restarting the server. Varnish allows 2 ways to purge: lookup and url. These examples are based on the Varnish documentation page on purginge: http://www.varnish-cache.org/trac/wiki/VCLExamplePurging

Purge by lookup

Purging by lookup uses the vcl_hit function and “PURGE” http action:

acl purgeable {
  "localhost";
  "192.168.2.20";
}
 
sub vcl_recv {
  if (req.request == "PURGE") {
    if (!client.ip ~ purgeable) {
      set obj.ttl = 0s;
      error 405 "Not allowed to purge.";
    }
  }
}
 
sub vcl_hit {
  if (req.request == "PURGE") {
    set obj.ttl = 0s;
    error 200 "Purged.";
  }
}
 
sub vcl_miss {
  if (req.request == "PURGE") {
    set obj.ttl = 0s;
    error 404 "Not in cache.";
  }
}

Purge by URL

Purging by url is probably a safer bet if you are using cookies or any other tricks in your hash function:

sub vcl_recv {
  if (req.request == "PURGE") {
    if (!client.ip ~ purgeable) {
      error 405 "Not allowed.";
    }
    purge("req.url == " req.url " && req.http.host == " req.http.host);
    error 200 "Purged.";
  }
}

Handling Encodings

Its good to canonicalize your encoded requests because you could either get redundent cached objects, or you could end up returning incorrect encoded objects. For more details, please refer to the Varnish FAQ on Compression. Below is a snippet from that page.

if (req.http.Accept-Encoding) {
  if (req.url ~ "\.(jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg)$") {
    # No point in compressing these
    remove req.http.Accept-Encoding;
  } elsif (req.http.Accept-Encoding ~ "gzip") {
    set req.http.Accept-Encoding = "gzip";
  } elsif (req.http.Accept-Encoding ~ "deflate" && req.http.user-agent !~ "Internet Explorer") {
    set req.http.Accept-Encoding = "deflate";
  } else {
    # unkown algorithm
    remove req.http.Accept-Encoding;
  }
}

Advanced Backends

Multiple Backends

Lets pretend that we have a special assets server that serves up just our stylesheets. Here is an example of having multiple backends for this purpose:

backend default {
  .host = "127.0.0.1";
  .port = "8080";
}
 
backend stylesheets {
  .host = "10.0.0.10";
  .port = "80";
}
 
sub vcl_recv {
  if (req.url ~ "^/stylesheets") {
    # set stylesheets backend
    set req.backend = stylesheets;
    return(lookup);
  }
 
  # set default backend
  set req.backend = default;
  return(pass);
}

Round Robin and Random Multiple Server Backend

backend server1 {
  .host = "10.0.0.10";
}
 
backend server2{
  .host = "10.0.0.11";
}
 
director multi_servers1 round-robin {
  {
    .backend = server1;
  }
  {
    .backend = server2;
  }
}
 
director multi_servers2 random {
  {
    .backend = server1;
  }
  {
    .backend = server2;
  }
}

Varnish stays on our stack happily ever after…

When we first started using Varnish, it was out of desperation and all new to us.  Over the past year, we’ve been figuring out ways to leverage its performance in more creative ways.  At this point, we couldn’t imagine putting together a stack that didn’t include this great project.

We hope this post has been helpful for anyone interested in getting varnish setup for the first time.  Future blog posts will include more advanced features.

How to Hack Jet Lag with 4 Simple Tools

A while back I traveled to China to visit the Factual offices in Shanghai. There was an 8 hour time difference and from experience I knew that can induce some serious jet lag.

Shortly before the trip to China I had read about a study at the Harvard Medical School suggesting that fasting while in transit may help us recover from jet lag. While working in Shanghai, I had the idea of combining this finding with other tricks I had picked up to fight jet lag. I put together a no-holds-barred approach to “hack” jet lag and decided to try it out on my trip home from Shanghai to Los Angeles.

By applying four simple tools I was able to greatly reduce the effects of jet lag. I touched down Saturday afternoon and was feeling back on PST by Monday morning.

When traveling east it can take about one day to adjust to each hour of time difference. (Apparently traveling west to east produces worse jet lag than east to west due to how our circadian rhythm works.) So that would be a solid week to adjust when coming home to Los Angeles from Shanghai. However, it took me only about 44 hours to be fully adjusted to Los Angeles time.

Here’s how I did it:

Tool #1: Sleep deprivation. Avoid sleeping in transit – this helps you fall asleep at the correct time in your new time zone.

I basically did not sleep while in transit. I waited until 10 PM PST to go to sleep the first night. That amounted to about 30 hours of no sleep from leaving the hotel in Shanghai to going to bed my first night home. That night I slept a deep, unbroken 8 hours. I guess the science here is pretty sound: once you’re sufficiently sleep deprived, it doesn’t matter what time zone your brain thinks it’s in — when you go down, you go down solid.

My goal then became to stay awake and active during the day and only sleep at night. Sunday, my first full day back, I struggled to avoid napping. Jet lag really kicked in after lunch and I ended up sleeping maybe 1 or 2 hours that afternoon. Ideally I wouldn’t have slept at all, but it seems those few hours didn’t do any real harm. After Sunday, I didn’t have any more problems staying awake during the day.

Tool #2: Natural Sunlight. Expose yourself to the natural sunlight of your new time zone.

About four hours before touching down in LA, I realized the sun was “rising” outside the plane. I opened my visor all the way and set my forehead against the window for about an hour, eyes wide open. It was roughly 10 AM PST. The idea here was to give my body a strong and natural signal that I was in a new time zone. I’ve read that the sunrise and sunset can be the most effective times to do this. Of course, this required me to have a window seat on the plane. I also lucked out with the timing – I hadn’t planned to be able to get the “sunrise” this way.

Both Saturday and Sunday I went for afternoon walks and exposed my body to the sun (which by the way felt GREAT). I especially made sure to experience the sunset on both days by standing outside for about 15 minutes in direct view of the sun going down.

The basic idea of getting natural sunlight in your new time zone seems pretty straightforward. However, later research turned up some subtle nuances:

“If you travel on an east-bound flight:
- across <6 time zones, light in the morning is advised.
- across 6 to 10 time zones, avoid light in the morning and get light at midday.

If you travel on a west-bound flight:
- across <6 time zones, light in the late afternoon is advised.
- across 6 to 10 time zones, avoid light in the afternoon and get light at midday.”

Tool #3: Drugs. Use drugs to help you sleep and help you wake up.

(WARNING: I am not a doctor. You should always consult your personal physician before taking drugs.)

I’ve always thought the most brutal aspect of jet lag is the sheer inability to fall asleep at night. Other than the self-induced sleep deprivation while in transit, my chosen weapons against insomnia were two specific drugs: sublingual melatonin and Actifed.

I consider melatonin the mild natural sleep aid, and Actifed the jack hammer. (I guess it’s the anti-histamine properties of a drug like Actifed that make you very drowsy. Admittedly there may be better solutions. I don’t have any experience with “sleeping pills”, Valium, or the like.)

Before going to sleep Saturday night, I took a maximum dose of melatonin. Perhaps I didn’t need it in order to sleep well that night — considering that I hadn’t slept in 30 hours. But melatonin is used naturally by our bodies to help regulate sleep, so my idea was that it could help reset my internal clock.

I also took melatonin before going to sleep Sunday night. However, I woke up about 3:30 AM. I gave myself 15 minutes to fall back to sleep naturally. When that didn’t work, I popped an Actifed. I think I was back to sleep by around 4:30 AM. This was the only time I had trouble sleeping at the right time.

I woke up a little late and a little tired Monday morning. My chosen drug to help me wake up is simple caffeine, and so I had two cups of hot black tea upon rising. I made it to work about 10 AM feeling mostly natural.

Tool #4: Fasting. Avoid eating any food while in transit, and then eat at the correct time in your new time zone.

This was the idea that inspired me to take a no-holds-barred approach to hacking jet lag in the first place. I did not eat a single thing in transit. That amounted to about 20 hours of no food. My first meal back home was a late lunch upon arriving in Los Angeles. From then on, I simply ate normally and at normal times.

As I understand it, this is the core of the lesson from the study done by the Harvard Medical School. The idea is that this period of fasting, followed by eating at the “correct” time, is a trigger that signals your body to reset its internal clock to the new time zone. The study suggests that fasting 16 hours is enough to activate the trigger.

Conclusion

This was not a scientific experiment. There was no “control”, nor can I tell you how much one tool paid off versus the other tools.

What I can tell you is that I made an 8-hour time shift in the more difficult direction and was free of jet lag within about 44 hours. I touched down Saturday afternoon and was able to hit work Monday morning feeling back to normal. I accomplished this with just the four simple tools described above (and lots of love, encouragement, and support from my understanding wife).

The overall approach may sound misery inducing, but I never felt miserable. Most of the time I actually felt fine.

For one thing, I’ve never enjoyed airplane food, so fasting was not a hardship. I’ve also never been entirely comfortable sleeping in-flight, so most of the sleep deprivation wasn’t an issue for me either. As for the drugs and the sunlight/sunset, those are just no-brainers as far as I’m concerned. The hardest part was staying awake Saturday until my correct bedtime the first night back home — but even that was not too bad (my wife helped by standing guard).

But if you’re not interested (or not able) to use all four of these tools, it’s possible that the ones you are willing to use can make a significant improvement for you the next time you have to adjust to a new time zone. And of course, there’s a wealth of additional research and advice available online for dealing with jet lag — the wikipedia entry is a fine start.

I hope this might help you hack jet lag sometime!

Things have a funny way of working out this way. A couple features were pushed back from a previous release and some last minute improvements were thrown in, and suddenly we found ourselves dragging out a lot more fresh code in our release than usual. All this the night before one of our heavy API users was launching something of their own. They were expecting to hit us thousands of times a second and most of their calls touched some piece of code that hadn’t been tested in the wild. Ordinarily, we would soft launch and put the system through its paces. But now we had no time for that. We really wanted to hammer the entire stack, yesterday, and so we couldn’t rely on internal compute resources.

Typically, people turn to a service for this sort of thing but for the load we wanted, they charge many hundreds of dollars. It was also short notice and after hours, so there was going to be some sort of premium to boot.

At first, I thought we should use something like JMeter from some EC2 machines. However, we didn’t have anything set up for that. What were we going to do with the stats? How would we synchronize starting and stopping? It just seemed like this path was going to take a while.

I wanted to go to bed. Soon.

We routinely run Hadoop jobs on EC2, so everything was already baked to launch a bunch of machines and run jobs. Initially, it seemed like a silly idea, but when the hammer was sitting right there, I saw nails. I Googled it to see if anyone had tried. Of course not. Why would they? And if they did, why admit it? Perhaps nobody else found themselves on the far right side of the Sleep vs Sensitivity-to-Shame scale.

Sleep vs Sensitivity to Shame

So, it was settled — Hadoop it is! I asked my colleagues to assemble the list of test URLs. Along with some static stuff (no big deal) and a couple dozen basic pages, we had a broad mixture of API requests and AJAX calls we needed to simulate. For the AJAX stuff, we simply grabbed URLs from the Firebug console. Everything else was already in tests or right on the surface, so we’d have our new test set in less than an hour. I figured a few dozen lines of ruby code using Hadoop streaming could probably do what I had in mind.

I’ve read quite a few post-mortems that start off like this, but read on, it turns out alright.

Hadoop Streaming

Hadoop Streaming is a utility that ships with Hadoop that works with pretty much any language. Any executable that reads from stdin and writes to stdout can be a mapper or reducer. By default, they read line-by-line with the bytes up to the first tab character representing the key and any remainder becomes the value. It’s a great resource and bridges the power of Hadoop with the ease of quick scripts. We use it a lot when we need to scale out otherwise simple jobs.

Designing Our Job

We had just two basic requirements for our job:

  1. Visit URLs quickly
  2. Produce response time stats

Input File

The only input in this project is a list of URLs — only keys and no values. The job would have to run millions of URLs through the process to sustain the desired load for the desired time but we only had hundreds of calls that needed testing. First, we wanted to skew the URL frequency towards the most common calls. To do that, we just put those URLs in multiple times. Then we wanted to shuffle them for better distribution. Finally, we just needed lots of copies.

for i in {1..10000};  do sort -R < sample_urls_list.txt; done > full_urls_list.txt

Mapper

The mapper was going to do most of the work for us. It needed to fetch URLs as quickly as possible and record the elapsed time for each request. Hadoop processes definitely have overhead and even though each EC2 instance could likely be fetching hundreds of URLs at once, it couldn’t possibly run hundreds of mappers. To get past this issue, we had two options: 1) just launch more machines and under-utilize them or 2) fetch lots of URLs concurrently with each mapper. We’re trying not to needlessly waste money, so #1 is out.

I had used the curb gem (libcurl bindings for ruby) on several other projects and it worked really well. It turns out that it was going to be especially helpful here since it has a Multi class which can run concurrent requests each with blocks that function essentially as callbacks. With a little hackery, it could be turned into a poor/lazy/sleep-deprived man’s thread pool.

The main loop:

@curler.perform do
  flush_buffer!
  STDIN.take(@concurrent_requests-@curler.requests.size).each do |url|
    add_url(url.chomp)
  end
end

Blocks for success and failure:

curl.on_success do
  if errorlike_content?(curl.body_str)
    log_error(curl.url,'errorlike content')
  else
    @response_buffer<<[curl.url,Time.now-start_time]
  end
end
curl.on_failure do |curl_obj,args|
  error_type = args.first.to_s if args.is_a?(Array)
  log_error(curl.url,error_type)
end

As you can see, each completion calls a block that outputs the URL and the number of seconds it took to process the request. A little healthy paranoia about thread safety resulted in the extra steps of buffering and flushing output — this would ensure we don’t interleave output coming from multiple callbacks.

If there is an error, the mapper will just emit the URL without an elapsed time as a hint to the stats aggregator. In addition, it uses the ruby “warn” method to emit a line to stderr. This increments a built-in Hadoop counter mechanism that watches stderr for messages in the following format:

reporter:counter:[group],[counter],[amount]

in this case the line is:

reporter:counter:error,[errortype],1

This is a handy way to report what’s happening while a job is in progress and is surfaced through the standard Hadoop job web interface.

Mapper-Only Implementation

The project could actually be done here if all we wanted was raw data to analyze via some stats software or a database. One could simply cat together the part files from HDFS and start crunching.

In this case, the whole job would look like this:

hadoop  jar $HADOOP_HOME/hadoop-streaming.jar                                  \
-D mapred.map.tasks=100                                                        \
-D mapred.reduce.tasks=0                                                       \
-D mapred.job.name="Stress Test - Mapper Only"                                 \
-D mapred.speculative.execution=false                                          \
-input  "/mnt/hadoop/urls.txt"                                                 \
-output "/mnt/hadoop/stress_output"                                            \
-mapper "$MY_APP_PATH/samples/stress/get_urls.rb 100"

and then when it finishes:

hadoop dfs -cat /mnt/hadoop/stress_output/part* > my_combined_data.txt

Reducer

In our case, I wanted to use the reducer to compute the stats as part of the job. In Hadoop streaming, a reducer can expect to receive lines through stdin, sorted by key and that the same key will not find its way to multiple reducers. This eliminates the requirement that the reducer code track state for more than one key at a time — it can simply do whatever it does to values associated with a key (e.g. aggregate) and move on when a new key arrives. This is a good time to mention the aggregate package which could be used as the reducer to accumulate stats. In our case, I wanted to control my mapper output as well as retain the flexibility to not run a reducer altogether and just get raw data.

The streaming job command looks like this:

hadoop  jar $HADOOP_HOME/hadoop-streaming.jar                                  \
-D mapred.map.tasks=100                                                        \
-D mapred.reduce.tasks=8                                                       \
-D mapred.job.name="Stress Test - Full"                                 \
-D mapred.speculative.execution=false                                          \
-input  "/mnt/hadoop/urls.txt"                                                 \
-output "/mnt/hadoop/stress_output"                                            \
-mapper "$MY_APP_PATH/samples/stress/get_urls.rb 100"                          \
-reducer$MY_APP_PATH/samples/stress/stats_aggregator.rb --reducer”

For each key (URL) and value (elapsed time), the following variables get updated

  1. sum – total time elapsed for all requests
  2. min – fastest response
  3. max – slowest response
  4. count – number of requests processed
key,val = l.chomp.split("\t",2)
if last_key.nil? || last_key!=key
  write_stats(curr_rec) unless last_key.nil?
  curr_rec = STATS_TEMPLATE.dup.update(:key=>key)
  last_key=key
end
 
if val && val!=''
  val=val.to_f
  curr_rec[:sum]+=val
  curr_rec[:count]+=1
  curr_rec[:min]=val if curr_rec[:min].nil? || val<curr_rec[:min]
  curr_rec[:max]=val if curr_rec[:max].nil? || val>curr_rec[:max]
else
  curr_rec[:errors]+=1
end

Finally, as we flush, we compute the overall average for the key.

def write_stats(stats_hash)
  stats_hash[:average]=stats_hash[:sum]/stats_hash[:count] if stats_hash[:count]>0
  puts stats_hash.values_at(*STATS_TEMPLATE.keys).join("\t")
end

Final Stats (optional)

When the job completes, it produces as many part files as there are total reducers. Before this data can be loaded into, say, a spreadsheet, it needs to be merged and converted into a friendly format. A few more lines of code get us a csv file that can easily be dropped into your favorite spreadsheet/charting software:

hadoop dfs -cat /mnt/hadoop/stress_output/part* | $MY_APP_PATH/samples/stress/stats_aggregator.rb --csv > final_stats.csv

Our CSV converter looks like this:

class CSVConverter
  def print_stats
    puts STATS_TEMPLATE.keys.to_csv
    STDIN.each_line do |l|
      puts l.chomp.split("\t").to_csv
    end
  end
end

Source Code

The mapper, get_urls.rb:

#!/usr/bin/env ruby1.9
require 'rubygems'
require 'curb'
 
class MultiCurler
  DEFAULT_CONCURRENT_REQUESTS = 100
  def initialize(opts={})
    @concurrent_requests = opts[:concurrent_requests] || DEFAULT_CONCURRENT_REQUESTS
    @curler = Curl::Multi.new
    @response_buffer=[]
  end
  def start
    while !STDIN.eof?
      STDIN.take(@concurrent_requests).each do |url|
        add_url(url.chomp)
      end
      run
    end
  end
 
  private
 
  def run
    @curler.perform do
      flush_buffer!
      STDIN.take(@concurrent_requests-@curler.requests.size).each do |url|
        add_url(url.chomp)
      end
    end
    flush_buffer!
  end
 
  def flush_buffer!
    while output = @response_buffer.pop
      puts output.join("\t")
    end
  end
 
  def add_url(u)
 
    #skip really obvious input errors
    return log_error(u,'missing url') if u.nil?
    return log_error(u,'invalid url') unless u=~/^http:\/\//i
 
    c = Curl::Easy.new(u) do|curl|
      start_time = Time.now
      curl.follow_location = true
      curl.enable_cookies=true
      curl.on_success do
        if errorlike_content?(curl.body_str)
          log_error(curl.url,'errorlike content')
        else
          @response_buffer<<[curl.url,Time.now-start_time]
        end
      end
      curl.on_failure do |curl_obj,args|
        error_type = args.first.to_s if args.is_a?(Array)
        log_error(curl.url,error_type)
      end
    end
    @curler.add(c)
  end
 
  def errorlike_content?(page_body)
    page_body.nil? || page_body=='' || page_body=~/(unexpected error|something went wrong|Api::Error)/i
  end
 
  def log_error(url,error_type)
    @response_buffer<<[url,nil]
    warn "reporter:counter:error,#{error_type||'unknown'},1"
  end
 
end
 
 
concurrent_requests = ARGV.first ? ARGV.first.to_i : nil
 
runner=MultiCurler.new(:concurrent_requests=>concurrent_requests)
runner.start

The reducer and postprocessing script, stats_aggregator.rb:

#!/usr/bin/env ruby1.9
require 'rubygems'
require 'csv'
 
 
module Stats
  STATS_TEMPLATE={:key=>nil,:sum=>0,:average=>nil,:max=>nil,:min=>nil,:errors=>0,:count=>0}
 
  class Reducer
    def run
 
      last_key=nil
      curr_rec=nil
 
      STDIN.each_line do |l|
        key,val = l.chomp.split("\t",2)
        if last_key.nil? || last_key!=key
          write_stats(curr_rec) unless last_key.nil?
          curr_rec = STATS_TEMPLATE.dup.update(:key=>key)
          last_key=key
        end
 
        if val && val!=''
          val=val.to_f
          curr_rec[:sum]+=val
          curr_rec[:count]+=1
          curr_rec[:min]=val if curr_rec[:min].nil? || val<curr_rec[:min]
          curr_rec[:max]=val if curr_rec[:max].nil? || val>curr_rec[:max]
        else
          curr_rec[:errors]+=1
        end
      end
      write_stats(curr_rec) if curr_rec
    end
 
    private
 
    def write_stats(stats_hash)
      stats_hash[:average]=stats_hash[:sum]/stats_hash[:count] if stats_hash[:count]>0
      puts stats_hash.values_at(*STATS_TEMPLATE.keys).join("\t")
    end
  end
 
 
  class CSVConverter
    def print_stats
      puts STATS_TEMPLATE.keys.to_csv
      STDIN.each_line do |l|
        puts l.chomp.split("\t").to_csv
      end
    end
  end
 
 
end
 
 
mode = ARGV.shift
 
case mode
  when '--reducer' #hadoop mode
    Stats::Reducer.new.run
  when '--csv' #csv converter; run with: hadoop dfs -cat /mnt/hadoop/stress_output/part* | stats_aggregator.rb --csv
    Stats::CSVConverter.new.print_stats
  else
    abort 'Invalid mode specified for stats aggregator.  Valid options are --reducer, --csv'
end

Reckoning (Shame Computation)

In a moment of desperation, we used Hadoop to solve a problem for which it is very tenuously appropriate, but it actually turned out great. I wrote very little code and it worked with almost no iteration and just a couple of up-front hours invested for an easily repeatable process. Our last minute stress test exposed a few issues that we were able to quickly correct and resulted in a smooth launch. All this, and it only cost us about $10 of EC2 time.

Hadoop Streaming is a powerful tool that every Hadoop shop, even the pure Java shops, should consider part of their toolbox. Lots of big data jobs are actually simple except for scale, so your local python, ruby, bash, perl, or whatever coder along with some EC2 dollars can give you access to some pretty powerful stuff.

We do a lot of Java work at Factual. Most of our back-end data store is written in Java, and we use a ton of Java libraries and frameworks. We recently began experimenting with Clojure in certain parts of our data store. Clojure is a Lisp implementation that runs on the JVM and offers excellent Java interoperability.

Because our data store already had a custom Lisp-like query engine, using Clojure was a natural experiment. In certain situations we found it to be more expressive and more productive than what we’d have to do in Java to get the same work done.

But why should anyone be interested in a Lisp? Aren’t there too many parenthesis? And isn’t it just… weird?

Some consider Lisps to be the most powerful programming languages created so far. You don’t have to agree, but one thing is hard to deny: Lisps in general, and Clojure specifically, offer some interesting ideas and compelling features. Working with a Lisp will inspire you to think differently about programming.

This post kicks off an ongoing series of discussions about Clojure. We’ll look at Clojure from the perspective of a Java developer who may not be very familiar with Lisp, but would like to be more productive and is open to thinking about programming in a new way.

Clojure uses “prefix notation”… but don’t be afraid!

As a Java programmer, one of the things that might immediately weird you out about Clojure is that it uses “prefix notation”, meaning the first thing you see in a line of code is the name of the function being called. So, instead of…

myObj.doSomething(anArg);

… you’ll see…

(do-something my-obj an-arg)

A little more formally, the syntax is like:

(fn-name arg1 arg2 arg3 ...)

The Clojure way to print out a String shouldn’t look strange to you…

(println "Hello Clojure!")

But maybe the way to add two numbers in Clojure will seem odd to you:

(+ 2 3)

Lisps, including Clojure, hold consistently to this prefix notation. This might seem strange at first, and it may take some getting used to. But believe it or not, there is some power and beauty in this way of doing things. As a small example, consider doing a logical AND across 5 variables. In Java:

... (boolA && boolB && boolC && boolD && boolE) ...

In Clojure:

... (and boolA boolB boolC boolD boolE) ...

By putting functions at the head, the language can support multiple numbers of arguments in a cleaner fashion. Obviously, this example is not earth-shattering. However, at least you can see with a very simple example how prefix notation can make for cleaner code.

Clojure is functional

Lisps do a great job of allowing you to treat functions as first class objects. This means you can easily do useful things with functions, like passing them into other functions, returning them from other functions, and composing them together. Let’s first look at simple function definition syntax in Clojure, then move on to treating functions as first class objects…

Defining named functions

Here’s an example of a simple function definition in Clojure:

(defn do-something []
  (println "hello!"))

The above function takes no arguments. Here is a function that takes 3 arguments and prints out each one in succession:

(defn do-something [arg1 arg2 arg3]
  (println arg1)
  (println arg2)
  (println arg3))

When you run a function, the returned value of that function is the last thing evaluated. So in the above example, the return value of do-something will be the returned value of the third call to println. The println function prints its argument to standard out, then returns nil. Therefore, the normal return of do-something will always be nil. However, we could easily upgrade do-something to return something else if we wanted to, such as a String:

(defn do-something [arg1 arg2 arg3]
  (println arg1)
  (println arg2)
  (println arg3)
  "my returned String")

Defining anonymous functions

Many programming languages support the concept of anonymous functions (also called “closures” or “code blocks” in certain situations). These can be convenient and make our code more concise. If we’re only going to use the function once, for example, there’s no need to give it a name… just define it in the same place where you use it, don’t name it, and therefore it’s “anonymous”.

In Java, we have a kind of “poor man’s” version of anonymous functions and closures, where we use anonymous inner classes.

As an example, imagine you have a collection of ugly Strings and you need to transform them into pretty Strings. You could write the imperative code to do this “by hand”, but Google’s Guava library offers an alternative.

There’s a transform method in their Collections2 class that takes a Collection and a Function. It returns a new Collection where each item in the new Collection is the result of applying the Function to every item in the original Collection. So the Function allows you to define exactly how you want to transform each item in the Collection.

You could create a new Java class that implements Function and pass that to transform. But why bother creating a new named class, if you’re only going to use it once? So instead, you can define the Function anonymously like this:

import com.google.common.base.Function;
...
  // an anonymous function to take an
  // ugly String and make it pretty:
  new Function<string, String>(){
    @Override
    public String apply(String ugly) {
      return ugly.replaceAll("ugly", "pretty");
    }
  }...

Imagine that the ellipses (…) represent the surrounding code that passes this anonymous function into Collections2.transform(). Since the Function implementation is defined and used exactly once, there’s no real need to give it a name. (We’ll look at the complete transformation implementation a little later.)

Clojure has excellent support for anonymous functions and closures. Here’s an example of creating an anonymous function in Clojure, which takes one argument and prints it out:

(fn [an-arg] (println "my arg is" an-arg))

Here’s the Clojure equivalent of the previous Java code to prettify Strings:

...
  (fn [ugly] (re-gsub #"ugly" "pretty" ugly))...

re-gsub takes a regular expression, a replacement, and a String. It replaces all matches in the String with the specified replacement — effectively doing what replaceAll did for us in the Java code. Note that the regular expression is defined as #"ugly" — Clojures’s way of compiling a regex.

Clojure also has a less verbose version of defining anonymous functions that doesn’t require named arguments:

#(re-gsub #"ugly" "pretty" %)

The above is equivalent to the previous version using the explicit “(fn [...“. The difference here is that we didn’t need to give the ugly argument a name. Instead, we can use the % notation, which means “the one and only argument”.

This may look cryptic at first, but don’t worry. It’s easy enough to break this apart and see exactly what’s happening. First off, the #(...) syntax is Clojure shorthand for defining an anonymous function. The code contained in the “…” section is the implementation of the anonymous function. In this case, our implementation just applies re-gsub to the incoming argument. We define our regex, which in this case is just anything that is “ugly”. Our replacement is “pretty”, and then we apply that to the one and only incoming argument, signified with the %.

What would you do if you want an anonymous function that takes multiple arguments, but you don’t want to name the arguments? Clojure lets you use the convenient % notation along with argument indices, like this:

#(println "my three args: " %1 %2 %3)

The above anonymous function expects 3 arguments, and prints them out on one line.

Consider that doing this with Guava’s Function interface would be somewhat awkward — you’d need your function implementation to take some kind of Collection, array, or other first class object that grouped the arguments. Then your function would need to unpack those arguments explicitly.

The ability to treat functions as first class objects in this way offers power and flexibility. One huge win is the ability to pass functions into other functions, otherwise known as “composability”. Let’s take a look at three common functions provided by Clojure that work very well when composed with other functions…

Three Lispy workhorses: map, reduce, and filter

One area where Lisps are particularly strong is when you need to work with collections of things. This is good, because as programmers we are constantly working with collections of things – iterating over them, transforming them, summarizing them to a single result, and so on.

Three workhorses that Lisps use to deal with collections of things are the functions map, reduce, and filter. (You might recognize “map” and “reduce” from Google’s seminal work on distributed batch processing, and you’d be right to make the connection. Their “map” and “reduce” idiom is a nod to these classic Lispy workhorses that have been with us for decades.)

Introducing map

It’s very common to have a collection of things, and need to transform it into a collection of different things. Clojure’s map function applies a given function to each element in a collection and returns a new collection with the results:

(map f coll)

The first argument f to map is the function we want applied to each element in the original collection. The second argument is our original collection that we want transformed. (Note: the map function is not related to hash maps or dictionaries!)

A simple example using map:

; Get the squares of some integers...
(map #(* % %) [0 1 2 3 4 5 6])
=> (0 1 4 9 16 25 36)

Let’s break down this example, working from the inside out. Our second and last argument to map is the collection we want to transform — a handful of integers.

The first argument to map is the function we want applied across our collection of integers. In this case that’s an anonymous function that takes a value and produces the square of that value. So we end up with a new collection: the squares of the items from the original collection.

Imperative pseudo-code to illustrate what map is doing:

// inputs are f and coll
new_coll = new Collection
for (each item i in coll) {
  new_coll.add( f(i) )
}
return new_coll

Transforming a collection: Java vs. Clojure

Say you’re given a list of ugly Strings and you need to turn it into a list of pretty Strings. In Java, assuming you’re a fan of Google’s nifty Guava utilities for collections, you could do this:

private Collection prettyStrings(Collection uglyStrings) {
  return Collections2.transform(uglyStrings, new Function<string, String>(){
    @Override
    public String apply(String ugly) {
      return ugly.replaceAll("ugly", "pretty");
    }});
}

Using what we’ve learned so far, here’s how you could do the same thing in Clojure:

(defn pretty-strings [ugly-strs]
  (map #(re-gsub #"ugly" "pretty" %) ugly-strs))

Introducing reduce

We often need to take a collection and summarize it down to just one result. For example, suppose you have a collection of numbers and you want to get the grand total of all the numbers. In Java we might do this:

private long mySum(Collection<integer> nums) {
  long sum = 0;
  for(int num : nums) {
    sum += num;
  }
  return sum;
}

In Clojure, we’d use the reduce function, and do this:

(defn my-sum [nums] (reduce + nums))

And actually, since the Clojure code to do the summing is so short, you probably wouldn’t even define it as a named function. Instead, when you need to do it, you’d just write:

(reduce + nums)

That’s so concise, it seems like cheating. In a way it is cheating. Lispy, functional languages like Clojure natively support this very common concept. This often means that common tasks can be done very concisely in Clojure, compared to an imperative language like Java.

There are two versions of reduce; the simple version used above is like this:

(reduce f coll)

This simple version of reduce applies a function f to an accumulator value and each element of the collection coll. The function f is expected to take 2 arguments. reduce supplies these arguments on each iteration: it plugs in the accumulator (“the running total”) as the first argument, and it plugs in the next item in the iteration as the second argument. The accumulator starts with the first value of the collection, and the iteration starts on the next item.

A more flexible version of reduce allows you to specify the initial accumulator value, val:

(reduce f val coll)

The accumulator value could be anything you design it to be, such as a number, another collection, or any arbitrary data structure.

Another way to understand reduce is to consider this imperative pseudo code:

iterator = coll.iterator()
if (val is specified?) {
  accum = val
} else {
  accum = iterator.next()
}
 
while(iterator.hasNext()) {
  accum = f(accum, iterator.next())
}
 
return accum

In the Clojure summation example, our function f is simply the + function. (Note that the + function can naturally take two arguments.) We didn’t make use of the optional val argument to reduce. However, it’s worthwhile to note that the optional val adds important flexibility. It can be very handy to supply an explicit custom starting value, e.g. a pre-populated collection or an empty hash-map.

A simple histogram example using reduce

Let’s use reduce to build a word histogram. We’ll define a function called counts that takes a collection of words and builds a hash-map where each key is a unique word, and each value is the count of how many times that word appears in the collection:

(defn counts [words]
  (reduce
    (fn [map-so-far word]
      (assoc map-so-far word (inc (get map-so-far word 0))))
    {}
    words))

Our call to reduce runs across words, and it’s initial accumulator is an empty hash-map (the {}). The core work of our function is our first argument to reduce. It’s this anonymous function…

(fn [map-so-far word]
  (assoc map-so-far word (inc (get map-so-far word 0))))

…which contains the logic to populate the hash-map. As reduce runs over a collection and passes in arguments to f, the first argument is the “running total” (in our case, map-so-far), and the second argument is the next item in the collection (in our case, word). assoc is the Clojure way to take a hash-map and assign a value to a key in that hash-map (like Java’s Map.put()). So (assoc map-so-far word ...) is saying, “in the current hash-map of counts, associate this word with X”. In our case, X is obtained by the call to inc.

The inc call is what implements the word counting logic. Here, get is like Java’s Map.get, except you can supply an optional argument to indicate what to return should the requested key not exist. So our call to get says, “return the associated value of this word in the current hash-map, or 0 if the word isn’t in the hash-map. Our inc call simply increments the value returned from get. Stringed together, we get the exact logic we need to keep running word counts.

Now, if we wanted to we could shorten this up by using the % notation to avoid needing to name our arguments to the anonymous function:

(defn counts [words]
  (reduce
    #(assoc %1 %2 (inc (get %1 %2 0)))
    {}
    words))

I double-dog dare you to write the Java code that does this, and compare.

Introducing filter

We don’t always want to use every item in a collection. Sometimes we wish to skip certain items that don’t meet our criteria. Clojure’s filter function helps with this.

(filter pred coll)

The first argument to filter is pred, which must be a function that takes one argument. pred is expected to have logic that answers the question, “should this item be kept?”. The second and last argument to filter is the original collection you want to filter based on pred. For example, say you have some numbers but you only want the positive ones. Here’s some Clojure for that:

(filter #(> % 0) [8 -3 4 2 -1 -9 -12 7])

The first argument to filter, our pred, is an anonymous function that asks, “is this item % larger than zero?”. The result from running this filter is exactly what we’d expect:

=> (8 4 2 7)

Now, in the case of finding positive numbers, Clojure has the pos? function built-in. So we could have also done:

(filter pos? [8 -3 4 2 -1 -9 -12 7])

And of course you’re free to use custom predicates as well, and define whatever arbitrary filter logic you want:

(defn my-pred? [num]
  (> num 6))

Here we’ve defined a custom predicate my-pred which only returns true when its argument is greater than 6. So:

(filter my-pred? [8 -3 4 2 -1 -9 -12 7])
=> (8 7)

Thinking in Clojure with functional composition

When you have a collection of things and need to turn that into a collection of different things, consider using the map function. The above examples are somewhat simple, but bear in mind that you can make the function you pass into map be as complex as you need it to be.

Clojure offers powerful functions to customize your iteration over the collection, like skipping elements based on arbitrary filter logic; taking only each nth element; partitioning the elements into groups; and so on.

When you have a collection of things and need to calculate a single result from that collection, consider using the reduce function. Remember that you can make the function you pass into reduce be as complex as you need it to be.

When you have a collection but only need some of the items, filter is probably the first thing you need to apply to the collection before doing other things.

With map, reduce, and filter, you can begin to see some of the power of function composition. Each of these three functions are designed to take another function as an argument and use it in their work. The passed-in function could be a built-in Clojure function, a function provided by someone else’s library, or a custom function you’ve written yourself.

In Java, we tend to think imperatively. We like to design our code in terms of setting up variables, looping through collections to yank out items, changing the state of our variables, and so on. In a Lispy language like Clojure, we want to think more functionally. That is, we want to minimize side-effects and strive for the expressiveness of cleanly applying functions to things.

Imagine we want to add up the squares of all the odd numbers in a collection. In Java, we might do something like this:

long sum = 0;
for(int i : coll) {
  if(i % 2 != 0) {
    sum += i * i;
  }
}

In Clojure, we can compose three of our favourite Lispy workhorses to do this a little more expressively:

(reduce + (map #(* % %) (filter odd? coll)))

The best way to read the Java code is probably top to bottom. To best read the Clojure code, try working from the inside out. The first thing we do is get only the odd numbers in coll, by applying the call to filter. (Note that Clojure has the odd? function built in.) Next we use map to transform our collection of odd numbers into a new collection of the squares of those odd numbers. Finally, we run a reduce over the squares, using + so that our final result is their summation.

Hopefully you can see it’s a different way of thinking! To write the Clojure code, we didn’t think in terms of, “how do I define a variable and then mutate that variable?” Instead, we asked, “which functions can we compose together, apply to our collection and get to where we’re going?”

Lispy languages like Clojure heavily leverage functional composition. map, reduce, and filter are just the tip of the iceberg. Clojure has a wealth of powerful and versatile functions.

Stay tuned…

This post aimed to gently introduce you to some basic Clojure concepts, starting from a Java perspective. In future posts we’ll further explore the power and flexibility offered by Clojure. Please stay tuned…!