Blog

Scaling up with Hadoop, EMR & mrjob

By Mark Simpson on 14th July 2015 | Development

In a previous post, we talked about creating 3D terrain resources. While terrain is arguably the single most important feature in a map, our mapping platform also provides buildings, transport networks (road, rail, tram), trees, place names and more. Given the number of features that make up a complete ‘slice’ of the map, there’s a lot of work to be done in creating it!

slice_half_res_transparent

When building a small geographical area like a town or small city, a single computer provides enough horsepower to get the job done in a timely fashion. The problem is, we don’t just build cities. We build the world, often a country or even a continent at a time.

As an example of what we’re up against: say we wanted to create the map tiles for Canada. A simple approach would be to take a handful of dedicated computers, line up the work and wait. Unfortunately, you’d be waiting for a very long time indeed, as Canada’s land mass sprawls over 9.98 million km²! We often need to get things done quickly, so it’s not feasible to play the waiting game.

How can we make this scale?

big_pbaws_logo_300pxhadoop-logo mrjob
AWS & Hadoop mrjob

We’re lucky enough to be operating in the cloud era so, rather than maintaining our network of computers, we harness Amazon’s Elastic Mapreduce (EMR) and Simple Storage Service (S3). EMR is an Apache Hadoop-based distributed computing service. Tasks that can be structured as MapReduce jobs can achieve close to linear scaling. At the simplest level, it’s the old cliché of, “if it takes one person 24 hours to dig a hole, how long would it take 24 people to dig the same hole?” The answer should be “roughly one hour”.

In addition to using EMR, we use the excellent mrjob package for Python. Python is a language that is well suited for GIS processing due to the wealth of packages in its ecosystem (GDAL & Shapely are particularly useful). Consequently, mrjob is a great fit for us.

Although we could’ve used any number of distributed computing frameworks and services, we chose EMR as it allows us to provision hundreds of machines at the drop of a hat.

Partitioning

To exploit the power of distributed computing, we need to divide our workload into discrete pieces that can be processed independently. I quite like Ayende’s at-a-glance description of MapReduce: thinking in C# LINQ terms, it’s basically a distributed Select() + GroupBy().

In our workflow, a typical “map” function reads features from ESRI shapefiles (e.g. WGS84 building outlines) and then calculates which spatial index* each feature should be assigned to in our target resource space – it’s spatial binning, hadoop-style. The map function produces key value pairs of (K: spatial index, V: feature).

* A spatial index is a way of referring to a particular cell in a spatial structure (e.g. a cell in a quadtree). For interested readers: we use Morton Codes in conjunction with a cubemap.

Map

spatial_binning

If we were to run the building features through the map function, the output would be:

Map input Map output (Key) Map output (Value)
Building 108 Cell A Building 108
Building 109 Cell A Building 109
Building 291 Cell D Building 291

You can see that the output keys contain 2x ”Cell A” and 1x ”Cell D”.

In JSON format, this would look something like this (notice these are three independent results)

{
    "Key": "Cell A",
    "Value": {
        "building_id": 108,
        "building_outline": "... polygon data"
    }
}

{
    "Key": "Cell A",
    "Value": {
        "building_id2": 109,
        "building_outline": "...polygon data"
    }
}

{
    "Key": "Cell D",
    "Value": {
        "building_id": 291,
        "building_outline": "... polygon data"
    }
}

Reduce

The “reduce” function’s input will be a key value pair of (K: spatial index, V: N features).

The reduce function input for Cell A would be:

{
    "Key": "Cell A",
    "Value": [
        {
            "building_id": 108,
            "building_outline": "... polygon data"
        },
        {
            "building_id": 109,
            "building_outline": "... polygon data"
        }
    ]
}

… whereas the reduce function input for Cell D would be:

{
    "Key": "Cell D",
    "Value": [
        {
            "building_id": 291,
            "building_outline": "... polygon data"
        }
    ]
}

As we can see, the map outputs have been grouped by the spatial index to form the reduce input – the value of the reduce input is a list of features rather than a single element. As the reducer runs, we have everything we need to process a single resource cell. We then generate a mesh (and any other metadata that is useful) and output a binary resource that can be streamed over the web.

The beauty of MapReduce is that it doesn’t matter which machine is processing any given range of input features – processing happens in parallel. In summary, we can crunch tens of millions of features in arbitrary shapefiles, group them by spatial index in our target coordinate space and, finally, output a binary resource.

Piece of cake, right?

Unfortunately not. Once you get your head around how MapReduce works, the job plumbing is relatively straightforward. However, the trouble with building 3D meshes for millions of square kilometers of the world is that one-in-a-million, pathological issues suddenly become an everyday occurrence.

Off the top of my head, problems include:

  • Vendor A’s data contains error case X, while Vendor B’s data does not exhibit case X, but error cases Y, Z and W instead.
  • Processing data that straddles the boundary of vendor shapefiles (each vendor tends to have their own conventions)
  • The quality of a particular vendor’s data varies from location to location. Things you took for granted in San Francisco might not hold in Alaska.
  • Computational geometry issues.
  • Hadoop is an all-or-nothing batch job – one malformed input can kill a massive job, wasting time and money.

    Robustness suddenly becomes a byword for success. We’ll touch on this in a future post.

  •  
     

    eeGeo Ltd
    Suite 2, 144 West Marketgait
    Dundee
    DD1 1NJ




    +44 1382 221 424
    enquiries@eegeo.com
    eegeo.com
    @eeGeo

    Terms | Privacy | Legal | © eeGeo Ltd, 2016