How AWS Lambda Transforms Our Data Processing Capabilities

Written by Steve Gifford

July 10, 2024

The evolution of cloud computing is truly remarkable, especially for those who’ve been in the field for a while. I recall the days of naming individual servers, monkey species were my favorite, and meticulously maintaining hardware. Now, with technologies like AWS Lambda, we’ve entered a new era of cloud services that’s revolutionizing how we approach data processing and storage.

While some argue that ‘serverless is just someone else’s server,’ I’ve come to appreciate this shift. We no longer focus on hardware; instead, we break down services into tasks and execute them using various cloud-based solutions. AWS Lambda has become an invaluable tool for our weather data ingestion at Wet Dog Weather, allowing us to efficiently process vast amounts of meteorological information without worrying about the underlying infrastructure.

What is AWS Lambda

Introduced in 2014, “AWS Lambda is an event-driven, server-less Function as Function as a Service” addition to AWS. It started as a fairly minimal ‘run some code’ based on a trigger service and has grown into something much more complex since.

AWS added container support a few years later, necessary for us as our importers and data processing tasks require a fair bit of geospatial and meteorology code. We use Python because our customers use it; thus, we need a host of libraries in the container.

Container Lambda acts a teeny bit different, though.

AWS Lambda Container Workloads

Though AWS encourages you to think of the Lambdas as individual execution instances, it makes sense to delve deeper into the specifics. We’ve found that the following are essential.

  • AWS will fire up your container once for the first request
  • It will reuse an existing container for a group of requests
  • Each request has a hard timeout, as long as 15 minutes
  • The container itself has a hard timeout, too

If you’re using heavy containers like we do, consider how your container is invoked and how reusable it can be. For us, bigger containers that can do multiple things work best.

Building our containers is the hardest part.

Containers are the Worst Solution Except for All the Other Solutions

On the import side for our Boxer service, we read a variety of data formats that are popular in the weather. Contained within those, we process models and observed data sets that have a wide variety of assumptions and defaults. That part of it is actually more complex than the data formats themselves.

Our data display and query pipelines require us to interpret data, convert units, and sometimes perform more complex processing to integrate different models into the same conceptual space. This is separate from the geospatial processing for some of our static visual products.

All of this means we pull in a lot of strange Python libraries. Like siblings in a big family, they quibble. We use Conda (& Mamba) to manage them, but we’ll still get the occasional weird crash in a container.

That’s what testing is for.

Lamba Containers for Import

Most of our containers are running data import. This is how we do everything from GFS every 6 hours to MRMS every few minutes. The common containers are constantly in use for an AWS Lambda somewhere, and the less common containers will spin up when needed.

Of course, only some data sets are in AWS, so we have periodic scrapers to get data. These drop data sets into AWS and let the rest of the processing kick off as designed. It’s a nicely logical architecture.

Lambda Containers for Processing

Once we’ve pulled data into Boxer, it must go through a few steps to reach the user. First, we’ll reorganize the variables for display. This consists of building a data tile pyramid and letting the rest of the system know it’s available.

After the main visual display is ready, we’ll do several optional steps for the static tile service, which consists of reprojecting and resampling the data for that more limited use case. Ironically, we do more work for the less often used case.

In these steps, we take advantage of the way AWS Lambda works. On import, we’ll strip the variables and levels we want from a model, GFS, for example. That can result in hundreds of new time series + variable + level entries, which are tossed back into Boxer for processing.

AWS Lambda will spin up a container and reuse it for a fixed period. Thus, our hundreds of GFS variable processing requests will run across some of these instances, but we won’t incur the startup costs for each one.

AWS Lambda Advantages & Disadvantages

Data import for Boxer is bursty. The big models, like GFS, come in periodically, and radar, like MRMS, comes in every few minutes. Custom models also show up on their own schedules.

All of that could be predicted, except when there’s an upstream network problem. For instance, we find that MRMS or HRRR will often bunch up. Using lambdas allows us to catch up fairly quickly. If we’re hit with several hours’ worth of data at once, we automatically go wide to get it.

If you think purely in terms of CPU, AWS Lambda is kind of expensive. I’ll admit to fantasizing about buying a nice high-end PC and running all of this in an office. That would be great—until it isn’t. The reliability and maintainability can’t be beat.

Why We’re Committed to AWS Lambda

We don’t just do data import. There’s much more going on in the rest of the system, particularly the parts that answer user queries. But we like AWS Lambda for data import and much of our processing chain.

We’re happy with how Lambdas have worked out for Boxer. It meshes well with our Infrastructure as Code and general DevOps mindset. We have yet to have anything within the system go down. So we’ll be sticking with it for now despite the cost.