Friday, May 6, 2016

Migration to DevOps and AWS – Part 1

Introduction

It’s been about a year since NerdWallet moved all of our code and data from a set of leased servers into AWS. The project finished in early 2015, and it seems a no-brainer to talk about it as the first post in our engineering blog. This article discusses the key decisions made by the project team, challenges we came across, and general learnings gained from the migration process.

NerdWallet’s early days

NerdWallet started in 2009, but didn’t raise our Series A round of funding until 2015. Being a bootstrapped company, we ran a very lean shop, especially in the early days. We had a very small engineering team, and saving even $100 a month on bills was something we often talked about. Given this, we made a conscious decision to optimize for speed of iteration, costeffectiveness and operational simplicity in all areas of the company, and that included our serving infrastructure.

The company was originally built on a LAMP tech stack, with Redis added as a caching layer later on. We ran on a few leased (bare-metal) servers out of a data center in Chicago and generally co-located software on a small number of machines. We did configuration management by embedding credentials for our dev, test, and prod environments in one file and using hostname as a key to switch between them. All of the initial server setup and administration was done via SSH and checklists. PHP made code deployments fairly easy: Code was deployed by an rsync model. We ran Centos 5 with PHP 5.3.

As NerdWallet started hiring engineers in 2013, we made some early moves toward automation to help with developer onboarding — for example, setting up development environments using Vagrant with Chef for provisioning — however, our production environment remained manually administered. By late 2013, the business was  doing well enough that cost, while important, was no longer such a huge driver for technology decisions.

NerdWallet – AWS planning

Moving into 2014, we decided to deprecate new PHP development and start moving toward Python being the de-facto language for all of our business logic. To start fitting Python into our PHP monolith, we also began moving toward more of a microservice architecture — the Python business logic would expose REST interfaces that would be consumed by PHP front-end code. (We still have similar models today for new development, except that the front-end is now powered by node.js.)

We started investigating AWS for two major reasons:

  1. Agility. Spinning up an EC2 instance for experimentation is super fastversus the cost of calling up our hosting provider, trying to lease a new machine, and hoping we sized it correctly since we were stuck paying the bill for at least a month.
  2. High availability. AWS makes it very easy for us to spread our instances across multiple availability zones that are still connected by private networking. Gaining that sort of high availability with our leased data center partners would have been much trickier.

For a few months, we ran in a split data center model where our core PHP applications ran at our original hosting provider, while our new Python services ran in AWS (backed by an RDS instance). While taking a WAN trip from our PHP server to AWS to invoke Python services is not ideal, it did not have a huge impact on our page load times. However, it was clear that we should go all-in our AWS investment, and we began work in late 2014 to forklift our PHP applications over to AWS as well.

The design questions

There were a few decisions we made while planning the cutover that were no-brainers to us:

  • Use Chef to build and administer the AWS PHP servers. We were already using Chef to build our development VMs and our Python servers, and it was very clear that we should be using the same paradigm for our PHP servers as well.
  • Use Chef for configuration management. We relied on checking in production service names + credentials to Github and looking at hostnames to figure out whether we should use the production environment or the test environment. Having production credentials in source control is a security risk, and having only one place to look for service and environment information makes it much easier to figure out why your application is picking up a configuration setting.
  • Build applications for high availability. AWS makes it very easy to set up stateless servers across several availability zones in a given region. We wanted to ensure new software was designed to be stateless and work behind an ELB, and also go back through our old code and redesign it where necessary. (Our PHP application server used to have a “warm” standby that was kept in sync with all code updates, but all traffic always went to one node.)
  • Take advantage of Amazon’s VPC. Our old data center exposed all of our hosts to the public internet, and we relied on iptables to protect access at the network layer. Amazon’s VPC let us emulate a more traditional data center with some boxes not exposed to the internet at all except via NAT, subnetting, access control enforced by the routing layer, and more. This was clearly something we wanted to take advantage of as we knew we would be rolling out more services in the future.

We had to make a few tricky decisions while planning the cutover:

  • AMI Design
    • Do we build a stripped down base AMI that contains a very small set of packages and have Chef build everything else from scratch at provision time? Do we take a middle-ground approach and build one base AMI for each of our tech stacks, or do we start producing immutable AMIs as part of our build process? All of these have differing trade-offs between machine bring up times, build times, deploy times, and reproducibility of environments.
  • How many changes to our infrastructure should we make at once?
    • We already had decided to change our configuration management and networking. However, there were a few other changes we knew had to be made at some point too. How many of them should gate the AWS cutover? On the one hand, doing multiple smaller changes reduces risk at each milestone. On the other hand, each of these items also has a high manual testing cost that we could reduce by batching a bunch of changes together.
    • System software: Our software stack had grown a little old. CentOS 6 had been released long ago, and our DevOps engineer was more familiar with the Ubuntu stack. Ubuntu 14.04 LTS also shipped standard with PHP 5.5 and Apache 2.4, both of which would be upgrades for us.
    • Should we try to move our code to a new hosting platform and upgrade the underlying software at the same time, or try to break it into two smaller projects?
    • NerdWallet software design: Should we try to refactor all of our PHP applications to be HA aware before we cut over to AWS?
  • Infrastructure automation:
    • It would be ideal for QA purposes to be able to bring up a copy of the NerdWallet production environment just by running a script or two. How should we automate creation of VPCs, security groups, etc.?
    • Should we invest in auto-scaling groups? While scaling our traffic up and down by load is not a super important use case for us, having a self-healing infrastructure (if a machine dies, AWS will automatically create a new one for us) is certainly nice.
    • How do we bootstrap new EC2 instances so that they understand which type of server they are supposed to be?
  • Amazon managed services:
    • Did we want to leverage ElastiCache and RDS instead of trying to forklift our existing Redis and MySQL servers over with the same configuration files?
    • Our existing DNS provider (Rackspace) worked fine, but should we migrate over to Route53 so that all of our infrastructure lives under one roof?
  • SSL termination:
    • We knew for sure that we wanted all access to NerdWallet’s servers to run through ELBs. We did have to decide whether we wanted to terminate any inbound SSL connections at the ELB itself (and have it route traffic via HTTP internally) or pass through connections to our application servers and have them terminate the SSL connections themselves.
  • Database migration was a tricky enough topic that it is covered in its own section below.

The design decisions

Where did we end up with the above questions and how would I evaluate our decisions?

  • AMI design: We decided to go with a base AMI that is built up by Chef. While this makes the bring-up of a new server fairly slow, it does mean we don’t have to worry much about AMI management. It also works well with our Vagrant environment where we combine multiple Chef roles into one virtual machine. This decision worked well for us at the time, although we are looking at alternatives now. We have so many services that we are running into difficulties making sure our AWS environment and our dev environment stay relatively converged.
  • Scope of changes: We did decide to migrate our production environment to Ubuntu, Apache 2.4 and PHP 2.5 at the same time as moving to AWS. Thanks to our use of Vagrant and Chef automation, we were able to build new VMs with this system software and test our software against them fairly easily. The PHP move was pretty seamless, but we ran into some problems with the Apache migration, which I will cover below. While changing so many core pieces of infrastructure at once is a pretty scary thing, I think this ended up being the right decision in the end, as it let us end the AWS cutover project feeling like our core software infrastructure was in a good spot.
  • HA: We did not_try to refactor all of our app infrastructure to be HA aware as part of the cutover, as we decided this was one step too far as it would have added at least several weeks to the project. This process was particularly complicated for us because NerdWallet runs a self-hosted version of WordPress, and WordPress isn’t particularly designed to be run in a cluster (and even if it were, many of the plugins in the WP ecosystem assume they can use the local disk as a state store).
  • Infra automation:
    • We used auto-scaling groups to create and provision all of our servers. This also turned out to be a good way to handle our bootstrapping problem — the launch configuration for each auto-scaling group specifies which Chef roles to run on boot-up.
    • We investigated CloudFormation for infrastructure bring-up, but found it pretty unwieldy to use — we started trying to build one gigantic JSON file that described the entire NerdWallet environment, but maintaining this over time became very tricky — and hard to read. It was also hard to test what would happen when you tried to update the CF template. We are still provisioning security groups, IAM roles, etc. by hand, which will become a bottleneck for us as we continue to grow our infrastructure. It’s definitely something we should have tackled during the cutover process. There are now alternatives to CloudFormation like HashiCorp’s Terraform (which didn’t exist when we were doing our planning) or Ansible’s EC2 modules,but we have not done a full evaluation of either yet.
  • AWS managed services: We decided to use AWS managed services where possible to take advantage of the built-in backup and built in cross-AZ failover. While you do lose some visibility into the inner workings of the service, this decision has generally worked out for us. Unfortunately, we were unable to use RDS in our initial cutover due to an issue that will be discussed later. We used Route53, which has been very helpful for us: Being able to automate DNS record creation as part of instance bootstrap has been great.
  • SSL termination: We decided to terminate SSL at the ELB and have it forward HTTP traffic internally. We went this route because it makes certificate management much easier — the private and public keys can be stored at the Amazon level instead of distributing it to all of our machines. We’ve have had no problems here.

Stay tuned for part 2 in how we planned for and executed the cutover!

No comments:

Post a Comment