First, let me tell you a little about what Clew is so you have some context as to what we're trying to achieve here and hopefully relate back to your own projects. Clew lets you search through all your cloud services like Google Drive, Dropbox and GitHub using one search-bar. It's a desktop app that's supported by a backend service (what we deployed here.)
Besides whatever logic we use to resolve your search terms into search results, the application has the makings of a standard web service, and most of the requirements are fairly obvious and relatable.
- A Laravel PHP monolith that uses most standard features and packages of the framework to get things done.
- Most of the ancillary services you can think of: RDS for databases, SQS for queues, SES for email, Elasticsearch, Redis Cache, S3 for storage and a few others.
- CodePipeline for continuous deployments.
- Monitoring, logs and alarms on Cloudwatch.
- Analytics service that runs on its own servers and independant DBs (but is deployed as a part of the larger Laravel monolith).
- Update server for deploying app updates using Cloudfront and S3.
Things we care about
- High availability (horizontal and vertical scalability); we can currently scale most of this infrastructure up or down, left or right across different regions in about ten minutes by changing one configuration file.
- Security; ability to provision virtual private networks, log events, automatically serve and roll security keys. All features that both AWS and GCS support.
- Componentization of different services; since Clew is built on the Laravel (PHP) framework, we wanted to be able to run our queues, jobs, email servers, socket servers and Redis caches independently of the main monolith application. I like to think of it as a modular monolith.
- Low maintenance; this is mostly a nice-to-have because unless you're using a managed service for your deployment, you do just have to keep an eye on things and update them as necessary.
- Ease of migration; the Cloudformation templates we use can be converted to Terraform templates, which can then be used to help deploy across other service providers.
- Supports seamless continuous deployment.
Key decisions we made
- Using Cloudformation (infrastructure as code) to deploy about 70% of our infrastructure. This means a large portion of our infrastructure is written in several YAML files. You can also go with a with a Terraform template instead, but I had a hard time getting Terraform resources to deploy. Will likely be something that I will take another look at down the line.
- We want to keep all our services under one cloud service provider, currently AWS.
- Wherever possible, we want to use a managed (dedicated) service for any ancillary services (like email servers). This is mostly for failure management.
- Security is number one priority. (pat yourself on the back for being well versed in youtube-culture if you got that reference.)
Why Infrastructure as Code?
Servers are complicated, fickle things. In other words, there's plenty you could do wrong and a few configurations that work well. The great thing about infrastructure as code is that you get to precisely define things like open ports, firewalls, virtual private networks and scaling policies. All this and more would otherwise have been incredibly painful to define, monitor and maintain.
Infrastructure as code lets you deploy a fleet of otherwise disconnected services as one unified, secure platform for your application to run on. Furthermore, thanks to supported services like Cloudformation, it lets you iterate on your infrastructure over time and seamlessly roll out and scale your services as you need.
How to think about designing your infrastructure
Think of all the services you need for the application, then think about how you're going to have them communicate securely, with low latency and friction. Then think about scalability, followed by how you're going to deploy code to your infrastructure. All the in-betweens like, how can my queue workers scale independently of the primary servers? It can be thought of and defined at a per-service level once you can adequately picture your birds-eye view.
That's the beauty and power of coupling infrastructure as code with managed cloud services like AWS and GCP.
Enforcing high availability and high performance
Our backend sometimes serves thousands of critical requests per second and frequently processes hundreds of incoming calls from apps, web hooks and web sockets. High availability and performance are critical for our service to function smoothly. In our early days we tried running most of Clew's tasks (queues, jobs, analytics, logging and etc) less independently and we used to experience downtime anytime we deploy or when individual services failed. Here're some things we did that almost eliminated these infrastructure issues.
- Deploy across multiple data centers and use load balancers to distribute the traffic. Codify health checks for each instance so that any instances that bug out don't get any traffic and eventually get replaced. (The tutorial down below talks about how to set this up.)
- Setup our CI/CD pipeline to support blue-green deployments. This was critical in taking the pressure off deployments and shipping consistently.
- Database read-writes can easily become a bottleneck. Distributed databases and database clusters with independant read/write instances help mitigate this. RDS (from AWS) is perfect for this use case and provides automatic fail-overs and distribution read-replicas across data centers for better performance.
- Async Async Async. Where ever possible try to do things asynchronously and setup error boundaries around all these async tasks. This way, when a specific task fails, it can fail independently of the overall process and can be re-queued or debugged independently. With Laravel, we leverage queues to do this.
I'm separating this out into a separate section because this is top of mind for me right now. It's one thing to have all the pieces together and see your infrastructure work, but what happens when it doesn't?
How do you chase down errors, identify bottlenecks and edge cases? The first step here is to setup an error monitoring tool. We leverage Bugsnag and a self-hosted tool to monitor for exceptions and provide a useful UI for identifying when and where things fails. Next up we'll likely to use a consolidated logging tool like Splunk that would give us the ability to query logs efficiently to quickly surface problems.
Tutorial & Question
This wasn't meant to be anything more than a quick dissection of how and why we deploy Clew the way we do. I used countless resources to put our infrastructure together and felt obliged to share our learnings. As long as you have some idea of what you want, the linked resources (primarily the one below) will help you figure out how to best deploy it.
The "Laravel on AWS" reference architecture by Lionel Martin has by-far been the most tangible resource for deploying Laravel to AWS via Cloudformation. The template is slightly outdated and needs some obvious changes to work. With this as a starting point, I was able to extend it to add other resources and services I needed for the application.
If you have any other questions or feedback feel free to tweet me @fakeUdara.