Mark Russo February 12, 2021
Designing and building cloud applications that will operate at scale is very different from what is appropriate with a small scale software system. Whether evaluating an architecture proposal of a new large scale system or assessing an existing one, there is a set of things I look for. Here is my set of top no-no’s for the scalability and robustness of at-scale cloud applications.
Assuming the result set will fit in memory
For any query that can return multiple records, do not assume the result set will fit into memory. It’s easy to rationalize, thinking “we don’t expect to have more than x of those.” Adversaries attacking your APIs and the security team running fuzz tests can violate your expectations.
Optimistic health checks
Does your system rely on the health check API provided by a framework? What will that API’s response be when a microservice has lost connectivity with the database or a vital background thread gets stuck? If health checks fail to notice when important processing has stopped, customers can be impacted without your SRE team receiving alerts.
Too little focus on handling failures
The pressure to meet deadlines for new systems and new features can leave architects and senior engineers little time to think deeply about handling failures. Often the implementation too readily rejects or abandons units of work and services are too quick to resort to triggering a restart. That approach can unnecessarily impact the user experience. With careful design, failures not in the critical path for a workflow can be tolerated instead of failing the logical transaction or requesting a restart. Services can be made to degrade gracefully when facing some types of failure and auto-heal when things resolve.
Choosing synchronous processing without considering async
When implementing an API endpoint, the simplest thing to do is to code all steps to run synchronously. That way by the time the response is sent, 100% of the processing is complete. Does the API contract actually require that all processing be complete when the response is received? What runs in 100ms in your development environment might take multiple seconds at scale, and users are impatient creatures. Any processing not required to form the API’s response is a candidate for handling asynchronously.
Using your NoSQL data store as a relational database
RDBMSs are so convenient to use. Referential integrity enforcement, cascading deletes, the ability to rollback transactions…all boost developer productivity. But if your next project will operate at the scale of billions of events per day and your architect directs you to use a particular NoSQL database, your next step SHOULD NOT be to build a relational emulation layer on top of the database. Setting aside some conveniences and not promising all ACID properties is a key part of how NoSQL systems scale so much better than RDBMSs. Embrace the tool for what it is. Do not try to turn it into something else.
Ignoring batch optimizations
Do you have an async pipeline that is adding thousands of records per second into a data store one at a time? If that data store supports bulk operations, make use of them. Bulk operations make much more efficient use of resources, increasing the load the system can handle before new nodes are needed.
Using http for high volume communications between server nodes
RESTful APIs are the standard way for micro services to communicate with each other. They are simple to create and easy to debug; they are so convenient. But convenience usually comes at a cost. HTTP traffic is bulky and requires a lot of handshaking. Instead of always using REST, consider solutions offering a binary protocol over a persistent socket. gRPC and Thrift (among others) enable efficient remote procedure calls, making more cost efficient use of your cloud infrastructure.
Not specifying the life cycle for each of your data stores
In the mad dash towards your product launch, it is easy to forget to do data life cycle planning. For any stored entity, how long should it be retained? As they age, should they migrate across hot, warm and code data stores? Should full detail be retained or should the data be aggregated? Skipping this analysis can lead to uncontrolled growth of your data sets. This will increase latency at the data tier which can cascade upwards as failures as timeout thresholds are reached.
Not investigating how worker processes will collide at scale
You have two instances of a micro service processing a data feed in your development environment and all is well. Throughput is good and there are no signs of resource contention. Will that still be the case when the data feed becomes a firehose and many micro service instances are running? How will your database respond to so much concurrent access to a single table? How often will micro service instances contend for the same record? Depending on the answers, throughput per worker can drop as more are added, and total throughput can drop as well. Not doing this analysis early on can prove fatal to a system at scale.
Ignoring hosting costs when making architecture decisions
The new feature your team launched drives $5.99 of revenue per user per month and you have millions of users. You are a hero! Two months later you are notified that your cloud hosting costs are $12.50 per user per month, and unpleasant conversations with executives ensue. Don’t let this happen to you. Do cost projections for each piece of cloud infrastructure you use, and pursue frugality at every opportunity.
Common among these items is choosing expediency and convenience over doing the extra work required to keep the system viable when operating at scale. The time saved early on by expedient choices results in much more time spent on unplanned work later, often in the context of a production outage.
Keep these no-no’s in mind and share your own with the community.