Skip to main content

Command Palette

Search for a command to run...

Modern Backend Scaling: From Stateless APIs to Serverless Architectures

Updated
10 min read

Building scalable backend systems is not about memorizing a list of tools. It is about understanding where bottlenecks appear, how systems evolve with growth, and which techniques solve which problems.

Modern systems rarely scale with a single technique. Instead they combine multiple layers:

  • Stateless services

  • Load balancing

  • Database scaling

  • Distributed caching

  • CDNs

  • Asynchronous processing

  • Service architecture (monolith vs microservices)

  • Serverless computing

This article explains these concepts in a practical and interview-ready way, correcting common misconceptions and aligning them with modern cloud architecture used today (2025+).


Statelessness — The Foundation of Horizontal Scaling

Horizontal scaling means adding more servers instead of increasing the power of a single machine.

For example:

  • Vertical scaling → upgrading a server from 4GB RAM → 32GB RAM

  • Horizontal scaling → running multiple instances of the same service

For horizontal scaling to work properly, backend services must be stateless.

A stateless service does not store request-specific or user-specific state locally in memory. Every request must be independently processable by any server instance.

Why stateful servers break scaling

If a server stores session data in memory:

Server A memory:
session[user123] = {...}

And the next request goes to Server B, the session will not exist there.

Result:

401 Unauthorized

This creates inconsistent behaviour across servers.

Correct approach: externalize state

State must be stored in shared infrastructure:

State Type Correct Storage
Sessions Redis / distributed cache
Files Object storage (S3, R2, GCS)
Databases Managed databases (PostgreSQL, MySQL, etc.)
Cache Redis / Memcached

Example: storing sessions in Redis (Node.js)

import session from "express-session"
import RedisStore from "connect-redis"
import { createClient } from "redis"

const redisClient = createClient({ url: process.env.REDIS_URL })
await redisClient.connect()

app.use(
  session({
    store: new RedisStore({ client: redisClient }),
    secret: process.env.SESSION_SECRET,
    resave: false,
    saveUninitialized: false
  })
)

Now any server instance can verify the session.

Stateless architecture is the prerequisite for scalable backend systems.


Load Balancers — Distributing Traffic

Once multiple backend instances exist, a mechanism is required to distribute traffic among them. That component is the load balancer.

All incoming traffic first goes to the load balancer, which forwards requests to backend servers.

Common production load balancers:

  • NGINX

  • HAProxy

  • AWS ALB / ELB

  • Cloudflare

  • Envoy

Common load balancing algorithms

Round Robin

Requests are distributed sequentially.

Server selection pattern:

A → B → C → A → B → C

Works well when:

  • requests have similar processing cost

  • servers have equal capacity

Weighted Round Robin

Used when servers have different capacity.

Example configuration:

Server A weight = 2
Server B weight = 1
Server C weight = 1

Traffic ratio:

A : B : C = 2 : 1 : 1

Least Connections

Requests are sent to the server with the fewest active connections.

Useful when:

  • request processing time varies

  • some requests are heavy

This algorithm adapts automatically to server load.

Health Checks

Load balancers continuously check server health using lightweight endpoints.

Example health endpoint:

app.get("/health", (req, res) => {
  res.status(200).send("OK")
})

If a server stops responding with HTTP 200, the load balancer removes it from rotation.

This prevents users from receiving errors.


Database Scaling

Scaling application servers is relatively easy. Scaling databases is much harder because databases maintain state and consistency.

Two primary strategies exist.


Read Replicas

A primary database handles writes while replicas serve read queries.

Flow:

Write queries → Primary database
Read queries → Replicas

Example architecture:

  • Primary DB → writes

  • Replica 1 → reads

  • Replica 2 → reads

  • Replica 3 → reads

This dramatically reduces load on the primary database.

Typical traffic distribution in SaaS applications:

70–90% reads
10–30% writes

So replicas handle most of the traffic.

Replication Lag

Replicas synchronize asynchronously.

Example:

  1. User updates profile

  2. Write occurs in primary DB

  3. Replica updates after ~200 ms

  4. Immediate read may still return old data

Solutions include:

  • read-after-write routing to primary

  • delaying read queries briefly

  • caching updated data on client

Managed database services already handle much of this complexity.

Popular managed providers:

  • AWS RDS

  • Google Cloud SQL

  • Neon

  • PlanetScale

  • CockroachDB

  • YugabyteDB


Database Sharding

Sharding divides a large dataset across multiple databases.

Instead of storing everything in a single database:

Orders 1-5 → DB shard A
Orders 6-10 → DB shard B

Sharding improves:

  • query performance

  • throughput

  • storage scalability

Choosing a Sharding Key

A sharding key determines how data is distributed.

Examples:

Key Example
User ID distributes users evenly
Geographic region data near users
Date useful for time-series data

Example query routing:

function getShard(userId) {
  return userId % 4
}

This distributes users across 4 shards.

Sharding introduces complexity such as:

  • cross-shard queries

  • distributed transactions

  • data rebalancing

Therefore most teams use managed distributed databases instead of building this from scratch.


Content Delivery Networks (CDNs)

CDNs reduce latency by serving content from servers closer to users.

Static assets are the primary use case:

  • JavaScript bundles

  • CSS

  • images

  • fonts

  • videos

Instead of requesting assets from a distant server, users receive them from nearby edge locations.

Benefits:

  • lower latency

  • reduced origin server load

  • improved global performance

Modern CDN providers include:

  • Cloudflare

  • AWS CloudFront

  • Fastly

  • Akamai

Example CDN caching headers

Cache-Control: public, max-age=3600

This allows CDN nodes to cache content for one hour.

CDNs also provide protection against large-scale attacks such as DDoS attacks by absorbing massive traffic spikes.


Edge Computing

Traditional CDNs only served static files.

Edge computing allows code execution at edge locations.

Examples:

  • Cloudflare Workers

  • Vercel Edge Functions

  • Fastly Compute@Edge

Edge computing is useful for:

  • authentication checks

  • localization

  • request routing

  • header manipulation

Example authentication at edge:

export default async function handler(req) {
  const token = req.headers.get("authorization")

  if (!token) {
    return new Response("Unauthorized", { status: 401 })
  }

  return fetch(req)
}

Edge computing reduces unnecessary traffic reaching origin servers.

However edge environments have limitations:

  • restricted runtime APIs

  • limited memory

  • execution time limits

They are designed for lightweight computation, not heavy workloads.


Asynchronous Processing

Many operations do not require immediate completion.

Performing them synchronously increases latency unnecessarily.

Examples of asynchronous tasks:

  • sending emails

  • notifications

  • video processing

  • report generation

  • background data cleanup

Instead of blocking user requests, these tasks are pushed to a message queue.

Background job architecture

Components:

  • Producer (application server)

  • Queue (Redis / RabbitMQ / Kafka)

  • Worker (background processor)

Example with BullMQ

import { Queue } from "bullmq"

const emailQueue = new Queue("email")

await emailQueue.add("sendInvite", {
  email: "user@example.com"
})

Worker:

import { Worker } from "bullmq"

new Worker("email", async job => {
  await sendEmail(job.data.email)
})

Benefits:

  • faster user response times

  • improved reliability

  • easier scaling of background workers

This pattern is heavily used in production systems.


Monolith vs Microservices

A monolith is a single deployable application containing all business logic.

Advantages:

  • simple architecture

  • easy debugging

  • faster development

  • simpler deployments

Monoliths scale surprisingly well with horizontal scaling.

Microservices divide an application into independent services.

Examples:

  • authentication service

  • payment service

  • notification service

  • order service

When microservices make sense

Microservices are primarily about organizational scalability, not just performance.

They are useful when:

  • teams exceed 100+ engineers

  • services require independent scaling

  • different technologies are needed

Example:

image processing service → Rust
API service → Node.js
analytics pipeline → Python

Tradeoffs

Microservices introduce:

  • network latency

  • distributed debugging complexity

  • data consistency challenges

  • infrastructure overhead

For most startups and mid-scale systems, a well-structured monolith is often the best starting architecture.


Serverless Computing

Serverless abstracts away server management.

Developers deploy functions, and infrastructure automatically executes them when events occur.

Popular platforms:

  • AWS Lambda

  • Cloudflare Workers

  • Vercel Functions

  • Google Cloud Functions

Serverless execution model

Instead of running servers continuously:

Request arrives
↓
Function instance starts
↓
Code executes
↓
Response returned
↓
Instance terminated

Billing occurs only during execution time.

Advantages

  • automatic scaling

  • no server management

  • pay-per-execution pricing

  • excellent for burst traffic

Cold Start Problem

Because functions are created on demand, the first request may experience startup latency.

Modern improvements such as:

  • lightweight VM technologies

  • optimized runtimes

  • edge execution

have significantly reduced cold start delays.

Good serverless use cases

Serverless is ideal for:

  • event processing

  • image/video processing

  • webhooks

  • scheduled jobs

  • API endpoints with unpredictable traffic

Poor serverless use cases

Serverless is less suitable for:

  • long running tasks

  • latency-critical systems

  • persistent connections

  • heavy stateful workloads


Practical Scaling Strategy

In real systems scaling evolves gradually.

Typical progression:

  1. Start with a simple monolith

  2. Add caching

  3. optimize database queries

  4. add horizontal scaling

  5. introduce background jobs

  6. use CDNs

  7. scale database with replicas

  8. consider microservices if necessary

Most applications never need extreme architectures.


Key Interview Takeaways

Interviewers frequently test conceptual understanding.

Important points to remember:

Horizontal vs Vertical Scaling

Vertical → upgrade machine
Horizontal → add machines

Stateless Services

State must be externalized to enable scaling.

Load Balancers

Distribute traffic and monitor server health.

Database Scaling

Common techniques:

  • read replicas

  • sharding

CDN

Improves latency and reduces origin traffic.

Asynchronous Processing

Use message queues for non-critical tasks.

Microservices

Primarily solve team scaling problems.

Serverless

Best for event-driven and burst workloads.


Final Principles for Scalable Systems

  1. Measure before optimizing
    Use observability tools like Prometheus, Grafana, or New Relic.

  2. Prefer simple architectures first
    Complexity increases operational cost.

  3. Scale gradually
    Optimize based on real bottlenecks.

  4. Design stateless services early

  5. Understand tradeoffs
    Every architecture decision introduces new constraints.

Scalable systems are not built through a single technology choice. They emerge through careful measurement, iterative improvements, and disciplined architectural decisions.

Mastering these principles prepares engineers not only for system design interviews but also for building production-grade backend systems.


This is part of series Backend First Principles. Next: Understanding Concurrency in Backend Systems

Backend First Principles

Part 2 of 17

This series documents my learning journey through the "Backend from First Principles" playlist. Instead of jumping directly into frameworks, the focus is on understanding the core concepts that power backend systems. Throughout this series, I explore how backend systems actually work — from the request-response lifecycle, HTTP fundamentals, routing, serialization, authentication, and validation to more advanced topics like caching, task queues, observability, security, and scaling. The goal of this series is to build a strong conceptual foundation for backend engineering that applies across languages and frameworks. By learning backend development from first principles, we gain a deeper understanding of how modern web systems are designed, built, and scaled.

Up next

Backend Performance & Scaling: A Practical Engineering Guide

Modern backend systems must handle millions of requests, unpredictable traffic bursts, and strict latency expectations.Understanding performance metrics, bottlenecks, database optimization, caching, a