Skip to main content

Command Palette

Search for a command to run...

Backend Performance & Scaling: A Practical Engineering Guide

Updated
9 min read

Modern backend systems must handle millions of requests, unpredictable traffic bursts, and strict latency expectations.
Understanding performance metrics, bottlenecks, database optimization, caching, and scaling strategies is essential for building reliable systems.

This article explains the core concepts engineers use to design high-performance backends, while also preparing you for system design interviews and real production scenarios.


Latency: The Core Performance Metric

Latency is the time taken for a request to travel through the entire system pipeline.

Typical backend flow:

  1. User interaction triggers a request

  2. Browser sends HTTP request

  3. API server processes request

  4. Server queries database / services

  5. Response returned

  6. Browser renders result

Latency = Total time between request initiation and response completion

Example:

User clicks "View Products"
→ HTTP request sent
→ Backend processes request
→ Database query executed
→ Response returned
→ UI rendered

If the total process takes 320 ms, the request latency is 320 ms.

Why Latency Matters

Modern UX expectations:

Latency User Perception
<100ms Instant
100-300ms Fast
300-1000ms Noticeable
>1s Slow

For large systems like payment gateways or SaaS platforms, keeping latency low is critical for retention and conversion.


Why Average Latency is Misleading

Average latency hides extreme slow requests.

Example:

Request Latency
990 requests 50ms
10 requests 5s

Average latency:

≈ 100 ms

But 1% of users experience 5 seconds delay, which is unacceptable.

This is why engineers use percentile metrics.


Percentile Latency (P50, P90, P99)

Percentiles measure distribution of request times.

Metric Meaning
P50 50% of requests complete within this time
P90 90% of requests complete within this time
P95 95% within this time
P99 99% within this time

Example:

P50 = 120ms
P90 = 350ms
P99 = 2s

Interpretation:

  • Half the users get response in 120ms

  • 10% experience >350ms

  • 1% experience >2 seconds

Why P99 is Important

Slowest requests often contain:

  • Complex queries

  • Payment flows

  • External API calls

  • heavy business logic

Optimizing P95/P99 latency significantly improves system reliability.


Throughput: Requests a System Can Handle

Throughput measures system capacity.

Common metric:

Requests per second (RPS)

Example:

API server capacity = 2000 RPS

If traffic exceeds this:

Latency increases
Queues build up
Timeouts occur

Relationship Between Throughput and Latency

As throughput increases:

Latency initially increases slowly
Then rises sharply near capacity

This occurs because servers start queueing requests.


System Utilization and Performance

Utilization measures how much of system capacity is used.

Utilization = current load / max capacity

Example:

Utilization Behavior
20% Idle
60% Optimal
80% High load
100% Overloaded

Important principle:

Production systems should not run at 100% utilization

Most systems operate around:

60% – 80% utilization

This ensures buffer capacity for traffic bursts.


Traffic is Bursty

Traffic rarely arrives uniformly.

Example SaaS traffic:

Normal load: 300 RPS
Burst load: 2000 RPS

Bursts happen due to:

  • marketing campaigns

  • flash sales

  • viral content

  • batch jobs

Systems must maintain capacity headroom.


Finding Bottlenecks (Measure, Don’t Guess)

Performance tuning must always follow:

Measure → Identify bottleneck → Optimize

Never optimize blindly.

Common bottlenecks:

  • database queries

  • synchronous logging

  • serialization

  • network latency

  • external APIs


Example: Hidden Bottleneck

Example API:

GET /products/:id

Initial assumption:

Database is slow

Engineers add Redis caching.

But performance does not improve.

After instrumentation:

DB query = 8ms
Redis lookup = 3ms
Logging service = 500ms

Problem:

Logging performed synchronously

Correct solution:

async logging

Code Example: Async Logging

Bad

await logger.logToRemoteService(data);

Better

setImmediate(() => {
  logger.logToRemoteService(data);
});

Or using a queue:

await logQueue.add(data);

Profiling and Distributed Tracing

Profiling

Profilers measure:

  • CPU usage

  • memory allocation

  • function execution time

Example Node.js profiling:

node --prof server.js

Visualization tools generate flame graphs showing slow functions.


Distributed Tracing

Tracing tracks requests across services.

Typical tools:

  • OpenTelemetry

  • Jaeger

  • Zipkin

  • Datadog APM

Tracing shows:

API Gateway
  ↓
User Service
  ↓
Database
  ↓
Payment Service

Example insight:

API logic = 2ms
DB query = 800ms

Clear bottleneck identified.


Database Performance Optimization

Databases often become the first scaling bottleneck.

Key issues include:

  • N+1 queries

  • missing indexes

  • inefficient joins

  • excessive connections


The N+1 Query Problem

Common ORM issue.

Example scenario:

Fetch posts with authors.

Bad approach

const posts = await db.posts.findMany();

for (const post of posts) {
  post.author = await db.users.findUnique({
    where: { id: post.authorId }
  });
}

For 100 posts → 101 queries


Optimized Solution

Use joins or bulk fetch.

const posts = await db.posts.findMany({
  include: { author: true }
});

Or SQL:

SELECT p.*, u.name
FROM posts p
JOIN users u
ON p.author_id = u.id;

Database Indexing

Indexes dramatically improve query performance.

Without index:

Full table scan

With index:

Direct lookup

Example:

CREATE INDEX idx_posts_author_id
ON posts(author_id);

Query:

SELECT * FROM posts WHERE author_id = 5;

Performance improvement:

2s → 30ms

Composite Index

Useful when queries filter multiple columns.

CREATE INDEX idx_user_created
ON orders(user_id, created_at);

Works for:

WHERE user_id = ?
WHERE user_id = ? AND created_at > ?

Covering Index

Allows database to serve results directly from index.

Example:

CREATE INDEX idx_user_email
ON users(email)
INCLUDE (id);

Query:

SELECT id FROM users WHERE email='a@b.com'

Database avoids scanning main table.


Connection Pooling

Creating a database connection is expensive.

Operations include:

  • TCP handshake

  • authentication

  • session allocation

  • memory allocation

Opening a connection per request causes huge overhead.


Connection Pooling Example

Node.js using pg:

import { Pool } from "pg";

const pool = new Pool({
  max: 20,
  host: "localhost",
  user: "postgres",
  password: "password",
  database: "appdb"
});

const result = await pool.query(
  "SELECT * FROM users WHERE id=$1",
  [1]
);

Connections are reused instead of recreated.


External Pooling (PgBouncer)

In large systems:

App servers → PgBouncer → Database

Benefits:

  • connection reuse

  • prevents connection exhaustion

  • improves throughput


Caching

Caching stores results of expensive operations.

Typical improvement:

DB query = 800ms
Cache lookup = 5ms

Common caching tools:

  • Redis

  • Memcached

  • Valkey


Cache Invalidation

Two major strategies exist.

Time Based

Cache expires automatically.

TTL = 5 minutes

Example Redis:

await redis.set(
  "user:42",
  JSON.stringify(user),
  "EX",
  300
);

Event Based

Invalidate cache when data changes.

await redis.del(`user:${userId}`);

Cache Patterns

Cache Aside

Most widely used.

Flow:

check cache
→ miss
→ query DB
→ store in cache

Example:

let user = await redis.get(key);

if (!user) {
  user = await db.getUser(id);
  await redis.set(key, JSON.stringify(user));
}

Write Through

Write to cache and DB simultaneously.


Write Behind

Write to cache first, DB later asynchronously.

Used in high throughput systems.


Cache Hit Ratio

Measures caching effectiveness.

hit ratio = cache hits / total requests

Example:

90% hit ratio

Meaning:

90% requests served from cache
10% hit database

Low hit ratio indicates poor caching strategy.


Vertical Scaling (Scaling Up)

Upgrade hardware of single machine.

Example upgrades:

  • more CPU cores

  • more RAM

  • faster SSD

  • better network

Benefits:

  • simple architecture

  • minimal code changes

Limitations:

  • hardware limits

  • single point of failure

  • no geographic distribution


Horizontal Scaling (Scaling Out)

Add more servers.

Example:

1 server → 1000 RPS
5 servers → ~5000 RPS

Benefits:

  • unlimited scaling potential

  • fault tolerance

  • geo-distribution


Load Balancing

Traffic must be distributed across servers.

Common algorithms:

  • round robin

  • least connections

  • IP hashing

Example architecture:

Client
   ↓
Load Balancer
   ↓
Server Cluster
   ↓
Database

Popular load balancers:

  • Nginx

  • HAProxy

  • AWS ELB

  • Cloudflare


Challenges of Horizontal Scaling

Horizontal scaling introduces complexity.

Problems include:

  • state synchronization

  • distributed transactions

  • network partitions

  • consistency issues

Solutions involve:

  • stateless services

  • distributed caches

  • message queues

  • consensus protocols


Key Interview Takeaways

Engineers should understand:

Core performance metrics

Latency
Throughput
Percentiles
Utilization

Database optimization

Indexes
N+1 queries
Connection pooling

Caching

Cache aside
TTL
Cache invalidation
Hit ratio

Scaling

Vertical scaling
Horizontal scaling
Load balancing

Observability

Profiling
Tracing
Metrics

Final Thoughts

Performance engineering is not about blind optimizations.

It requires:

measurement
bottleneck analysis
data-driven decisions

The best backend engineers focus on:

  • optimizing P95/P99 latency

  • maintaining capacity headroom

  • designing scalable distributed architectures

Mastering these concepts enables you to design systems capable of handling millions of users reliably and efficiently.


This is part of series Backend First Principles. Next: Modern Backend Scaling

Backend First Principles

Part 3 of 17

This series documents my learning journey through the "Backend from First Principles" playlist. Instead of jumping directly into frameworks, the focus is on understanding the core concepts that power backend systems. Throughout this series, I explore how backend systems actually work — from the request-response lifecycle, HTTP fundamentals, routing, serialization, authentication, and validation to more advanced topics like caching, task queues, observability, security, and scaling. The goal of this series is to build a strong conceptual foundation for backend engineering that applies across languages and frameworks. By learning backend development from first principles, we gain a deeper understanding of how modern web systems are designed, built, and scaled.

Up next

Graceful Shutdown in Backend Systems: Designing Reliable Services During Deployment

Modern backend systems rarely run on a single server for long. Deployments happen frequently, containers restart, orchestration systems replace instances, and infrastructure evolves continuously. Desp