Backend Performance & Scaling — Practical Engineering Guide

Modern backend systems must handle millions of requests, unpredictable traffic bursts, and strict latency expectations.
Understanding performance metrics, bottlenecks, database optimization, caching, and scaling strategies is essential for building reliable systems.

This article explains the core concepts engineers use to design high-performance backends, while also preparing you for system design interviews and real production scenarios.

Latency: The Core Performance Metric

Latency is the time taken for a request to travel through the entire system pipeline.

Typical backend flow:

User interaction triggers a request
Browser sends HTTP request
API server processes request
Server queries database / services
Response returned
Browser renders result

Latency = Total time between request initiation and response completion

Example:

User clicks "View Products"
→ HTTP request sent
→ Backend processes request
→ Database query executed
→ Response returned
→ UI rendered

If the total process takes 320 ms, the request latency is 320 ms.

Why Latency Matters

Modern UX expectations:

Latency	User Perception
<100ms	Instant
100-300ms	Fast
300-1000ms	Noticeable
>1s	Slow

For large systems like payment gateways or SaaS platforms, keeping latency low is critical for retention and conversion.

Why Average Latency is Misleading

Average latency hides extreme slow requests.

Example:

Request	Latency
990 requests	50ms
10 requests	5s

Average latency:

≈ 100 ms

But 1% of users experience 5 seconds delay, which is unacceptable.

This is why engineers use percentile metrics.

Percentile Latency (P50, P90, P99)

Percentiles measure distribution of request times.

Metric	Meaning
P50	50% of requests complete within this time
P90	90% of requests complete within this time
P95	95% within this time
P99	99% within this time

Example:

P50 = 120ms
P90 = 350ms
P99 = 2s

Interpretation:

Half the users get response in 120ms
10% experience >350ms
1% experience >2 seconds

Why P99 is Important

Slowest requests often contain:

Complex queries
Payment flows
External API calls
heavy business logic

Optimizing P95/P99 latency significantly improves system reliability.

Throughput: Requests a System Can Handle

Throughput measures system capacity.

Common metric:

Requests per second (RPS)

Example:

API server capacity = 2000 RPS

If traffic exceeds this:

Latency increases
Queues build up
Timeouts occur

Relationship Between Throughput and Latency

As throughput increases:

Latency initially increases slowly
Then rises sharply near capacity

This occurs because servers start queueing requests.

System Utilization and Performance

Utilization measures how much of system capacity is used.

Utilization = current load / max capacity

Example:

Utilization	Behavior
20%	Idle
60%	Optimal
80%	High load
100%	Overloaded

Important principle:

Production systems should not run at 100% utilization

Most systems operate around:

60% – 80% utilization

This ensures buffer capacity for traffic bursts.

Traffic is Bursty

Traffic rarely arrives uniformly.

Example SaaS traffic:

Normal load: 300 RPS
Burst load: 2000 RPS

Bursts happen due to:

marketing campaigns
flash sales
viral content
batch jobs

Systems must maintain capacity headroom.

Finding Bottlenecks (Measure, Don’t Guess)

Performance tuning must always follow:

Measure → Identify bottleneck → Optimize

Never optimize blindly.

Common bottlenecks:

database queries
synchronous logging
serialization
network latency
external APIs

Example: Hidden Bottleneck

Example API:

GET /products/:id

Initial assumption:

Database is slow

Engineers add Redis caching.

But performance does not improve.

After instrumentation:

DB query = 8ms
Redis lookup = 3ms
Logging service = 500ms

Problem:

Logging performed synchronously

Correct solution:

async logging

Code Example: Async Logging

Bad

await logger.logToRemoteService(data);

Better

setImmediate(() => {
  logger.logToRemoteService(data);
});

Or using a queue:

await logQueue.add(data);

Profiling and Distributed Tracing

Profiling

Profilers measure:

CPU usage
memory allocation
function execution time

Example Node.js profiling:

node --prof server.js

Visualization tools generate flame graphs showing slow functions.

Distributed Tracing

Tracing tracks requests across services.

Typical tools:

OpenTelemetry
Jaeger
Zipkin
Datadog APM

Tracing shows:

API Gateway
  ↓
User Service
  ↓
Database
  ↓
Payment Service

Example insight:

API logic = 2ms
DB query = 800ms

Clear bottleneck identified.

Database Performance Optimization

Databases often become the first scaling bottleneck.

Key issues include:

N+1 queries
missing indexes
inefficient joins
excessive connections

The N+1 Query Problem

Common ORM issue.

Example scenario:

Fetch posts with authors.

Bad approach

const posts = await db.posts.findMany();

for (const post of posts) {
  post.author = await db.users.findUnique({
    where: { id: post.authorId }
  });
}

For 100 posts → 101 queries

Optimized Solution

Use joins or bulk fetch.

const posts = await db.posts.findMany({
  include: { author: true }
});

Or SQL:

SELECT p.*, u.name
FROM posts p
JOIN users u
ON p.author_id = u.id;

Database Indexing

Indexes dramatically improve query performance.

Without index:

Full table scan

With index:

Direct lookup

Example:

CREATE INDEX idx_posts_author_id
ON posts(author_id);

Query:

SELECT * FROM posts WHERE author_id = 5;

Performance improvement:

2s → 30ms

Composite Index

Useful when queries filter multiple columns.

CREATE INDEX idx_user_created
ON orders(user_id, created_at);

Works for:

WHERE user_id = ?
WHERE user_id = ? AND created_at > ?

Covering Index

Allows database to serve results directly from index.

Example:

CREATE INDEX idx_user_email
ON users(email)
INCLUDE (id);

Query:

SELECT id FROM users WHERE email='a@b.com'

Database avoids scanning main table.

Connection Pooling

Creating a database connection is expensive.

Operations include:

TCP handshake
authentication
session allocation
memory allocation

Opening a connection per request causes huge overhead.

Connection Pooling Example

Node.js using pg:

import { Pool } from "pg";

const pool = new Pool({
  max: 20,
  host: "localhost",
  user: "postgres",
  password: "password",
  database: "appdb"
});

const result = await pool.query(
  "SELECT * FROM users WHERE id=$1",
  [1]
);

Connections are reused instead of recreated.

External Pooling (PgBouncer)

In large systems:

App servers → PgBouncer → Database

Benefits:

connection reuse
prevents connection exhaustion
improves throughput

Caching

Caching stores results of expensive operations.

Typical improvement:

DB query = 800ms
Cache lookup = 5ms

Common caching tools:

Redis
Memcached
Valkey

Cache Invalidation

Two major strategies exist.

Time Based

Cache expires automatically.

TTL = 5 minutes

Example Redis:

await redis.set(
  "user:42",
  JSON.stringify(user),
  "EX",
  300
);

Event Based

Invalidate cache when data changes.

await redis.del(`user:${userId}`);

Cache Patterns

Cache Aside

Most widely used.

Flow:

check cache
→ miss
→ query DB
→ store in cache

Example:

let user = await redis.get(key);

if (!user) {
  user = await db.getUser(id);
  await redis.set(key, JSON.stringify(user));
}

Write Through

Write to cache and DB simultaneously.

Write Behind

Write to cache first, DB later asynchronously.

Used in high throughput systems.

Cache Hit Ratio

Measures caching effectiveness.

hit ratio = cache hits / total requests

Example:

90% hit ratio

Meaning:

90% requests served from cache
10% hit database

Low hit ratio indicates poor caching strategy.

Vertical Scaling (Scaling Up)

Upgrade hardware of single machine.

Example upgrades:

more CPU cores
more RAM
faster SSD
better network

Benefits:

simple architecture
minimal code changes

Limitations:

hardware limits
single point of failure
no geographic distribution

Horizontal Scaling (Scaling Out)

Add more servers.

Example:

1 server → 1000 RPS
5 servers → ~5000 RPS

Benefits:

unlimited scaling potential
fault tolerance
geo-distribution

Load Balancing

Traffic must be distributed across servers.

Common algorithms:

round robin
least connections
IP hashing

Example architecture:

Client
   ↓
Load Balancer
   ↓
Server Cluster
   ↓
Database

Popular load balancers:

Nginx
HAProxy
AWS ELB
Cloudflare

Challenges of Horizontal Scaling

Horizontal scaling introduces complexity.

Problems include:

state synchronization
distributed transactions
network partitions
consistency issues

Solutions involve:

stateless services
distributed caches
message queues
consensus protocols

Key Interview Takeaways

Engineers should understand:

Core performance metrics

Latency
Throughput
Percentiles
Utilization

Database optimization

Indexes
N+1 queries
Connection pooling

Caching

Cache aside
TTL
Cache invalidation
Hit ratio

Scaling

Vertical scaling
Horizontal scaling
Load balancing

Observability

Profiling
Tracing
Metrics

Final Thoughts

Performance engineering is not about blind optimizations.

It requires:

measurement
bottleneck analysis
data-driven decisions

The best backend engineers focus on:

optimizing P95/P99 latency
maintaining capacity headroom
designing scalable distributed architectures

Mastering these concepts enables you to design systems capable of handling millions of users reliably and efficiently.

This is part of series Backend First Principles. Next: Modern Backend Scaling

Command Palette

Latency: The Core Performance Metric

Why Latency Matters

Why Average Latency is Misleading

Percentile Latency (P50, P90, P99)

Why P99 is Important

Throughput: Requests a System Can Handle

Relationship Between Throughput and Latency

System Utilization and Performance

Traffic is Bursty

Finding Bottlenecks (Measure, Don’t Guess)

Example: Hidden Bottleneck

Code Example: Async Logging

Bad

Better

Profiling and Distributed Tracing

Profiling

Distributed Tracing

Database Performance Optimization

The N+1 Query Problem

Bad approach

Optimized Solution

Database Indexing

Composite Index

Covering Index

Connection Pooling

Connection Pooling Example

External Pooling (PgBouncer)

Caching

Cache Invalidation

Time Based

Event Based

Cache Patterns

Cache Aside

Write Through

Write Behind

Cache Hit Ratio

Vertical Scaling (Scaling Up)

Horizontal Scaling (Scaling Out)

Load Balancing

Challenges of Horizontal Scaling

Key Interview Takeaways

Final Thoughts

Comments

Backend First Principles

Graceful Shutdown in Backend Systems: Designing Reliable Services During Deployment

More from this blog