Skip to main content

Command Palette

Search for a command to run...

Backend Error Handling: Building Fault-Tolerant Systems (Modern Guide for Engineers & Interviews)

Updated
8 min read

Backend systems fail. Databases disconnect, APIs timeout, users send bad input, and business logic behaves unexpectedly under real traffic. A strong backend engineer does not assume systems will work — they design systems that expect failure and handle it gracefully.

Modern backend engineering focuses on fault tolerance, observability, and predictable error handling so that failures do not cascade into system outages.

This article explains:

  • Types of backend errors

  • Real engineering strategies to handle them

  • Modern production practices

  • Global error handling architecture

  • Security considerations

  • Interview-ready explanations

  • Practical code examples

Everything here reflects how production backend systems are built today.


1. The Backend Error-Handling Mindset

Errors are not exceptional in backend systems — they are expected.

Examples from real systems:

  • A database connection pool exhausts

  • A payment API times out

  • A user submits malformed JSON

  • A distributed cache becomes unavailable

  • A race condition corrupts business logic

The question is never:

Will errors happen?

The real question is:

How does your system behave when they happen?

A production-grade backend system must:

  • Detect failures early

  • Prevent error propagation

  • Recover automatically when possible

  • Degrade gracefully when recovery is impossible

This mindset defines reliable backend architecture.


2. Logic Errors (The Most Dangerous Ones)

Logic errors do not crash systems — they silently produce incorrect results.

Example scenario:

An e-commerce backend applies a discount twice.

const finalPrice = price - discount - discount;

The system runs successfully, but the company loses money on every order.

Why Logic Errors Happen

Common reasons:

  1. Misunderstood requirements

  2. Incorrect algorithm implementation

  3. Missing edge cases

  4. Unexpected user behaviour

Real Example

A loyalty system calculates reward points:

function calculatePoints(orderAmount: number) {
  return orderAmount * 0.1
}

If the requirement was 10 points per dollar, this implementation is wrong.

Prevention Strategies

Production teams rely on:

  • Unit testing

  • Integration tests

  • Feature flags

  • Monitoring business metrics

Example metric monitoring:

  • failed transactions

  • refund rate

  • payment failures

  • order completion rate

If these suddenly change, logic errors might exist.


3. Database Errors

Backend applications depend heavily on databases. Database failures can bring entire systems down.

Common categories:

  • connection errors

  • constraint violations

  • query errors

  • deadlocks


Connection Errors

Occurs when backend cannot communicate with the database.

Reasons:

  • database overload

  • network issues

  • connection pool exhaustion

Example Node.js connection pooling:

import { Pool } from "pg"

const pool = new Pool({
  max: 20,
  connectionString: process.env.DB_URL
})

If the application exceeds the pool limit, requests start failing.

Prevention

  • Use connection pooling

  • Monitor pool usage

  • Implement timeouts


Constraint Violations

Databases enforce data integrity.

Examples:

  • duplicate emails

  • invalid foreign keys

  • null constraints

Example error:

duplicate key value violates unique constraint "users_email_key"

Example Handling

try {
  await db.insertUser(user)
} catch (error) {
  if (error.code === "23505") {
    throw new Error("Email already exists")
  }
}

Query Errors

Malformed SQL or incorrect table names cause failures.

Example:

SELECT * FROM customer;

If the actual table is customers, the query fails.

Prevention

  • Use ORMs

  • Type-safe query builders

  • Database migrations


Deadlocks

Deadlocks occur when multiple transactions wait on each other.

Example scenario:

Transaction A locks row 1
Transaction B locks row 2

Then:

A waits for row 2
B waits for row 1

Database resolves this by killing one transaction.

Strategy

Retry transactions:

async function runTransaction() {
  for (let i = 0; i < 3; i++) {
    try {
      await db.transaction(...)
      return
    } catch (err) {
      if (err.code === "DEADLOCK") continue
      throw err
    }
  }
}

4. External Service Errors

Modern applications rely heavily on third-party services.

Examples:

  • payment gateways

  • authentication providers

  • AI APIs

  • email services

  • cloud storage

Every external dependency introduces failure points.


Network Failures

Requests fail due to:

  • DNS resolution errors

  • network partitions

  • connectivity issues

Example:

await fetch(paymentGateway)

This request may timeout.


Rate Limiting

Many APIs enforce request limits.

Typical response:

HTTP 429 Too Many Requests

Exponential Backoff Strategy

async function retryRequest(fn, retries = 5) {
  let delay = 1000

  for (let i = 0; i < retries; i++) {
    try {
      return await fn()
    } catch (err) {
      if (err.status !== 429) throw err
      await new Promise(r => setTimeout(r, delay))
      delay *= 2
    }
  }
}

Service Outages

Cloud providers sometimes experience outages.

Example:

  • AWS S3 downtime

  • authentication service outage

Production systems implement fallbacks:

  • cached data

  • secondary storage

  • queueing operations


5. Input Validation Errors

Users frequently send invalid data.

Examples:

  • invalid email

  • missing fields

  • malformed JSON

Backend validation protects the system.


Format Validation

if (!emailRegex.test(email)) {
  throw new Error("Invalid email format")
}

Range Validation

if (quantity <= 0 || quantity > 100) {
  throw new Error("Invalid quantity")
}

Required Fields

if (!title) {
  throw new Error("Title is required")
}

These errors typically return:

HTTP 400 Bad Request

6. Configuration Errors

Configuration mistakes are common in production deployments.

Example:

An API key exists locally but missing in production.

process.env.OPENAI_API_KEY

If undefined, runtime failures occur.


Best Practice: Validate Config on Startup

function validateConfig() {
  if (!process.env.OPENAI_API_KEY) {
    throw new Error("Missing OPENAI_API_KEY")
  }
}

validateConfig()

This prevents runtime failures.


7. Health Checks (Proactive Error Detection)

Health checks allow infrastructure to verify system health.

Typical endpoint:

GET /health

Example implementation:

app.get("/health", async (req, res) => {
  try {
    await db.query("SELECT 1")
    res.status(200).send("OK")
  } catch {
    res.status(500).send("DB Error")
  }
})

These checks are used by:

  • Kubernetes

  • load balancers

  • monitoring systems


8. Monitoring and Observability

Production systems rely heavily on monitoring.

Important metrics:

  • error rate

  • response time

  • throughput

  • CPU usage

  • memory usage

Example stack used today:

  • Prometheus

  • Grafana

  • Loki

  • OpenTelemetry

  • Datadog

Structured logging example:

logger.error({
  message: "Database error",
  userId: user.id,
  requestId
})

Structured logs make debugging easier.


9. Global Error Handling Architecture

Production systems centralize error handling.

Typical architecture:

  • routing layer

  • handler layer

  • service layer

  • repository layer

  • global error handler

Errors propagate upward until handled.


Example Global Error Handler (Express)

app.use((err, req, res, next) => {

  if (err instanceof ValidationError) {
    return res.status(400).json({
      message: err.message
    })
  }

  if (err instanceof NotFoundError) {
    return res.status(404).json({
      message: err.message
    })
  }

  console.error(err)

  res.status(500).json({
    message: "Internal server error"
  })
})

Advantages:

  • consistent responses

  • reduced duplicated code

  • centralized error management


10. Error Recovery Strategies

Two categories exist:

Recoverable Errors

Examples:

  • network timeouts

  • temporary resource exhaustion

Solution:

  • retries

  • exponential backoff

  • queueing


Non-Recoverable Errors

Examples:

  • corrupted data

  • missing resources

  • invalid business rules

Solution:

  • graceful degradation

  • disable non-critical features

  • fallback systems


11. Error Propagation Control

Errors should propagate in a controlled way.

Example:

try {
  const user = await repo.getUser(id)
} catch (err) {
  throw new ServiceError("User lookup failed", err)
}

This preserves context.

Without proper propagation, debugging becomes extremely difficult.


12. Security and Error Handling

Improper error messages can expose sensitive information.

Example bad response:

SQL Error: duplicate key value violates constraint users_email_key

This exposes internal schema.

Correct response:

Email already exists

Authentication Security Example

Bad implementation:

User does not exist
Password incorrect

Attackers can enumerate valid emails.

Correct implementation:

Invalid email or password

13. Logging Security

Sensitive information should never appear in logs.

Avoid logging:

  • passwords

  • credit card numbers

  • API keys

  • authentication tokens

Instead log identifiers:

logger.error({
  userId,
  requestId,
  message: "Authentication failure"
})

14. Interview-Ready Concepts

If asked about backend error handling, mention these principles:

Core Concepts

  • fault tolerance

  • graceful degradation

  • centralized error handling

  • observability

  • retries with backoff

  • circuit breakers

Architecture Terms

  • health checks

  • structured logging

  • monitoring

  • error propagation

  • retry strategies

  • distributed tracing

Common Interview Question

Q: Why use global error handling?

Answer:

It centralizes error responses, prevents duplicated logic across layers, improves maintainability, and ensures consistent API responses.


Final Thoughts

Reliable backend systems are designed with failure in mind.

A robust backend system must:

  • detect failures early

  • isolate errors

  • recover automatically

  • degrade gracefully when necessary

  • maintain data integrity

  • protect sensitive information

Error handling is not just exception catching — it is a core architectural responsibility of backend engineers.

The difference between fragile systems and production-grade platforms is how well they handle failure.


This is part of series Backend First Principles. Next: Configuration Management for Modern Backend Systems

Backend First Principles

Part 7 of 17

This series documents my learning journey through the "Backend from First Principles" playlist. Instead of jumping directly into frameworks, the focus is on understanding the core concepts that power backend systems. Throughout this series, I explore how backend systems actually work — from the request-response lifecycle, HTTP fundamentals, routing, serialization, authentication, and validation to more advanced topics like caching, task queues, observability, security, and scaling. The goal of this series is to build a strong conceptual foundation for backend engineering that applies across languages and frameworks. By learning backend development from first principles, we gain a deeper understanding of how modern web systems are designed, built, and scaled.

Up next

Full-Text Search and Elasticsearch

Why Traditional Database Search Breaks at Scale Imagine it is 2005. You are a software engineer working at a rapidly growing e-commerce company. The internet boom is happening and your platform is gai