Payment Processing Systems: A Complete Architect’s Guide

payment-processing-systems: From a user clicking “Buy” to funds settling in a merchant’s account — a deep dive into every layer of a production-grade payment gateway, explained for beginners and senior engineers alike.

📖 18 min read | 🎓 Beginner to Advanced | 🏗 System Design & Architecture

Table of Contents

What is a Payment Gateway?
Why Do We Need It?
Core Requirements
Baseline Architecture
Payment Processing Flow
Security & Tokenization
Transaction Lifecycle
Database & Double-Entry Bookkeeping
Idempotency & Concurrency
Scalability & Distributed Systems
Regulatory Compliance
Handling Network Failures
Interview FAQ

What is a Payment Gateway?

When a user clicks the Buy button, a complex financial workflow is triggered behind the scenes — often completing in under three seconds. It may look simple from the user’s perspective, but the underlying architecture is highly critical: it can process transactions worth millions of dollars, where even a single lost penny can erode customer trust and trigger regulatory consequences.

In system design interviews, designing a payment system is a classic challenge because it pushes candidates beyond basic request-response thinking. You must address stringent requirements around consistency, security, reliability, and data integrity — all while ensuring transactions are processed accurately at scale.

A payment gateway is the secure digital bridge between a customer, a merchant, and the financial networks (banks and card networks) that makes online payments possible.

Figure 1: The payment gateway as a secure bridge between all parties

Why Do We Need a Payment Gateway?

Without a payment gateway, every merchant would need to directly integrate with multiple banks and card networks — a massive engineering burden that is complex, insecure, and impossible to scale. The gateway solves three fundamental problems:

Security
Encrypts sensitive card data, runs fraud checks, and ensures PCI DSS compliance so merchants don’t have to handle raw financial data.

Trust
Reliable fund handling means money is never lost or mishandled during the transaction process. Trust is the foundation of any financial product — even a single failed or duplicate charge damages brand reputation.

Abstraction
Acts as a single integration point, hiding the complexity of bank and card-network APIs. Merchants integrate once with the gateway; the gateway handles the rest.

Core Requirements: Functional & Non-Functional

Before drawing any architecture diagram, experienced engineers clarify exactly what the system must do. This is called requirements gathering — it shapes every design decision that follows.

FUNCTIONAL REQUIREMENTS
(What the system must do)

1. Idempotent Authorization
A payment request must be processed exactly once, regardless of how many times the client retries due to timeouts or network issues. No customer should ever be charged twice for the same transaction.

2. Two-Phase Settlement (Authorize & Capture)
Support “holding” funds at checkout (authorization) and only moving them later — for example, when an order ships (capture). This model is standard in e-commerce and hospitality.

3. Refunds & Reversals
Handle the return of funds with full auditability. Every reversal must be traceable back to the original transaction.

4. Webhooks
Asynchronously notify downstream services — such as order fulfillment or inventory systems — when payment status changes, without blocking the main payment flow.

5. Reconciliation
A background process that verifies the internal ledger matches what external payment providers (PSPs like Stripe, Razorpay, or Adyen) actually report.

NON-FUNCTIONAL REQUIREMENTS
(How well it must do it)

✅ 100% Data Accuracy — No ghost payments or missing ledger entries
✅ Strong Consistency — Ledger must always reflect accurate fund state
✅ High Availability — API must be available to receive payment intents at all times
✅ Low Latency — p95 authorization latency must be under 500ms
✅ PCI DSS Compliant — Raw card data minimized via tokenization

⚠️ Why correctness matters more than speed
Unlike social media or search systems where eventual consistency is acceptable, payment systems demand strong consistency. A ledger entry out of sync with the real world means money disappearing — or being charged twice. There is no “undo” in financial systems.

Baseline Architecture: The Five Core Components

At its core, a payment system coordinates between three entities:

– The User (Client) — who initiates the payment
– The Internal Ledger — the source of financial truth
– The External Payment Service Provider (PSP) — Stripe, Razorpay, Adyen, etc.

Here are the five building blocks that make that coordination possible:

1. Payment API Gateway
The entry point. Receives all incoming payment requests, validates them, and orchestrates the entire flow.

2. Idempotency Store (Redis)
A fast key-value cache that stores unique request keys. Prevents duplicate processing when clients retry failed requests.

3. Ledger Database (PostgreSQL)
The financial source of truth. Uses ACID transactions to record every movement of money with full integrity.

4. Wallet / Account Service
Manages real-time balances for users and merchants. Updated on every transaction event.

5. Reconciliation Engine
An offline batch job that compares PSP settlement reports against internal ledger entries. Catches any discrepancies.

Figure 2: Five core components of a baseline payment architecture

Payment Processing Flow: Step by Step

Here is exactly what happens from the moment a customer clicks “Buy” to the moment their bank approves or declines the payment.

Step 1 — Customer Initiates Payment
The customer decides to purchase and is redirected to the payment gateway. They enter their card details or UPI ID on a secure, encrypted page.

Step 2 — Payment API Gateway Receives the Request
The gateway validates the incoming request — checking format, authentication, and basic sanity before proceeding.

Step 3 — Idempotency Check
The server looks up the unique idempotency key attached to this request. If it already exists in the cache, the stored result is returned immediately — preventing any duplicate charge.

Step 4 — Fraud Detection
Before touching any bank system, the request passes through fraud checks:
– IP geolocation — is the request from an unexpected region?
– Velocity checks — has this card attempted too many transactions recently?
– ML scoring — does the pattern match known fraud signatures?

Step 5 — Acquiring Bank
The gateway forwards the transaction details to the merchant’s acquiring bank, which routes the request to the appropriate card network (Visa, Mastercard, RuPay, etc.).

Step 6 — Issuing Bank Authorization
The card network contacts the customer’s issuing bank. The bank verifies the customer’s identity, available funds, and spending patterns, then sends back an approval or decline.

Step 7 — Result Delivered
– If Approved: The payment gateway confirms success. The ledger is updated. The customer sees an order confirmation.
– If Declined: The customer is prompted to retry with a different payment method.

Figure 3: End-to-end payment flow from customer to bank decision

Security: How Payment Gateways Protect Your Data

Card Tokenization (Beginner Explanation)
Raw card numbers should never be stored in any database. Instead, gateways use tokenization — replacing the actual card number with a unique, random token. This token is meaningless on its own and can only be mapped back to the real card by the card network’s secure vault.

How tokenization works:
1. Customer enters card details on the payment gateway’s secure page
2. Gateway sends the card number to the card association’s network (Visa, Mastercard)
3. Card association creates a unique token and returns it
4. Gateway stores only the token — never the raw card number
5. Future payments use the token to reference the original card securely

Four Key Security Layers

1. HTTPS Protocol
All payment data is transmitted over HTTPS — every byte is encrypted in transit between the customer’s browser and the gateway server. No plain-text data ever travels across the network.

2. Transaction Validation via Hash Function
A secret key, known only to the merchant and the gateway, is used to generate a cryptographic hash of each transaction request. This proves the request hasn’t been tampered with in transit.

3. IP Address Verification
The IP address of every incoming request is checked against known malicious ranges and flagged for unusual geographic origin. This blocks many automated bot attacks before they reach the payment logic.

4. 3-D Secure (Virtual Payer Authentication)
An additional authentication step — such as a one-time password or biometric confirmation — that the customer must complete before the transaction is approved. This protects both buyer and seller from fraudulent card-not-present transactions.

Figure 4: Tokenization — the raw card number never reaches the merchant database

Transaction Lifecycle: Authorize → Capture → Settle

A payment is not a single action — it is a state machine that moves through defined stages. Understanding this lifecycle is essential for handling failures, refunds, and disputes correctly.

AUTHORIZATION
What it is: The bank confirms the customer has sufficient funds and places a hold on that amount.
Important: The money has NOT been debited yet. This is a reservation only.
When it happens: Always at checkout, the moment the customer submits their payment.

CAPTURE
What it is: The command that tells the bank to transfer the reserved funds to the merchant.
Important: Some businesses separate authorization and capture deliberately.
Example: An e-commerce store authorizes at checkout, but only captures when the item actually ships.

SETTLEMENT
What it is: The final inter-bank movement of money — funds physically move from the issuing bank to the acquiring bank.
When it happens: Typically in batches, 1–2 business days after capture.

RECONCILIATION
What it is: An asynchronous safety net — a background batch job that compares internal ledger entries against the PSP’s settlement reports.
Why it matters: Catches any discrepancies that real-time logic missed. Example: A transaction marked “failed” internally but “successful” at the bank must be detected and corrected.

REFUNDS & REVERSALS
A separate flow that reverses a captured or settled payment. Every reversal must link to an original transaction ID and be recorded in the ledger with full auditability.

Payment State Machine
Every transaction moves through these states:
INITIATED → AUTHORIZED → CAPTURED → SETTLED
↓ ↓
FAILED REVERSED / REFUNDED

Never skip states. Never mark FAILED immediately on a timeout (see Section 12).

Figure 5: Transaction state machine — every payment moves through these defined stages

Database Design & Double-Entry Bookkeeping

The database is the backbone of a payment system. Financial systems demand strong consistency — which is why relational SQL databases like PostgreSQL are the industry standard. NoSQL databases offer flexibility but typically lack full ACID compliance required for financial data integrity.

What is Double-Entry Bookkeeping? (Beginner Explanation)
Double-entry bookkeeping is a 500-year-old accounting principle that says: every financial transaction must affect at least two accounts. This keeps the system mathematically balanced and fully auditable at all times.

Example: A user pays $100 to a merchant
– DEBIT → User’s account: -$100 (money leaves)
– CREDIT → Merchant’s account: +$100 (money arrives)

Rule: Total Debits must always equal Total Credits.
If they don’t match, something is wrong — a double charge, a lost entry, or a bug.

Why does this matter for engineers?
It makes bugs visible. If the ledger doesn’t balance, your reconciliation engine flags it immediately. Without double-entry, a lost database write could mean money disappearing with no trace.

Schema Overview (Simplified)

The two most critical tables:

TRANSACTIONS TABLE (immutable event log)
– id (UUID) — unique transaction identifier
– idempotency_key (VARCHAR, UNIQUE) — prevents duplicate processing
– amount (BIGINT in cents) — always store money as integers, never floats
– currency (CHAR 3) — ISO currency code e.g. USD, INR, EUR
– status (VARCHAR) — INITIATED | AUTHORIZED | CAPTURED | SETTLED | FAILED | UNKNOWN
– psp_token (VARCHAR) — tokenized card reference from PSP
– created_at (TIMESTAMP)

LEDGER ENTRIES TABLE (double-entry, append-only)
– id (UUID)
– transaction_id (FK → transactions)
– account_id (UUID) — which account is affected
– entry_type (VARCHAR) — DEBIT or CREDIT
– amount (BIGINT)
– created_at (TIMESTAMP)

⚠️ Important: Ledger entries are APPEND-ONLY. You never update or delete a ledger row. Corrections are made by adding new reversing entries. This is the foundation of financial auditability.

Figure 6: Double-entry ensures every dollar debited is credited somewhere

Idempotency & Concurrency Control

What is Idempotency? (Beginner-Friendly Definition)
An operation is idempotent if repeating it multiple times always produces the same result as doing it once.

Real-world analogy: Pressing an elevator button 10 times still only calls the elevator once.

In payments: If a network timeout causes the client to retry a payment request 3 times, the customer should be charged exactly once — not three times.

How Idempotency Keys Work
1. The client generates a unique key for each payment request (e.g. a UUID like txn_abc123)
2. The client sends this key in every request header
3. The server checks Redis: does this key already exist?
• YES → Return the stored result. Do not re-execute.
• NO → Process the payment, store the result with this key, return result.
4. All future retries with the same key get the cached result — guaranteed.

⚠️ Never mark a payment FAILED if the idempotency key hasn’t been checked first. A “failed” response might just be a timeout — the payment may have succeeded at the PSP side.

Concurrency Control
Two threads might simultaneously try to spend the same wallet balance — a classic race condition where $100 might be spent twice when only $100 exists.

Two strategies prevent this:

Pessimistic Locking (SELECT FOR UPDATE)
Locks the database row when reading it, so no other transaction can modify it until the first one completes. Simple and reliable, but creates contention under high load.

Optimistic Locking (Version Numbers)
Attaches a version counter to each record. When updating, the system checks: is the version the same as when I read it? If another transaction changed it in the meantime, the update is rejected and the client retries. Lower contention, better for high-throughput systems.

Figure 7: Idempotency key flow — retries always return the cached result, never re-charge

Scalability & Distributed Systems

Designing for scale means ensuring the system can handle exponential growth in transaction volume without compromising reliability or consistency. Here are the four core patterns used in production payment systems:

1. Horizontal Scaling (Stateless API Servers)
The Payment API Gateway holds no state — all state lives in the database and cache. This means you can add more server instances behind a load balancer whenever traffic grows. Adding 10 servers is just as simple as adding 1.

2. Database Sharding
The database is often the scalability bottleneck. Sharding splits data across multiple database nodes using a partition key (e.g. user_id modulo N). Each shard handles a subset of transactions, enabling parallel writes far beyond what a single server can handle.

Example: user_id ending in 0–4 → Shard A, user_id ending in 5–9 → Shard B

3. Geo-Redundancy
Deploy regional database replicas and API nodes closer to your users. This reduces latency and eliminates a single data center as a failure point. Combine with coordinated failover so traffic automatically reroutes if a region goes down.

4. Event-Driven Architecture (Kafka / Message Bus)
Instead of the payment service directly calling the ledger service, notification service, and order service — it publishes a single event to a message bus. Each downstream service independently consumes that event on its own schedule.

Benefits:
– Each service scales independently
– A slow notification service doesn’t block payment processing
– Events can be replayed if a consumer crashes
– Full audit trail of every system action

Figure 8: Stateless API servers + sharded DB + event bus = horizontal scale

Regulatory Compliance

Payment systems must comply with financial regulations. Non-compliance can result in severe penalties, loss of payment processing licenses, and criminal liability.

PCI DSS (Payment Card Industry Data Security Standard)
Applies to: Any system that stores, processes, or transmits cardholder data.
Key requirements:
– Encrypt all stored and transmitted card data
– Minimize raw card data exposure — use tokenization
– Restrict access to cardholder data on a need-to-know basis
– Maintain detailed audit logs of all access to financial data
– Conduct regular penetration testing and vulnerability scans

GDPR (General Data Protection Regulation)
Applies to: Any system handling personal data of EU citizens, regardless of where your company is based.
Key requirements:
– Obtain explicit consent before collecting personal data
– Allow users to request deletion of their data (right to be forgotten)
– Notify authorities within 72 hours of a data breach
– Only retain data for as long as necessary

AML (Anti-Money Laundering)
Applies to: All regulated financial institutions and payment processors.
Key requirements:
– Flag and report suspicious transaction patterns (e.g. structuring large amounts into smaller ones)
– Verify customer identity for transactions above regulatory thresholds (KYC — Know Your Customer)
– Maintain transaction records for regulatory audit

Geo-Fencing
Automatically block transactions from sanctioned countries or high-risk regions using IP geolocation and BIN (Bank Identification Number) checks. This is both a legal requirement and a fraud-reduction measure.

Audit Logs
Every action in a payment system — every state change, every API call, every admin access — must be logged immutably. These logs are the evidence you need during regulatory audits, dispute resolution, and fraud investigations.

Handling Network Failures

Network failures are inevitable in distributed systems. A payment request may reach the gateway successfully, but the response may never return due to timeouts or connectivity issues. The critical mistake most beginners make: marking a payment as FAILED immediately after a timeout.

⚠️ A timeout means “I didn’t get a response” — NOT “the payment failed.”
The PSP may have successfully processed the payment on their side. Marking it FAILED prematurely and issuing a refund costs real money.

The correct status for any unconfirmed payment is: UNKNOWN.

Six Reliability Mechanisms That Work Together

1. Idempotency Keys
Every retry of the same request includes the same unique key. The server returns the cached result without re-executing. No double charges.

2. Payment States
Maintain explicit states: INITIATED → PROCESSING → SUCCESS / FAILED / UNKNOWN
Never collapse UNKNOWN into FAILED. Treat them as different things.

3. Reconciliation Jobs
A background job periodically queries the PSP for all payments in UNKNOWN state and updates their final status. Runs every few minutes or hours depending on SLA requirements.

4. Webhooks
The PSP proactively pushes payment outcomes to your server asynchronously. Even if the original request timed out, the webhook delivers the final result. This is the fastest recovery mechanism.

5. Outbox Pattern
Write the payment update AND the event record to the database in the same ACID transaction. A background process then reliably publishes the event to Kafka. This eliminates the race condition between database writes and message queue publishes — no events are ever lost.

6. Retries and Dead Letter Queues (DLQs)
Transient failures (network blips, temporary PSP downtime) are handled with exponential backoff retries. Messages that fail repeatedly after all retries are moved to a Dead Letter Queue for manual investigation — they are never silently discarded.

Figure 9: UNKNOWN state + reconciliation + webhooks = exactly-once processing

Interview FAQ: Payment Systems

These questions appear regularly in system design interviews at FAANG companies and fintech startups. Understanding the reasoning behind each answer matters as much as the answer itself.

Q: What happens if a payment gateway processes a transaction successfully but the client never receives the response?

A: The client times out and retries. The server’s idempotency key ensures the payment is not charged again — the second request returns the cached result from the first. The transaction is stored as UNKNOWN status until a reconciliation job or webhook confirms the actual outcome.

Q: How do payment systems prevent duplicate charges during retries?

A: By requiring every payment request to include a client-generated, globally unique idempotency key (typically a UUID). The server records this key alongside the result on the first execution. Any retry with the same key returns the stored result without re-running the payment logic.

Q: Why should a payment NOT be marked as FAILED immediately after a timeout?

A: Because a timeout means “I didn’t receive a response” — not “the payment failed.” The PSP may have successfully processed it on their side. If you mark it FAILED and issue a refund, you lose money unnecessarily. The correct approach is to mark the payment as UNKNOWN and resolve it via reconciliation or webhook.

Q: What is the purpose of reconciliation jobs in a payment system?

A: Reconciliation is the asynchronous safety net. A batch job periodically compares internal ledger entries against PSP settlement reports. It catches discrepancies that real-time logic missed — such as payments that succeeded at the PSP but are still UNKNOWN internally — and flags or auto-corrects them.

Q: How do webhooks improve reliability in payment processing?

A: Webhooks let the PSP push payment outcomes to your server asynchronously, rather than requiring your server to poll. Even if the original API call timed out and the client never received a response, the PSP will still deliver the final result via webhook — decoupling outcome delivery from the original request lifecycle.

Q: How do you ensure payment events are not lost after updating the database?

A: Use the Outbox Pattern. Write both the payment status update and the event record to the database inside the same ACID transaction. A separate background process (or Change Data Capture tool) then reliably publishes the event to Kafka. This eliminates the race condition where a DB write succeeds but the subsequent Kafka publish fails — no events are ever silently lost.

Conclusion: Building Production-Grade Payment Systems

A robust payment gateway is not a single feature — it is a carefully orchestrated composition of multiple engineering disciplines working together:

– Idempotency to prevent duplicate charges
– Strong consistency to ensure every penny is accounted for
– Event-driven architecture to scale each component independently
– Fraud detection to protect users and merchants in real time
– Regulatory compliance to meet legal obligations across markets
– Exhaustive failure handling to recover gracefully from network issues

Every design decision in a payment system exists to protect one fundamental truth: money must never be lost, duplicated, or misrouted. Even a single incorrect ledger entry can cause regulatory consequences and permanent loss of customer trust.

Master these principles and you have the architectural foundation to design any financial system — from a startup’s first checkout flow to a platform processing billions of transactions per year.