The Three Pillars of Scalable Data Processing

Feb 17

Every unit of work in a data processing system should aspire to be small, independently processable, and consistently sized. When these three properties hold, scaling becomes almost trivially simple. Reality rarely cooperates, which is why understanding these properties matters so much for platform engineering.

The Ideal State: Independent, Small, and Consistent

These three properties represent the pinnacle of what we do as platform engineers. Every message flowing through a system, every unit of work we design, should aspire to be:

Small: Quick to process, minimal memory footprint, and fast to retry if something goes wrong. Yes, smaller batches introduce per-message overhead, but predictability gains typically outweigh coordination costs.

Independently processable: No coordination required with other units of work. I can process this message without knowing anything about any other message in the system.

Consistently sized: No surprises. Every unit of work takes roughly the same amount of time and resources, with no landmines lurking in the queue.

When these three properties hold, scaling becomes almost simple. See lag growing on a Kafka topic? Double the consumer count. Since everything is independently consumable, small, and consistently sized, you've just doubled your throughput. The math works because the work itself cooperates with horizontal scaling. LinkedIn's Kafka benchmarks demonstrate this principle: properly partitioned message streams achieved 2.6 million records per second with three consumers (nearly 3x the throughput of a single consumer), showing near-linear scaling. The catch is when work is indirectly coupled, like all consumers use a shared cache or database.

Where Things Break: Serial Dependencies

Imagine you're processing time-series data where each minute's data depends on the previous minute's state. You can't process this minute until the last minute completes. That's a serial dependency.

Now shrink the window. Instead of minute-level data, you're receiving data every second. Your processing time must now complete in under one second, every single time, or you'll never catch up. You're permanently underwater.

The natural instinct when falling behind is to double the cluster size to increase throughput. But with serial dependencies, this accomplishes almost nothing. You still have to process everything in sequence. Two processors means one works while the other waits for its turn. You've doubled your infrastructure costs while your actual utilization drops by half. This isn't just intuition; it's Amdahl's Law. Even with 90% of your code parallelizable, the remaining 10% serial portion caps your maximum speedup at 10x, regardless of how many processors you add.

This gap between what we want and what we can actually achieve is the core challenge of platform engineering. If you're interested in bridging these operational realities, Working in the Mud explores how teams navigate the space between ideal technical states and pragmatic compromises.

This is the fundamental problem with serial processing: horizontal scaling (the primary tool for handling load) stops working. Adding resources doesn't translate to proportional throughput because the work itself refuses to parallelize.

The Landmine Problem: Inconsistent Sizing

Serial dependencies are bad enough. Combine them with inconsistently sized work, and you've created a system that will eventually choke.

Most messages might process in milliseconds. But scattered throughout the queue are landmines: messages that take 10-100x longer. Maybe they have more complex data structures, more nested relationships, or trigger more expensive downstream operations.

When these landmines hit a serial processing system, throughput goes off a cliff. You're not just briefly delayed; you're blocked for what feels like forever in distributed systems terms. And if you're processing serially, you can't route around the problem. Everything stacks up behind the slow one. Real systems demonstrate this dramatically: one analysis found web requests with average latency under 50ms experienced p99 spikes close to 1 second (a 20x variance between typical and worst-case performance).

This is why consistent sizing matters so much. If I know every message takes roughly the same resources, I can reason about capacity planning. I can predict when I'll fall behind and by how much. Inconsistent sizing removes that predictability entirely.

Smart Partitioning: Finding Independence Dimensions

The solution lies in identifying dimensions along which work is naturally independent. This requires understanding your business domain, not just your technical architecture.

For many SaaS systems, customers represent the obvious partition boundary. Are you doing any cross-customer analytics? Any cross-customer data aggregation? If not, you can process each customer's data independently. This is enormously valuable.

Within a single customer, the question becomes more interesting. Can you independently process across devices? Across sensors? Across feature types? Each dimension where you can maintain independence is a dimension where you can scale horizontally. Whether you use batch or real-time processing for each dimension shapes your architecture significantly; Batch and Real-Time Platforms Have Different Jobs explores how these different workload types have competing requirements for independence.

When you identify a valid independence boundary, you can structure your queues and consumers around it. Ten customers means ten independent processing streams. Falling behind on one customer? Scale that stream while other customers remain unaffected.

But claiming independence requires honesty about your actual dependencies. If processing one customer's data ever requires knowledge of another customer's state, you've lost that independence guarantee. Better to design explicitly for the dependencies you have than to discover them during an incident.

The Resource Utilization Trap

Here's where this gets expensive. Say you have data that requires serial processing within a customer, but you're processing all customers through a single queue. If your queue structure doesn't partition by customer, adding consumers creates contention. Two consumers might try to process the same customer's data, with one winning while the other blocks. Or you implement locking, and now you're coordinating across consumers instead of simply processing work.

The result: doubled infrastructure costs with far less than doubled throughput. This is the real cost of violating independence assumptions. Scaling doesn't work the way you expect, and you pay full price for resources you can't actually use.

Designing for the Ideal

So how do you build systems that maintain small, independent, consistently sized work?

Question serial dependencies. Every time you find yourself requiring sequential processing, ask whether the business logic truly demands it. Often, serial processing is an accident of implementation rather than a genuine requirement.
Design partition boundaries early. Don't retrofit independence. Build your data model and queue structure around the dimensions where independence naturally exists. This is architectural work, and your early decisions compound across years. Software Architecture Is a Building provides a framework for thinking about these structural decisions.
Validate independence claims. Before committing to a partitioning strategy, trace through your actual data flows. Where do cross-partition queries happen? Where might they happen in the future?
Monitor for landmines. Track processing time distributions, not just averages. When your p99 is 100x your p50, that tells you something important about your sizing consistency.
Separate fast paths from slow paths. If some work is fundamentally more expensive, route it differently. Don't let expensive operations poison queues full of cheap ones. The Async Decoupling Pattern shows how isolating expensive batch work from real-time systems prevents infrastructure strain.

The Payoff

When you achieve the three pillars, capacity planning becomes predictable. Scaling becomes proportional. Incidents become recoverable. You can look at growing lag and know exactly what lever to pull.

That's the promise of small, independent, consistently sized work. Systems that honor these properties let you scale without surprises, and in platform engineering, predictability is worth everything.

Related Content

Featured

Feb 17, 2026

The Three Pillars of Scalable Data Processing

Feb 17, 2026

Feb 10, 2026

The Async Decoupling Pattern for Scalable Batch Processing

Feb 10, 2026

Batch processing architecture has a clean pattern that scales elegantly: decouple batch systems asynchronously from everything else. When you get this right, your real-time system stays stable regardless of batch volume, and you never need elaborate job scheduling to avoid infrastructure strain.

Feb 10, 2026

Feb 3, 2026

Batch and Real-Time Platforms Have Different Jobs

Feb 3, 2026

When designing data platforms, I frequently encounter teams trying to build one unified system that handles both real-time streaming and batch analytics. The instinct makes sense: both workloads operate on the same underlying data, so why not share infrastructure?

Getting this architecture right has real consequences.

The challenge is that these workloads have fundamentally different characteristics. Supporting both well on a single platform is expensive and complex. In most cases, you get better results by separating them early and letting each system lean into its strengths.

Feb 3, 2026

Jan 28, 2026

Making Interviews Objective with AI (Without Making Them Worse)

Jan 28, 2026

Everyone has opinions about candidates. That's the problem.

We're supposed to ask standard questions, evaluate people against the job description, and test whether they can do the work. Instead, we dig into areas where we think they're weak, ask different questions for each person, and end up testing our biases instead of their abilities.

Jan 28, 2026

Jan 20, 2026

The Software That Shouldn't Exist

Jan 20, 2026

Everyone's worried about AI replacing engineers. The more interesting question is what happens when the cost of building software drops so dramatically that entirely new categories of software become viable.

The industry is calling this "personalized software." Custom tools built for a specific person, a specific context, a specific moment. Software that never leaves your machine. Software that would never justify a product. Software that, until recently, simply wouldn't exist.

Jan 20, 2026

Jan 6, 2026

Shifting Left - How Small Teams Handle Organizational Gaps Without Breaking

Jan 6, 2026

Every small organization has gaps. Maybe you have an engineering lead but no dedicated DevOps team. Maybe your product manager is stretched thin and the tech lead is absorbing PM responsibilities. Maybe a designer role is emerging, but nobody owns it yet.

These gaps often emerge in specific domains. Growing organizations typically need four types of engineering leadership, and early-stage teams almost never have all of them covered. This is normal. The question is: how do you respond?

Teams may make the mistake of dumping the entire burden on one person. They identify the gap, find whoever is closest to it, and expect that person to absorb all the additional work. This breaks people.

There's a better approach I call "shifting left."

Jan 6, 2026

Dec 30, 2025

Working in the Mud - The Mental Model That Keeps Engineering Teams Moving

Dec 30, 2025

Every engineering blog paints a picture of clean microservices, continuous deployment, and comprehensive observability. I've been in this industry for over a decade, and I've never experienced this ideal state across the board. I've seen glimmers. Teams that nail one dimension. But never everything at once.

That gap between the ideal and reality is what I call working in the mud.

Dec 30, 2025

Dec 23, 2025

Software Architecture Is a Building - A Mental Model for Technical Decisions

Dec 23, 2025

Most architecture discussions devolve into abstract debates about microservices, monoliths, and database choices. After years of explaining these concepts to engineers and product leaders, I've found that thinking about software architecture like a physical building cuts through the noise and makes the tradeoffs viscerally clear.

This isn't just a teaching metaphor. It's a decision framework that surfaces why some changes cost weeks and others cost months, why certain tech debt compounds silently while other debt screams at you daily, and how to gauge the right amount of architectural runway to build.

Dec 23, 2025

Dec 16, 2025

AI-Assisted Development Changes What Matters in Framework Selection

Dec 16, 2025

The two-minute deploy is killing my productivity.

That sounds wrong until you think about proportions. Two minutes is nothing. But when AI-assisted development shrinks the time spent writing code, those two-minute deploys start consuming a much larger percentage of your development cycle.

I discovered this while building with a managed backend framework that requires redeployment even during local sandbox development. The frontend rebuilds in seconds. The backend takes two minutes. Suddenly, that backend deploy time is where I spend most of my dev cycle waiting.

A caveat before going further: this observation comes from a greenfield project where I'm moving quickly and iterating frequently. AI-assisted development changes the structure of work in existing projects too, but this effect is most pronounced when building something new and small, where rapid iteration is the default.

Dec 16, 2025

Dec 9, 2025

Stop Fighting the Wrong Battles - The Three-Level Problem Framework

Dec 9, 2025

Most engineering teams waste weeks solving the wrong problems.

They polish user interfaces while core APIs fail. They optimize conversion funnels while databases crash. They redesign onboarding flows while authentication randomly breaks.

This happens because everything gets labeled "high priority" without any systematic way to determine what actually needs fixing first.

Here's a three-level framework that immediately clarifies what to fix first, what can wait, and what's wasting everyone's time.

Dec 9, 2025

architecturedata

Brian Conn https://connsulting.io

The Three Pillars of Scalable Data Processing

The Ideal State: Independent, Small, and Consistent

Where Things Break: Serial Dependencies

The Landmine Problem: Inconsistent Sizing

Smart Partitioning: Finding Independence Dimensions

The Resource Utilization Trap

Designing for the Ideal

The Payoff

Related Content

Connsulting

About

Offerings