Batch and Real-Time Platforms Have Different Jobs
When designing data platforms, I frequently encounter teams trying to build one unified system that handles both real-time streaming and batch analytics. The instinct makes sense: both workloads operate on the same underlying data, so why not share infrastructure?
Getting this architecture right has real consequences.
The challenge is that these workloads have fundamentally different characteristics. Supporting both well on a single platform is expensive and complex. In most cases, you get better results by separating them early and letting each system lean into its strengths.
Independence as an Architectural Lens
When I evaluate a data processing workload, I ask one question: how independent is this work? Independence is the key to horizontal scalability. The more independent the units of work, the easier the system scales.
Several dimensions of independence matter:
- Payload independence: Can I process payload A without caring about payload B?
- Time independence: Does it matter when I process something, as long as I eventually process it?
- Entity independence: Can I process data for entity X without knowing anything about entity Y?
- Order independence: Can I process items out of order without breaking correctness?
Workloads that score high on independence scale well horizontally. Workloads with dependencies require coordination, which limits throughput and adds complexity.
Real-Time Workloads Favor Independence
Consider a streaming consumer that receives payloads and writes them to a data warehouse like BigQuery. The job is simple: receive payload, store with correct timestamps, acknowledge, move on.
This workload is highly independent across almost every dimension. Each payload is self-contained. A payload for one IP address has nothing to do with a payload for another. Processing can happen on separate workers without coordination. Even if payloads arrive out of order, it often does not matter because you write them with their original timestamps and let the warehouse handle the rest.
This independence is what makes real-time platforms scalable. You can parallelize across workers, partitions, and regions because each unit of work is isolated.
Where Independence Breaks Down
Not every component enjoys this luxury. Consider maintaining state in a relational database.
Say you track assets with a first_seen timestamp for each IP address. If a payload arrives for an IP at 10:00, and another arrives at 10:10, first_seen should be 10:00. Straightforward.
But what if the payloads arrive out of order? If the 10:10 payload arrives first, you write 10:10 as first_seen. When the 10:00 payload arrives, you have a correctness problem.
You can solve this in the application layer by always flooring to the earlier timestamp. But now you are trading compute for independence. Every write becomes a conditional update instead of a simple insert.
This is the core trade-off: accept dependencies and build for ordered processing, or build for independence and push complexity into the application layer. Neither is wrong, but you have to choose consciously.
Batch Workloads Accept Dependencies
Batch workloads, including ML pipelines, accept dependencies that real-time systems avoid.
An ML model analyzing daily patterns needs all the payloads from that day. It cannot operate on a single payload in isolation. Run the model before all the data has landed, and your results are incomplete. Many models analyze patterns across entities, comparing one IP to others. Time series analysis requires data in order.
Every independence dimension that makes real-time processing scalable becomes a dependency dimension for batch processing.
The saving grace is latency tolerance. Real-time platforms race against incoming data volume. If payloads pile up faster than you can process them, you fall behind.
Batch workloads can afford to be slow. Nobody cares if an anomaly detection model takes 30 minutes to run. This latency tolerance is how batch workloads function despite their dependencies. They can afford the coordination costs because they are not racing the clock.
Leaning Into Strengths
When you have two workloads with opposite characteristics, building one platform to handle both gets expensive. You end up with a system that cannot fully leverage the strengths of either approach.
Your real-time processing cannot scale as horizontally because it keeps accommodating batch dependencies. Your batch workloads cannot get the complete views they need because the infrastructure is optimized for independent parallel processing.
The alternative is two platforms connected by a thin interface.
Split your real-time platform from your batch platform as early as possible in the data flow. Let the real-time platform do what it does well: massively parallel, independent processing of incoming data. Let the batch platform do what it does well: analysis of complete datasets with cross-entity correlation.
The interface between them should be minimal. The real-time platform lands data in a store. The batch platform reads from that store when ready. They communicate through data, not shared processing infrastructure.
Assessing Your Own Workloads
When designing or inheriting a data platform, score each workload against the independence dimensions above. Workloads with similar profiles can share infrastructure. Workloads with opposite profiles should be separated.
The mistake I see teams make is assuming that because two workloads operate on the same data, they should share the same platform. But shared data does not mean shared infrastructure. The workload characteristics matter more than the data source.
The Trade-Off
Building one platform that handles both batch and real-time well is possible. Frameworks like Apache Flink and Spark Structured Streaming have made unified architectures more viable, and some organizations successfully run both workloads on shared infrastructure. It is also expensive and complex. According to Matillion's 2024 Data Readiness Survey, 89% of organizations report issues with their current platform's ability to scale pipelines to meet processing needs, and 70% rate pipeline management as complex.
The complexity tax shows up in how teams spend their time. When your platform tries to serve both workload types, the maintenance burden compounds as you juggle competing optimization requirements.
Separating them early costs you some duplication. You maintain two systems instead of one. But each system can lean into what it does well, and you avoid the ongoing complexity of trying to serve opposite requirements.
For most teams, the separation is worth it.

