For whatever reason I stumbled upon an interesting post on Materialized View blog on Substack. In contrast to its streaming counterpart batch data processing has evolved a great deal in the past 10 years which contributed enormously to shape current Modern Data Platforms. While the heirs of big data technology stacks starts consolidating, cultural challenges arise as I talked in Modern Data Platform Old Mindset. But stepping back to the streaming world, it seems we are still trapped in Kafka — I dare to say we stubbornly cling to yesterday’s architectures. However data streaming story is longer than the “Big Data streaming” fuss. Long before Apache Kafka and Flink became household names, organizations were wrestling with real-time integration using custom daemons, message queues like IBM MQ, and ad-hoc log scraping. As batch‐oriented systems (think Hadoop MapReduce circa 2006) flourished, the idea of processing events on‐the‐fly remained an afterthought.
The first wave of true stream‐processing platforms emerged in the early 2010s:
-
Storm (2011) introduced a distributed, fault-tolerant model for handling unbounded tuples.
-
Samza (2013), born at LinkedIn, embraced Kafka as both transport and storage, pioneering a more integrated approach.
-
Spark Streaming (2013) grafted micro-batching onto the Spark engine, blurring the line between batch and stream.
-
Flink (2014) championed native event‐time processing and stateful operators, setting a new bar for low-latency analytics.
Over the past decade, these innovations gave rise to patterns like the Lambda Architecture (batch + speed layers) and later the Kappa Architecture (single stream‐only pipeline). Yet despite all this activity, many core components have barely budged in their APIs or operational models.
The Stagnation of Kafka-Centric Streaming
As mention at the beginning in the “Kafka: The End of the Beginning” post is argued that while batch processing has enjoyed a renaissance—thanks to tools like Spark, Hadoop 2.0/YARN, and modern data-warehouse engines—streaming solutions such as Kafka and Flink have hit a plateau. Even as cloud-native paradigms emerge, Kafka’s entrenched protocol and ecosystem present a formidable barrier to fresh alternatives. Companies that built streaming pipelines five years ago still maintain them almost unchanged today, which has slowed the pace of real innovation. The piece draws a parallel to Hadoop’s early days, suggesting streaming may be at an analogous “end of the beginning” moment, ripe for disruption by a new generation of lightweight, cloud-first systems.
The journey from Samza to Flink could give us an idea of what would it be the evolution of Flink, as “From Samza to Flink: A Decade of Stream Processing”. says, despite Samza’s being formally impeccable, with an elegant decoupling of compute and transport did not reach a massive adoption given its operational complexity and waning adoption. Kafka Streams and Kafka Connect sprang from those lessons, offering more focused, client-library models. However, the author remains skeptical of Flink’s monolithic architecture, which, despite cutting-edge semantics, reintroduces some of the same operational challenges Samza faced. This philosophical tug-of-war between “single‐purpose” vs. “one-size-fits-all” solutions underscores a core tension: should a stream engine do everything, or just what it does best?
The Rise of Real-Time Table Formats
On the storage side, Apache Iceberg has become a de facto standard for table formats in both batch and streaming contexts. In “Optimizing Apache Iceberg Tables for Real-Time Analytics”, the authors dive into partitioning strategies, file‐level sorting, and compaction tactics that can dramatically accelerate query performance. They warn, however, that streaming writes can trigger “small file explosions” and metadata bloat—issues that must be tamed with periodic compaction and careful schema design. Aligning your Iceberg layout to your most-common filters, they argue, is just as critical as choosing the right stream engine.
What could it be our road Ahead?
As we look forward, it’s clear that the next wave of streaming innovation will need to address both protocol lock-in and operational complexity. Promising efforts—such as the S2 project rethinking cloud-native pipelines—show that we’re not out of ideas. But entrenched ecosystems rarely give way easily. We may well see a bifurcation: on one hand, highly managed, serverless streaming services that abstract away operational toil; on the other, lightweight, embeddable libraries optimized for edge use-cases.
Ultimately, our obsession with shiny new frameworks must be tempered by the realization that lasting change often comes from small, iterative improvements to our day-to-day tools. If the next decade of streaming can replicate the burst of innovation we saw in batch—from Hadoop to Spark, from Hive to modern data-warehouses—then we’ll finally inhabit a truly “modern” streaming ecosystem: cloud-native, schema-aware, and infinitely elastic. Until then, we’ll continue to build on the shoulders of giants—Kafka, Samza, Flink, Iceberg—while dreaming of what comes next.