Modern Data Stack and the Data Chasm — Part II: A Path to Leaner Data Systems
This article was originally published on Synnada on October 20, 2023.
Emergent Lean Patterns in Data Practices
In Part 1, we explored how the expanding data ecosystem led to fragmentation and complexity, thrusting teams into a “data chasm” where they are unable to integrate siloed tools. In Part 2, we focus on pioneering data teams that crossed this chasm, continuing innovation and growth. By studying these real-world cases, we aim to uncover principles for leaner data systems.
The examples chosen for this study are not random — they were selected based on the popularity, longevity and robustness of the solutions deployed. These technology platforms and methodologies have stood the test of time, continuously proving their worth and adaptability despite the dynamic landscape of data ecosystems. Such resilience makes them valuable case studies for identifying lean data practices.
First, we cover Airbnb’s data story: Amid rapid growth, Airbnb was struggling with an increasingly complex data infrastructure and inefficient workflows. They tackled this challenge by designing and building Airflow, a foundational solution to develop, schedule, and monitor data pipelines at scale. Next, we explore Uber’s story, which was facing rising logging costs and maintainability hurdles. They simplified their data infrastructure by replacing one of their core infrastructure components, along with all its supplementing tools, with a stronger/more unified technology — resulting in a leaner system. Finally, we discuss Apache Arrow, an industry wide solution that bridges the data chasm created by the explosive growth in big data tooling.
While we constrain our attention to these specific cases, we may cover other examples like Apache Spark, Apache Kafka, and DuckDB over time in subsequent posts (feel free to suggest others). Let’s examine our first case study.
Airbnb Tackling Tooling Disorganization and Complexity
Founded in 2007, Airbnb grew from humble beginnings into a hospitality giant serving over 4 million hosts and 1.5 billion guests globally. Airbnb’s data team contributed greatly to this success. However, the path was not without challenges.
Airbnb formed its data team in 2010. By 2015, the team ballooned to 70 people across 6 groups — massive growth in 5 years. Currently, the estimated team size is around 450. In 12 years, Airbnb’s data team grew by 43,000%!
Rapid growth strains organizations, and Airbnb was already facing significant data infrastructure challenges in 2014 (maybe even earlier). As the company was still a rapidly scaling startup, solving these challenges quickly became critical if the incredible growth rate were to continue.
The Data Chasm
Back then we knew so little about the business that any insight was groundbreaking; data infrastructure was fast, stable, and real-time (I was querying our production MySQL database); the company was so small that everyone was in the loop about every decision; and the data team (me) was aligned around a singular set of metrics and methodologies. [Source]
Airbnb’s first, one-person data team used to query production databases directly. As the team grew, they adopted an agile, decentralized approach allowing engineers to build custom tooling tailored to their projects’ goals. By 2015, their now 70-person data team were using a complex and fragmented infrastructure with a myriad of problems:
- Unclear tool ownership led to disjointed workflows spanning siloed systems. Tasks that once took hours now took days.
- Interconnected pipelines evolved into complex hairballs of jobs. Debugging failures was a nightmare.
- New data workers were overwhelming systems designed for niche use cases. General-purpose platforms were lacking.
Airbnb was hit hard by the “data chasm” trap: Their initial decentralized approach that catalyzed growth became an impediment, now threatening further growth.
Crossing the Chasm
In 2014, with the data team now 28 people strong, infrastructure engineers realized duct tape wouldn’t suffice amid mounting complexity and a projected 4x growth. The team knew further growth and specialization meant deeper complexity and entanglement unless they reassessed their approach to workflow management and built a robust yet flexible foundation to support their growth.
Their initial approach resulted in networks of jobs with interconnected dependencies between them. However, these networks of jobs could be “disentangled” into directed acyclic graphs (DAGs) of computational steps. DAG computations and their architectural requirements are extensively covered in the literature, and Airbnb relied on these strong theoretical foundations around DAGs while developing the Airflow platform rather than re-inventing the wheel.
Leveraging first principles and sound software engineering, Airbnb engineered Airflow as a modular, enduring workflow management solution offering high-level abstractions that simplify workflow authoring, scheduling, monitoring and logging. Airflow’s scalable, flexible and extensible architecture enabled Airbnb to cross the data chasm and continue their trajectory of data-driven innovation.
Uber Wrestling with Scaling Issues Around Logging
Uber, pioneering ride-hailing since 2009, operates in 70 countries and 10,500 cities globally with over 131 million monthly active users.
With Uber’s business growing exponentially (both in terms of the number of cities/countries we operated in and the number of riders/drivers using the service in each city), the amount of incoming data also increased and the need to access and analyze all the data in one place required us to build the first generation of our analytical data warehouse. [Source]
In its early days, Uber’s data needs were met by a few databases. But with exponential business and data volume growth, they constantly had to improve/iterate on their data infrastructure. By 2015, they were relying on the ELK (Elasticsearch, Logstash, Kibana) stack for logging, mainly focusing on observability use cases. With Elasticsearch’s document model, they could easily accommodate their ever-changing data needs — essential in such a rapid environment.
The Data Chasm
Fast forwarding to 2017, logging infrastructure was spanning thousands of Kafka topics, storing 2.5PB of data, and ingesting at 5GB/s. This incredible growth was fueled by increases in both data users and use cases, like demand forecasting, ETA predictions and driving usage. Meanwhile, microservices exploded to over 3000, each producing more log volume. This volume was now straining the ELK stack, ultimately hindering debugging and analytics efforts. Challenges included:
- Operational Complexity: The fragmented ELK deployment required constant resharding and rebalancing across 50+ clusters per region, consuming significant engineering resources.
- Hardware Costs: Elasticsearch’s indexing requirements consumed significant hardware resources, making it expensive to scale.
- Performance Bottlenecks: More than 80% of queries were aggregation queries, and Elasticsearch was not optimized for fast aggregations across large datasets.
- Schema Conflicts: Elasticsearch’s schema enforcement led to thousands of daily type conflicts and mapping explosions, reducing developer productivity. While Elasticsearch can be configured in a schema-agnostic fashion, this probably would have required significant rearchitecting resulting in migration costs/complexities, and may have exacerbated the aforementioned problems.
Uber hit the “data chasm” when the initial benefits of the ELK approach started turning into liabilities at scale. The system that once enabled growth was now a bottleneck inhibiting it.
Crossing the Chasm
Fundamentally, data engineering solutions should aim to maximize added value per unit cost spent. Uber’s ELK stack, being a search infrastructure at its core, was overextended for analytics use cases — causing costs to rapidly outpace the added value.
Recognizing this mismatch, Uber sought a solution better optimized for analytical workloads with strong unit economics. At this time, the industry already started seeing positive results applying column-oriented databases for analytics workloads — and the concept was previously explored in academic work in the likes of C-Store and MonetDB, which were shown to provide order-of-magnitude efficiency gains. Realizing this, Uber leaned towards ClickHouse, whose architecture followed the footsteps of these early technologies and was well-suited to Uber’s needs.
ClickHouse’s column-oriented processing offers high-performance ad-hoc analytics at scale. Furthermore, its efficient JSON ingestion and manipulation features enables a schema-agnostic logging architecture that can adapt to evolving data over time. This avoided the scaling and cost drawbacks of attempting to re-architect Elasticsearch into a schema-agnostic system.
A purpose-built architecture leveraging ClickHouse to meet their evolving analytics requirements was a promising vehicle for Uber as they sought to cross the data chasm. Advantages included:
- 10x higher ingestion throughput per node over Elasticsearch.
- 50% lower hardware costs ingesting more data.
- Increased operational efficiency and reduced overhead.
By optimizing for value per unit cost spent, Uber successfully crossed the chasm with ClickHouse — enabling continued analytics innovation with an architecture positioned for scale, performance, and efficiency.
Apache Arrow Bridging the Interoperability Gap
The big data era of the 2010s was akin to a “Cambrian Explosion” of data tools. From Hadoop’s distributed data storage to Spark’s in-memory computation and Pandas for Python-based dataframe manipulation, the ecosystem was teeming with specialized life forms. Companies like Databricks, Cloudera, and DataStax were at the forefront of this revolution, each offering unique solutions optimized for specific tasks. This diversity fueled innovation and growth, allowing organizations to tailor their data analytics capabilities precisely.
The Data Chasm
However, this biodiversity of tools came at a steep cost. Each tool had its own internal data representation, making data interchange a cumbersome and resource-intensive task. As Wes McKinney, the creator of Pandas and a key contributor to Apache Arrow, points out: “The computational overhead from data serialization and deserialization was becoming a bottleneck.” The very diversity that had been an asset was turning into a liability, creating a data chasm filled with inefficiency and complexity.
Crossing the Chasm with Arrow
Recognizing the root cause of the problem as the lack of a universal data interchange format, the team behind Apache Arrow took a first-principles approach to find a solution. They turned to foundational research in the field of database management systems, particularly the work done on MonetDB/X100. These papers demonstrated that vectorized, columnar in-memory processing could lead to order-of-magnitude performance improvements in a vast space of use cases.
Inspired by this scientific insight, Apache Arrow took the road to be a “universal columnar memory format” optimized for analytics and compatible across multiple languages and platforms. As Jacques Nadeau, a key contributor to Apache Arrow, states: “Arrow is designed to be the underpinning of the analytics big data ecosystem for the next decade.”
The project incorporates the following key principles:
- Columnar Format: Building on prior academic research, Arrow leverages columnar storage to enable more efficient analytics operations.
- Language and Platform Agnosticism: The Arrow ecosystem provides libraries for multiple languages to realize seamless data interchange in practice.
- Zero-Copy Reads: Arrow supports zero-copy reads to significantly reduce the computational overhead associated with data movement. Zero-copy reads are one of the key causes of the aforementioned performance improvements.
By grounding their solution in well-established research and taking a first-principles approach, Apache Arrow aspires to bridge the “data chasm” that had been created by the Cambrian explosion of data tools. It is turning the ecosystem’s diversity from a liability back into an asset, setting a new standard for efficient, scalable data processing.
Apache Arrow, in a short period of time, is poised to become a cornerstone in the modern data stack, enabling organizations to escape the “data chasm” and continue their trajectory of data-driven innovation. Its universal columnar data format is being adopted by a range of big data tools, possibly setting a new standard for data interchange.
Introducing the Lean Data Stack
Studying these successful “crossing the chasm” stories reveals common threads that catalyze success. Rather than chasing trends (e.g. creating an “Ops” layer with each hype), these teams focused on root causes and engineered durable solutions by asking foundational questions first. Their approaches reflect that even though the hype cycle favors the early bird, only thoughtful foundations deliver in the long haul.
Another theme is the pursuit of economic yet capable systems aligning costs tightly with the added value. Rather than overbuilding lavishly, they weighed expenses like compute, storage and bandwidth against needs for performance, speed and flexibility. Ultimately, they were trying to justify whether their data engineering complexity delivers enough additional insights per dollar spent. This pragmatism provides guidance: Anchor architectures to measurable value per dollar.
Beyond foundations and economics, these solutions also focus on composability through modularity, capability depth, and standards. Rather than monolithic structures, these teams built flexible systems from simple, compatible components with clean interfaces on top of extensible cores. Such extensible cores allow third parties to enhance and expand in unforeseen ways, letting users dynamically adapt by piecing together elements as necessary.
Distilling lessons from these pioneers, we put forth the initial tenets of the Lean Data Stack — a paradigm change from today’s Modern Data Stack. The vision is to architect simple yet powerful data systems optimized for longevity, thrift, and flexibility. This refined methodology is defined by the following principles:
- Foundational: Build sound foundations focused on durable solutions rather than ephemeral trends. Engineer resilient systems positioned for the long haul.
- High-surplus: Align infrastructure costs tightly to the value gained per dollar spent. Pragmatism over lavishness.
- Compoundable: Favor modular components with capability depth that can be composed together for complex capabilities. Enable agility through interchangeable solutions based on common standards.
While these principles are just an initial framing, we believe they capture the essence of the Lean Data Stack. There is still much left to uncover about what makes data systems and architectures lean. By taking an iterative approach, analyzing additional case studies and investigating practical examples, we aim to continue adding depth to the Lean Data Stack concept — equipping it to meet future data challenges.
In the next article, we will explore how Synnada applies these concepts as we build next-generation data infrastructure technology. By embracing foundational, high-surplus, and compoundable design, Synnada is engineering a solution that aims to help organizations bypass the data chasm.
References and Remarks
 The Airbnb engineering team provided an in-depth look at Airflow’s origins and evolution in this Medium post. Maxime Beauchemin, Airflow’s original creator, also explained key technical details in this blog and this video.
 This YouTube video by Uber engineer Hua Wang covers how Uber utilized the ELK stack and the challenges they faced at scale.
 The Uber engineering team outlined their ClickHouse implementation journey and how ClickHouse’s columnar architecture proved better optimized for Uber’s analytical workloads in this blog post and presentation.
 This post by Dremio provides background on the history and goals of Apache Arrow.