19 - Google BigQuery / Dremel (CMU Advanced Databases / Spring 2023)

CMU Database Group・2 minutes read

The class shifts focus to real-world systems from individual techniques, emphasizing how industry papers apply discussed methods and understanding fundamentals for decoding marketing claims. Topics like BigQuery, Spark SQL, and Snowflake are covered, highlighting the separation of compute from storage in database systems and challenges like lack of data statistics and adaptive query optimization.

Insights

Industry papers often lag behind real system development, but understanding fundamental concepts and reading these papers can aid in deciphering marketing claims about new techniques, providing a diverse catalog of problem-solving approaches.
Google's significant influence in the database realm, showcased through products like Dremel evolving into BigQuery, emphasizes the importance of open-source software for large systems, with unique features like in-memory Shuffle operations and in-situ data processing setting these systems apart in query processing efficiency and fault tolerance.

Get key ideas from YouTube videos. It’s free

Recent questions

What is the main focus of the class?
Real systems over individual techniques from papers.

Related videos

Summary

00:00

Analyzing Industry Papers for System Development Trends

The class will now focus on real systems rather than individual techniques from papers.
The goal is to understand how industry papers apply the discussed techniques.
Understanding fundamentals helps in deciphering marketing claims about new techniques.
Reading papers helps in building a catalog of different problem-solving approaches.
The agenda includes topics like BigQuery, Spark SQL, Snowflake, Velox, and Redshift.
Industry papers are typically a few years behind in terms of system development.
A recurring theme is the separation of compute from storage in modern database systems.
Lack of statistics about data and adaptive query optimization are common challenges.
Support for non-relational data formats like JSON and XML is crucial.
Most systems covered will focus on vectorized execution for query processing.

13:52

Google's Influence on Open-Source Database Evolution

YouTube developed a sharding middleware for MySQL, which was later open-sourced as PlanetScale.
YouTube's autonomy due to financial success allowed them to open-source the middleware without legal interference.
PlanetScale is the only public product from the original YouTube project, requiring rewrites to function without internal Google services like Spanner.
Google's cloud services influence the importance of open-source software, especially for large systems like Dremel or Redshift.
Google's LODB, based on Postgres, offers a unique on-premises deployment option with a Docker file.
Google's App Engine, with a JSON database, inspired MongoDB, showcasing Google's significant influence in the database realm.
Dremel, developed in 2006, aimed to enable quick analysis on data files generated from other tools, particularly for MapReduce jobs.
Dremel evolved from a shared-nothing system to a shared architecture on the Google file system, becoming BigQuery commercially in 2012.
Dremel's unique in-memory Shuffle operation sets it apart, shared only with Spark in the query processing realm.
Dremel's approach to in-situ data processing allows direct query execution on data files without prior ingestion or schema definition, aligning with modern data lake concepts.

27:49

Optimizing Query Execution with Shuffle Service

The coordinator node in query execution plans acts as the central point, gathering file locations before query execution begins.
A batch approach is recommended to prevent file server overload, ensuring all necessary data is obtained at the start.
Data at rest is stored on a distributed file system, akin to Colossus, with a coordinator node managing worker tasks.
Workers retrieve data from the file system, perform computations, and store outputs in a Shuffle service for quick access.
The Shuffle service facilitates data transfer between query stages, optimizing performance and enabling fault tolerance.
Dedicated Hardware is utilized for efficient hashing and partitioning within the Shuffle service, enhancing speed.
Intermediate results are stored in the Shuffle service, allowing for seamless continuation of queries if a worker fails.
The Shuffle service acts as a checkpoint in the query lifecycle, ensuring uninterrupted progress even if a worker node encounters issues.
The abstraction layer between producers and consumers in the Shuffle service simplifies software engineering tasks, despite potential performance trade-offs.
The benefits of the Shuffle service extend beyond performance, offering ease of maintenance and improved fault tolerance in query processing.

44:02

Efficient Memory Management and Fault Tolerance

Workers have larger memory machines to avoid memory limitations and are located in the same data center for low latency and high internal bandwidth.
Memory is used to store data and state, with workers evicting data when memory limits are exceeded.
The system does not support transactions and is not optimized for multi-reads or transactional queries.
Each worker has its own local memory for processing data, with the ability to spill to disk if needed.
Worker nodes can write to local cache and disk for storing results from S3.
Checkpoints ensure fault tolerance, with the ability to reassign tasks if a worker fails.
Stragglers are handled by reassigning tasks to new workers if deadlines are not met.
Dynamic repartitioning allows for scaling up or down based on the size of the stage.
BigQuery pricing is based on the amount of data read, with additional charges for egress and ingress.
Query optimization in BigQuery involves heuristics and a simple cost-based optimizer to generate and adjust query plans at runtime.

59:38

Efficient Data Processing Techniques in Google's Infrastructure

A worker is used to perform a repartition step by reading data from shuffle nodes and rehashing it to fill up two partitions, eliminating excess data to avoid disk usage.
This concept is crucial for efficiency as it allows for dynamic partitioning without stopping queries or creating new hash tables, unlike traditional methods.
Google relies on the distributed file system Colossus for storage, enhancing storage capacity using an app and benefiting from continuous improvements without direct involvement.
Capacitor, a non-open-source tool similar to Parquet or ORC, is used for database encoding and file formatting, with utility libraries for data access.
Capacitor aids in converting nested data structures like JSON to a columnar format, optimizing data traversal and access without parsing entire schemas.
Google emphasizes the importance of catalogs within data sets, storing thousands of attributes efficiently for quick access during queries.
Google's transition from a NoSQL-focused company to supporting SQL-based database systems, like Dremel, marked a significant shift in their internal infrastructure.
Systems inspired by Dremel, such as Drill, Dremio, and Apollo, have emerged, each with unique approaches to query execution and data processing.
Dremio, founded in 2015, leverages Apache Arrow and Reflections for fast query execution on external data files, utilizing materialized views for efficiency.
Impala, initiated in 2012, aims to replicate Google's F1 and Dremel techniques for open-source use, focusing on Cogen for expression evaluation and CSV file parsing.

01:13:58

"Shuffle phase improves performance and efficiency"

The paper from 2011 introduced the shuffle phase, which may seem wasteful but actually creates engineering opportunities and improves performance by decomposing data system components into separate services, allowing independent development, scaling, and efficiency. This approach abstracts additional code, making it cleaner to write worker code or other system parts, simplifying data transfer processes and enhancing system cleanliness and efficiency.

Try it yourself — It’s free.