Bigtable: Distributed Storage for Structured Data
Google's wide-column distributed database powering Search, Gmail, Maps, and YouTube. Deep dive into the data model, tablet hierarchy, read/write paths, and fault tolerance.
Published by Google in 2006, Bigtable is a distributed storage system for structured data designed to scale to petabytes across thousands of commodity servers. It underlies Google Search (web index), Gmail (email storage), Google Maps (geo data), and Google Analytics.
๐ก Core idea
Data Model
The fundamental unit of data is a cell identified by three coordinates: row key, column (family:qualifier), and timestamp. Multiple timestamped versions of a cell are kept automatically.
-- Bigtable data model: (row, column family:qualifier, timestamp) โ value
Table: "webtable"
Row Key Column Family: "contents" Column Family: "anchor"
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
"com.cnn.www" contents:html@t9 โ "<html>โฆ" anchor:cnnsi.com@t8 โ "CNN"
contents:html@t8 โ "<html>โฆ" anchor:my.look.ca@t5 โ "CNN.com"
"com.google.www" contents:html@t6 โ "<html>โฆ"
"com.nytimes.www" contents:html@t11 โ "<html>โฆ" anchor:slate.com@t9 โ "NYT"
Key insight: rows sorted lexicographically by row key
Reversed domains โ com.google.* groups all Google pages togetherArbitrary string up to 64 KB. Rows sorted lexicographically. A row range is called a tablet.
Must be declared at table creation. Unit of access control and compression. Example: contents, anchor.
64-bit integer (usually ยตs since epoch). Multiple versions per cell; configurable retention (last N or last T seconds).
Three-Level Tablet Location Hierarchy
Finding which tablet server holds a given row key requires traversing three levels of indirection. Clients cache locations aggressively โ after the first lookup, reads go directly to the tablet server with zero extra RPCs.
โข 1 level: Chubby would need to store all tablet locations โ doesn't scale.
โข 2 levels: Would limit total tablets to ~128 K (insufficient for Google scale).
โข 3 levels: 128 K METADATA tablets ร 128 K user tablets each = ~17 billion user tablets. Sufficient for petabytes.
โข 4+ levels: Adds latency with no benefit at Google's data scale.
Client Read Request โ Step by Step
The master is never in the read path. Clients find tablet servers autonomously through the 3-level hierarchy and then communicate directly.
Tablet Internals: Write & Read Paths
Write Path
Read Path
๐ก Bloom filters: the secret to read performance
Master Responsibilities
The master handles only metadata and control operations. All data reads and writes go directly between clients and tablet servers โ the master never touches actual data.
| Responsibility | Details | Impact if Master Fails |
|---|---|---|
| Tablet assignment | Assigns unassigned tablets to tablet servers on startup or after failure | New assignments blocked; existing tablets still serve requests |
| Server health monitoring | Monitors tablet server heartbeats via Chubby; detects failures | Cannot detect new failures; existing tablets unaffected |
| Load balancing | Moves tablets between servers based on load metrics | No rebalancing; hot spots persist until master recovers |
| Schema management | Handles table creation, deletion, column family changes | Schema changes blocked; reads/writes continue |
| GC of SSTable files | Removes orphaned SSTables from GFS | Storage may grow; no correctness impact |
Chubby Integration
Chubby unavailability is catastrophic
If Chubby becomes unavailable, new clients cannot perform their initial lookup. Tablet servers that lose their Chubby session suicide (to prevent split-brain). The master cannot detect failures. However, existing clients with cached locations can continue to read/write for some time.
Real-World Uses at Google
Geo data at planetary scale
Email storage
Time-series click data
Python HappyBase API Example
HappyBase provides a Pythonic interface to HBase (which shares Bigtable's API). The same patterns apply to Google Cloud Bigtable via the official client library.
1import happybase
2
3# Connect to HBase (compatible with Bigtable API)
4connection = happybase.Connection('bigtable-host', port=9090)
5
6# Create a table with two column families
7connection.create_table(
8 'webtable',
9 {'contents': {}, 'anchor': {'max_versions': 10}}
10)
11
12table = connection.table('webtable')
13
14# Write a row
15table.put(
16 b'com.google.www',
17 {
18 b'contents:html': b'<html>...</html>',
19 b'anchor:nyt.com': b'Google',
20 b'anchor:reddit.com': b'Google Search',
21 }
22)
23
24# Read a single row
25row = table.row(b'com.google.www', columns=[b'contents', b'anchor'])
26print(row) # {b'contents:html': b'...', b'anchor:nyt.com': b'Google', ...}
27
28# Scan a range (uses sorted row keys efficiently)
29for key, data in table.scan(
30 row_start=b'com.google.',
31 row_stop=b'com.googlez.' # exclusive upper bound
32):
33 print(key, data[b'contents:html'][:100])
34
35# Delete a specific cell version
36table.delete(b'com.google.www', columns=[b'anchor:nyt.com'])
37
38# Read specific version (timestamp)
39row = table.row(
40 b'com.google.www',
41 columns=[b'contents:html'],
42 timestamp=1672531200000 # Unix ms
43)Bigtable Exam Questions
These 6 questions cover the core Bigtable concepts tested in the exam. Click any question to reveal the hint, then follow the link for a full model answer.