Bigtable: Distributed Storage for Structured Data

Google's wide-column distributed database powering Search, Gmail, Maps, and YouTube. Deep dive into the data model, tablet hierarchy, read/write paths, and fault tolerance.

Published by Google in 2006, Bigtable is a distributed storage system for structured data designed to scale to petabytes across thousands of commodity servers. It underlies Google Search (web index), Gmail (email storage), Google Maps (geo data), and Google Analytics.

๐Ÿ’ก Core idea

Bigtable is a sparse, distributed, persistent, multidimensional sorted map. The map is indexed by a (row key, column family:qualifier, timestamp) triple and each value in the map is an uninterpreted array of bytes.

Data Model

The fundamental unit of data is a cell identified by three coordinates: row key, column (family:qualifier), and timestamp. Multiple timestamped versions of a cell are kept automatically.

bigtable-data-model.sql
-- Bigtable data model: (row, column family:qualifier, timestamp) โ†’ value

Table: "webtable"

Row Key                  Column Family: "contents"        Column Family: "anchor"
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
"com.cnn.www"            contents:html@t9  โ†’ "<html>โ€ฆ"   anchor:cnnsi.com@t8  โ†’ "CNN"
                         contents:html@t8  โ†’ "<html>โ€ฆ"   anchor:my.look.ca@t5 โ†’ "CNN.com"
"com.google.www"         contents:html@t6  โ†’ "<html>โ€ฆ"
"com.nytimes.www"        contents:html@t11 โ†’ "<html>โ€ฆ"   anchor:slate.com@t9  โ†’ "NYT"

Key insight: rows sorted lexicographically by row key
Reversed domains โ†’ com.google.* groups all Google pages together
Row Key

Arbitrary string up to 64 KB. Rows sorted lexicographically. A row range is called a tablet.

Column Family

Must be declared at table creation. Unit of access control and compression. Example: contents, anchor.

Timestamp

64-bit integer (usually ยตs since epoch). Multiple versions per cell; configurable retention (last N or last T seconds).

Three-Level Tablet Location Hierarchy

Finding which tablet server holds a given row key requires traversing three levels of indirection. Clients cache locations aggressively โ€” after the first lookup, reads go directly to the tablet server with zero extra RPCs.

Bigtable 3-Level Hierarchy
Cache:
Level 0Level 1Level 2Level 3โ€ฆโ€ฆChubby FileRoot Tablet (never split)META 1META 2META NTโ‚Tโ‚‚Tโ‚™
Chubby
Root Tablet (Level 1)
METADATA Tablets (Level 2)
User Tablets (Level 3)
Click any node for details
Why exactly 3 levels?

โ€ข 1 level: Chubby would need to store all tablet locations โ€” doesn't scale.

โ€ข 2 levels: Would limit total tablets to ~128 K (insufficient for Google scale).

โ€ข 3 levels: 128 K METADATA tablets ร— 128 K user tablets each = ~17 billion user tablets. Sufficient for petabytes.

โ€ข 4+ levels: Adds latency with no benefit at Google's data scale.

Client Read Request โ€” Step by Step

The master is never in the read path. Clients find tablet servers autonomously through the 3-level hierarchy and then communicate directly.

1
Cache check
Check client cache
Does the client have a cached tablet location for this row key? If yes, skip steps 2โ€“7 entirely.
2
Level 0
Query Chubby (if cold)
Read the single Chubby file storing the network address of the root tablet server. This is the only Chubby read in the entire flow.
3
Level 1
Root Tablet Server
Send: "Where is the METADATA tablet for row R?" The root tablet responds with the network address of the responsible METADATA tablet server.
4
Cache
Cache root tablet location
Client caches the root tablet location. Future requests skip step 2.
5
Level 2
METADATA Tablet Server
Send: "Where is the user tablet for row R?" The METADATA tablet server responds with (tablet_server_host, port, key_range).
6
Cache
Cache METADATA result
Client caches the user tablet location. Future reads for rows in this tablet skip steps 2โ€“5.
7
Level 3
Tablet Server โ€” GET
Client sends GET(rowKey, column, timestamp) directly to the user tablet server. No master involved!
8
Merge
Merge memtable + SSTables
Tablet server checks memtable, then SSTables (using Bloom filters to skip empty ones). Returns merged result.

Tablet Internals: Write & Read Paths

Write Path

WRITE PATH (SSTable + WAL): 1. Client โ†’ Tablet Server: PUT (rowKey, column, value) 2. Tablet Server: Write to WAL (Write-Ahead Log) on GFS โ€” ensures durability 3. Tablet Server: Write to Memtable (in-memory sorted buffer) 4. Return success to client โ† fast! (just memory + sequential disk write) When memtable is full โ†’ Minor Compaction: โ€ข Freeze current memtable โ€ข Write to new SSTable on GFS โ€ข Create fresh empty memtable Periodically โ†’ Major Compaction: โ€ข Merge ALL SSTables + memtable into one SSTable โ€ข Only time deleted data is physically removed (tombstones eliminated)

Read Path

READ PATH (SSTable merge): 1. Client โ†’ Tablet Server: GET (rowKey) 2. Check memtable first (most recent writes, in-memory, fast) 3. Check SSTable files in order (newest first): a. Use Bloom filter: "Does this SSTable possibly contain the key?" โ†’ NO โ†’ skip this SSTable (save I/O!) โ†’ YES โ†’ check SSTable index โ†’ seek to block โ†’ read value 4. Merge results from memtable + all SSTables: โ€ข SSTable files are immutable and sorted โ€ข Use a merge iterator (heap-based) โ€ข Return most recent version (by timestamp) Note: more SSTables = more merge work โ†’ compactions keep count low

๐Ÿ’ก Bloom filters: the secret to read performance

Each SSTable has a Bloom filter โ€” a compact probabilistic data structure that answers "does key K exist in this SSTable?" in O(1) with ~1% false positive rate. Without Bloom filters, every SSTable file would need to be scanned for every read. With them, reads typically touch only 1โ€“2 files regardless of how many SSTables exist.

Master Responsibilities

The master handles only metadata and control operations. All data reads and writes go directly between clients and tablet servers โ€” the master never touches actual data.

Bigtable master responsibilities and failure impact
ResponsibilityDetailsImpact if Master Fails
Tablet assignmentAssigns unassigned tablets to tablet servers on startup or after failureNew assignments blocked; existing tablets still serve requests
Server health monitoringMonitors tablet server heartbeats via Chubby; detects failuresCannot detect new failures; existing tablets unaffected
Load balancingMoves tablets between servers based on load metricsNo rebalancing; hot spots persist until master recovers
Schema managementHandles table creation, deletion, column family changesSchema changes blocked; reads/writes continue
GC of SSTable filesRemoves orphaned SSTables from GFSStorage may grow; no correctness impact

Chubby Integration

Four ways Bigtable uses Chubby
๐Ÿ“
Root tablet location
A single Chubby file stores the network address of the root tablet server. This is the entry point for all clients.
๐Ÿ’“
Tablet server liveness
Each tablet server acquires an exclusive lock on a unique file in /bigtable/servers/. If the lock is lost, the tablet server commits suicide (kills itself). The master detects failure by watching this directory.
๐Ÿ†
Master election
The master acquires a master lock at startup to prevent split-brain. Only one active master can exist at a time.
๐Ÿ“‹
Schema and ACL storage
Table schemas and access control lists are stored in Chubby, making them highly available and consistent.

Chubby unavailability is catastrophic

If Chubby becomes unavailable, new clients cannot perform their initial lookup. Tablet servers that lose their Chubby session suicide (to prevent split-brain). The master cannot detect failures. However, existing clients with cached locations can continue to read/write for some time.

Real-World Uses at Google

๐ŸŒ In the Wild:Google Maps

Geo data at planetary scale

Google Maps stores satellite imagery, map tiles, and Points of Interest in Bigtable. The row key encodes the geohash of the location, enabling efficient range scans for a bounding box query. Column families separate different resolutions and data types. At peak, Maps serves billions of tile requests per day from Bigtable.
๐ŸŒ In the Wild:Gmail

Email storage

Each user's email is stored with the user ID as the row key prefix. Column families separate message headers, body, attachments (stored in GFS with pointer in Bigtable), and labels. Timestamps provide automatic versioning for drafts and edits. Bigtable's sorted row keys allow efficient mailbox scans (newest-first via reverse timestamps).
๐ŸŒ In the Wild:Google Analytics

Time-series click data

Analytics stores event data with reversed timestamp row keys (so newest events are first in sort order). Column families separate different event types (pageview, click, conversion). Range scans on row key prefixes efficiently retrieve a time window for a given website. Bigtable's multi-version cells store historical snapshots.

Python HappyBase API Example

HappyBase provides a Pythonic interface to HBase (which shares Bigtable's API). The same patterns apply to Google Cloud Bigtable via the official client library.

bigtable_client.py
1import happybase
2
3# Connect to HBase (compatible with Bigtable API)
4connection = happybase.Connection('bigtable-host', port=9090)
5
6# Create a table with two column families
7connection.create_table(
8    'webtable',
9    {'contents': {}, 'anchor': {'max_versions': 10}}
10)
11
12table = connection.table('webtable')
13
14# Write a row
15table.put(
16    b'com.google.www',
17    {
18        b'contents:html':     b'<html>...</html>',
19        b'anchor:nyt.com':    b'Google',
20        b'anchor:reddit.com': b'Google Search',
21    }
22)
23
24# Read a single row
25row = table.row(b'com.google.www', columns=[b'contents', b'anchor'])
26print(row)  # {b'contents:html': b'...', b'anchor:nyt.com': b'Google', ...}
27
28# Scan a range (uses sorted row keys efficiently)
29for key, data in table.scan(
30    row_start=b'com.google.',
31    row_stop=b'com.googlez.'      # exclusive upper bound
32):
33    print(key, data[b'contents:html'][:100])
34
35# Delete a specific cell version
36table.delete(b'com.google.www', columns=[b'anchor:nyt.com'])
37
38# Read specific version (timestamp)
39row = table.row(
40    b'com.google.www',
41    columns=[b'contents:html'],
42    timestamp=1672531200000      # Unix ms
43)

Bigtable Exam Questions

These 6 questions cover the core Bigtable concepts tested in the exam. Click any question to reveal the hint, then follow the link for a full model answer.