Adaptive H3 Aggregation for 73M Places (Without Losing Detail)

When you run backend jobs — duplicate detection, geocode QA, brand matching, ML feature extraction where spatial proximity matters — over tens of millions of places, you hit a scaling wall:

dense cities blow up: a single coarse hex over NYC can hold hundreds of thousands of places, causing OOM or multi-hour runtimes
sparse regions waste work: millions of nearly-empty cells become millions of pointless queue items

A fixed-resolution grid can't match the wildly uneven density of real-world place data. What you need is adaptive processing blocks: spatial partitions where every unit has a bounded, predictable cost.

This post describes a pattern in Postgres with the h3-pg extension:

Emit non-overlapping H3 cells such that each cell has place_count <= threshold.

Dense areas subdivide into finer hexes; sparse areas stay coarse. The output is an adaptive frontier — a cut across H3 resolutions where every place belongs to exactly one emitted cell.

All code and benchmark queries are in the overturemaps-pg repo.

The goal

Given a table of places with a geometry column, produce:

cell (h3index), res (int), place_count (bigint)

…where:

each place contributes to exactly one emitted cell (no double counting)
emitted cells do not overlap (safe to process independently)
place_count(cell) <= threshold by construction

These cells work as queue items, batch inputs, or sharding units for backend processing.

Strict nesting via `cell_to_parent`

H3 distinguishes logical containment (exact) from geographic containment (approximate). h3_cell_to_parent is a bitwise truncation of the index — the logical hierarchy is exact. Geographic boundaries are a different story (cell boundary polygons don't nest perfectly), but for index-based work that doesn't matter.

We compute the cell once at res 10 and derive all coarser levels via h3_cell_to_parent(). This is both fast (one latlng_to_cell call instead of ten — cell_to_parent is just bit manipulation) and simple (one source of truth for the hierarchy).

The pattern: "emit the coarsest cell under a threshold"

Pick a threshold like 10,000 places.

For each place (really: for each res 10 bucket of places), find the coarsest resolution where the cell count is <= threshold.

if a res 4 cell has 8,000 places → emit res 4
if a res 4 cell has 80,000 places → check res 5, then res 6, ... until it fits

That gives you a frontier across resolutions.

Using the cells as processing blocks

The output is a work queue:

each row is a unit of work (process that cell)
place_count is a cheap proxy for expected cost
the threshold keeps worst-case tasks bounded

A typical pattern:

materialize the frontier (REFRESH MATERIALIZED VIEW)
enqueue cell ids into your job system
for each cell, load/process places in that cell, then write results

Handling boundary effects

Any spatial chunking scheme has the same boundary problem: two places can be true duplicates while landing in adjacent cells. If you only compare within the home cell, you miss cross-boundary pairs.

The fix is to process each cell with an overlap buffer using H3's neighbor lookup:

for each cell in frontier:
    candidates = places_in(cell) ∪ places_in(gridDisk(cell, 1))
    results    = find_matches(candidates)
    emit only results whose "home" cell == cell

Load cell ∪ gridDisk(cell, 1) (or k=2 if your matching radius exceeds one hex width). This stays in H3 index space — no geometry intersections.
Only emit results for the home cell so you never double-report.
Globally dedupe by stable keys as a safety net — e.g. (min(id_a, id_b), max(id_a, id_b)) for pair-based dedup.

The overlap means each job loads ~7x more candidates at k=1 (home cell plus 6 neighbors). In practice the cost is modest because the threshold already caps each cell's count.

Visualizing the frontier

The processing blocks are the deliverable — but if you want to sanity-check partitions, convert cells to polygons and render them. Color by place_count, by res, or extrude in 3D.

h3_cell_to_boundary is only needed for visualization. All partitioning and overlap logic stays in H3 index space.

Adaptive H3 frontier along California with threshold=5,000 — **California, threshold = 5,000.** 63,795 cells across 9 resolution levels. Dense clusters around LA and SF, coarse hexes inland. Twice the scheduling overhead of t=10K.

Adaptive H3 frontier along California with threshold=10,000 — **California, threshold = 10,000.** 32,740 cells. Half the cells of t=5K with nearly identical uniformity (Gini 0.551 vs 0.547). The sweet spot for most backend jobs.

Same California view with threshold=50,000 — **California, threshold = 50,000.** 7,279 cells, resolutions 1–7 only. LA and SF still subdivide, but the finest resolutions (8, 9) disappear entirely.

Same California view with threshold=100,000 — **California, threshold = 100,000.** 3,910 cells, resolutions 1–6 only. Barely adaptive — most of the coast is covered by a handful of large hexes.

Zoomed view of Los Angeles area with threshold=5,000 — **LA metro, t=5K.** At the finest threshold, dense neighborhoods subdivide all the way to res 8–9. Compare with the t=100K view — same area, dramatically fewer cells.

Zoomed view of Los Angeles metro showing coarse cells at threshold=100,000 — **LA metro, t=100K.** At high thresholds, even LA is covered by relatively few mid-resolution cells. The contrast with t=5K shows why threshold choice matters.

Implementation

Requires extensions: postgis, h3, h3_postgis. The view does four things:

compute h3_r10 once and derive h3_r1..h3_r9 via h3_cell_to_parent
compute counts per cell at each resolution
join counts back and choose emit_res via a CASE ladder
group by the emitted cell so each place is counted exactly once

CREATE MATERIALIZED VIEW places_h3_t{threshold} AS
WITH
base AS (
  SELECT h3_r10, place_count,
    h3_cell_to_parent(h3_r10, 9) AS h3_r9, h3_cell_to_parent(h3_r10, 8) AS h3_r8,
    h3_cell_to_parent(h3_r10, 7) AS h3_r7, h3_cell_to_parent(h3_r10, 6) AS h3_r6,
    h3_cell_to_parent(h3_r10, 5) AS h3_r5, h3_cell_to_parent(h3_r10, 4) AS h3_r4,
    h3_cell_to_parent(h3_r10, 3) AS h3_r3, h3_cell_to_parent(h3_r10, 2) AS h3_r2,
    h3_cell_to_parent(h3_r10, 1) AS h3_r1
  FROM (
    SELECT h3_latlng_to_cell(p.geometry::point, 10) AS h3_r10,
           COUNT(*)::bigint AS place_count
    FROM places p GROUP BY 1
  ) r10
),
c1  AS (SELECT h3_r1  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c2  AS (SELECT h3_r2  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c3  AS (SELECT h3_r3  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c4  AS (SELECT h3_r4  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c5  AS (SELECT h3_r5  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c6  AS (SELECT h3_r6  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c7  AS (SELECT h3_r7  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c8  AS (SELECT h3_r8  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c9  AS (SELECT h3_r9  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c10 AS (SELECT h3_r10 AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
tagged AS (
  SELECT b.h3_r10,
    CASE
      WHEN c1.cnt  <= {threshold} THEN 1 WHEN c2.cnt  <= {threshold} THEN 2
      WHEN c3.cnt  <= {threshold} THEN 3 WHEN c4.cnt  <= {threshold} THEN 4
      WHEN c5.cnt  <= {threshold} THEN 5 WHEN c6.cnt  <= {threshold} THEN 6
      WHEN c7.cnt  <= {threshold} THEN 7 WHEN c8.cnt  <= {threshold} THEN 8
      WHEN c9.cnt  <= {threshold} THEN 9 ELSE 10
    END AS emit_res
  FROM base b
  JOIN c1 ON b.h3_r1=c1.cell JOIN c2 ON b.h3_r2=c2.cell JOIN c3 ON b.h3_r3=c3.cell
  JOIN c4 ON b.h3_r4=c4.cell JOIN c5 ON b.h3_r5=c5.cell JOIN c6 ON b.h3_r6=c6.cell
  JOIN c7 ON b.h3_r7=c7.cell JOIN c8 ON b.h3_r8=c8.cell JOIN c9 ON b.h3_r9=c9.cell
  JOIN c10 ON b.h3_r10=c10.cell
)
SELECT
  h3_cell_to_parent(t.h3_r10, t.emit_res) AS cell,
  t.emit_res AS res,
  SUM(b.place_count)::bigint AS place_count
FROM base b
JOIN tagged t ON b.h3_r10 = t.h3_r10
GROUP BY 1, 2;

Why "res 10 then parent"

Two reasons.

Performance. h3_latlng_to_cell does real work (face lookup, coordinate projection, digit computation). h3_cell_to_parent is nearly free — it's a bitwise truncation. For 73M places, that's 1 expensive call + 9 cheap calls instead of 10 expensive calls per point.

Correctness. H3's logical hierarchy is exact: cell_to_parent always gives you the one true ancestor. But latlng_to_cell is a geometric classification step — it projects a point onto an icosahedral face and assigns it to the nearest cell center. H3's docs are silent on whether latlng_to_cell(P, 6) is guaranteed to equal cell_to_parent(latlng_to_cell(P, 7), 6) for every point. By deriving all resolutions from a single res 10 cell, we sidestep the question entirely.

Note: if you need metric-distance buffers (e.g. "all places within 500m"), use PostGIS geography (geodesic) or H3 gridDisk neighbors, not EPSG:3857 planar math.

Validation

The benchmark suite runs two checks against the view:

No overlapping cells. For every emitted cell, generate all its ancestors via h3_cell_to_parent and check that none of them are also emitted. Result: 0 overlapping pairs across all four thresholds.
Boundary mismatch proof. Find a place that h3_latlng_to_cell assigns to a cell but that falls outside the cell's h3_cell_to_boundary_geometry polygon. This always finds one — because H3's geographic containment is approximate. It confirms that you should never use ST_Intersects on boundary polygons to determine cell membership; always use the index.

Boundary mismatch: a place point outside the H3 cell boundary polygon but correctly assigned by the index. H3's geographic containment is approximate; the logical index is exact.

Picking a threshold: what the numbers say

Benchmarked on 72,783,221 places from the Overture Maps dataset. Each view takes ~190 seconds to materialize regardless of threshold (the dominant cost is the res 10 grouping).

threshold	cells	res range	avg/cell	stddev	Gini
5,000	63,795	1–9	1,141	1,192	0.547
10,000	32,740	1–9	2,223	2,350	0.551
50,000	7,279	1–7	9,999	11,695	0.597
100,000	3,910	1–6	18,615	23,444	0.630

The Gini coefficient (0 = perfectly uniform, 1 = all places in one cell) measures distribution evenness.

Lower Gini = more uniform distribution. The sweet spot is the highest threshold where Gini hasn't degraded much — you get fewer cells (less scheduling overhead) without sacrificing workload balance.

5K→10K: free win. Doubling the threshold cuts cells in half (64K → 33K) with a Gini increase of just 0.004 (0.547 → 0.551). That's half the queue items for essentially the same distribution quality.

10K→50K: the unexplored gap. Gini jumps from 0.551 to 0.597 (+0.046) — a much bigger degradation. But somewhere in this range (say 20K–25K) the Gini likely stays close to 0.55 while cutting cells further. We didn't benchmark intermediate values; worth investigating for your workload.

50K+: uniformity breaks down. The stddev (11,695) exceeds the average (9,999) — cell sizes vary wildly. The finest resolutions disappear. Fine for "split the world into ~7K chunks," too uneven for balanced work queues.

100K: barely adaptive. Only 6 resolution levels remain, Gini hits 0.63. The bottom decile averages 10 places/cell, the top decile averages 74,575 — a 7,500x spread.

Watch for single-place cells. Every threshold produces cells with exactly 1 place — pure scheduling overhead. At t=5K there are 851 of them (1.3% of cells); at t=100K only 18. Not worth optimizing away, but good to know when sizing your queue.

The practical rule: set the threshold to the largest batch your worker can process without memory pressure.

Detailed breakdown for threshold = 10,000:

res	cells	places	avg/cell	min	max
1	640	572,928	895	2	9,953
2	794	1,503,620	1,894	1	10,000
3	2,893	6,640,106	2,295	1	10,000
4	7,751	18,321,519	2,364	1	9,995
5	8,634	18,201,626	2,108	1	9,997
6	7,486	17,339,166	2,316	1	9,986
7	3,878	8,889,703	2,292	1	9,896
8	657	1,303,188	1,984	1	9,956
9	7	11,365	1,624	357	2,217

32,740 cells covering 72,783,221 places, each place counted exactly once. Every max stays at or below the threshold. Full benchmark results including decile distributions and validation queries are in the companion repo.

Querying the view

-- Basic
SELECT cell, res, place_count FROM places_h3_t10000;

-- With geometry for rendering
SELECT cell, res, place_count,
  ST_SetSRID(h3_cell_to_boundary(cell)::geometry, 4326) AS geom
FROM places_h3_t10000;

-- Filter to a bounding box (Los Angeles)
SELECT cell, res, place_count,
  ST_SetSRID(h3_cell_to_boundary(cell)::geometry, 4326) AS geom
FROM places_h3_t10000
WHERE ST_Intersects(
  ST_SetSRID(h3_cell_to_boundary(cell)::geometry, 4326),
  ST_MakeEnvelope(-118.7, 33.7, -117.7, 34.4, 4326)
);

This is a good use case for a materialized view — the expensive part is building the multi-resolution counts; querying becomes a cheap SELECT with optional bbox filtering. Refresh on your data update schedule.

Conclusion

Compute once at res 10, derive parents — h3_cell_to_parent is a bitwise truncation, practically free. One expensive latlng_to_cell call per point instead of ten.
The 5K–10K threshold range is the sweet spot for backend work queues. Below 5K you're paying scheduling overhead for marginal uniformity gains. Above 50K the distribution becomes too uneven for balanced workloads.
Use the index, not the polygon. H3's geographic containment is approximate — cellToBoundary polygons don't perfectly contain all index-assigned points. Always assign places via latLngToCell, never via ST_Intersects on the boundary.
Validate your frontier. The overlap check (0 pairs across all thresholds) and total count match confirm the view is correct. Run them after any schema change.

Tested with PostgreSQL 17, PostGIS 3.5, and h3-pg 4.x on the Overture Maps dataset (72.8M places). Full benchmark code and queries: github.com/nikmarch/overturemaps-pg