When you run backend jobs — duplicate detection, geocode QA, brand matching, ML feature extraction where spatial proximity matters — over tens of millions of places, you hit a scaling wall:

A fixed-resolution grid can't match the wildly uneven density of real-world place data. What you need is adaptive processing blocks: spatial partitions where every unit has a bounded, predictable cost.

This post describes a pattern in Postgres with the h3-pg extension:

Emit non-overlapping H3 cells such that each cell has place_count <= threshold.

Dense areas subdivide into finer hexes; sparse areas stay coarse. The output is an adaptive frontier — a cut across H3 resolutions where every place belongs to exactly one emitted cell.

All code and benchmark queries are in the overturemaps-pg repo.

The goal

Given a table of places with a geometry column, produce:

cell (h3index), res (int), place_count (bigint)

…where:

These cells work as queue items, batch inputs, or sharding units for backend processing.

Strict nesting via cell_to_parent

H3 distinguishes logical containment (exact) from geographic containment (approximate). h3_cell_to_parent is a bitwise truncation of the index — the logical hierarchy is exact. Geographic boundaries are a different story (cell boundary polygons don't nest perfectly), but for index-based work that doesn't matter.

We compute the cell once at res 10 and derive all coarser levels via h3_cell_to_parent(). This is both fast (one latlng_to_cell call instead of ten — cell_to_parent is just bit manipulation) and simple (one source of truth for the hierarchy).

The pattern: "emit the coarsest cell under a threshold"

Pick a threshold like 10,000 places.

For each place (really: for each res 10 bucket of places), find the coarsest resolution where the cell count is <= threshold.

That gives you a frontier across resolutions.

Using the cells as processing blocks

The output is a work queue:

A typical pattern:

  1. materialize the frontier (REFRESH MATERIALIZED VIEW)
  2. enqueue cell ids into your job system
  3. for each cell, load/process places in that cell, then write results

Handling boundary effects

Any spatial chunking scheme has the same boundary problem: two places can be true duplicates while landing in adjacent cells. If you only compare within the home cell, you miss cross-boundary pairs.

The fix is to process each cell with an overlap buffer using H3's neighbor lookup:

for each cell in frontier:
    candidates = places_in(cell) ∪ places_in(gridDisk(cell, 1))
    results    = find_matches(candidates)
    emit only results whose "home" cell == cell

The overlap means each job loads ~7x more candidates at k=1 (home cell plus 6 neighbors). In practice the cost is modest because the threshold already caps each cell's count.

Visualizing the frontier

The processing blocks are the deliverable — but if you want to sanity-check partitions, convert cells to polygons and render them. Color by place_count, by res, or extrude in 3D.

h3_cell_to_boundary is only needed for visualization. All partitioning and overlap logic stays in H3 index space.

Implementation

Requires extensions: postgis, h3, h3_postgis. The view does four things:

  1. compute h3_r10 once and derive h3_r1..h3_r9 via h3_cell_to_parent
  2. compute counts per cell at each resolution
  3. join counts back and choose emit_res via a CASE ladder
  4. group by the emitted cell so each place is counted exactly once
CREATE MATERIALIZED VIEW places_h3_t{threshold} AS
WITH
base AS (
  SELECT h3_r10, place_count,
    h3_cell_to_parent(h3_r10, 9) AS h3_r9, h3_cell_to_parent(h3_r10, 8) AS h3_r8,
    h3_cell_to_parent(h3_r10, 7) AS h3_r7, h3_cell_to_parent(h3_r10, 6) AS h3_r6,
    h3_cell_to_parent(h3_r10, 5) AS h3_r5, h3_cell_to_parent(h3_r10, 4) AS h3_r4,
    h3_cell_to_parent(h3_r10, 3) AS h3_r3, h3_cell_to_parent(h3_r10, 2) AS h3_r2,
    h3_cell_to_parent(h3_r10, 1) AS h3_r1
  FROM (
    SELECT h3_latlng_to_cell(p.geometry::point, 10) AS h3_r10,
           COUNT(*)::bigint AS place_count
    FROM places p GROUP BY 1
  ) r10
),
c1  AS (SELECT h3_r1  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c2  AS (SELECT h3_r2  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c3  AS (SELECT h3_r3  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c4  AS (SELECT h3_r4  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c5  AS (SELECT h3_r5  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c6  AS (SELECT h3_r6  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c7  AS (SELECT h3_r7  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c8  AS (SELECT h3_r8  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c9  AS (SELECT h3_r9  AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
c10 AS (SELECT h3_r10 AS cell, SUM(place_count)::bigint AS cnt FROM base GROUP BY 1),
tagged AS (
  SELECT b.h3_r10,
    CASE
      WHEN c1.cnt  <= {threshold} THEN 1 WHEN c2.cnt  <= {threshold} THEN 2
      WHEN c3.cnt  <= {threshold} THEN 3 WHEN c4.cnt  <= {threshold} THEN 4
      WHEN c5.cnt  <= {threshold} THEN 5 WHEN c6.cnt  <= {threshold} THEN 6
      WHEN c7.cnt  <= {threshold} THEN 7 WHEN c8.cnt  <= {threshold} THEN 8
      WHEN c9.cnt  <= {threshold} THEN 9 ELSE 10
    END AS emit_res
  FROM base b
  JOIN c1 ON b.h3_r1=c1.cell JOIN c2 ON b.h3_r2=c2.cell JOIN c3 ON b.h3_r3=c3.cell
  JOIN c4 ON b.h3_r4=c4.cell JOIN c5 ON b.h3_r5=c5.cell JOIN c6 ON b.h3_r6=c6.cell
  JOIN c7 ON b.h3_r7=c7.cell JOIN c8 ON b.h3_r8=c8.cell JOIN c9 ON b.h3_r9=c9.cell
  JOIN c10 ON b.h3_r10=c10.cell
)
SELECT
  h3_cell_to_parent(t.h3_r10, t.emit_res) AS cell,
  t.emit_res AS res,
  SUM(b.place_count)::bigint AS place_count
FROM base b
JOIN tagged t ON b.h3_r10 = t.h3_r10
GROUP BY 1, 2;

Why "res 10 then parent"

Two reasons.

Performance. h3_latlng_to_cell does real work (face lookup, coordinate projection, digit computation). h3_cell_to_parent is nearly free — it's a bitwise truncation. For 73M places, that's 1 expensive call + 9 cheap calls instead of 10 expensive calls per point.

Correctness. H3's logical hierarchy is exact: cell_to_parent always gives you the one true ancestor. But latlng_to_cell is a geometric classification step — it projects a point onto an icosahedral face and assigns it to the nearest cell center. H3's docs are silent on whether latlng_to_cell(P, 6) is guaranteed to equal cell_to_parent(latlng_to_cell(P, 7), 6) for every point. By deriving all resolutions from a single res 10 cell, we sidestep the question entirely.

Note: if you need metric-distance buffers (e.g. "all places within 500m"), use PostGIS geography (geodesic) or H3 gridDisk neighbors, not EPSG:3857 planar math.

Validation

The benchmark suite runs two checks against the view:

  1. No overlapping cells. For every emitted cell, generate all its ancestors via h3_cell_to_parent and check that none of them are also emitted. Result: 0 overlapping pairs across all four thresholds.

  2. Boundary mismatch proof. Find a place that h3_latlng_to_cell assigns to a cell but that falls outside the cell's h3_cell_to_boundary_geometry polygon. This always finds one — because H3's geographic containment is approximate. It confirms that you should never use ST_Intersects on boundary polygons to determine cell membership; always use the index.

Boundary mismatch: a place point outside the H3 cell boundary polygon but correctly assigned by the index. H3's geographic containment is approximate; the logical index is exact.

Picking a threshold: what the numbers say

Benchmarked on 72,783,221 places from the Overture Maps dataset. Each view takes ~190 seconds to materialize regardless of threshold (the dominant cost is the res 10 grouping).

threshold cells res range avg/cell stddev Gini
5,000 63,795 1–9 1,141 1,192 0.547
10,000 32,740 1–9 2,223 2,350 0.551
50,000 7,279 1–7 9,999 11,695 0.597
100,000 3,910 1–6 18,615 23,444 0.630

The Gini coefficient (0 = perfectly uniform, 1 = all places in one cell) measures distribution evenness.

Lower Gini = more uniform distribution. The sweet spot is the highest threshold where Gini hasn't degraded much — you get fewer cells (less scheduling overhead) without sacrificing workload balance.

5K→10K: free win. Doubling the threshold cuts cells in half (64K → 33K) with a Gini increase of just 0.004 (0.547 → 0.551). That's half the queue items for essentially the same distribution quality.

10K→50K: the unexplored gap. Gini jumps from 0.551 to 0.597 (+0.046) — a much bigger degradation. But somewhere in this range (say 20K–25K) the Gini likely stays close to 0.55 while cutting cells further. We didn't benchmark intermediate values; worth investigating for your workload.

50K+: uniformity breaks down. The stddev (11,695) exceeds the average (9,999) — cell sizes vary wildly. The finest resolutions disappear. Fine for "split the world into ~7K chunks," too uneven for balanced work queues.

100K: barely adaptive. Only 6 resolution levels remain, Gini hits 0.63. The bottom decile averages 10 places/cell, the top decile averages 74,575 — a 7,500x spread.

Watch for single-place cells. Every threshold produces cells with exactly 1 place — pure scheduling overhead. At t=5K there are 851 of them (1.3% of cells); at t=100K only 18. Not worth optimizing away, but good to know when sizing your queue.

The practical rule: set the threshold to the largest batch your worker can process without memory pressure.

Detailed breakdown for threshold = 10,000:

res cells places avg/cell min max
1 640 572,928 895 2 9,953
2 794 1,503,620 1,894 1 10,000
3 2,893 6,640,106 2,295 1 10,000
4 7,751 18,321,519 2,364 1 9,995
5 8,634 18,201,626 2,108 1 9,997
6 7,486 17,339,166 2,316 1 9,986
7 3,878 8,889,703 2,292 1 9,896
8 657 1,303,188 1,984 1 9,956
9 7 11,365 1,624 357 2,217

32,740 cells covering 72,783,221 places, each place counted exactly once. Every max stays at or below the threshold. Full benchmark results including decile distributions and validation queries are in the companion repo.

Querying the view

-- Basic
SELECT cell, res, place_count FROM places_h3_t10000;

-- With geometry for rendering
SELECT cell, res, place_count,
  ST_SetSRID(h3_cell_to_boundary(cell)::geometry, 4326) AS geom
FROM places_h3_t10000;

-- Filter to a bounding box (Los Angeles)
SELECT cell, res, place_count,
  ST_SetSRID(h3_cell_to_boundary(cell)::geometry, 4326) AS geom
FROM places_h3_t10000
WHERE ST_Intersects(
  ST_SetSRID(h3_cell_to_boundary(cell)::geometry, 4326),
  ST_MakeEnvelope(-118.7, 33.7, -117.7, 34.4, 4326)
);

This is a good use case for a materialized view — the expensive part is building the multi-resolution counts; querying becomes a cheap SELECT with optional bbox filtering. Refresh on your data update schedule.

Conclusion

  1. Compute once at res 10, derive parentsh3_cell_to_parent is a bitwise truncation, practically free. One expensive latlng_to_cell call per point instead of ten.
  2. The 5K–10K threshold range is the sweet spot for backend work queues. Below 5K you're paying scheduling overhead for marginal uniformity gains. Above 50K the distribution becomes too uneven for balanced workloads.
  3. Use the index, not the polygon. H3's geographic containment is approximate — cellToBoundary polygons don't perfectly contain all index-assigned points. Always assign places via latLngToCell, never via ST_Intersects on the boundary.
  4. Validate your frontier. The overlap check (0 pairs across all thresholds) and total count match confirm the view is correct. Run them after any schema change.

Tested with PostgreSQL 17, PostGIS 3.5, and h3-pg 4.x on the Overture Maps dataset (72.8M places). Full benchmark code and queries: github.com/nikmarch/overturemaps-pg