Optimizing BigQuery Geospatial Intersects

Working with geospatial data, intersection operations are fundamental and frequently used to answer questions like “Which points fall within this boundary?” or “Do these two areas overlap?”

When we are not looking for high-precision, millimeter level intersects, we can often leverage spatial indexing and approximation techniques to dramatically improve performance. The core concept is to reduce the number of expensive, full geometry comparisons by pre filtering candidates based on a simpler, faster method.

Now here is an interesting rabbit hole I fell into while doing a similar thing in BigQuery (Google’s SQL like Big Data Warehouse), and you know if its BigQuery -> use case must be quite big too – and yes it was, intersecting Terabytes of data.

What i wanted to do is get number of points that would lie inside a city polygon – bigquery has its own geospatial analytics library https://docs.cloud.google.com/bigquery/docs/geospatial-data
That helps do these kind of things by using ST_ functions, for instance:

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon 
    FROM `project-id.dataset_id.city_polygon`
),
SELECT DISTINCT id, ts, duration, lat, lon
FROM Points AS pt
INNER JOIN TargetPolygons AS pl
ON ST_INTERSECTS(pt.geo_point, pl.polygon)

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon 
    FROM `project-id.dataset_id.city_polygon`
),
SELECT DISTINCT id, ts, duration, lat, lon
FROM Points AS pt
INNER JOIN TargetPolygons AS pl
ON ST_INTERSECTS(pt.geo_point, pl.polygon)

and it is based on S2 Geometry https://s2geometry.io/ – which is implemented using a 3d sphere instead of a 2d surface (it has more of a geodesic worldview than planar euclidean!)

To make this computationally more efficient the sphere is further projected onto a cube, which is hierarchally divided into smaller cells and assigned unique integers connected using hilbert curve to preserve the local relationships, you can read about it here.

What that means is when i use ST_ functions i am essentially using an S2 geometry engine in the back.

You can read more about different grid systems in geospatial analytics here.

The goal was simple: take a table of daily stops (Points) and join it to a table of city boundaries (TargetPolygons)

Now, i went with intuition first, If we want to capture points that are near or inside the city, a bounding box seems computationally cheaper than a complex, jagged polygon.

This was pretty basic but i fell into the simplicity trap the intuitive SQL approach i thought was to use the BETWEEN operator on the bounding box limits instead of polygon intersect

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon,
    ST_BOUNDINGBOX(geo)  bounding_box,
    FROM `project-id.dataset_id.city_polygon`
)
SELECT DISTINCT id, ts, duration, lat, lon
FROM Points AS pt
INNER JOIN TargetPolygons AS pl
-- ON ST_INTERSECTS(pt.geo_point, pl.polygon)
ON v.meanLong BETWEEN p.bounding_box.xmin AND p.bounding_box.xmax
AND v.meanLat BETWEEN p.bounding_box.ymin AND p.bounding_box.ymax

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon,
    ST_BOUNDINGBOX(geo)  bounding_box,
    FROM `project-id.dataset_id.city_polygon`
)
SELECT DISTINCT id, ts, duration, lat, lon
FROM Points AS pt
INNER JOIN TargetPolygons AS pl
-- ON ST_INTERSECTS(pt.geo_point, pl.polygon)
ON v.meanLong BETWEEN p.bounding_box.xmin AND p.bounding_box.xmax
AND v.meanLat BETWEEN p.bounding_box.ymin AND p.bounding_box.ymax

now if you work with big joins you know a cartesian condition is probably not a good idea and it wasnt:

Method	Mechanism	Complexity	Slot Time (Sample Data)
ST_INTERSECTS	S2 Cell Index Lookup (Hash Join)	$O(N)$ (roughly)	213k ms
BETWEEN	Cartesian / Nested Loop	$O(N \times M)$	17.8M ms

BigQuery projects geospatial data onto a sphere using the S2 Geometry library. When you use native spatial predicates like ST_INTERSECTS, BigQuery translates the geometries into S2 cell IDs (integers). It then performs an Equality Join (Hash Join) on those integers, which is incredibly fast ($O(N)$ complexity).

When you use BETWEEN, you are forcing an Inequality Join ($a > b$). BigQuery cannot use the S2 spatial index for raw floats. Instead, it falls back to a Cross Join (Cartesian Product), mathematically comparing every single visit against every single city’s limits ($O(N \times M)$ complexity)

But at the time I couldn’t understand why would a bounding box take longer than a jagged polygon without realizing i switched to a cartesian join from a hash join!

The manual bounding box query was ~83x more expensive because it bypassed BigQuery’s spatial indexing engine entirely

Now that realized, I tried to use BigQuery’s built in bounding box intersect: ST_INTERSECTSBOX

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon,
    ST_BOUNDINGBOX(geo)  bounding_box,
    FROM `project-id.dataset_id.city_polygon`
)

SELECT DISTINCT id, ts, duration, lat, lon
FROM Points AS pt
INNER JOIN TargetPolygons AS pl

-- ON ST_INTERSECTS(pt.geo_point, pl.polygon)

-- ON v.meanLong BETWEEN p.bounding_box.xmin AND p.bounding_box.xmax
-- AND v.meanLat BETWEEN p.bounding_box.ymin AND p.bounding_box.ymax

ON ST_INTERSECTSBOX(v.visit_point, 
                    p.bounding_box.xmin, p.bounding_box.ymin, 
                    p.bounding_box.xmax, p.bounding_box.ymax)

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon,
    ST_BOUNDINGBOX(geo)  bounding_box,
    FROM `project-id.dataset_id.city_polygon`
)

SELECT DISTINCT id, ts, duration, lat, lon
FROM Points AS pt
INNER JOIN TargetPolygons AS pl

-- ON ST_INTERSECTS(pt.geo_point, pl.polygon)

-- ON v.meanLong BETWEEN p.bounding_box.xmin AND p.bounding_box.xmax
-- AND v.meanLat BETWEEN p.bounding_box.ymin AND p.bounding_box.ymax

ON ST_INTERSECTSBOX(v.visit_point, 
                    p.bounding_box.xmin, p.bounding_box.ymin, 
                    p.bounding_box.xmax, p.bounding_box.ymax)

Surprisingly, this was still taking too long

BigQuery sees a GEOGRAPHY object on one side (v.visit_point) and raw float columns on the other (xmin, ymin, etc.). Because it cannot easily build a spatial grid purely on floats, it once again falls back to a Cross Join!

Now somehow it was established that whatever St_INTERSECT was doing – was undeniably fastest, but still there must be a way to accelarate it further by using bounding box, because all of my experience said that a box has to be faster than such a jagged geospatial polygon!

So, next idea was what if the bounding box was the exact polygon that the ST_INTERSECT function was comfortable with!

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon,
    ST_BOUNDINGBOX(geo)  bounding_box,
    FROM `project-id.dataset_id.city_polygon`
),
TargetEnvelopes AS (
    SELECT
        city_name,
        ST_GEOGFROMTEXT(FORMAT('POLYGON((%f %f, %f %f, %f %f, %f %f, %f %f))',
            bounding_box.xmin, bounding_box.ymin,
            bounding_box.xmax, bounding_box.ymin,
            bounding_box.xmax, bounding_box.ymax,
            bounding_box.xmin, bounding_box.ymax,
            bounding_box.xmin, bounding_box.ymin
        )) AS polygon
    FROM TargetPolygons
)
SELECT DISTINCT
    v.device_id,
    v.device_type,
    p.city_name,
    v.event_ts,
    v.dwell_time, v.breadcrumbs, v.meanLat, v.meanLong
FROM ValidVisits AS v
INNER JOIN
TargetEnvelopes AS p
        ON ST_INTERSECTS(v.visit_point, p.polygon)

WITH Points AS (
    SELECT id, timestamp ts, duration, lat, lon, ST_GEOGPOINT(lon, lot) AS geo_point
    FROM `project-id.dataset_id.daily_stops`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2025-01-01") AND TIMESTAMP("2025-01-02")
),
TargetPolygons AS (
    SELECT city_name, geo AS polygon,
    ST_BOUNDINGBOX(geo)  bounding_box,
    FROM `project-id.dataset_id.city_polygon`
),
TargetEnvelopes AS (
    SELECT
        city_name,
        ST_GEOGFROMTEXT(FORMAT('POLYGON((%f %f, %f %f, %f %f, %f %f, %f %f))',
            bounding_box.xmin, bounding_box.ymin,
            bounding_box.xmax, bounding_box.ymin,
            bounding_box.xmax, bounding_box.ymax,
            bounding_box.xmin, bounding_box.ymax,
            bounding_box.xmin, bounding_box.ymin
        )) AS polygon
    FROM TargetPolygons
)
SELECT DISTINCT
    v.device_id,
    v.device_type,
    p.city_name,
    v.event_ts,
    v.dwell_time, v.breadcrumbs, v.meanLat, v.meanLong
FROM ValidVisits AS v
INNER JOIN
TargetEnvelopes AS p
        ON ST_INTERSECTS(v.visit_point, p.polygon)

Metric	Failed Manual Box (BETWEEN)	New Constructed Box (ST_GEOGFROMTEXT)	Original Exact Intersect
Duration	1 min 5 sec	5 sec	4 sec
Compute (Slot ms)	17,856,403	327,200	213,050

And voila! this was it, atleast it was comparable!

the reason it was still higher because the query was creating a bounding box polygon at runtime, saving TargetEnvelope ahead of time saved the additional overhead and helped process 84 GB data in a duration of just 2s! (this was one of the test cases ofcourse – the actual intersect we had to do was much bigger scale – and geo accuracy wasn’t an important factor – we wanted points in and around the area – so this worked like a charm). But yes doesn’t sound like much but it was confusing and eventually insightful how BQ geospatial engine actually worked under the hood, this rabbit hole took about an hour so I don’t feel too bad about missing such a simple thing as cartesian join messing up my initial optimization!