Why Your BigQuery ST_Intersects Queries are Costing You a Fortune

A client in the telecommunications sector asked us to audit their cloud data warehouse spend. We quickly identified a single scheduled BigQuery job that was costing them thousands of dollars a month.

The culprit? A massive spatial join running every night, matching millions of mobile device pings against a national dataset of complex zip code boundaries.

In BigQuery, performing a spatial join using ST_Intersects is computationally heavy. If your tables aren’t clustered spatially, BigQuery essentially performs a Cartesian product—comparing every point to every polygon—billing you for scanned bytes and burning massive amounts of slot time.

We refactored their data model by introducing spatial clustering and enforcing a pre-filtering bounding box step.

Here is a look at what they were doing versus the optimized approach:

The Expensive Query:


-- ...

-- BAD: No clustering, no pre-filtering. BigQuery compares everything.
SELECT
    d.device_id,
    z.zip_code
FROM
    `telecom_prod.device_pings` d
JOIN
    `geo_reference.zip_codes` z
ON
    ST_Intersects(d.location_geom, z.boundary_geom)
WHERE
    DATE(d.ping_timestamp) = CURRENT_DATE();

-- ...

The Optimized Solution:

First, we clustered their points table natively by the geography column. Then, we optimized the join query.


-- 1. Create a spatially clustered table (Run once)
CREATE TABLE `telecom_prod.device_pings_clustered`
CLUSTER BY location_geom
AS SELECT * FROM `telecom_prod.device_pings`;

-- ...

-- 2. The Optimized Query
-- By clustering, BigQuery's execution engine automatically prunes files that don't overlap spatially!
SELECT
    d.device_id,
    z.zip_code
FROM
    `telecom_prod.device_pings_clustered` d
JOIN
    `geo_reference.zip_codes` z
ON
    -- Pre-filter utilizing BigQuery's internal spatial indexing
    ST_DWithin(d.location_geom, z.boundary_geom, 0)
    AND
    -- Perform the expensive exact intersection only on the candidates
    ST_Intersects(d.location_geom, z.boundary_geom)
WHERE
    DATE(d.ping_timestamp) = CURRENT_DATE();

By adding spatial clustering and writing our queries to take advantage of BigQuery’s spatial indexing via ST_DWithin before attempting an exact ST_Intersects, the daily query cost dropped by 85% and execution time went from 45 minutes to under 5.

Godspeed!

 

Cart (0 items)

Create your account