
ST_Intersects Queries are Costing You a FortuneA client in the telecommunications sector asked us to audit their cloud data warehouse spend. We quickly identified a single scheduled BigQuery job that was costing them thousands of dollars a month.
The culprit? A massive spatial join running every night, matching millions of mobile device pings against a national dataset of complex zip code boundaries.
In BigQuery, performing a spatial join using ST_Intersects is computationally heavy. If your tables aren’t clustered spatially, BigQuery essentially performs a Cartesian product—comparing every point to every polygon—billing you for scanned bytes and burning massive amounts of slot time.
We refactored their data model by introducing spatial clustering and enforcing a pre-filtering bounding box step.
Here is a look at what they were doing versus the optimized approach:
-- ...
-- BAD: No clustering, no pre-filtering. BigQuery compares everything.
SELECT
d.device_id,
z.zip_code
FROM
`telecom_prod.device_pings` d
JOIN
`geo_reference.zip_codes` z
ON
ST_Intersects(d.location_geom, z.boundary_geom)
WHERE
DATE(d.ping_timestamp) = CURRENT_DATE();
-- ...
First, we clustered their points table natively by the geography column. Then, we optimized the join query.
-- 1. Create a spatially clustered table (Run once)
CREATE TABLE `telecom_prod.device_pings_clustered`
CLUSTER BY location_geom
AS SELECT * FROM `telecom_prod.device_pings`;
-- ...
-- 2. The Optimized Query
-- By clustering, BigQuery's execution engine automatically prunes files that don't overlap spatially!
SELECT
d.device_id,
z.zip_code
FROM
`telecom_prod.device_pings_clustered` d
JOIN
`geo_reference.zip_codes` z
ON
-- Pre-filter utilizing BigQuery's internal spatial indexing
ST_DWithin(d.location_geom, z.boundary_geom, 0)
AND
-- Perform the expensive exact intersection only on the candidates
ST_Intersects(d.location_geom, z.boundary_geom)
WHERE
DATE(d.ping_timestamp) = CURRENT_DATE();
By adding spatial clustering and writing our queries to take advantage of BigQuery’s spatial indexing via ST_DWithin before attempting an exact ST_Intersects, the daily query cost dropped by 85% and execution time went from 45 minutes to under 5.
Godspeed!