BigQuery Performance Optimization Tips


In some or the other way, a performance optimized query is the one which takes less resource, less time, less costs and has fewer failures. Based on my learnings and experience, I have outlined the following points for a better performing query in BigQuery.

Four key areas to avoid Performance Pitfalls

  • Input/Output
    • How many bytes did you read? Write to storage?
    • Do no SELECT *, use only the columns you need
    • Denormalize your schemas and take advantage of nested and repeated fields
    • Use granular suffixes in your table wildcards for more specificity
  • Communication between slots (shuffle)
    • How many bytes did you pass to the next stage?
    • External direct data connections can never be cached
    • Live edits to underlying external sources (e.g. spreadsheets) could create race conditions
    • Native BigQuery tables have intelligence built-in like automatic predicate pushdown
    • Pre-filter your data before doing JOINs
    • Many shuffle stages can indicate partition issues (skew)
    • Do not use WITH clauses in place of materializing results
      • Commonly filtering and transforming the same results? Store them into a permanent table
      • WITH clause queries are not materialized and are re-queried if referenced more than once
    • Be careful using GROUP BY across many distinct values
  • CPU work
    • User-defined functions (UDF), functions
      • Reduce Javascript UDFs to reduce computational load
        • Javascript UDFs require BigQuery to launch a Java subprocess to run
      • Use native SQL functions whenever possible
      • Concurrent rate limits
        • For UDF queries – 6
        • For non-UDF queries – 100
  • SQL syntax
    • Is there a more efficient way to write your query?
    • Understand your data model before applying JOINs and UNIONs
      • Know your join conditions and if they’re unique – no accidental cross joins
      • Filter wildcard UNIONs with _TABLE_SUFFIX filter
      • Do not use self-joins (consider window functions instead)
    • Push intensive operations to the end of the query
      • Large sorts should be the last operation in your query
      • If you need to sort early, filter or limit your data before sorting
    • Large GROUP BY means many forced shuffles

Prevent Hotspots (Data skew) in your Data

  • Filter your dataset as early as possible (this avoids overloading workers on JOINs)
  • Use the Query Explanation map and compare the Max vs the Avg times to highlight skew
  • BigQuery will automatically attempt to reshuffle workers that are overloaded with data

Diagnose performance issues with the Query Plan and Timeline

  • Query Plan and Timeline are part of Execution details
  • Understand the amount of read, compute and write your query triggers
  • Date Partitioning a BigTable (or a big table) based on a specified day or a Date column, prior to running BigQuery is having a best performance
  • Use monitoring metrics to monitor resource utilization, query performance, etc.
  • Turn off the use of cache when you are debugging your query

  • March 9, 2022