In some or the other way, a performance optimized query is the one which takes less resource, less time, less costs and has fewer failures. Based on my learnings and experience, I have outlined the following points for a better performing query in BigQuery.
Four key areas to avoid Performance Pitfalls
- Input/Output
- How many bytes did you read? Write to storage?
- Do no SELECT *, use only the columns you need
- Denormalize your schemas and take advantage of nested and repeated fields
- Use granular suffixes in your table wildcards for more specificity
- Communication between slots (shuffle)
- How many bytes did you pass to the next stage?
- External direct data connections can never be cached
- Live edits to underlying external sources (e.g. spreadsheets) could create race conditions
- Native BigQuery tables have intelligence built-in like automatic predicate pushdown
- Pre-filter your data before doing JOINs
- Many shuffle stages can indicate partition issues (skew)
- Do not use WITH clauses in place of materializing results
- Commonly filtering and transforming the same results? Store them into a permanent table
- WITH clause queries are not materialized and are re-queried if referenced more than once
- Be careful using GROUP BY across many distinct values
- CPU work
- User-defined functions (UDF), functions
- Reduce Javascript UDFs to reduce computational load
- Javascript UDFs require BigQuery to launch a Java subprocess to run
- Use native SQL functions whenever possible
- Concurrent rate limits
- For UDF queries – 6
- For non-UDF queries – 100
- Reduce Javascript UDFs to reduce computational load
- User-defined functions (UDF), functions
- SQL syntax
- Is there a more efficient way to write your query?
- Understand your data model before applying JOINs and UNIONs
- Know your join conditions and if they’re unique – no accidental cross joins
- Filter wildcard UNIONs with _TABLE_SUFFIX filter
- Do not use self-joins (consider window functions instead)
- Push intensive operations to the end of the query
- Large sorts should be the last operation in your query
- If you need to sort early, filter or limit your data before sorting
- Large GROUP BY means many forced shuffles
Prevent Hotspots (Data skew) in your Data
- Filter your dataset as early as possible (this avoids overloading workers on JOINs)
- Use the Query Explanation map and compare the Max vs the Avg times to highlight skew
- BigQuery will automatically attempt to reshuffle workers that are overloaded with data
Diagnose performance issues with the Query Plan and Timeline
- Query Plan and Timeline are part of Execution details
- Understand the amount of read, compute and write your query triggers
- Date Partitioning a BigTable (or a big table) based on a specified day or a Date column, prior to running BigQuery is having a best performance
- Use monitoring metrics to monitor resource utilization, query performance, etc.
- Turn off the use of cache when you are debugging your query
Leave a Reply