7 “Gotchas” for Data Engineers
New to Google BigQuery
By
Josh Levy
Everywhere you look these days, IT organizations are looking to the cloud to solve their data storage, movement, and analytics challenges…and with good reason! Cloud services from Amazon, Google, Microsoft and others have revolutionized how we think about data, from an IT and an end user perspective. They’ve vastly reduced the function of the corporate data center, which is rarely a core competency of the organization. They’ve democratized access to data, analytics, and processing power. They’ve disrupted how we think about the costs involved with data warehousing and analytics. Cloud is all these things and more, so it’s no wonder many organizations are rushing to push their big data operations into the cloud. Cost savings, performance improvement, and streamlined operations are all well and good, but what happens when you actually get down to doing the work? Chances are your data engineers will be new to whatever cloud technology your organization decides to utilize. This article details my own experience as a data engineer being exposed to Google BigQuery (GBQ) for the first time.
I’ve been a data engineer for many years and I’ve worked with most flavors of RDBMS and SQL in my career. A recent project had my coworkers and I redesigning an existing SQL Server database into GBQ, which inspired me to write this article as a resource for fellow data engineers new to GBQ. Here are some things that might take some getting used to, along with mitigation strategies where I’ve found them:
- Case Sensitivity
Unlike most RDBMS, BigQuery is case sensitive, not only for string comparison, but for object names as well. For string comparison, using The UPPER function on both sides (or one side of comparing to a literal) will eliminate the effect of case sensitivity. Curiously, although object names are case sensitive when referenced in a FROM clause, column names are not case sensitive. - Fully Qualified Object Names
All object names referenced in the FROM clause must be fully qualified to the dataset level, or the project level if they don’t exist in your current billing project. Not only that, but the entire fully qualified name must be enclosed in backwards single quotes (usually shares a key with the tilde). You will absolutely spend several frustrated minutes at some point trying to figure out why a query errors out until you realize one of those quotes doesn’t look like the other. - Standard SQL vs. Legacy SQL
BigQuery has two flavors of SQL, for some reason. Standard SQL is very much like ANSI SQL and is what you should use. However, the classic BigQuery Web UI (which I prefer for reasons I’ll get into shortly) defaults to Legacy SQL. There’s an option to use Standard SQL, and there’s also a Chrome extension called BigQuery Mate that will let you skip that step. - No Stored Procedures or Functions
This is a big one and there’s really no easy way around it. Google provides tools with which to do ETL and data transformation and preparation, but they don’t offer the flexibility of coding your own procedures. Functions can be created, but they are temporary, only existing for a single session, which makes them very limited as a programming tool. - Classic UI vs. New UI
The classic UI (the one that looks like Gmail) has one big advantage in that it allows you to add a project to which you don’t have explicit named access to the object browser. For some reason, the new UI doesn’t have that feature. It does however, default to Standard SQL, which can save you a few headaches. Both are fine, but neither is perfect. - No Query Plan Estimates
You have to wonder if whoever runs Google’s Cloud Services division was a doctor in a past life. What other industry would perform a service without explaining the cost first? That’s essentially what Google is doing here, albeit in a less predatory manner. Simply put, there is no way to determine if your complex query will run efficiently until you actually run it. Once you run the query, the execution plan is actually very helpful and actionable. - Limited DDL Capability
There is very little ability to alter a table once it exists. You can’t remove a column or change a datatype. You can’t rename fields or tables. Nearly any table or view altering operation involves a “CREATE TABLE AS SELECT” type of operation, meaning you technically will end up with a “new” table or view.
I would imagine that reading that list, it looks like I’m disparaging BigQuery quite a bit. That’s not the case at all. These are, for the most part, annoyances that new GBQ developers will encounter and overcome. There’s actually a lot to like, too. Here are some pleasant surprises that I encountered during my first GBQ project:
- Automatic Object Expiration
This one’s for the lazy DBA who always has 25 temp tables covered with dust in the corner of his/her database. In GBQ, you can (and should) create a dataset and specify a short object lifespan. This way, you can have a sandbox environment that cleans itself up and you don’t need to worry about incurring additional storage charges for data you no longer need. - Query History
GBQ logs all of the queries you run for billing purposes of course, but it also exposes them to you in an easily searchable list. This can be extremely handy if you ever lose track of a piece of code, which happens to the best of us. - Cached Query Results
Google charges to store data and in most cases to retrieve it as well. If you’re like me, you may find yourself running the same query periodically during development. GBQ will cache those results by default and your project will not be charged. There is an option to turn this feature off in the Web UI options if you want. - Autocomplete
Most query tools like TOAD or SQL Management Studio will autocomplete table and column names for you with varying degrees of success. I’ve found that the GBQ Web UI autocomplete - GUI Based Table Creation
Sometimes you just need to create a quick small table to store parameters or something like that. The GBQ Web UI allows a user with no SQL skills the ability to create a table and add columns of various data types.
The upshot is that GBQ takes a little getting used to and still has one or two glaring functionality gaps, mainly related to the inability to create stored procedures or functions. However, the potential benefits of cloud data storage and analytics far outweigh these considerations. I continue to learn new shortcuts and techniques every day, and as I overcome the various small challenges, I can see the bright “cloudy” future in front of me.
Josh Levy is a Manager in the Analytics practice of aspirent. He has spent the past 20 years working in various capacities and industries within the Business Intelligence space. [@AspirentConsult]