Introduction to BigQuery
What is BigQuery?
- BigQuery is a service provided by Google Cloud Platform, a suite of products & services that includes application hosting, cloud computing, database services, etc on on Google's scalable infrastructure
- BigQuery is Google’s fully managed solution for companies who need a fully-managed and cloud based interactive query service for massive datasets
- Google BigQuery is an enterprise data warehouse built using BigTable and Google Cloud Platform.
- It’s serverless and completely managed.
- BigQuery works great with all sizes of data, from a 100 row Excel spreadsheet to several Petabytes of data.
- Most importantly, it can execute a complex query on those data within a few seconds.
- We need to note before we proceed, BigQuery is not a transactional database. It takes around 2 seconds to run a simple query like ‘SELECT * FROM bigquery-public-data.object LIMIT 10’ on a 100 KB table with 500 rows. Hence, it shouldn’t be thought of as OLTP (Online Transaction Processing) database. BigQuery is for Big Data!
- BigQuery supports SQL-like query, which makes it user-friendly and beginner friendly.
- It’s accessible via its web UI, command-line tool, or client library (written in C#, Go, Java, Node.js, PHP, Python, and Ruby).
- You can also take advantage of its REST APIs and get our job` done by sending a JSON request
- BigQuery is serverless, highly available, and petabyte scalable service which allows you to execute complex SQL queries quickly.
- It lets you focus on analysis rather than handling infrastructure.
- The idea of hardware is completely abstracted and not visible to us, not even as virtual machines.
Now, let’s dive deeper to understand it better. Suppose you are a data scientist (or a startup which analyzes data) and you need to analyze terabytes of data. If you choose a tool like MySQL, the first step before even thinking about any query is to have an infrastructure in place, that can store this magnitude of data.
Designing this setup itself will be a difficult task because you have to figure out what will be the RAM size, DCOS or Kubernetes, and other factors. And if you have streaming data coming, you will need to set up and maintain a Kafka cluster. In BigQuery, all you have to do is a bulk upload of your CSV/JSON file, and you are done. BigQuery handles all the backend for you. If you need streaming data ingestion, you can use Fluentd. Another advantage of this is that you can connect Google Analytics with BigQuery seamlessly.
Why BigQuery?
- Service for interactive analysis of massive datasets (TBs)
- Query billions of rows: seconds to write, seconds to return
- Uses a SQL-style query syntax
- It's a service, can be accessed by a API
- Reliable and Secure
- Replicated across multiple sites
- Secured through Access Control Lists
- Scalable
- Store hundreds of terabytes
- Pay only for what you use
- Fast (really)
- Run ad hoc queries on multi-terabyte data sets in seconds
Comments
Post a Comment