Posts

Showing posts from March, 2021

Interview Questions Accenture (Pune)

Tell me about things you did in your project related to GCP? Legacy SQL vs Standard SQL What details we are getting from Execution Details Tab? How you will reduce query cost? Any impact of LIMIT on query cost On which type of columns we can do  PARTITIONING on Tables? What MERGE does in BigQuery? Suppose there are 2 columns in table. Column1 of TIMESTAMP, Column2 of STRING but has timestamp values. How will you calculate average of difference between these 2 columns? JSON file is on GCS. Load JSON data into a table using bq command line tool. There is CSV file present on GCS. Schema of CSV and Table is matching but There is error while loading csv file to Table? What will be the Possible Errors (Expected UTF8 vs UNICODE characters) What is SAFE_CAST()? Suppose you have a array of 100 elemnts. How will you get value of 98 element? (Expecting to use Array Functions, OFFSET or CARDINAL) What is RANK() and DENSE_RANK()? What is FIRST_VALUE() and LAST_VALUE()? What is LEAD() and LAG()? Wha

BigQuery Execution Details

Image
  How to use BigQuery execution plans to optimize query performance BigQuery breaks down your SQL into a series of execution steps (stages) and creates a query execution tree. The query execution tree consists of a series of leaf nodes and the master node. The execution plan is an ideal tool to understand how the query is turned into series of work units and it provides valuable insights on where the query is spending so much time. Finding the execution plan for your query To find the execution plan, Once you execute the query in the WebUI, Navigate to the the preview panel (The area right below the query editor), And click on the fourth tab "Execution details" to see the query plan To check the execution plan for older queries, you can navigate and find the query that you ran in the past using the "Query History" window and once you find the query, click on query to expand, and then click on "Open query in editor", Now you should see the "Execution d

CLUSTER BY

Image
  Within a partition, Bigquery uses the cluster keys to sort the data. You can specify up to four columns as part of the cluster by clause. It is recommended to use the columns most likely used in the filter conditions as part of the cluster by clause

PARTITION BY

  "Partition by" as the name specifies, splits the table into multiple partitions. You can specify only one column as part of the PARTITION by clause. Its recommended to use the columns that are most likely to be used as part of the filter conditions as the partition key. Partition key helps with data pruning, reducing the amount of data scanned by the queries thereby reducing the overall cost. As of this writing, Bigquery only supports DATE, Timestamp and Numeric columns to be used as a Partition by column

Accessing BigQuery

  Accessing BigQuery  You can access BigQuery and run jobs from your web browser  Developers can use bq command line tool python-based tool that can access BigQuery from the command line  Developers can also leverage the Service API  RESTful API to access BigQuery programmatically  Requires authorization by OAuth2  Google client libraries for Python, JavaScript, PHP, etc.  Integration Possible with Third party Tools  Visualization and Statistical Tools tools like Tableau, QlikView, R, etc.  You can export data in a .csv file, jason or to Google Cloud Storage There are three main ways you interact with BigQuery: Loading and exporting data Querying and viewing data Managing data To perform these interactions, you can use the following: The  Cloud Console The  bq  command-line tool The BigQuery  REST API  or  client libraries Loading and exporting data In most cases, you  load data into BigQuery storage . If you want to get the data back out of BigQuery, you can  export the data . Alterna

BigQuery Organisation

Image
BigQuery is structured as a hierarchy with 4 levels:  Projects: Top-level containers in the Google Cloud Platform that store the data  Datasets: Within projects, datasets hold one or more tables of data  Tables: Within datasets, tables are row-column structures that hold actual data  Jobs: The tasks you are performing on the data, such as running queries, loading data, and exporting data Projects  Projects are the top-level containers that store the data  Within the project, you can configure settings, permissions, and other metadata that describe your applications  Each project has a name, ID, and number that you’ll use as identifiers  When billing is enabled, each project is associated with one billing account but multiple projects can be billed to the same account Datasets  Datasets allow you to organize and control access to your tables  All tables must belong to a dataset. You must create a dataset before loading data into BigQuery  You can configure permissions at the organizatio

BigQuery Time Travel (Snapshot decorator)

  BigQuery Time Travel and Undelete using Snapshot decorator What is Time Travel in BigQuery With Time travel, You can access the data in the table from any point (eg. any particular second) within the last seven days, even if the data has been deleted. You can use time travel to query data that was updated or deleted, restore a table that was deleted, or restore a table that expired. Better yet, there is no additional cost to enable this feature and there are no maintenance overheads as well How to query a table using Time travel in BigQuery If you have to take a look at the data in a table as of any particular time, all you have to do is add the AS OF Suffix (also known as snapshot decorator) SELECT * FROM table FOR SYSTEM_TIME AS OF '2020-01-01 12:00:00-07:00'; You can also use relative time as shown below: SELECT * FROM `mydataset.mytable` FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR); with Time Travel you can query up to 7 days in the past H

BigQuery - Data Encryption and Security

Image
With the cloud, Data encryption is no more a luxury but a necessity. BigQuery encrypts all your data at rest, encrypts all your data in transit. This tutorial will take a deep dive into BigQuery's encryption features Data Security Features in BigQuery Automatic data encryption by BigQuery using Google-managed keys. All communication and data transfer between clients and the server protected through TLS You can choose the geographical location where your data is stored, based on your cloud region Deployment inside a cloud platform VPC. Granular permission control using IAM (Identity and access management) What is End-to-End Encryption End-to-end encryption (E2EE) means the data is encrypted as it leaves the user till it gets loaded and vice versa. In Bigquery, this means that only a customer and the runtime components can read the data. No third parties, including Bigquery's cloud computing platform or any ISP, can see data in the clear 1. Data encryption at rest (i.e. on-disk e

BigQuery Compression

  Google Bigquery Compression and Columnar Storage Overview BigQuery is a Columnar analytics database.  Behind the scenes, All the data is compressed by default in BigQuery. Although the data is compressed, Google will charge the customers for the uncompressed data size (for both storage and data scanned) How BigQuery stores the data internally With Columnar format, any column level analytical operations are going to be much faster. To understand the organization of data in Bigquery, we will need to get ourselves familiar with the below two concepts: Repetition level: the level of the repetitions in case of a repeat column type Definition level: Tells us whether the field been assigned a value or is it null For example: Consider the below table message Book { required string title, repeated string author, repeated group price { optional int64 discount, optional int64 usd, optional int64 eur, } } And consider we have three records in the table Book1: author: "AAA"

BigQuery Caching

Image
  What is BigQuery Caching ? When you execute a SQL in BigQuery the results are cached for 24 hours. If you happen to rerun the same query within the 24 hours, Results will be served from the cache (There is no charge for results served from cache). Caveat to note: Caching is offered at the user account level, which means the caching benefits are available only to you, If you teammate runs the same query, his query will not utilize your cache, it will scan the disk instead How does Caching work Internally ? Under the hood, whenever you execute a SQL, Once BigQuery completes the query processing and fetches the output result set, it automatically creates a temp table and stores the results in it. The next time you execute the same query, BigQuery will directly serve the results from the temp table. (There is no storage cost for the temp tables, and they will be automatically dropped after 24 hours) What happens to Cache results when the underlying data changes ? BigQuery Cache results a

BigQuery Columnar Storage

Image
Understanding Google BigQuery Columnar Storage Architecture Traditional RDBMS systems use a Row-level storage - The entire row is stored together, so it can be retrieved.  A columnar database stores data by columns rather than by rows, which makes it suitable for analytical query processing, and thus for data warehouses BigQuery stores data in a proprietary columnar format called Capacitor, which has a number of benefits for data warehouse workloads.  Each column in the table is stored in a separate file block and all the columns are stored in a single capacitor file, which are compressed and encrypted on disk.  BigQuery dynamically uses query access patterns to determine the optimal number of physical shards Colossus - Google’s distributed file system The data persistence layer is provided by Google’s distributed file system, Colossus, where data is automatically compressed, encrypted, replicated, and distributed. Colossus ensures durability using erasure encoding to store redundant c

BigQuery Compute Engine (Slot)

Image
  Understanding Google BigQuery Compute Slots Bigquery uses Compute Slots to process your SQL Query and return the results. In this tutorial we will go through Bigquery slots in detail What is a BigQuery slot ? A BigQuery slot is a virtual CPU used by BigQuery to execute SQL queries. BigQuery automatically calculates how many slots are required by each query, depending on query size and complexity. There are two pricing models in Bigquery on-demand pricing and flat-rate pricing and the slot allocation differs in both. Lets take a detailed look Bigquery slot allocation for on-demand customers Customers with  On-demand pricing model (pay as you go) can get up to 2000 slots for processing their queries. The slots are allocated on demand (only when there is a need) and automatically de-allocated once the job is done. The customers only pay for the active usage milliseconds. One caveat here is the 2000 slots is not guaranteed. If the compute resouces are available in the shared pool, google

Architeture of BigQuery

Image
  Google BigQuery Architecture and Concepts Google Bigquery has a unique Cloud Native architecture that allows for fast queries at petabyte scale using the distributed processing power of Google’s infrastructure.  Bigquery Architecture As you see in the above picture,  Storage and compute Layers are separated (decoupled) which allows them to scale independently on demand The Storage and compute layers are connected using a high speed network When a user submits a query, BigQuery will spin-up the compute on demand and then read the data from the distributed storage and fulfill the query. The decoupling of the storage and compute layers allows the compute layer to scale infinitely and support more queries and users This model also works for small and medium businesses customers as well Bigquery doesn't use the standard GCS buckets to store the data. It uses Colossus which is Google's proprietary storage system Now that we know the high level architecture, lets take a deep dive in

Features of BigQuery

  Features of BigQuery Following are some of the useful features of BigQuery: 1. Fully Managed, Serverless Insight GCP that is Google cloud platform excels the industry in the ability to let you analyze data at the scale of the entire web, with the awareness of SQL and in a fully managed, serverless architecture where backend infrastructure is fully handled on behalf of you. One of the wonderful features of Google's big data analytics products is that they are able to scale automatically while you focus only on the business insight you want to uncover. 2. Fast Queries on Petabyte-scale Datasets BigQuery is Cloud Platform's fully managed data warehouse that lets you frugally query massive volumes of data at a speed which anyone would expect from Google. Google does not charge daily but you have to pay as you go. Google provides pricing benefits and the scalability and security of Google's best infrastructure to power your business insights. 3. Unified Batch and Stream Proces