close

Jack Ma once said: "In the next twenty or thirty years, the most expensive and valuable thing will not be oil or houses, but data." We often hear about the applications of big data helping companies target customer needs and identify potential customers. However, even with vast amounts of data, its value cannot be highlighted without effective data processing. To address this issue, companies can utilize the PaaS service - BigQuery for interactive analysis of large datasets. According to the official introduction: "BigQuery is a highly scalable and cost-effective serverless multi-cloud data warehouse system designed to enhance enterprise flexibility." In simple terms, BigQuery is a cloud-based enterprise data warehouse.

Enterprise Data Warehouse (EDW) is a data warehouse that many companies use to centrally manage all data within the enterprise. EDW acts as a unified and centralized big data warehouse, storing enterprise data from different systems (such as ERP, CRM) together.

 

To better understand BigQuery, it is necessary to first understand the architecture of a data warehouse (Data Warehouse), which mainly consists of three parts: data source (Raw data layer), warehouse (Warehouse), and the interface presented through analytical tools (Analytical tools).

In-depth Exposure! GCP Architecture in Action! - Big Data Analysis with BigQuery

Firstly, the diagram illustrates a layer architecture of a data warehouse. The database directly connects to the user's analytical interface. This architecture faces the problem of being unable to demonstrate flexible analytical performance. For example, if there is 500GB of data in the database, directly querying it would take too long. Additionally, without precise input, it is also unable to filter out valid data. Therefore, the design of a single-layer architecture is rarely applied in large-scale data platforms.

In-depth Exposure! GCP Architecture in Action! - Big Data Analysis with BigQuery

Two-layer architecture adds a Data Mart Layer between EDW and the user interface. Here, the Data Mart can be seen as a subset of the data warehouse, focusing on specific business topics, usually containing only one subject domain, and is a smaller-scale database. This structure solves the problem of slow queries faced by the single-layer architecture. Because the Data Mart limits the data that users can query, on the one hand, each department can more conveniently access the required data, and on the other hand, such limitations also enhance the overall system security.

In-depth Exposure! GCP Architecture in Action! - Big Data Analysis with BigQuery

The design of the three-layer architecture further adds an Online Analytical Processing (OLAP) layer, which is a method for quickly solving multidimensional analysis problems. Our commonly seen relational databases belong to the two-dimensional form, while OLAP cubes consist of multidimensional data. The four types of OLAP operations include: Roll-up, Drill-down, Slice and dice, and Pivot (rotate), achieving fast query functions for multidimensional data. Therefore, OLAP is particularly optimized for reading and can also achieve data updates and deletions. Currently, almost all data warehouses provide OLAP services.

In-depth Exposure! GCP Architecture in Action! - Big Data Analysis with BigQuery

Source:​​https://cloud.google.com/architecture/bigquery-data-warehouse
 

BigQuery shares the same characteristics as EDW as a data warehouse. Because it is a fully managed service in the cloud, it is also considered a Cloud Data Warehouse (CDW).

 

BigQuery

The main features of BigQuery are fully managed and serverless. Both the software and hardware of the service are managed by GCP. The service is highly available and will not easily interrupt. Users do not need to worry about hardware resources, and machine resources can flexibly increase or decrease, with costs depending on actual usage.

In-depth Exposure! GCP Architecture in Action! - Big Data Analysis with BigQuery

資料來源:https://cloud.google.com/bigquery 
 

BigQuery supports various import sources: local, Google Cloud Storage, Cloud Storage, BigQuery Data Transfer Service (DTS), Data Fusion plugins, etc., to upload data to the data warehouse. The supported data formats include: Avro, CSV, JSON, ORC, or Parquet formats.

In-depth Exposure! GCP Architecture in Action! - Big Data Analysis with BigQuery
Source:https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview

 

The overall architecture of BigQuery separates storage and computation, with each function independent. The decoupled structure reduces the frequency of system errors, allowing enterprises to choose storage and processing solutions flexibly according to business needs, and also control usage costs separately. When uploading data to BigQuery, the data is encrypted to ensure data security. The uploaded data is also stored in different data centers, each with different encryption keys. In terms of operation, using Standard SQL syntax, there is no learning curve for developers. BigQuery also supports Join operations between data tables for real-time analytics.

Additionally, BigQuery is fast in computation for two main reasons: one is the use of Columnar storage (data stored by column), so searches do not need to scan all data and can quickly find data. The other reason is the Tree architecture, which means the Query is distributed and processed by thousands of servers. Since its establishment in 2010, besides being a query service for big data warehouses, BigQuery also provides other diverse services, such as BigQuery ML: allowing users to build machine learning models, BigQuery GIS: supporting analysis techniques for geographic spatial data formats, etc.


Author

 

 

Assistant Engineer
許媁涵 Annie Hsu