Insights

Google Cloud Data Summit 2022

Google Cloud Data Summit 2022

Google continues to be an industry leader in data and analytics, and they have once again outdone themselves (if you ask us) at the newest Cloud Data Summit 2022 event.

Those participating shared insights on how you can advance your business using the next generation of data solutions. Using key innovations in AI, machine learning, analytics, and databases, you and your organization can solve complex challenges and make smarter decisions.

Although the digital event ended, the sessions can still be viewed on-demand here.

Our key takeaways from the event

BigLake

Organizations store, manage and distribute data more than ever before, which is still growing by the day. If you want to watch the session on-demand, click here

This new Google Cloud Platform (GCP) service allows customers to integrate data lakes and warehouses, manage access on a row and column level, and analyze that data using the GCP native tool BigQuery on an open-source processing service such as Spark (through BigQuery Storage Read API). 

This extends a decade of BigQuery innovations to data lakes with the support of

  • multi-cloud storage
  • open formats
  • unified security and governance

BigLake is based on BigQuery (BQ) and allows you to examine files in familiar formats (CSV, JSON, Avro, Parque, and ORC) that may be spread over several cloud storage systems (GCP – cloud storage, AWS S3, Azure – Blob Storage) from a centralized place. This enables a single source of data "truth" to be shared across numerous cloud platforms without duplicating or copying your data.

BigLake extends BigQuery to data lakes, so in the fullness of time, you can get the same functionality as BigQuery (even through the BigLake storage APIs for external engines). It does not necessarily help transfer data from Snowflake to BigQuery, though. 

To start using BigLake features in your BigQuery, you must first enable the Bigquery Connection API.

  1. Once enabled, you can navigate to your BigQuery services and add a new External data source by clicking on the "+ ADD DATA" button and selecting the External data source.
  2. As a Connection Type, choose "Cloud Resource (for BigLake tables)." If desired, give it a distinct connection id and a friendly name, data location, and a description. To connect to Amazon S3 or Azure Blob Storage, you must use the "via BigQuery Omni" connection type.

  3. Once the new external data source has been added, it should be visible in the left menu under "External connections."

  4. Go to the new connection and copy the service account information. This service account allows you to access data stored in the cloud. Add the service account to your project as a "Storage Object Viewer" in IAM & Admin. This enables the connection to access data from your Cloud Storage buckets.

  5. You may now construct your first BigLake table, which will read data from your cloud storage bucket. To make a new table, go to an existing dataset (or make a new one) and select Create Table. Select "Google Cloud Storage" under "Create a table from." To read files from an external cloud provider, you might alternatively pick Amazon S3 or Azure Blob Storage here.

  6. After specifying the file's location, format, and desired destination, pick the table type as External table, which instructs BigQuery not to import the file but rather read it from an external data source. Then, check the box next to "Use Cloud Resource connection to establish approved external table," which will open the Connection ID drop-down menu and allow you to pick the connection you built in the previous step.

  7. Complete the remaining options (Schema, Partition/Cluster settings, and advanced options) and click "CREATE TABLE."

  8. You should now be able to query the table. These tables are read-only (no DML statements) and may be used with BigQuery's native tables in queries. You may also begin implementing access controls to these tables to utilize BigLake's capabilities fully.

Further announcements

Google also announced updates to Analytics Hub and introduced Spark everywhere, and Dataflow Prime GA.

They further launched Dataplex GA.

With Dataplex's intelligent data fabric, companies can centrally manage, monitor, and administer their data across data lakes, data warehouses, and data marts with uniform rules, enabling access to reliable data and powering analytics at scale.

Security and governance are centralized, allowing for distributed ownership while maintaining global control.

Your data intelligence is included in harmonizing scattered data without requiring data migration.

It is an open platform that supports open source technologies and has a robust partner ecosystem.


ABOUT CRYSTALLOIDS

Crystalloids helps companies improve their customer experiences and build marketing technology. Founded in 2006 in the Netherlands, Crystalloids creates crystal-clear solutions that turn customer data into information and knowledge into wisdom. As a leading Google Cloud Partner, Crystalloids combines experience in software development, data science, and marketing, making them one of a kind IT company. Using the Agile approach, Crystalloids ensures that use cases show immediate value to their clients and make their job focus more on decision making and less on programming.