Share this
A best practice for collecting and publishing data in Google Cloud Platform in 3 steps
by Dico Klaassen on Feb 17, 2021 11:23:38 AM
Blog Dico Klaassen, SCRUM Master and Software Developer
Introduction
There are several ways to design a cloud-based data analytics solution. Today, I will be sharing a best practice that we have implemented recently for a broadcasting / media organisation. This company wanted to benchmark their current architecture with the solution that Crystalloids proposes as a best practise, keeping their business and requirements in mind. In this article, I will share a best practice that can be used as the foundation of your data solution. In addition, it is a solution that only requires Google Cloud components.
The goal
The organisation was already loading a number of social media data sources into a Datahub in GCP to deliver insights to the publishers for analysis of customer reach of their shows and other formats and also for growth hacking purposes. They had a working solution in place, but wanted to bring their architectural setup to the next level. That is why they challenged Crystalloids to design and build a future proof solution.
They were looking for a solution that is:
- Easy to use by both developers and analysts
- Easy to monitor
- Make sure that it is easy to detect issues, before they are even noticed by the data users.
- Easy to maintain
- Whenever new information needs to be added or existing information needs to be changed, while maintaining the availability of the data to its users.
- Easy to expand
- Whenever you want to add new data sources, the same solution could be applied for the new data source.
- Easy to reload
- Whenever any step has failed, it should be easy to redo that step individually, without having any dependency on the rest of the process.
- Easy to secure
- Groups of people should only have access to data that is pre-defined for them to use.
The solution
Instead of loading data from a source into BigQuery and directly build reports on these BigQuery tables, the best practice is to divide this process into three steps:
1. LoadThis step only takes data from the source and writes it to BigQuery in its raw form. No transformations are made on the data, to make sure that you are looking at the same data as was received from the source. It can be useful to not only write it to BigQuery, but also to a file in Google Cloud Storage. For example, when you receive data in JSON format, this can be stored in its purest form.
2. NormaliseThis step takes the raw data and makes the transformations that are needed to be able to use the data. Make sure that:
- all data is deduplicated: no transformed table contains two identical rows.
- all data is actual: only one version of the data is present, the most recent one,
- all data is stored at the lowest granularity (no aggregation of raw data, only storing its lowest level)
3. Publish
Once the data is normalised and all entities are stored in tables, you are ready to start gathering the data that you want to use in your reports or for specific analyses. This should happen in a separate publish step. Here you also have the possibility to combine multiple entities and/or sources.
To orchestrate these steps and make sure that the next step will only be executed when the previous is finished, I can recommend to use Google Workflows. Within this workflow, each step in the solution is handled by a Cloud function.
For this case it was required to load the data from the source on a daily basis. Therefore the first step is triggered by Cloud scheduler, since it allows to trigger the process every day at the same time.
By using this setup, in each individual step, monitoring can be applied. Operational engineers can act on this to help manage the production lifecycle.
The solution has a pay-per-use pricing model, making sure there are no unnecessary costs. And since this process runs serverless, there is no configuration involved.
If you would like to learn more about this use case or if you need assistance in your data product, please contact us.
ABOUT CRYSTALLOIDS
Crystalloids helps companies improve their customer experiences and build marketing technology. Founded in 2006 in the Netherlands, Crystalloids builds crystal-clear solutions that turn customer data into information and knowledge into wisdom. As a leading Google Cloud Partner, Crystalloids combines experience in software development, data science, and marketing, making them one of a kind IT company. Using the Agile approach Crystalloids ensures that use cases show immediate value to their clients and frees their time to focus on decision making and less on programming.
Share this
- December 2024 (1)
- November 2024 (5)
- October 2024 (2)
- September 2024 (1)
- August 2024 (1)
- July 2024 (4)
- June 2024 (2)
- May 2024 (1)
- April 2024 (4)
- March 2024 (2)
- February 2024 (2)
- January 2024 (4)
- December 2023 (1)
- November 2023 (4)
- October 2023 (4)
- September 2023 (4)
- June 2023 (2)
- May 2023 (2)
- April 2023 (1)
- March 2023 (1)
- January 2023 (4)
- December 2022 (3)
- November 2022 (5)
- October 2022 (3)
- July 2022 (1)
- May 2022 (2)
- April 2022 (2)
- March 2022 (5)
- February 2022 (3)
- January 2022 (5)
- December 2021 (5)
- November 2021 (4)
- October 2021 (2)
- September 2021 (2)
- August 2021 (3)
- July 2021 (4)
- May 2021 (2)
- April 2021 (2)
- February 2021 (2)
- January 2021 (1)
- December 2020 (1)
- October 2020 (2)
- September 2020 (1)
- August 2020 (2)
- July 2020 (2)
- June 2020 (1)
- March 2020 (2)
- February 2020 (1)
- January 2020 (1)
- December 2019 (1)
- November 2019 (3)
- October 2019 (2)
- September 2019 (3)
- August 2019 (2)
- July 2019 (3)
- June 2019 (5)
- May 2019 (2)
- April 2019 (4)
- March 2019 (2)
- February 2019 (2)
- January 2019 (4)
- December 2018 (2)
- November 2018 (2)
- October 2018 (1)
- September 2018 (2)
- August 2018 (3)
- July 2018 (3)
- May 2018 (2)
- April 2018 (4)
- March 2018 (5)
- February 2018 (2)
- January 2018 (3)
- November 2017 (2)
- October 2017 (2)