Share this
The process and best practices of maintenance in Google Cloud
by Livia Cirnu on Jul 10, 2020 4:04:50 PM
Every development team has a long list of software features coming from the business units. Whether it is to build or improve a product, an application, developers are needed. But as the business grows, IT managers may find the maintenance tasks as a result of the developers’ work taking more and more of their time. As the customer base grows and the business might have seen some first spikes, it is time to start thinking about professionalising maintenance.
Bringing someone in-house might seem like an easy option if your business can find the right talent to manage their operations. Another option is to outsource to a larger remote maintenance support team. In that case, your company has to deal with a ticketing system handled by various engineers at different time zones. Not to forget there will be an in-house person needed to manage things closely. The third option is to find a dedicated consultant that can provide exactly what your organisation needs, communicate closely with the internal team and help exactly when it needs it.
Although working in the cloud allows you to easily deploy applications without having to hire an operations team, there is always some part of maintenance that needs to be done and can’t be neglected to ensure business continuity. The more visibility and monitoring there is in place, the easier it is to grow, innovate and save costs. In this article, we will share what maintenance means in the context of Google Cloud and how we monitor cloud operations for our clients.
The maintenance scope
Maintenance on a daily basis includes monitoring of processes such as data pipelines, data quality and billing. Among these activities, we also include quality assurance. More specifically, we check for data freshness (if it was delivered on time), data completeness (do we receive all the files) and data accuracy (do the files contain the information we expected and in the appropriate format).
Traditionally there used to be a software development team that designed and built the solutions and an operational team that would maintain the environment. Today we rely on DevOps practices. With DevOps, both the Development and the Operations teams are combined to ensure that the software and data development cycles run smoothly.
Data pipelines
As part of our maintenance efforts, the first category to monitor is the data pipelines. A data pipeline is an infrastructure which ensures the delivery and necessary transformation of data. Therefore it is important to keep an eye on its health.
The basic health indicators are:
- Status of the job: success or failure?
- Latency: how long does it take for the job to complete?
Let's take a dataflow for example. In order to pass the checks which were specified above, the maintenance team will take these actions:
- Configure alerts on Stackdriver to send emails when a dataflow fails;
- Collect metrics on time and the amount of data that is ingested by the process.
Data Quality
After checking whether the infrastructure works as expected, the next thing we monitor is the data quality. We mentioned earlier that data can be assessed through its freshness, completeness and accuracy (whether it satisfies the business requirements).
The starting point is to check the data the moment it arrives for the following criteria:
- Schema
- Format
- Size
- Arrival time
The maintenance team has created scripts that automate these checks. When one of the checks fails, then an alert is triggered.
Billing
One of the critical business maintenance operations is to monitor the billing cost and any significant changes in expenses that might have occurred. In case of any suspicious spikes in the billing report, we would investigate:
- which environment incurred the expenses (dev, test or prod)
- which Google Cloud products/resources cause it
- which processes from these products are involved
After that, we would get into contact with developers or product owners in charge of that particular project. Below you can see an example of how the report looks like and which Google Cloud products contributed to the overall costs.
The maintenance process
Reports and automated alerts
There are many visualisation applications that can be used for maintenance and monitoring, such as Google Data Studio, Looker, Tableau. To monitor the Google Cloud environments, we work with Google Data Studio that allows us to create custom dashboards and easily detect any errors or over-consumption. In addition, the reports can be shared with other team members, can be used for forecasting of resource usage and costs and it’s free of charge.
The monitoring process can be automated in Google Cloud by setting up alerts to check data accuracy or when the tables were last updated. We set up the warning system and monitoring alert via Stackdriver, a product of Google Cloud. With Stackdriver we can monitor data pipelines, for instance. When a Dataflow is lagging behind or it has failed we get notified via an automated email.
Daily monitoring
For each project, we use Confluence to document the processes, what kind of checks we need to perform and how to act in case something fails. The daily tasks include checking the following:
- Dataflows
- Cron jobs
- Scheduled queries
- The status of open Jira tickets
- File delivery in certain buckets
The maintenance of the production environment helps to identify performance deviations or any issues related to the availability of the service that might impact using the systems. We work closely with the developers and communicate with them about expected behaviour or bugs that have arisen. While the developers should focus on new releases they are inevitably part of the monitoring team that assures all the other processes run smoothly.
Monthly Service Review
A best practice is to take the maintenance to a more strategic level by reviewing periodically with senior IT management. As a result of this review, structural improvements can be made. These are the elements we usually cover in a service review:
- Action points/minutes from the previous review
- Maintenance activities and tools
- Incident management
- Service requests
- SLA (service level agreement) overview
- Maintenance meeting - Topics & Minutes
- Billing report
- Service Improvement Plan
By applying this kind of review we have a helicopter view on the maintenance and the cooperation of the teams including the technology.
The Service Improvement plan may consist of topics such as shown in the image below
Summary
Maintenance is just as crucial for a seamless software experience as the application itself. Organisations can no longer afford to run the risk of system failures that would affect productivity and cost money. We can help you maintain a secure and reliable cloud environment without burdening your budget.
ABOUT CRYSTALLOIDS
Crystalloids helps companies improve their customer experiences and build marketing technology. Founded in 2006 in the Netherlands, Crystalloids builds crystal-clear solutions that turn customer data into information and knowledge into wisdom. As a leading Google Cloud Partner, Crystalloids combines experience in software development, data science, and marketing, making them one of a kind IT company. Using the Agile approach Crystalloids ensures that use cases show immediate value to their clients and frees their time to focus on decision making and less on programming.
Share this
- November 2024 (5)
- October 2024 (2)
- September 2024 (1)
- August 2024 (1)
- July 2024 (4)
- June 2024 (2)
- May 2024 (1)
- April 2024 (4)
- March 2024 (2)
- February 2024 (2)
- January 2024 (4)
- December 2023 (1)
- November 2023 (4)
- October 2023 (4)
- September 2023 (4)
- June 2023 (2)
- May 2023 (2)
- April 2023 (1)
- March 2023 (1)
- January 2023 (4)
- December 2022 (3)
- November 2022 (5)
- October 2022 (3)
- July 2022 (1)
- May 2022 (2)
- April 2022 (2)
- March 2022 (5)
- February 2022 (3)
- January 2022 (5)
- December 2021 (5)
- November 2021 (4)
- October 2021 (2)
- September 2021 (2)
- August 2021 (3)
- July 2021 (4)
- May 2021 (2)
- April 2021 (2)
- February 2021 (2)
- January 2021 (1)
- December 2020 (1)
- October 2020 (2)
- September 2020 (1)
- August 2020 (2)
- July 2020 (2)
- June 2020 (1)
- March 2020 (2)
- February 2020 (1)
- January 2020 (1)
- December 2019 (1)
- November 2019 (3)
- October 2019 (2)
- September 2019 (3)
- August 2019 (2)
- July 2019 (3)
- June 2019 (5)
- May 2019 (2)
- April 2019 (4)
- March 2019 (2)
- February 2019 (2)
- January 2019 (4)
- December 2018 (2)
- November 2018 (2)
- October 2018 (1)
- September 2018 (2)
- August 2018 (3)
- July 2018 (3)
- May 2018 (2)
- April 2018 (4)
- March 2018 (5)
- February 2018 (2)
- January 2018 (3)
- November 2017 (2)
- October 2017 (2)