«Do not put humans in a robot’s job». Ten rules Google follows

At Lviv’s IT Arena, Google’s engineering lead Christof Leng shared ten things his team learned from running production infrastructure, focusing on the balance between reliability and velocity.

Christof Leng (left) speaks virtually to the IT Arena audience

By Vitalii Holich

During his six years at Google, German Christof Leng, with a PhD in computer science, has worked on its various services, including Cloud, Ads, and internal developer tooling. Speaking virtually to Lviv’s 8th annual IT Arena tech festival he proposed ten tips to balance reliability and velocity.

The speaker emphasised the importance of site reliability engineering (SRE), in which processes are treated as software problems. Google started SRE in 2003 in order to cover the gaps between operations and engineering. Such an approach combines two things, crucial for companies’ success: reliability of the products, to make sure that customers don’t run away, and the future velocity, so that the pace isn’t getting too slow.

To explain this, Christof Leng shared ten things his company learned from balancing reliability and velocity.

Reliability can’t be taken for granted. It is essential to have a voice at your decision table that focuses on the reliability of the product and pushes this topic on the agenda so that it’s not constantly postponed. In Google’s experience, it’s always helpful to start early using the shift-left approach – testing the initial components of the product (code, deployment) in the earlier phases of development, as opposed to the «right shift» approach of testing a product on its final stage.
Cattle vs Pets. Regarding processes, infrastructure, and software, scaling unique individual systems (pets) would demand more investments with a bigger opportunity cost, than scaling a range of similar ones (cattle). So if you operate with numerous services, as Google does, their architectures and technologies should be similar so that the changes and scaling can be performed centrally, not relying excessively on the unique team of each product.
Blamelessness. The main question is not why the engineer pushed the red button that caused trouble, but why that button is there and what we should do with it to avoid similar issues in the future.
Measure what matters: user experience. While defining the actual state of the system, whether it’s broken or not, focus on the user experience. If your measurement shows that the system is healthy but the user is unhappy, then you’re measuring the wrong thing.
The best way to learn how the system works is by watching its failure modes. Put it into production and see it burst into flames because without doing this you cannot know the system very well. Failures are complex and easy to misinterpret from a distance, so you need to understand each aspect of a failure, as seemingly unrelated failures may have a deeper connection.
No heroes. Don’t try to act alone upholding SLA (agreements with your clients/users) and SLO (objectives your team must hit to meet those agreements), or meeting arbitrary deadlines. You can burn out, which may harm the team and the system.
Automate yourself out of your job. In Google, engineers automate themselves every 18 months. If they’re doing the same manual coding longer than that, it must be changed. The rule is not to put humans in a robot’s job, because if you can easily put it into code, it’s not a task for the employees who can do something more useful and exciting.
Change is the number one reason for outages. Find the right reliability-velocity balance, and minimize the cost of each individual change. Always be able to stop these changes and have a source of trust, like GitHub, where you can keep code and configurations so that you know what the current state of production is. Have a copy of your production systems for testing and deployment purposes.

Outages are inevitable. It’s not great, it’s life. Be able to roll back your change quickly if it impacts the user negatively. Use written communication for incident management – it’s great for analysing the immediate incident response. Have a reliable incident management protocol, and always read the source code of the system primarily.

No haunted graveyards. Avoid creating parts of your system that no one wants to enter because of the fear to change anything. Anyone can build a complex system while building a simple one is harder. A simple system is easier to change, to implement new technology, and it’s more resilient.

You can follow Christof Leng on LinkedIn and Twitter.

By Vitalii Holich

More info about IT Arena

From October 7-9, Lviv hosted the 8th annual IT Arena tech conference, with a mix of virtual panel sessions and in-person meet-ups in the city’s grand hotels, cafés, the Pravda Beer Theatre. The event culminated with a Saturday gala at the 19th-century Lviv Opera House with a start-up pitch competition, a champagne reception, and live music by a cello-violin-drum trio.

You earn more at the IT Arena website.

At Lviv Now’s Wealth and Democracy page, you can read recaps of some of the panels and ideas discussed, including:

«Six Questions You Should Ask. AirBnB’s expert on how to make decisions.»

«Product-Builders Need to Talk with Customers Every Week»

«Invest in Smart Friends Who Have Integrity: Advice From Angel Investor»

«Six Positive to One Negative»: How to Lead Your Team.

To receive our weekly email digest of stories, please follow us on Substack.

Lviv Now is an English-language website for Lviv, produced by Tvoe Misto («Your City») media-hub, which also hosts regular problem-solving public forums to benefit the city and its people.