I’m a CTO by title, but a data engineer at heart.

It’s hard not to feel a twinge of envy toward my software engineering colleagues when I read (OK, briefly skim over) the service license agreements of popular SaaS solutions and see a contractual guarantee of 99.9% (or higher!) availability.

My co-founder and I formally interviewed hundreds of data teams prior to launching our startup and know most are nowhere near that level of reliability for their internal operations.

It’s a good day when all the dashboards have refreshed and you haven’t heard from your friends in marketing that the “data looks funny.”

Severe consequences made SaaS reliability a priority

I take heart knowing it wasn’t always this way. In 2004, Google’s search engine was down for hours as a result of the MyDoom virus. Netflix was down on Christmas Eve in 2012 due to an AWS outage. In 2015, an internal DNS error took Apple’s app store ecosystem down for 12 hours.

You get the idea. Software engineers have had decades to hone their software reliability best practices. They had no choice–the cost of downtime was too high.

Google got about 86 billion search queries in 2004. It now handles that many searches in a little over 10 days. In 2018, an hour of downtime during Prime Day may have cost Amazon up to $100 million in lost sales.

On the other hand, the consequences for data downtime–a period of time where data is partial, erroneous, missing or otherwise inaccurate–have not been seen as severe. Until recently that is.

Data downtime is more expensive than you think

Not only has data become more widely adopted across the enterprise, but it has also become unleashed from dashboards to inform machine learning models or land directly into operational systems.

Thanks to data warehouse technologies like Snowflake, data operations have been able to scale cost-effectively. In many cases now, such as with advertising or merchant platforms, data is a crucial part of the customer offering. Data IS the product.

At the same time, data engineers are taking longer to hire and are more expensive in today’s tight labor market. The Dice 2020 Tech Job Report said data engineer was the fastest-growing job in technology with a 50% year-over-year growth in the number of open positions and the 2022 Report has the average salary as $117,295.

The product telemetry from our data observability platform, which has end-to-end access across hundreds of data stacks, reveals there will be about one data issue a year for every 15 tables in the data warehouse. If you look at a hypothetical mid-sized company with 3,500 tables that takes roughly 8 hours to identify and resolve each issue, that’s roughly 1,866 hours of data downtime.

So how can data engineers take a page out of software engineering’s handbook and start building data platforms that are as reliable as SaaS solutions? We need to mature and standardize emerging best practices across people, processes, and technology.

People

Google shaped the evolution of the site reliability engineer as a specialization and sub-field of software engineering. Their SRE handbook remains canon.

With this innovation, there now was simple and straightforward ownership for reliability and accountability for downtime. What started as a position with the swagger of a fighter pilot that was celebrated for the downtime dogfight matured into a measured stride toward consistent, repeatable excellence.

The data engineering space is currently going through the same evolution. While many data teams still respond ad hoc and spend up to 40% of their working time tackling data quality issues according to Forrester, others such as DoorDash, Disney Streaming Services, and Equifax have started hiring data reliability engineers to take a more proactive approach.

If you are going to treat data like a product you also need a data product manager. Having someone focused on the longer-term needs outside of the daily grind is a crucial piece of the data reliability puzzle.

After all, data platforms are mostly internal tools and have no need to be first to market. They should be built for the long haul as you only get one chance to create that favorable first impression with your users so crucial to their adoption.

Uber’s big data platform was built over the course of five years, constantly evolving with the needs of the business; Pinterest has gone through several iterations of its core data analytics product; and leading the pack, LinkedIn has been building and iterating on its data platform since 2008.

Process

The inconvenient truth is there are quite a few areas where data teams need to mature our reliability processes. Measuring data quality is one of those areas.

Most teams don’t have data on their time to detection and time to resolution for data incidents. Some track their number of incidents, but that’s only for the incidents they catch. Data drift, hazy ownership across data assets, opaque incident triaging processes, and other factors conspire to make data quality tracking difficult.

However, data teams should nonetheless strive to establish connections with business stakeholders and codify data SLAs (after all SaaS products have SLAs!) which include uptime and other data health metrics within them.

Stronger DevOps, or DataOps, processes can be implemented as well. Projects should be iterative, with a strong peer-review process in an active environment.

Governance and naming conventions should not be overlooked either; you would never see a microservice named after a person in software engineering, but it’s a very, very common practice for each engineer to have their own schema in the warehouse.

Technology

The technology infrastructure surrounding SaaS reliability has matured considerably in recent years. The data reliability space has seen a renaissance as well. Solutions like Snowflake, Databricks, Fivetran, debt, and others have quickly formed the foundation of a modern cloud data stack.

Not only is it a regular practice to test code with tools like Jenkins, but these solutions are also leveraged alongside observability companies like DataDog or New Relic that monitor cloud applications and immediately alert teams when systems experience issues that could impact performance.

The data observability and testing space have now matured in parallel so now data engineers like their software engineering brethren can be sure they are the first to know when data goes bad or their system is down.

Unfortunately, not all software reliability solutions have a data engineering relative. One of the biggest gaps right now is a parallel solution for git. In other words, a solution for peer review in an active environment to coordinate evolving code and move it from staging to production.

SaaS level reliability for data is possible

Five 9s of uptime seems like an impossible benchmark for data platforms, but data engineering teams can clear this bar by emphasizing reliability across their people, processes, and technology.

Lior is the CTO and co-founder of Monte Carlo, the data reliability company, and the creator of the industry’s first end-to-end data observability platform. He is based in the San Francisco Bay Area. You can connect with him on LinkedIn, Twitter, or our blog