The concept of observability traces back to the 1960s, with Rudolf E. Kalman’s canonical work around decomposing complex systems for human understanding. It was a heady time for new compute systems in aerospace and navigation. The advances in these systems exceeded humans’ ability to reason about them, and Kalman’s work is largely credited for laying the foundation for observability theory.
Observability as we know it today—the $9 billion category that is a staple of modern IT operations—is more commonly associated with Google’s site reliability engineering approaches to hyperscale services like Google Search, Google Ads, and YouTube.
According to Google’s Site Reliability Engineering book, it was in 2003—corresponding with the creation of Borg, the cluster operating system that would inspire Kubernetes—that Google created a novel monitoring system called Borgmon. Google recognized that, with the many moving parts of microservices operating across distributed infrastructure, a new model was required for understanding dynamic systems—one that worked in real time and didn’t swamp platform teams with noisy pages.
Borgmon “made the collection of time-series data a first-class role of the monitoring system and replaced check scripts with a rich language for manipulating time-series into charts and alerts.”
Enter Prometheus and Grafana
Borgmon became the inspiration for Prometheus, which was publicly released in