Mean time to innocence - Splunk's case for why your observability data is as much a political problem as a technical one
- Summary:
- When a VIP customer calls and the trace has been discarded, you can't prove the issue wasn't yours. Splunk's Stephane Estevez has a name for that problem - and an argument for why fixing it starts long before the incident.
Look up "observability" in a dictionary and you will find "the capacity for being observed" – a meaning that predates IT by nearly two centuries. What you will not find is an agreed-upon definition of what the cloud native industry actually means when it uses the word. Every vendor has its own version, and so does every customer.
That is Stephane Estevez's diagnosis. Estevez is Observability Market Advisor at Splunk, eight years at the company and considerably longer on the buy side running service catalogs for managed service providers and telcos. In a discussion at KubeCon Europe 2026 in Amsterdam, he argues the definitional fog creates some of the most expensive misunderstandings in enterprise IT:
Everybody talks about real time. Nobody does real time. Everybody talks of end-to-end. Where the hell is end, and where does it start? The devil really hides in the detail.
A data platform that arrived at observability
Splunk's positioning is worth understanding on its own terms, because its history shapes its current approach. It did not start as an observability vendor. It started as a machine data platform – founded by engineers from Yahoo who needed a better way to manage logs – and observability followed as one consequence of what that platform could already do:
We started as an IT ops solution from Yahoo guys that said it's a nightmare to manage logs. Observability is just a subset. Cyber security is a consequence. We didn't start like that. We are a data platform, a machine data platform to be more precise.
That history shows up directly in how Splunk approaches the market today, particularly in its commitment to OpenTelemetry – the open source, vendor-neutral telemetry framework that has become, in Estevez's words, "as big as Kubernetes" in terms of community adoption. Splunk runs a dedicated OpenTelemetry engineering team in Krakow contributing to the project full-time, and has co-developed OpenTelemetry eBPF Instrumentation (OBI) alongside Grafana Labs.
Estevez describes the combination of OBI and standard OpenTelemetry as solving two different problems simultaneously – breadth versus depth:
eBPF gives you this capability to say, we don't care anymore. Whatever your application, at least you have the minimum telemetry. If you want to go deeper, then it's better to go for OpenTelemetry. That's why it's an interesting combination – you can have breadth and depth.
Shipping a vendor distribution of an open standard could raise lock-in concerns. Estevez continues:
If you know you can get out easily, then you're more likely to get in. Just as simple as that.
A no-sampling stance
For two decades, the Application Performance Monitoring (APM) industry has operated on the assumption that you sample trace data because storing everything costs too much. Splunk's no-sampling stance runs against that orthodoxy, and Estevez's case for it comes down to a single question: why are you sampling in the first place?
Why are you sampling? The only reason I could find is cost. It costs too much money to store too much data, period.
The scenario he uses to illustrate the risk will be familiar to anyone who has sat in an incident room. A VIP customer calls, something is not working, the transaction did not flag as an error, so the trace was discarded. Without it, the operations team cannot prove the issue was not on their side:
You cannot prove that you're innocent. There's so many cases where you need the data.
He calls this "mean time to innocence" – the speed at which a team can demonstrate a problem is not theirs. Any CIO who has fielded a blame-laden post-incident call will be horribly familiar with this scenario.
The implications compound as AI enters the picture. AI-driven analysis depends on complete data sets to detect early signals long before an error fires – a trace discarded because everything looked fine at the time takes that signal with it. Estevez says that Splunk can sustain the no-sampling position through economies of scale on storage, but his underlying argument is that the industry default has been shaped by cost assumptions rather than by what the data is actually worth.
The Splunk State of Observability 2025 report, drawn from 1,855 ITOps and engineering professionals across nine countries, puts numbers behind the pain: 52% cite a high volume of false alerts as a persistent problem, and 59% still struggle with too many disparate tools – symptoms of environments reacting to noise rather than working from complete signal.
Breadth, depth, and instrumentation
Underpinning the no-sampling position is a more fundamental data collection problem: getting telemetry out of complex, heterogeneous environments in the first place. Estevez notes this is where most organizations spend the most time and get the least attention from vendors:
The first challenge is that change, because that's the biggest one. It's all the processes you build on top of it, all the trainings you've done on it, the scripts that you created for remediation. There's a lot of change that needs to be done.
OBI addresses one specific and persistent gap in that picture. Standard OpenTelemetry works well for common runtimes – Java, .NET – but struggles with compiled languages like C, and anything that is difficult to instrument directly. eBPF — Extended Berkeley Packet Filter — lets programs run safely inside the Linux kernel without modifying it. OBI uses that capability to capture telemetry from applications without requiring code changes, providing baseline coverage across an environment while standard OpenTelemetry handles deeper instrumentation where it is needed. The co-development with Grafana Labs means it is a community project, not a proprietary one. Splunk's logic is that anything lowering the cost of OpenTelemetry onboarding in complex environments helps both sides: organizations spend less time on instrumentation groundwork, and more data flows into the platform.
Deployment, silos, and licensing traps
The conversation about what data to collect quickly connects to a harder question: where does it go, and who can see it? Estevez describes a market where the sales pitch – real-time, end-to-end, full-stack – has become so homogenized that buyers cannot distinguish between vendors without digging into the detail:
How would you decide who you work with when everybody tells you the same story? Unfortunately, they have to go and investigate a little bit more into the details, because the devil is in the detail.
Licensing models, he says, are a specific trap that buyers underestimate:
There's a lot of traps in all the licensing model.
This matters particularly as deployment models diversify. Many observability vendors grew up solving cloud problems and have not invested seriously in on-premises capability. For large European enterprises facing tightening data residency requirements, that creates a structural mismatch:
There's a need more and more. Can I do on-prem? Sovereign cloud? And we do have the solution for all of the three different models.
Splunk's Cisco parentage provides a structural advantage here – the ability to offer SaaS, on-premises, and hybrid deployment to customers whose regulatory environment or risk posture rules out a pure cloud approach. The CNCF Q1 2026 State of Cloud Native Development report puts the trend in context: hybrid cloud adoption among all developers has reached 34% – the highest proportion recorded to date – with data sovereignty requirements and privacy regulations cited as key drivers.
There is a parallel shift happening across organizational silos that Estevez sees accelerating. Security teams and observability teams have historically operated separately, with little appetite for sharing data in either direction. That is starting to change:
We see interest now from the cyber security guys. They're really more interested in observability data. They don't want t much of their own data to the others yet, but they're more and more interested.
The report data supports this: 44% of observability leaders say they can solve issues collaboratively with security teams, compared to 29% of others. The biggest barrier is not technology – it is resistance to change, cited by 59% of respondents. Getting 40 to 50 applications across multiple business units into a single screen, as BMW has done, requires platform capability. But it also requires the organizational will to match.
My take
Estevez is the first to admit the market is crowded, and his description of it – a landscape where every vendor claims "real user monitoring, synthetic monitoring, APM, distributed tracing, infrastructure monitoring, log management, pipeline management" – is the most honest articulation of the competitive problem you are likely to hear from someone inside it.
Splunk's no-sampling position is a claim about what data is actually worth – and as AI-driven analysis becomes more central to how teams detect and respond to incidents, the cost of discarding traces to save on storage looks increasingly hard to justify. Mean time to innocence is a small, sharp concept with real organizational weight: the value of complete trace data is political as well as operational.
The OBI/eBPF story is under-appreciated. The ability to provide baseline telemetry from any application without touching the code, combined with deep instrumentation via OpenTelemetry where you want it, is a practical answer to one of the most expensive problems in observability rollouts – getting data out of complex environments without a multi-year instrumentation project.
Worth keeping mind is the licensing point – the detail that bites you is usually commercial, not technical.