Main content

How Norway's welfare system moved 400GB of daily logs to managed OpenSearch without a service interruption

By Alyx MacQueen April 22, 2026
Dyslexia mode
Excerpt:
The Norwegian Labor and Welfare Administration (NAV) pays out 33% of Norway’s state budget. A thousand developers rely on its central logging infrastructure every day. At OpenSearchCon Europe 2026, platform engineer Hans Kristian Flaatten and Aiven Product Director Dmitry Kan walked through how they swapped out the underlying search engine while the lights stayed on – and what happened when a last-minute integration problem turned up three days before Christmas.

NAV's Hans Kristian Flaatten and Aiven's Dmitry Kan at OpenSearchCon

NAV is the Norwegian government agency responsible for unemployment, sickness, disability, maternity and pension payments to 5.5 million citizens. When platform engineer Hans Kristian Flaatten and his colleagues looked at their logging stack, two forces converged. The Elasticsearch license had changed, so the tool no longer qualified as open source under NAV’s own sourcing strategy. And a wider cloud migration was already underway to reduce the maintenance burden on internal teams.

I sat down with Flaatten and and Aiven Product Director Dmitry Kan at OpenSearchCon in Prague to discuss an unusual migration story that included a failed experiment with Grafana Loki, a scale problem Aiven’s engineering manager initially refused to take on, and a three-day scramble before Christmas. Plus a substantial reduction in cluster size at the end of it all. 

Ten years of paving the highway

NAV a decade ago looked almost nothing like the organization Flaatten now works in. It was project-based, reliant on the lowest-bidder public tender model, with private contractors delivering what new management eventually concluded was a slow, expensive, low-satisfaction service for citizens.

A new welfare director and a new IT director arrived with a different theory. Instead of outsourcing, NAV began hiring developers, moving to product teams, adopting open source, and building an internal developer platform called NAIS.

Flaatten calls NAIS (NAV's Application Infrastructure Service) a golden path – a concept familiar to anyone who has followed the platform engineering conversation at Kubernetes conferences. Product teams stay on the road because the road is paved. Off-road is permitted, but only with serious effort and equipment. Guardrails sit at the edges, preventing irrevocable damage to neighboring teams. The tools, Flaatten acknowledged, are sharp and need to be used carefully. All of the source code sits on GitHub, because NAV is publicly funded and the code therefore ought to be publicly available.

Centralized logging was built out alongside this platform, and it is what the more than one hundred product teams rely on whenever something misbehaves in production.

Why Loki didn’t stick

Before arriving at OpenSearch, NAV attempted a more modernized logging stack built on Grafana Loki. The pitch was attractive on paper – cheaper storage, tighter integration with Prometheus metrics and distributed traces, a cleaner fit with the Grafana ecosystem NAV was already using for other observability work. In practice, Flaatten said, it did not survive contact with the developers who had to use it.

High cardinality queries were a persistent bottleneck. Full text search – a feature NAV’s thousand or so developers used heavily to investigate incidents – was limited. Switching workflows without feature parity was not going to happen at that scale. Loki had merit for smaller, single-application use cases, but central logging needed a tool that worked at least as well as the Elasticsearch cluster NAV was trying to leave behind.

When NAV’s go-to-market team approached Aiven, the opening question was whether Aiven could actually run a workload this size. Kan’s engineering manager, when first shown the spreadsheet, said no.

He says, 'boss, I love you, but no, we have not run this scale'.

The cluster in question had 40 nodes in its largest configuration, with others at 15 to 20, and roughly 60 terabytes of disk storage. Kan said his first instinct as a product leader was not to walk away but to restructure the conversation. He and NAV agreed to a weekly working call, with Aiven’s engineering team reporting progress transparently and NAV providing equally transparent feedback. A proof of concept was run. Cluster size grew in careful increments.

By the time the migration was complete, NAV was serving the same load with 15 nodes at 64GB of memory each – fewer nodes and less storage than the Elasticsearch cluster it was replacing. The performance gap had been closing for some time; the managed service simply made the difference visible.

Dual write, no data transferred

NAV’s technical migration strategy is genuinely elegant. No data was transferred from the old Elasticsearch cluster to the new OpenSearch one. Instead, log shippers – most of them running inside Kubernetes where Flaatten had direct control, and a smaller number in legacy environments – were reconfigured to send logs in parallel to both systems. Kan described the constraint as:

Building a startup while you’re jumping on the tape.

Logging had to stay available throughout, because developers rely on it the moment anything starts going wrong in production. Dual writing gave NAV a period during which both stacks ingested identical data, which made verification possible. Once the new cluster had accumulated enough history to be useful for incident investigation – NAV only retains logs for a matter of months – the Elasticsearch cluster was switched off. No data transfer, no cutover window, no downtime.

Flaatten also used the migration to tighten governance around sensitive data. The previous environment had drifted into something he was blunt about calling a dumping ground for information that should never have been in the logging platform – data that belonged in a properly access-controlled database, referenced by case identifier rather than recorded in raw form. Role-based access control was deliberately not enabled on the new stack, because open developer access to logs had always been a strength at NAV. The fix was better discipline at the point of log generation, not tighter access downstream.

Three days before Christmas

The migration ran smoothly enough that Flaatten was finalizing sign-off documentation when a last-minute compatibility problem surfaced. NAV’s existing log and trace shipping stack used an Elasticsearch agent with OpenTelemetry connectors, and it did not work with OpenSearch. Kan’s go-to-market team was unwilling to sign off until a path forward existed. It was three days before Christmas.

Aiven evaluated two options. One was Vector, which it already uses internally. The other was Data Prepper, part of the OpenSearch project itself. Vector did not support traces. Data Prepper did, but it was not yet a productized service in Aiven’s portfolio. They resolved it by agreeing that NAV would set up Data Prepper itself in the short term, while Aiven added a managed Data Prepper service to its roadmap. That service is now built.

This is the instance of co-creating with your customer.

Kan credited the weekly call pattern with making this kind of real-time problem-solving possible. He also attributed some of it to cultural alignment – Aiven is Finnish in origin, NAV is Norwegian, and both sides were comfortable skipping the small talk and getting straight to what was not working.

My take

Flaatten’s critique of “move fast and break stuff” development within the OpenSearch community was the most concrete suggestion I have heard for how to earn enterprise trust. He pointed at what the Kubernetes community achieved with its certified conformance program – the guarantee that the same query on the same data works consistently across distributions – and asked out loud whether OpenSearch could do the equivalent. That question is one the Foundation’s newly announced Long-Term Support (LTS) program will have to answer if it wants to move beyond the version 1.3 customers who are currently afraid to upgrade.

The cluster size reduction is a finding that vendor marketing rarely prepares you for. OpenSearch ended up smaller than the Elasticsearch installation it replaced. Migrations don’t usually go that way around.

Additionally, Flaatten’s openness about the Loki detour was refreshing. Enterprise case studies rarely include failed attempts, which is a shame because failed attempts are where most of the real learning lives. His willingness to describe it as a good-faith effort that did not work for centralized logging made the rest of the conversation more credible, not less.

Disqus Comments Loading...