Cisco Live Amsterdam - how DHL Express keeps a ThousandEyes focused on its global network to ensure 100% uptime
- Summary:
- What if you could run a global network and be pretty certain you weren't going to get any nasty surprises? That's what DHL Express is aiming to deliver.
As the largest global logistics company delivering 1.8 billion packages per annum, it’s fair to say that DHL Group has some pretty serious connectivity requirements that need to be met, especially with DHL Express’s ‘100% Uptime’ commitment to meet.
The courier firm is a heavy user of Cisco’s ThousandEyes cloud-based digital experience monitoring platform to manage its most critical parcel delivery applications, its SD-WAN overlay and underlay networks, as well as its Contact Center-as-a-Service (CCaaS) sites.
As seen above, it’s a big tech ask, as Richard Alcalay, VP, Global End User & Telecom Services at DHL, confirms:
From a telecom perspective, we have 3,300 sites in 220 countries, ranging from a small shop where you just drop off a parcel to a data center at one of the airports, for example. So it's a massive spectrum that we have to cover, from a telecoms perspective.
And all the time with that ‘100% Uptime’ idea looming over them. Alcalay explains:
Every year we look at issues that we had in the previous year and then come up with a set of initiatives that we then focus on in order just to take us that little bit further, always trying to make sure that we have that 100% up time. Because if we have a network outage, it impacts the way that we can sort or not sort our parcels, which means, obviously, a knock-on effect down the line. At the end of the day, you're impacting customer quality, which is something that we want to avoid at all costs.
Tackling complexity
Complexity is the order of the day, adds David Branik, VP, Head of Telecoms, DHL:
We're a large company that's basically everywhere. We have on-premises data centers, we're in the cloud, we're using the Internet, we're in every location, virtually every country in the world. That creates its own complexity. It's not easy to reduce the complexity so that we can find what's happening, where, and why? Sometimes it's not obvious. It's not like someone says, 'Well, this is what's not working', and, 'This is where some application isn't working, or someone's not connecting'.
Such complexity was, in part, brought on DHL by itself as well as the wider industry, suggests Alcalay:
The whole industry added complexity with the move away from MPLS (Multi-Protocol Label Switching), where you can fix point-to-point network connections and the vast majority of our applications were in our data centers and third party data centers. Now we've moved more and more to the cloud and we've got SaaS applications, and all of a sudden then we're becoming much more reliant on the Internet. That's when we really said, 'OK, we need to be able to monitor that’.
DHL needs to know what’s happening with traffic across its networks, he explains:
In the past, you knew very clearly where that traffic was going to be traversing. Now, all of a sudden, it could go through multitude of MSPs.It's taking the optimal route, but sometimes that can bite you, if there's an issue with a party, an ISP that maybe you don't have a relationship with. So being able to drill down and say, 'OK, it's right there, let's divert that traffic' [matters].
Surprise - no surprises!
With ThousandEyes up and running, there were some surprises to be uncovered, says Branik, although DHL Express is no longer taken by surprise:
We had a couple of things where we had a fairly low packet loss in different places and and we actually found it with ThousandEyes before it actually impacted the business. It was pleasant to find that it's actually doing what it was meant to do, and pro-actively tell us about issues that potentially down the road would cause an impact and reduce our traffic. We've hit that couple of times, and this is where it's been a very pleasant experience to find it out and be able to fix it before the business is even aware that there's an impact.
Obviously, you have your own ISPs, or your providers have their ISPs, but if [something's] somewhere on the Internet, you don't have a full control over it. Once we found the issue [we had was] somewhere on an unrelated internet path, and we were looking for an alternate path, because it wasn't a directly-contracted ISP with us, so it helped us to identify that along the path between the application and and the users.
In that case:
We didn't engage with [the ISP] in the end. I mean, we just took a different path. You can only control what you can. I'd love to fix the Internet, but that's not always possible.
Benefits - and fewer people on group calls!
Being ahead of problems has been a major benefit, agrees Alcalay:
Being able to drill down into [a problem] is one thing, but the pro-active monitoring that's really useful. We've got all of those sites in numerous countries, so having the ability within the tool to set up site-by-site or country-by-country dashboards with the alerting that they want centrally, [is important]. You don't want an alert every time everything goes off, but what you do want within a country is more alerts and they can then pro-actively look into them. We had instances where we had in one country traffic being routed in a very logical way, [while] in another country, just next door, they were having terrible performance, and the traffic was being rooted in a less-than-logical way. It was through ThousandEyes that we were able to identify that and then rectify it.
That clearly has a beneficial business impact, he adds:
You're reducing the company's cost, whether it's in troubleshooting or not having that impact at all, and not having to read out planes or packets or doing manual work in the backend. This is where it's hitting the business, and this is why I think it's key to get [monitoring] right and tune it to you. It's working on what's expected on our network...The tuning is very important at the beginning. It's worth spending the time there to get it right, because after that, you get the benefits.
One of which is that DHL has also been able to reduce the level of human traffic around incidents when they do occur. Alcalay explains:
When you've got an incident, a lot of people join the call. You've got network people, you've got hosting people, your application people, and if you can very quickly identify where the issue is, be it a network issue or not, it just reduces the number of people within those incident calls so that they can then quickly get rid of the rest of the people who are not involved. That leaves the people that can do something about a quieter space to be able to focus.
And it’s 2026 - there has to be an AI angle in play somewhere, surely? Branik laughs:
Well, I'd like to come into the office in the morning and talk to my computer and say, 'So what's going on? Is everything good?' and the computer tells me, 'Sure' and in the back end it checks everything. But I don't think we're there yet.