5 advices on how to reduce latency: here’s one more

A few months ago, Stephen J. Bigelow published an article listing 5 advices on how to reduce hybrid cloud latency. The good thing about this article is that it addresses a real issue: latency. However, we felt that we needed to add a few things to properly answer this core problem.

Processors wait all the time

Way too often, we mistakenly think that the performance of an internet service is directly associated with the computing power of a platform. However, when you take closer look, most of the time, processors are… waiting! Of course, we try to give them increasingly more things to do (virtualization), but it isn’t the core of the subject: we’re wrong to think we know where the issue comes from. When you split the information into various systems (in different places, geographically) and when you dismantle data processing centers’ storage units, you are winning in terms of “scalability” and redundancy. Nevertheless, by adding latency on every level, you are losing processing speed. Move information from acquisition cards to the computer memory, then onto the processor, after that onto the network interface and to finish with, from network to network: the time lost in these moving operations has become significantly higher than the information processing itself. It goes even further. Nowadays, with all this back and forth, the TCP Protocols induce latency in every request: establishing connections, acknowledgement, data transfer, and closing connections…

Amongst other things, Stephen J. Bigelow’s article insists on the need for proximity. Although it is true that the distance between the different connections does have an influence on the latency (especially if it goes all around the world from lack of peering), it is wrong to think that it is a necessary and sufficient condition. In the Internet world…

… the fastest way is not always the shortest one.

At best, it will be possible to reduce the distance by a few thousand kilometers and you will win a few milliseconds. However, there is much more to lose. Why? Because of all the hops and the efficiency of the routers deployed (performance, quality of settings, traffic prioritization order, overloading, uncertain QoS management mechanisms). This is why I’d rather have a 5000km path, as direct as possible, with IT engineers who know what they are doing, than a path with 18 hops going a few hundred kilometers around Paris, and overwhelmed routers on their knees.

Let’s go back to the 5 advices on how to reduce latency.

Yes, it is important to work with direct peering between IT centers and cloud infrastructures regarding the supervision of the quality of your service. Not because it shortens the distance, but because direct peering will reduce the number of hops and “unsupervised” infrastructures in between. Thus, it will reduce the latency.

In fact, the most important advice is in the conclusion: :“However, developers have to take the time to assess the possible conceptual changes that might help”. And here… it does not help much if you don’t have any way to evaluate these changes. Thankfully, there’s a solution. The following comments come from our experience as an Internet Operator and 16 years’ hindsight helping most global actors in improving their services’ performance.

« Yes, in order to supervise the quality of service, working with direct peering between IT centers and Cloud infrastructures is extremely important. »

You can’t improve what you’re not measuring.

Yes, in the Internet world, nothing is absolute science. Technologies change all the time, side-effects appear and higher loads lead to uncontrolled performances. Long story short, nothing ever goes as planned. But this is why it’s so fantastic. Even the best always question themselves. No position is immutable. Ensuring quality, just like you see it on mainstream sites like Amazon, Google, Facebook or Apple, is a constant tremendous amount of work. Not only do you need high expertise and skills, but also useful and precise measures. Without these measures, people skills – rare and expensive – are not used properly and lose too much time where an improved and automatized information is possible.

Way too often, users complain about how long technical incidents last and see engineers as unqualified and unable to fix issues. But do you know that the core of an outage duration is not fixing it but knowing where it comes from and why?

Only a real end-to-end monitoring, before, during and after modification can truly help – as a GPS or a compass, to stay on course and know where to go.

How can you measure latency? What should you measure exactly?

Once again, too often, people confuse the famous “ping” and talk about latency, when they are actually just measuring a back and forth between data packets, in an Internet operating mode (ICPM) where not a single application is working! Most applications that we use work through TCP or UDP, but not ICMP. Networks are often organized to optimize theses modes and process ICPM “as time permitting”, or not process it at all, by filtering it. Why? Because it will avoid tons of people “pinging” routers non-stop, thus disturbing their performances.

We are not saying that the ping is useless. It is a good “first step” debugging tool that can be used by anyone to:

  • know if a destination is still reachable
  • get a first glimpse of what the response times are on certain segments

Indeed, an absolute ICMP measure does not mean a lot. It is the same for a destination that starts losing packets and takes 800ms to respond instead of 60ms. It is a good indicator that there is a transmission problem (overloaded link, bad quality, DoS attack in process…).

If we really want to do something to improve latency, the good methods and the good tools need to be given to developers and dev-ops in charge of ensuring operational conditions. We need to move from the theoretical latency to the real latency: in other words, end-to-end applicative latency. We need to know how much there is and WHY?

For those of you that might be interested (and in order to not be too long here), I invite you here to do a research on SYN/ACK and to look deeper into everything that’s being exchanged in an HTTP session for instance.

« Only a real end-to-end monitoring, before, during and after modification can truly help – as a GPS or a compass, to stay on course and know where to go. »

Are requirements regarding latency the same for all services and purposes?

Well, the answer is no. As much as it is foolish to think that a good quality IP connection must not have more than 10 to the power -8 errors, a latency of 10ms is not that bad for some cases (mail/smtp for instance) and completely impossible in other cases (voice for instance… or even those playing with fast trading and fighting for nanoseconds).

If you don’t want to end up with approximations or even complete mistakes, the quality of a service must be measured under real conditions, without any arbitrary extrapolation.

Let’s take the example of video. From traditional broadcasting to OTT today, including IPTV, things have completely changed over the past few years. One can now perfectly watch a video with a high latency, simply because developers have understood Internet and invented protocols that take advantage of possible weaknesses and turn them into strengths. I’m thinking of HTTP Live Streaming (aka HLS) in particular: the third revolution brought by Apple ─ after the iPhone and the iPad in the years 2007-2009.

So how to proceed? Where to start?

Well first, start with measuring. But measuring what?

Ideally, there are two important things to know:

  • The quality of the service (not the quality of service, aka QoS, but in fact the quality of the service. At Witbe, we have invented a methodology that classifies this quality through a measure that we call QoE, focusing on 3 main areas:
    • availability
    • performance
    • integrity
  • The quality of the transport layer (the level 3, the IP level). Indeed, as much as the first measure will enable us to know what is going on, the second one will let us go further in the root cause analysis and understand “why”. At Witbe, we call this set of measures “SMARTPING”. And that’s what we’ve been calling it since 2000, when we invented it.

From checking to controlling

With the right indicators, one can better understand the consequences of changing a routing on the performance and quality of a service. It will be possible to diagnose right away a fluctuating performance because of a routing asymmetry… Without this, you might as well look for a needle in a haystack and keep on pretending that good quality of service can’t be ensured because of Internet. It’s the famous “it’s not me, it’s the Web”.

At Witbe, our job for the past 16 years has been to help the actors of the digital world go from checking to controlling. Not long ago, controlling could still be one of the means with QoS measures systems. Nevertheless, it is not enough today and it is important to go further, with a real QoE expertise policy. This is how you will benefit from all the potential technologies Internet has, and how you will finally become pro-active and no longer reactively put up with all this.