Open the circle of friends to brush to a small video, card into the PPT even if, and finally actually give me a “loading failure”? Nowadays, even watching a cat or dog has to test your patience. Do social application brother should understand this pain - users do not care about your back-end more complex, they only think that you are technically bad.
Delay this thing is like a stealth killer, usually do not feel, outbreaks directly let the user turnover rate soared to the sky. I helped a social platform to do optimization last year, found that the average delay of their Southeast Asian users more than 300ms, the young people brushed two on the run to use competing products. Later, with high defense CDN hard pressure to 80ms or less, the next day retention rate rose 17%.
Many people think that buying a large bandwidth server can solve the problem of latency, in fact, pure overthinking. The delay in the real scenario is the result of the superposition of multiple factors: physical distance, routing hops, operators pinch each other, sudden traffic congestion ...... and even undersea cables shake can make you experience back to the 2G era.
Recently tested three CDN service providers, found that the same link from Los Angeles to Shanghai, some can be stabilized at 120ms, and some can actually soar to 400ms and packet loss. The key gap in the node layout and routing strategy - some vendors of the “global nodes” is simply false labeling, the actual landing may be all rented cheap room.
Physical distance is hard, but route optimization can save lives
The speed of light limit is the law of physics, Beijing to New York and how to optimize can not be lower than 60ms. but the reality of a lot of delay is not caused by distance! I grabbed a Singaporean user to access the path of the Shenzhen server, obviously direct connection as long as 80ms, but the actual detour to Japan and then turn to the United States, live out of the 220ms of the ghost of the animal latency.
This is a typical case of routing strategy overturned. Some small CDN for cost savings, all the traffic to a few core nodes and then forwarding, the name is “intelligent scheduling”, but actually cut corners. Good CDN should make traffic like a drop car, at any time to match the nearest available node, and even pre-determine the user's mobile trajectory in advance of scheduling resources.
Proximity access is not a simple geographic judgment
Don't believe in those “assigning nodes based on IP geolocation” schemes. Cell phone users are always on the move, and when switching between 4G and Wi-Fi, their IPs may instantly cross provinces. What's even worse is that some carriers don't update their IP database for years, so it's not surprising that Shanghai users are assigned to Beijing nodes.
The solution I'm using now is triple judgment: IP geobase + real-time network detection + device GPS (authorization required). For example, if a user is detected moving from Chaoyang District to Haidian District, even if the IP hasn't changed, it will automatically switch to a closer node within 5 minutes. This strategy is measured to reduce the delay fluctuations in the mobile scene by 40%.
Post a snippet of the actual node selection logic in use (pseudo-code):
BGP Route Optimization is the Hidden Trump Card
The same is from the Beijing server room to the Guangzhou user, the telecom direct connection may be 80ms, if you take Unicom transit may soar to 200ms. high-quality CDN must have multi-line BGP interconnections, and can detect real-time dynamic switching of the operator's link quality.
Special mention of CDN07 this, their Anycast network to do really tricky. Last year, during the double eleven to help e-commerce customers to carry traffic, automatically South China Telecom congested traffic cut to the mobile line, the delay from 190ms down to 110ms, the user is completely unaware.
The routing optimization results (in ms) of the three have been tested against each other:
Instead of a high defense feature, might it increase latency?
Some vendors, in order to highlight the protection ability, direct all traffic to the high security cleaning center. Users accessing a picture have to pass through a black hole protection node 800 kilometers away, this is not pantsless farting? The real reasonable solution should be edge security + center joint defense.
I have seen a very clever design in the console of CDN07: ordinary traffic is distributed nearby, and only traffic suspected of attack is redirected to the cleaning center. Not only to ensure security but also not affect the normal user experience, this balance is quite sophisticated.
TCP protocol tuning can gouge out 10-20ms of detail
Many people ignore the transport layer optimization, in fact, every millisecond here is a real experience improvement. For example, adjusting the initial congestion window (initcwnd) from 10 to 16 can make small files load a circle faster. Another example is to open TCP Fast Open, the first packet RTT directly save 1/3.
Sharing a snippet of kernel parameter tuning that is tested and effective:
Don't forget the physical layer metaphysics
Once encountered a bizarre case: a CDN node latency periodically soaring high, and finally found that the air conditioning in the server room is timed to start resulting in temperature changes, the frequency of the network card crystal drift ...... This problem can not be found by looking at the monitoring charts, you have to squat in the server room in order to catch it. So when choosing a CDN vendor, it's best to see if they have a self-built server room, the quality control of the rented server room really depends on luck.
The new deal isn't a silver bullet, but it's worth a try.
The QUIC protocol really performs amazingly well under mobile networks, especially weak network environments. But don't blindly switch the whole station, some old equipment compatibility can make you debug to doubt life. It is recommended to first use in the picture, video, these highly fault-tolerant resources, and then gradually promote after stabilization.
Actual data: QUIC under 4G network has 23% lower latency than TCP and 5 times faster packet loss recovery. However, the advantage under Wi-Fi environment is not obvious, and may even be worse because of the encryption overhead instead.
Use multi-dimensional indicators for monitoring
Just looking at average latency is self-defeating. I must look at the 95th percentile latency and latency variance, the latter is more reflective of the user experience - users would rather stabilize 150ms than a roller coaster ride of 80ms and 300ms. We recommend using WebPageTest's movie mode to truly restore user experience, which is much more intuitive than looking at numbers.
Lastly, to say a word of offense: some CDN vendors on the market node data moisture to be able to raise fish. The claim that 500 nodes can actually use less than 200, and even some with 1Gbps ports pretending to be 10Gbps. test must be used to run a full 24 hours of real business traffic, while using third-party tools to cross-verify.
A truly high-quality social high defense CDN should exist like air - users can't feel it, but they can't get away from it for a moment. Now every time I see a friend send a video loaded in seconds, I know that there is a team behind the routing optimization, node scheduling on every millisecond. This kind of technical people's hard work, is the hardest bottom of the user experience.

