Recently to help an e-commerce station to deal with high defense CDN node switching problems, their business because of the node delay spike almost collapse. I took a look at the background is happy - this buddy actually node switching as a switch light operation, direct manual hard cut, the result is that user complaints like snowflakes flying in.
Node switching in a high-defense CDN is not simply a matter of "switching roads", it's more like changing tires when you're racing down the highway. You have to consider the business traffic, session maintenance, DNS caching, back to the source policy of these things, a little careless is 500 error screen.
Why do nodes become business bottlenecks?I found that 80% of the problem lies in the "over-reliance on the default configuration". Many teams think that buying a high-defense CDN is all right, but when the node is penetrated by DDoS, there is no alternative. Don't forget, these days even CDNs have to "prevent teammates" - some service providers' node monitoring is simply for show.
Let's start with a typical scenario of manual switching. When you find that the response time of a node soars from 50ms to 2000ms, your first reaction is definitely to "cut away quickly". But if you directly turn off the node, the established TCP connection will be suddenly interrupted, and the user's half-paid order will be lost.
Reliable manual switching has to go gray-scale process: first the new node weight to 5%, observe 15 minutes of stability and then gradually increase the proportion. I used the CDN5 API to do experiments, the failure rate of forced instantaneous switching is as high as 37%, while the failure rate of gray-scale switching is almost zero.
To show you a live configuration sample, here's the script that implements the grayscale switching using CDN5's API:
But manual switching is ultimately a firefighter's job. It's the automatic disaster recovery system that's really reliable. Good automatic switching should be like autopilot - changing lanes before it senses danger.
I have compared the solutions of the three major service providers: CDN5's intelligent routing is based on real-time attack characterization, CDN07 relies on global probe monitoring, and 08Host has developed a self-developed traffic prediction algorithm. To be honest, there is no perfect program, the key depends on the business characteristics.
Don't believe the "no configuration required" propaganda!. Remember the auto-switching failure of a cloud vendor last year? Because of the misjudgment of traffic characteristics, all normal user requests were transferred to the standby node, and as a result, the standby node was directly pierced. A good auto-switching must be configured with a fusion mechanism:
There is a lesson to be learned here: you must set up a rollback policy for automatic switching. I once did not configure the rollback conditions, the node recovery traffic automatically cut back, and then triggered a second failure. Now my standard practice is to observe at least 2 hours of stabilization before automatic rollback.
When it comes to DNS switching, the pit is even deeper. TTL value setting is an art, some teams try to set it to 60 seconds, and as a result, DNS queries skyrocketed and dragged the authorization server down. It is recommended to use a TTL of 300 seconds during peak business hours, and then lower it to 60 seconds during nighttime maintenance. Don't forget that there is also the problem of local validity, DNS cache refresh in some areas can be so slow that you doubt your life.
Recently I was helping financial clients design dual-activation solutions and realized that Anycast-based automatic switchover is the ultimate solution. Although the cost is so high that it hurts, but it is really user-perceptionless switching. Especially against DDoS attacks, Anycast can spread the traffic to global nodes, much more reliable than a single point of defense.
Finally, a solid suggestion: you must do a failure drill every month. Directly unplug the master node power cord to see if the automatic switching can really carry. I've seen too many teams configure and then toss it aside, and when something really goes wrong, they find that the monitoring and alerts are not configured correctly.
Node switching is essentially doing risk balancing. Too fast may trigger a chain reaction, too slow will expand the impact of the failure. After so many practical pit stops, my principle is: business peak priority for stability with gray-scale switching, night maintenance can try automatic disaster recovery, major activities must be done before the full link pressure test.
Go check your CDN configuration now and see if the node switching strategy is still stuck in the stone age. Don't forget that the best switching is always the kind that users don't even feel.

