At three o'clock in the morning, the cell phone suddenly vibrated like crazy. When I grabbed it in a daze, I saw that the monitoring and alerting SMS had piled up more than twenty - the new version of the game had just gone online, and the promotional effect was far beyond expectation, and the instantaneous traffic had directly filled up the bandwidth. The server room side of the news, the core switch port utilization rate soared to 98%, players began to massively drop, forums and customer service calls instantly exploded. I copied the computer emergency capacity expansion, while cursing in my heart: these days, do not fear no one to play the game, for fear of suddenly too many people to play, the server minutes to teach you to be a person.
This traffic roller coaster scenario is all too common in the gaming industry. Version updates, popular activities, anchor with goods, or even the sudden popularity of a social platform may instantly multiply the number of visits by several times or even dozens of times. The traditional fixed-bandwidth defense scheme is like setting up a fixed-width gate for a highway intersection - usually enough, but when it comes to holidays, it is directly blocked into a parking lot. What's more, many DDoS attacks also specialize in picking this business peak period mixed in the normal traffic to hit over, so that you add to the woes.
Why do so many gaming companies fall prey to traffic spikes? The fundamental problem lies in the rigidity of resource planning and the lack of elasticity. In order to control the cost, many teams only purchase bandwidth according to the average traffic, and the defense value is also set at the scale of regular attacks. Really encounter unexpected traffic, manual expansion process is lengthy: to approve, to contact the supplier to adjust, to test ...... and so on everything is done, players have long run out. I myself have suffered this loss, the early years of doing page tour, a service activity traffic spike, from the discovery to expand capacity to complete spent nearly 20 minutes, after the end of the activity data to see, the loss of a whole 37% of potential paying users.
Elastic bandwidth auto-scaling is the right solution. This thing is to put it bluntly to the CDN equipped with intelligent sensing and automatic scaling capabilities - traffic normal with the basic bandwidth to save money, once the detection of unexpected traffic or attacks, the system automatically on-demand capacity expansion, and then automatically after the peak capacity shrinkage. The key point is “automatic”, without human intervention, the response speed is seconds. I have tested, from the traffic anomaly trigger threshold to the expansion of the effective, the real delay can be controlled within 10 seconds, which is to protect the player experience is the difference between life and death.
However, to realize the real elastic expansion, the concept alone will not work, it must be landed on the specific technical solutions. The core has to look at three links: accurate traffic monitoring algorithms, seamless bandwidth scheduling mechanism, and a reasonable billing model. Monitoring level can not only look at the overall bandwidth usage, must be combined with the number of requests per second, the number of concurrent connections, the frequency of access to specific URLs and other multi-dimensional indicators, otherwise it is easy to misjudgment - for example, the download of patch packages may be soaring high bandwidth, but the number of requests is not much, and CC attacks may not be high bandwidth, but the number of requests to explode. A good system has to be able to distinguish between business spikes and attack traffic, which requires machine learning models to do behavioral analysis.
The scheduling mechanism varies greatly from one CDN service provider to another. CDN5's approach is based on BGPanycast network to do the global load, automatically scheduling traffic to the redundant node pool when expanding capacity, bandwidth resources are shared pooling, strong elasticity, but the cost is on the high side. CDN07 is preset with a spare bandwidth channel, triggered by the threshold to quickly switch routes, fast, but there is an upper limit to the capacity of the 08 Host's program is more alternative, they use a kind of edge Node Elastic Collaboration mode, through the P2P protocol between the nodes to temporarily allocate bandwidth, the cost is low but the complexity is high. Which one to use depends on your business characteristics and budget.
The billing method must be asked clearly! Some vendors talk about flexible capacity expansion, but in fact the peak billing model - according to the highest bandwidth peak charges for the month, in case of an attack on the brush out of an astronomical figure, the bill can be scared out of a heart attack. Do not believe the vague “pay-as-you-go” propaganda, be sure to write in the contract is 95 billing or monthly peak billing. I strongly recommend using 95 billing model (monthly bandwidth value from high to low ranking, remove the highest 5% take the maximum), this can enjoy the flexibility and will not be burst traffic to bankruptcy, measured a year can save more than 30% bandwidth costs.
Landing configuration is actually not complicated, the key steps to step on the right. The first is the threshold setting - set too low is easy to mistakenly expand the capacity to increase costs, set too high is to lose the significance of protection. It is generally recommended that the baseline threshold is set to 1.5 times the daily peak, and the emergency threshold is set to 2.5 times. Secondly, the expansion granularity, it is best to use stepwise expansion, do not pull to the top at once. For example, first expand 50Gbps, observe 5 minutes if the traffic is still rising and then continue to expand, to avoid waste of resources. Finally, we must set the conditions for capacity reduction, usually the traffic falls back to below the threshold and stabilizes for 15-30 minutes after the start of automatic capacity reduction.
Give a real configuration example of a traffic self-awareness solution based on Nginx+Lua implementation:
Don't forget to do integration with your CDN vendor API. Now the mainstream vendors are providing rich open interfaces, such as 08Host's expansion API can be directly triggered by curl:
When actually deployed, I suggest using a hierarchical strategy. Usually use 08Host as the basic protection, cost-effective; encounter large-scale activities in advance to enable the resilience of CDN5 protection; really by a large-scale attack and then trigger the emergency protection mode of CDN07. This combination of use, cost and security performance to achieve the best balance. Last year, our game to engage in global gaming events, is so to carry the opening moment 2.3 million players at the same time the impact of the influx, the whole process without lag.
There is also a small trick: in the automatic capacity expansion based on the addition of a manual emergency button. We made a one-click expansion function in the Ops backend, and the button was made extra big and red, with the words “Press with caution!” written next to it. --Who has time to look for the hidden menu when there's a real emergency? I've only pressed this button twice in three years, but every time it saved my life.
One final rant: elastic scaling is not a panacea. It solves the problem of bandwidth resources, but if your application itself has a performance bottleneck - such as database connection pool is not enough, cache design is not good, code inefficiency - that more bandwidth can not save. It is like a highway expanded to 100 lanes, but the exit is only a narrow lane, the blockage is still blocked. So before on the elastic CDN, the first application of its own optimization to do, otherwise it is a waste of money.
Reality is so cruel: players will never give you a second chance. Drop the line once and you may lose a user forever. Now our system has fully realized the elasticity of automatic expansion, and we can finally sleep soundly at night. After all, letting technology serve people, rather than people putting out fires every day, this is the highest state of doing technology.

