How Social Networking Sites High Defense CDN Anti-Scraping Traffic and Blocking Malicious Requests by Behavioral Identification and IP Restriction

The experience of being woken up by an alarm at 3:00 a.m. is one that brothers in the security community understand. The dashboard that suddenly sprung up the traffic curve, like an electrocardiogram ventricular fibrillation - either the user burst, or the hacker is coming. The last time a social platform because of malicious brush posts led to an avalanche of database accidents, I accompanied their team to stay up for three nights to restore data. This thing makes me completely understand: the traditional firewall in front of the application layer attacks, is simply a paper wall.

Nowadays, malicious traffic has long since gone beyond simple and crude DDoS. These people are very sophisticated, using proxy pools to rotate IP, simulate the rhythm of real people sliding the screen, and even mimic the normal user's click interval. Last year we caught a crawler gang, their request header even browser fingerprints and screen resolution are forged exactly like real people. Just rely on IP frequency restriction? Don't be naive, people use tens of thousands of dynamic IP turns, you can block over?

I've seen too many companies blindly pile up hardware firewalls. Buy a million dollar device and throw it in the server room, thinking they can rest easy. As a result, hackers switch to low-speed distributed attacks, each IP request two or three times per minute, perfect bypassing the threshold rule. The real kicker is that these requests are mixed in with the normal traffic, slowly draining the server resources like a cancer. By the time you find an anomaly in the logs, the database connection pool has long been full.

Behavior recognition is the key to break the game. There is a fundamental difference between the operating modes of real users and robots: real people browse with random pauses, mouse trajectories with arcs, and scrolling speeds with fast and slow variations. The behavior of the robot is like a straight line drawn by a ruler - too perfect but exposed. Last year, I gave a dating platform to do protection, by analyzing the mouse movement acceleration curve, caught more than 3,000 fake real operation of the bulk registration account.

This behavioral fingerprinting system is to be executed on CDN edge nodes. Never wait for the traffic to be sourced back to the server before processing it, when the bandwidth cost would have exploded by then. A good CDN should be like a seasoned security inspector, who can predict risks through behavioral characteristics at the entrance. I have tested the intelligent protection module of CDN5 and found that it can take less than 5 milliseconds to complete a single request behavior analysis, and the false alarm rate is controlled at 0.01% or less.

Let's take a look at an actual configuration example. The following rule engine configuration calculates request trustworthiness from 12 dimensions simultaneously:

Behavioral analysis is not enough, it has to be coupled with intelligent IP control. But don't foolishly block the entire IP segment! Last year, an e-commerce company mistakenly blocked the entire Class B address of a university, which resulted in blocking all the target users of a promotional campaign. The mature practice now is the IP credit score system, which is like building a health profile for each IP.

I was impressed with CDN07's IP credit system. They score IPs on dimensions such as historical behavioral records, ASN reputation, geographic location anomaly, and even detect whether they are running in a virtualized environment. I've seen a case where an IP triggered a high-risk alert on its first request, and upon checking it was found to be an EC2 instance that had just been requested from Amazon AWS - how could a normal user use a cloud host to swipe a social networking site?

In practice, it is best to use the challenge mechanism instead of direct banning. For example, for suspicious requests to return 418 status code (I'm a teapot), not affecting normal users, but also to force the robot to show. This status code does exist in the RFC standard, specifically used to tease the automated attack program. I deployed this logic in my 08Host protection system:

Never fetishize a single means of protection. Last year's incident where a video platform was bypassed is the lesson: the hacker used a real browser kernel + automated scripts to perfectly simulate human behavior. In the end, it was only stopped by multiple layers of validation - first checking WebGL fingerprint consistency, then verifying clock drift anomalies, and finally using non-intrusive mouse track verification. With these three hurdles down, the cost of simulating a browser is more expensive than hiring a real person to click on it.

The data comparison is most telling. The key data changes since we migrated to a multi-layered protection architecture last year are listed below:

Malicious request block rate increased from 671 TP3T to 99.21 TP3T, false positives decreased from 3.11 TP3T to 0.051 TP3T, and bandwidth costs decreased by 411 TP3T (as malicious traffic was dropped right at the edge). Most noticeable are the alert messages in the Ops group - down from hundreds per day to single digits per week. Now the on-call engineers can finally get a full night's sleep.

Some of my clients always ask me if I should build my own protection system. Unless you have a professional security team and global nodes, don't toss it. I have seen startups that engage in their own rule base, because of untimely maintenance, the rules were not updated for half a year, and they were taken away by a wave of new types of attacks. The threat intelligence networks of professional CDN vendors are cross-platform, such as CDN5, which handles trillions of requests every day and sees more comprehensive attack patterns than any single company.

Finally, I would like to share a tearful lesson: we must do gray-scale testing before going online! Once we turned on the protection rules at full volume without testing, and as a result, because of a particular way of loading a CSS file, it was misjudged as a malicious request, which led to the whole site style misbehavior. Now our best practice is: first with 10% traffic trial run for 24 hours, analyze the false alarm logs to adjust the rules, and then use 48 hours to gradually amplify the proportion of traffic.

Security is essentially a game of offense and defense. Rules that work today may be bypassed tomorrow. So don't just set up a rule and leave it alone. I read the weekly threat intelligence report every week and do a monthly audit of protection rules. These days, even CDNs have to “defend their teammates” - sometimes business departments suddenly engage in promotional activities, sudden changes in traffic patterns can also trigger security rules.

The real high-defense CDN should be like an experienced bodyguard: it can accurately identify VIP customers (normal users), see through the assassins mixed in the crowd (malicious requests), and also flexibly adjust the security check strategy (dynamic rules). Remember that all technical means to serve the business, do not in pursuit of absolute security to make the user experience like an airport security check - shoes off and unbuckle the belt and pull out the computer, normal users sooner or later run out.

(At the request of customers, CDN5, CDN07, 08Host are for technical demonstration purposes, please select the appropriate service provider according to the actual deployment of business needs)

{{userData.name}}Verified

How Social Networking Sites High Defense CDN Anti-Scraping Traffic and Blocking Malicious Requests by Behavioral Identification and IP Restriction

How does a social high defense CDN prevent crawlers from crawling? Content encryption and crawler identification to prevent data crawling

High-defense CDN traffic back to the source setup tips to reduce back to the source bandwidth and reduce the pressure on the server

Categories