How does a social high defense CDN prevent crawlers from crawling? Content encryption and crawler identification to prevent data crawling

Recently, several friends doing social platforms complained to me, saying that their own user dynamics and content were crawled to the bottom, and the server was blocked by crawlers every day. A buddy even laughed bitterly and said, “I now read the logs are PTSD, as soon as I see the User-Agent with the word Python in it, I want to unplug the network cable.”

Crawler technology threshold is getting lower and lower these days, write a few lines of Python script will dare to grips data, not to mention those who specialize in the data trafficking team, not moving to open hundreds of cloud hosting IP bombing rounds. Traditional firewall in the eyes of those people is just like paper, rely on an IP blacklist to prevent crawlers? You might as well hope to win the lottery.

I have helped a lot of social platforms over the years to do anti-climbing program, the test found that purely rely on the rules of matching simply can not play the modern crawler. Those advanced crawlers will now simulate the human behavior curve, the mouse trajectory can give you a normal distribution to create a request frequency to determine whether it is a crawler? Don't be naive.

A truly effective social high-defense CDN anti-climbing strategy must achieve three layers of defense: content encryption so that crawlers can not get effective data, behavioral identification to accurately distinguish between people and machines, and finally with dynamic defense mechanisms to make the cost of crawlers to doubt life. The following I will combine practical experience, talk about how to operate.

Let's start with the most damaging content leakage problem. Many platforms think that everything is fine with HTTPS, but I don't know that the crawler directly in your CDN node to parse the content. I've seen the most extreme case is a social platform API return JSON data directly by the crawler batch analysis, the user relationship chain have been climbed clean, the person in charge until the competitors began to accurately dig people found something wrong.

Don't believe those who say “Token verification is enough” program. Token leakage in the crawler circle has long been a standard technology, people directly decompile your APP unpacking, key extraction is as simple as shopping in the supermarket. The more ruthless direct Hook cell phone system, run-time memory Token are fishing out.

The reliable approach isDynamic content encryption. We can do a data obfuscation in the edge node of the CDN, such as the key data fields for randomized coding, each request to return a different field name. For example, the user ID field may be called ”uid”, the next time it will become ”z3df9″, so that the crawler simply can not establish a fixed parsing rules.

I've tested this solution on CDN5, and their edge computing nodes support custom JavaScript processing logic that can dynamically disrupt the JSON structure before the data is exported:

Crawlers get this kind of data is like removing a blind box, every time you parse the rules have to re-guess, greatly increasing the cost of data cleaning. I intentionally leaked a test of this interface, placed there for a week have not been successfully crawled, but instead the crawler team in the forum cursed that this broken interface every day to change the structure.

But encryption is not enough, some advanced crawlers will run JS directly to restore data. So it also has to be paired withBehavioral Characteristics Recognition. This is an area where CDN07 has done some hard work, with their bio-behavioral engine capturing more than 200 dimensions of human operating characteristics.

I am most impressed that they can detect the mouse movement of the Bezier curve match - real people operating the mouse will have a natural acceleration curve and a small jitter, while the reptile simulation of the movement of the trajectory is either too perfect or too random. Can also detect the statistical distribution of the page stay time, real people browsing time in line with the power law distribution, while the crawler's visit interval is often a fixed period or Poisson distribution.

This is the set of rulesets we configured on 08Host:

Don't underestimate these details, I've caught several “high-end” crawlers with these rules. One group masquerading as a GoogleBot used the full Chrome Headless mode, with a regular referer for each request, and got caught because the mouse trajectory was too linear - how can a real person move in a perfectly straight line every time?

When it comes to IP blocking, the first reaction of many people is to pull the black IP segment. But now crawlers are using cloud service provider IP, today sealed Ali cloud tomorrow with Tencent cloud, you can seal over? What's more disgusting is that those who use residential proxy networks, IP are real home broadband, sealing a real user may be injured by mistake.

I'd recommend it now.Dynamic Challenge Mechanism.. Instead of directly blocking the suspected traffic, random validation challenges are placed. For example, a normal user receives a minimalist CAPTCHA (e.g., tapping on an object in a picture), while a suspected crawler's session encounters an enhanced challenge:

Don't underestimate this computational problem challenge, it's a piece of cake for real people, but a nightmare for distributed crawlers. To coordinate hundreds of nodes to solve the problem synchronously, the latency directly explodes. After I deployed this solution on CDN5, the crawler traffic dropped by 82%, and the CPU load dropped directly from the alert line back to the normal range.

Some crawlers are now starting to use AI to crack CAPTCHA, so it's best to update the challenge question bank regularly. I usually prepare dozens of challenge types randomly rotated, from mathematical calculations to graphical logic questions everything, so that the crawler team is always running on the road to crack.

Finally, the details of API protection. Many social platform APIs are designed to be too prescriptive, for example, user information interfaces must be/api/user/{id}, the crawler writes a loop to batch crawl. It is recommended to design the API path counter-intuitively, such as hiding the version number in the Header and randomizing the interface path:

By the way, CDN07's API protection is more detailed, and you can set the frequency limit for each interface individually. For example, the personal homepage interface has a maximum of 60 requests per minute, while the friend list of such sensitive interfaces is limited to 5 requests per minute, and more than that, it will be automatically downgraded to return false data.

The data return format should also play some tricks. Don't always return complete JSON data, you can use chunked encoding (chunked encoding), split the key data into multiple packets to send, and insert garbage packets in the middle to interfere with crawler parsing. The real user has no perception, but the crawler parser is directly confused.

What's the most pitiful thing about crawler protection? It's killing real users by mistake. That's why I strongly recommend that all rules have aLearning modelPeriod, first observe without intervention, record traffic patterns to establish a baseline. the intelligent mode of CDN5 works well, first learning the normal traffic for a week, then automatically generating rule thresholds, much more accurate than manual configuration.

Finally, to give a real suggestion: do not expect a set of programs to eat all the scenarios. I generally use CDN5 to do the first layer of traffic cleaning, 08Host behavior analysis, CDN07 special protection API interface. The combination of three monthly cost of more than a few hundred dollars, but more cost-effective than the loss caused by crawling data.

Anti-crawler is essentially a cost game, what we want to do is to make the cost of crawling far higher than the value of the data. Now my client's platform is captured once the data, crawler team to invest more than a dozen high-profile servers + tens of thousands of proxy IP costs per month, and we rely on intelligent scheduling and dynamic encryption, so that the guys dig data than digging bitcoin is still burning money.

Recently, I found that the reptile circle has also begun to implode, and some teams have begun to use reinforcement learning to simulate human behavior. However, the Dao is one foot tall and the devil is ten feet tall, our side is also engaged in deep learning models to detect abnormal patterns. This war of attack and defense is estimated to go on, but one thing is certain - those who think that hanging a WAF can prevent the crawler platform, sooner or later to become the hardest hit by data leakage.

{{userData.name}}Verified

How does a social high defense CDN prevent crawlers from crawling? Content encryption and crawler identification to prevent data crawling

How high-defense CDNs can be used for IoT security to protect device communications and prevent data leakage

How Social Networking Sites High Defense CDN Anti-Scraping Traffic and Blocking Malicious Requests by Behavioral Identification and IP Restriction

Categories