How High Defense CDNs Prevent Crawler Attacks and Block Malicious Crawlers with UA Recognition and Frequency Limiting

Recently helped a few customers to deal with crawler attacks, found that many people thought on the high defense CDN on everything is fine, the results were crawled to doubt life. An e-commerce station was even crawled through the price database, competitors directly take the data to do dynamic pricing, the boss almost put the technical team sacrificed to heaven.

High-defense CDN can indeed carry DDoS, but against the crawler this thing must be used skillfully. I tested and found that simply by IP flow restriction can not prevent advanced crawlers - people can just get a proxy pool to break your rules. The real effective strategy, from the UA identification and frequency limitations of the combination of fist.

UA recognition is not simply matching keywords

Many teams configure their UA rules to only block requests that clearly say ”Python” or ”curl”. Nowadays, the most advanced crawlers are able to fake UA, such as pretending to be a major browser:

Can you tell if it's a real person or a crawler just by looking at this? I did tests on CDN5 last year, and the percentage of crawler requests that spoofed UA was as high as 83%. so the key has to be the behavioral profile - a normal person wouldn't use Chrome's UA and yet request the API interface 10 times per second.

Frequency limits to play dynamic strategy

Setting a rigid ”5 requests per second” rule can hurt normal users. Especially when there are promotional activities, the frequency of real user visits will also soar. I have practiced the dynamic frequency algorithm on the nodes of 08Host:

This configuration helped a news site reduce its false positives by 921 TP3T, while the crawler interception rate increased by 371 TP3T.

UA fingerprinting library in the real world has to be continuously updated!

Crawler frameworks are now learning to rotate UA, but each framework still has a fingerprint signature. For example, Puppeteer default with the HeadlessChrome word, although nowadays advanced crawlers will deliberately remove this mark, but through the JavaScript execution characteristics can still be detected.

The detection rule base I maintain on the CDN07 platform contains 1700+ fingerprints, with a few recent additions including:

These rules look simple, but in reality there are blood and tears lessons behind each of them. One customer was fooled by a crawler using an old browser UA to get past the rules, and in the end it was only by detecting the browser characteristics that the interception was successful.

Frequency limits should be designed in layers

Don't use the same set of frequency rules for all interfaces. The login interface has a completely different risk level than the product detail page, and I usually do three levels of frequency limits for my clients:

Static resources are relaxed to 50r/s, the API is strictly controlled to 10r/s, and the login interface must be limited to 3r/s or less. Particularly important is to protect the CAPTCHA interface - many crawlers will violently refresh the CAPTCHA, this must be limited to 1r/10s.

Example configuration on CDN5:

Never trust a UA whitelist.

Some programs suggest only releasing common browser UA's, and something is definitely going to happen with this one-size-fits-all approach. So many legitimate crawlers (search engines, price comparison sites) need special treatment these days. I've seen one site block Googlebot, and as a result natural traffic plummeted 40%.

The right thing to do is to verify the authenticity of well-known crawlers. For example, Googlebot provides a verification method:

Others like Bingbot have similar validation mechanisms, and this part must be configured manually and cannot rely on the CDN's default rules.

Browser signature verification is the ultimate weapon

Advanced crawlers are now perfectly capable of faking UA and IP, and in the end it's up to the browser fingerprinting to recognize it. Browser environment characteristics are detected through JavaScript challenges, such as checking if navigator.plugins is complete, if the WebGL renderer matches, and so on.

On CDN07 it can be configured like this:

This solution has been tested to block 99.9% headless browsers, but be aware of the impact on SEO, it is best to whitelist known search engine crawlers.

Don't forget to monitor and iterate

Crawler technology is constantly evolving, and rules that worked last month may not work this month. A monitoring system must be in place to keep an eye on these metrics:

Exception request ratio, CAPTCHA trigger rate, request frequency distribution by interface. I usually do real-time Kanban on Grafana and adjust the rules immediately when I find anomalies.

Recently found a new crawler began to simulate the mouse trajectory, the next time to talk about how to use behavioral biometrics to defend. These days, even the CDN have to ”defense teammates” - some crawlers even disguised as the CDN's own monitoring robot!

Truly effective protection is always a multi-layer combination of strategies: UA detection to sift out low-end crawlers, frequency restrictions to prevent intermediate attacks, and browser authentication to take out advanced players. Finally, leave an escape route - in case a real user is blocked by mistake, at least give people a channel to complain.

{{userData.name}}Verified

How High Defense CDN prevents crawler attacks and blocks malicious crawlers through UA identification and frequency limitation.

High Defense CDN Deals Summary New Customer Benefits and Money Saving Tips

The difference between the defense level of high-defense CDN and DNS protection and the common security of domain names

Categories