Recently, a friend doing online education with me to complain, said that their high price recorded course video, within a few days appeared in a variety of pirate sites, and even packaged and hung on Taobao to sell 9.9. “Can not be prevented ah, obviously with the CDN, how to with the same as the paper mache?” I'm too familiar with this problem, and today I'm pulling out my heart with the guys.
These days, crawlers are not simply `curl` anymore. People use distributed nodes, low rate slow crawl, and even disguised as a normal app client, picking the blind spot of your defense. You think you can rest easy with a CDN? Naive. The default configuration of many CDNs to prevent a CC attack is okay, against professional crawlers is to scratch the itch.
I have tested several mainstream service providers and found many pits. For example, some CDN's WAF rule base has not been updated for years, encountered a slightly deformed User-Agent will be directly released; there are also some in pursuit of accelerated performance, cached the state code should not be cached, but instead of giving the crawler a green light. The most pitiful thing is that some service providers put the anti-climbing function in the value-added package, you do not add money on the basic naked.
Really want to prevent reptiles, from “accelerate” thinking to “attack and defense” thinking. CDN should not just traffic porters, should be the first line of defense. Don't expect a single means to solve the problem, have to play a combination of punches. Next, I share a few tested and effective strategies, some configurations can even allow you to use a medium budget to achieve high-end protection.
Let's start with the simplest - recognizing machine traffic. Many crawlers are too lazy to disguise themselves and expose themselves directly in the request header. For example, empty Referer, missing Accept-Language, or using the default headers of non-common HTTP tool libraries. These can be intercepted by pairing a rule in the CDN management backend:
But advanced crawlers will fake browser fingerprints, and that's when the challenge mechanism has to be offered. I highly recommend enabling the CDN's JavaScript challenge feature - normal browsers will automatically perform the JS calculation and submit the token, while most crawlers just goof off. A tool like CDN07 is particularly fine-tuned in this regard, and also distinguishes between real browsers and headless tools like PhantomJS.
Frequency control is the biggie. Never use global frequency limiting, or you'll accidentally hurt normal users. You have to do dynamic flow limiting by IP, session ID, or even business dimension. For example, for the video API interface:
But the most ruthless of all is behavioral analysis. A truly professional protection program will establish a traffic baseline and detect unusual patterns. For example, an IP suddenly traverses /video/123 to /video/99999 at 2:00 a.m. in a frenzy, or the same account in a short period of time, the number of downloads skyrocketed. This type of dynamic rules CDN5 realizes the best, which can trigger real-time human verification or even temporary blocking.
Speaking of video watermarks, many people think it is a logo in the corner, that stuff with ffmpeg a command can be erased, to prevent a lonely. Effective watermark to meet three conditions: dynamic rendering, information binding, anti-removal. For example, to each request for the user to generate a separate watermark:
But this can still be intercepted. A more advanced approach is discrete digital watermarking - splitting user information into perturbed signals embedded in different frames of audio and video data, invisible to the naked eye but extractable by algorithms. This type of program 08Host offers out-of-the-box integration, which is a bit more expensive but can be used as key evidence in legal forensics.
Recently I also found a tawdry operation: some crawlers will impersonate the CDN edge node back to the source request. At this time, you have to do two-way verification at the source station, such as CDN vendors to assign exclusive Token:
Finally, I must complain: some vendors hide the anti-climbing function, do not buy the most expensive package not to use. In fact, like CDN07 is quite conscientious, the basic version of the API analysis report, you can clearly see the heat map of the crawler and TOP attack source. It is recommended to pull a report every week to focus on monitoring, maybe you will find your competitor company's IP segment in the crazy crawl...
To be honest, there is no 100% anti-climbing solution. But with a three-layer defense of CDN rules + watermark tracking + source site verification, you can at least drive the cost of crawling to the point where the other side can't make a profit. It is important to keep iterating - I update the blocking IP segments once a month and adjust the frequency threshold once a quarter. Security is supposed to be about offense and defense, and lying flat is waiting to be crawled through.
If you really want to recommend, small and medium-sized sites with CDN05 cost-effective enough, high-traffic video platforms directly on CDN07 behavioral analysis solutions. As for 08Host, it is suitable for those who need to customize watermarks and legal forensics for advanced needs. Don't forget, the last line of defense is always humanized design - in the video player with “report piracy” button, let your users become your sentinels.

