29 May 2026
Abusive scraping of public-interest websites amounts to 25% of total traffic
Scrapers are automated systems that collect web pages at scale. Some are legitimate: search engines, archival projects, academic crawlers or monitoring tools. Scrapers that act in good faith identify themselves, publish documentation, provide a contact address, and respect website scraping policies such as robots.txt or rate limits. The others do not.
At Qurium, we are mainly dealing with cases of opaque scraping systems that generate large volumes of traffic while providing no stable infrastructure that is easily identifiable, no point of contact, no declared purpose, and no practical opt-out mechanism. These systems may distribute requests across very large pools of residential, mobile, ISP, VPN, or brokered proxy addresses. At the level of a single request, such traffic can resemble ordinary user activity. However, at aggregate scale, it can overload infrastructure, consume massive bandwidth, trigger cache churn, increase database and application processing, inflate logging volume, and force site operators into defensive mitigation.
We estimate that today not less than 25% of the traffic and processing costs of the organizations we host are related to handling scrapers and other automated access systems and it is only going to get worse. This “overhead” includes not only bandwidth, but also web server processing, cache pressure, application execution, database load, log ingestion, monitoring, alerting, and incident-response work.
Scrapers creates an accountability problem. Legitimate crawlers identify themselves and allow a website operator to approve, deny, rate-limit, license, or otherwise manage automated access. Opaque scrapers deny that choice. They extract value from public-interest content while shifting infrastructure costs onto the targeted organization.
Opaque scrapers often claim the mantle of “public access”, but their behavior is closer to someone walking into a public library, reserving every seat, monopolizing the shelves, and then leaving the library to pay for the disruption.
One of current research challenges is therefore not just to fingerprint such patterns and block abusive IP addresses, but to attribute the actors behind such traffic.
The scraper hits “Arab Reporters for Investigative Journalism”
On 14 May the English-language site of ARIJ – Arab Reporters for Investigative Journalism received a massive scraping event targeting its library of investigations. The volume of the event was 10,000 higher that regular scraping traffic.
Arab Reporters for Investigative Journalism is a Jordan-based nonprofit organization supporting investigative journalism and fact-checking across the Arab world. ARIJ publishes investigations and provides journalists with training, mentoring, funding, networking, and safety support in a difficult regional press-freedom environment.
Hence, the target of the scraping was not a commercial actor, but a public-interest journalism website with hundreds of investigations, that is easy to contact and collaborate with.
Analyzing the scraping traffic logs
To understand the event, we analyzed several millions lines of web log data. We started by reducing the number of raw requests to unique sessions. We classified the requests into sessionized visits that collapses short web redirects into the same logical visits. We have included in this article more details of our technical analysis methodology at the end of this article.
The main finding of our quantitative analysis is that during one day (23.13h), the website received traffic from 1.35 million unique IP addresses where almost 75% of those “only scraped once”. The full traffic originated from more than 7,300 autonomous systems distributed in 223 out of the 249 ISO codes assigned.
After a few hours of “cool down”, IP addresses that have scraped content are re-used for scraping more content. The average “cool-down” period (when the same IP returns to scrape) is approximately 2 hours and 20 minutes.
IP rotation within the same network (/24) has an average time of 17.4 minutes. This means the traffic moved through pools of available addresses rather than repeatedly hammering the same IPs or traffic from the very same network.
The traffic spanned almost all the countries in the world and a large pool of network operators. The strongest presence was from highly populated countries such as Vietnam, Brazil, India, and other countries with high consumer ISP/mobile markets. This supports the conclusion that the scraping used distributed access infrastructure rather than a hosting environment or a traditional residential proxy infrastructure
The high level aggregated traffic pattern is consistent with a “proxy provider” that has access to very large pools of IP space across many ISPs and countries and keeps the individual usage per IP at very low rate.
We shared our results with several researchers that are mapping residential proxies, and their feedback was that the behavioral evidence and their internal classifiers points toward a system tagged as NetNut.
Israeli Safe-T (Alarum) acquires NetNut and Chi Cooked to merge proxy and ISP operations

NetNut publicly markets residential proxy services for web data collection and describes rotating residential proxies with very large IP pools across many countries. Its website describes use cases such as web scraping, ad verification, accessing geo-restricted content, and large-scale data collection. But what makes NetNut different is their claim that the can operate their proxy infrastructure without the need of traditional proxies. How is that possible?
To better understand what NetNut is offering we looked into their corporate history as it might help explain why an ISP-integrated proxy model is technically plausible.

In 2019, the Israeli owned company “Safe-T Group” acquired NetNut, described publicly as a business proxy network solution provider. Safe-T filings state that the company acquired NetNut and certain assets required for NetNut’s ongoing operations from DiViNetworks.

In December 2020, Safe-T acquired Chi Cooked LLC, described as a U.S.-based provider of cloud-based global IP proxy services.
Safe-T presented the acquisition as complementary to NetNut and as strengthening its position as a “one-stop shop for proxy-related business solutions”. This suggests that Safe-T was assembling a broader proxy-services portfolio: NetNut and DiViNetworks represented the ISP-integrated proxy model, while Chi Cooked added cloud-based/global IP proxy infrastructure and an existing customer base.
But the real deal was that Chi Cooked LLC operated in the sneaker-proxy ecosystem, providing proxy infrastructure used by sneaker resellers and bot operators to reach retail websites through large pools of IP addresses. After Safe-T Group acquired Chi Cooked LLC in December 2020, it moved quickly to turn that niche proxy infrastructure into a formal commercial product. By February 2021, Safe-T was promoting a dedicated residential proxy network for the sneaker resale market.


Safe-T later rebranded as Alarum Technologies in January 2023. Alarum filings and investor materials continue to list NetNut as part of the group’s proxy and web-data-collection business.
The DiViNetworks connection is central as the company was not in the proxy business space. DiViNetworks came from the ISP bandwidth-optimization and ISP-integration world, not from a conventional datacenter proxy model. The company specialized in optimizing network traffic by redirecting live IP traffic to a smart binary raw compressor, the compressed traffic once transported to the other side of a dedicated link is decompressed and IP routed.
According to the information disclosed by the International Financial Corporation part of the World Bank Group, the company formerly known as iPortent received in 2013, 5 MUSD funding to support the expansion of the company in emerging markets. According to the project description in 2013, their bandwidth optimization solution known as DiviCloud had points-of-presence (“PoP”) in 15 global Internet hubs and servers in more than 60 client locations in emerging markets.
Later on, the company changed their business focus (DiViCloud/DiViLink) and started to market services to ISPs, WISPs, and WiFi operators for bandwidth monetization (DiviPlus+), including router integration with platforms such as MikroTik, Cisco, and Juniper. Crucially, DiViNetworks has advertised that operators can monetize bandwidth without allocating specific IP ranges. In other words, the model is presented as using existing active IP space inside provider networks rather than requiring a separate proxy allocation.
DiviNetworks advertisement for their bandwidth monetization service (DiviPlus+) can be found here.
The magic “route reflector” – scraping without proxies
Alarum’s own description of NetNut’s patented “reflector” method is vague and boldly claims of being totally different from conventional anonymous proxies that reroute traffic through an intermediate proxy device. The technology does not need a proxy-type device inside of the ISPs.
We decided to brainstorm how such architecture could look like. We needed to find out what it would take to run scraping traffic in parallel with existing ISP clients and we came up with the design of our “(cc) Lease-a-Flow” service.
From a technical point of view we believe that that NetNut’s reflector-based anonymous proxying is a form of ISP-side session or “flow borrowing”. Such solution takes proxy-originated traffic that enters a provider network through a controlled path, such as a network tunnel to later source-NATed through provider-side address space. In another words, the external traffic takes over an existing IP address from the provider.
Therefore, the website targeted by NetNut’s scraping, sees sessions associated with addresses announced by, or active inside, the ISP provider network rather than with the requester’s original infrastructure.
With other words, the scraping attack of ARIJ is possible by implementing a simple NAT in residential ISP’s around the world, and piggy back on subscribers “unused bandwidth” and tunnel back the traffic to the scraper.
(cc) Lease-a-Flow: From whiteboard to the Lab
We reproduced the core behavior of our hypothesis in our lab with a minimal MikroTik setup. A remote system delivered web requests to the router over a (GRE) tunnel. On the MikroTik, traffic arriving from the tunnel interface was classified as tunnel/proxy traffic. A source NAT rule then translated that traffic to one of the public addresses announced by the MikroTik before forwarding it to the Internet.
In functional terms, the router behavior is:
- Receive web traffic through a controlled tunnel (from NetNut).
- Classify that traffic in the ISP as tunnel-originated proxy traffic.
- Apply source NAT to that traffic.
- Translate the source to an announced provider-side public address within an address pool.
- Forward the packet to the Internet.
- Preserve state so that return traffic is mapped back to the correct tunnel-originated flow.
But an alert reader that masters the TCP/IP bible “TCP/IP Illustrated”, will quickly notice that there is a major problem with this approach, namely Collisions.
The hard part is not basic egress. The hard part is safe coexistence with real subscriber traffic. If tunnel-originated proxy flows and ordinary subscriber flows share the same public address space and source port numbers, the router will deliver any of the identical flows to NetNut by default as NAT mapping takes over any pre-routing decision.
The router should therefore ensure that each translated flow remains uniquely identifiable but the router has no means to do that and avoid such mapping collisions: NAT vs normal-traffic. A collision would occur if our translated flow (5-tuple) matches an existing subscriber flow at a given time. The return traffic would then be ambiguous and source NAT pre-routing will take over all the traffic. All the traffic, including the subscriber real ingress flow will be routed back to NetNut!
A related requirement for this simple architecture is a fast (NAT) connection tear-down. If proxy-originated sessions are short-lived, the proxy system must close them cleanly and release NAT state as quickly as possible. In practice, this may involve aggressively ending idle or completed sessions (sending TCP RST) once the web request is completed, so that the router’s connection-tracking and NAT tables are not kept occupied longer than necessary. Fast tear-down reduces the probability of port exhaustion, stale mappings, and collisions with ordinary subscriber traffic using the same provider-side address space.
In practice, a more advance solution will require a full separation through two separated connection trackings, buffering traffic, reserving source-port ranges, dedicated address pools, SYN cookie tricks or equivalent mechanisms. We have doubts that this can be implemented with a few clicks or with six line of router configuration or under 15 minutes as NetNut claims.
Our lab experiment shows that to coexist with ordinary subscriber traffic, NetNut proxy-originated flows must remain distributed across a large address pool, statistically low per address, and short-lived enough not to exhaust NAT state or collide with active subscriber sessions. This operational constraint exactly matches the traffic pattern observed in the ARIJ event: many IPs, many /24s, a high share of single-session IPs, delayed same-IP reuse, faster switching between different IPs inside the same provider networks, and short session lifetimes.

Why Opaque Scraping Matters?
For public-interest media organizations, this kind of scraping creates a structural problem. The scraper can operate at industrial scale while making each individual request appear ordinary. Blocking single IPs is ineffective because the pool is too large. Blocking entire countries or large ISPs risks excluding legitimate readers. Too aggressive rate-limiting may harm real audiences, especially in regions where access is already fragile.
The lack of scraper identification is a deliberate attempt to avoid scrutiny and accountability. If the scraper presented a stable identity, declared purpose, contact address, and opt-in or opt-out mechanism, the affected organization could make an informed decision. It could allow access, restrict it, negotiate it, license it, or block it. Without that, the organization can only infer intent and infrastructure from behavior.
If NetNut’s solution for bandwidth monetization re-assembles our “lease a flow” model, it also opens serious security and ethical considerations: NetNut has the ability to gain access to ISP subscribers’ traffic and subscribers are absolutely unaware of what their IP addresses are used for.
Conclusion
The 14 May scraping event against ARIJ’s English-language website was not an ordinary spike in automated traffic. It was a large-scale, highly distributed extraction operation against a public-interest journalism platform, involving 1.35 million unique IP addresses, more than 7,300 autonomous systems, and traffic from 223 country codes over roughly one day. Most IPs appeared only once, while reused addresses typically returned hours later, and traffic rotated rapidly across IPs within the same networks.
The observed traffic pattern is consistent with a NetNut-style ISP-integrated proxy model: a system able to send externally supplied requests out through provider-side IP space while keeping per-address usage low, short-lived, and widely distributed. Qurium’s lab experiments shows that the basic behavior, receiving proxy traffic through a tunnel and source-NATing it through provider-announced addresses, is technically plausible with standard routing and NAT mechanisms, although safe coexistence with real subscriber traffic is far more complex than simple marketing claims suggest.
Such systems create an accountability gap. Public-interest websites are left carrying the cost of industrial-scale scraping while the scraper provides no stable identity, no contact point, no declared purpose, and no meaningful way to approve, deny, limit, or negotiate access. The burden is shifted onto media organizations that already operate under financial, political, and security pressure.
For organizations like ARIJ, the problem is not merely unwanted traffic. It is a structural imbalance: scrapers can extract content at scale while hiding behind millions of ordinary-looking residential or ISP addresses. Blocking individual IPs is useless, blocking whole networks risks excluding real readers, and aggressive rate limits can harm the very audiences these journalism projects are meant to serve.
The central issue is transparency. Legitimate crawlers that identify themselves, allow site operators to make informed decisions. Opaque proxy-based scraping systems do the opposite. They turn public access into unaccountable extraction, consuming infrastructure resources and potentially implicating unaware ISP subscribers whose IP addresses may be used in the process. If bandwidth-monetization systems resemble the “lease a flow” model described here, they raise not only operational concerns for website owners, but also serious ethical and security questions for ISPs, subscribers, and the broader public-interest web.
ARIJ’s Director General, Rawan Damen says “Although Qurium’s investigation provides insights on how the scraping was carried out and indicates who might be behind it, a definitive attribution and understanding of the underlying motives behind the Israeli company NetNut’s targeted attack remain a primary focus of our inquiry.”
Qurium has reached out twice (15 and 21 May) to Netnut’s marketing representative Anna Vainshtein and their abuse team to discuss the event. At the time of writing we have received no response.
Disclaimer
Our lab setup does not reproduce any proprietary NetNut or DiViNetworks implementation, nor does it prove how their production systems work internally. It does, however, demonstrate that the basic network behavior receiving externally supplied traffic through a tunnel and making it egress through provider-side announced address space can be achieved with standard router NAT functionality and minimal configuration.
Appendix
Studying live traffic
We decided to record the scraper live traffic from one of the large ASNs flooding the site, VNPT-VN AS45899. According to public documentation, the provider was a trial user of Divinetworks solutions in 2013-2014.

We recorded traffic coming from a large IP pool 14.160.0.0/11. We found that more than 97% of the traffic had the same TCP signature mss=1452,sackOK,TS,nop,wscale=4. We also found two large clusters of NAT traffic:
- NAT Type A: Prefix: 14.191.0.0/16, Ports: 1024–32767, TTL: mostly 48–49
- NAT Type B: Prefix: other observed 14.160.x.x /16s, Ports: 32768–60999, TTL: mostly 51–52
When we analyzed the two clusters we found a clear correlation of NAT source ports ranges and network prefixes that suggesting two types of ‘NATs clusters’ operating and handling the scraper requests.


Quantitative Analyses
raw requests: 3,593,003
sessionized visits: 1,750,423
observation window: 83,271 seconds ≈ 23.13 hours
average raw request rate: 43.15 requests/second
average session rate: 21.02 sessions/second
peak raw request rate: 368 requests/second
peak session rate: 209 sessions/second
peak raw requests/minute: 12,017
peak sessions/minute: 4,460
Address-space diversity:
unique IPs: 1,340,761
unique /24 networks: 368,362
single-session IPs: 1,024,889
IPs with 2+ sessions: 315,872
single-session IP share: 76.44%
reused-IP share: 23.56%
Raw vs sessionized:
raw requests per session:
3,593,003 / 1,750,423 ≈ 2.05 raw requests/session
Same-IP reuse:
median average same-IP reuse gap: 8,228.35 seconds ≈ 2.29 hours
mean average same-IP reuse gap: 11,100.20 seconds ≈ 3.08 hours
/24 switching:
median /24 different-IP switch gap: 1,043.29 seconds ≈ 17.39 minutes
mean /24 different-IP switch gap: 2,990.30 seconds ≈ 49.84 minutes
Raw Data Analysis
To understand the traffic flow we conducted ten different analyses:
- Raw Volumetric: Total raw requests, average request rate, peak requests per second, and peak requests per minute.
- Sessions Volumetric: Collapsed short same-IP request bursts into logical visits to remove redirect and canonicalization noise.
- IP and /24 diversity analysis: Counted unique IPs and unique
/24networks to estimate address-pool breadth. - Country and ASN/org grouping: Grouped sessions by country and network organization to understand geographic and provider-level distribution.
- IP reuse analysis: Measured how long it takes for the same IP to reappear in a later session.
- Net switching analysis: Measured how quickly traffic inside the same
/24moves from one IP to another. - Active pool analysis: Measured how many IPs,
/24s, countries, and ASN/orgs were active in sliding windows such as 1, 5, 15, and 60 minutes. - Pool-size correlation analysis: Compared ASN-level observed pool size — number of IPs, number of
/24s, and effective pool size — against same-IP reuse delay. - Reuse-tail visualization: Plotted how many reused IPs returned after minutes, hours, or most of the observation window.
- ASN inftrastructure: Analyzed historical information of scrapers ASNs in our infrastructure.



