Contrary Research Rundown #142

An internet built for AI, plus new memos on Waymo, Saildrone, and more.

Jun 28, 2025

Research Rundown

For nearly 40 years, openness has been a fundamental characteristic of the internet. The open web was founded on principles of openness, decentralization, universality, non-discrimination, collaborative development, consensus, and accessibility. This openness plays a crucial role in civic engagement and has enabled countless communities, creators, and knowledge-sharing platforms to flourish.

This landscape is changing rapidly due to automated traffic. By April 2025, nearly 50% of all internet traffic was generated by bots, much of it from automated scrapers and crawlers that will soon outpace human traffic entirely. As more AI agents come online, the volume of artificial traffic will only increase. AI-driven traffic grew 49% in Q1 2025 alone. The implications of this shift represent a restructuring of the way information flows, how value is created, and who captures that value in our digital economy.

The Invisible Middleman

The global web scraping market will hit $1.3 billion by 2025, with AI as a major driver. To understand this shift, it's worth examining how scraping has evolved from its origins.

JumpStation, one of the first "crawler-based" search engines, launched in December 1993 to organize the increasing number of internet webpages. Other early uses of scraping bots included gauging the size of the internet and identifying broken or dead links on servers. Crawlers were largely undisruptive and could even be beneficial, bringing people to websites from search engines like Google or Bing in exchange for their data. Websites began to use machine-readable files, called robots.txt files, to specify what content they want crawlers to leave alone.

Web scraping has historically relied on three methods: HTML scraping, crawler-based scraping, and API access. These formed the backbone of early search engines and data aggregators. With the advent and increasing adoption of AI search engines and chatbots, scraping has moved from indexing to ingestion and generating new output. AI tools like Perplexity and ChatGPT can instantly summarize the internet and give users direct answers, creating what amounts to an invisible middleman effect. ChatGPT's prominence pushed internet scraping into the spotlight and exposed AI models' data scraping practices to widespread scrutiny.

Source: PPC Land

Modern AI scrapers can extract, clean, and organize data from almost any website automatically, even adapting to changes in site structure or layout. AI-powered scraping tools in 2025 are so advanced that they can mimic human browsing, bypass anti-bot systems, and adapt to changes in website structure, making data extraction from almost any public site routine. These tools use machine learning to understand complex, dynamic content, including JavaScript-heavy pages, and employ techniques like human behavior emulation and dynamic proxy rotation to avoid detection and maintain reliable access.

Many of the most popular models of the last few years, including OpenAI’s GPT-3, Google’s Gemini 2.0 Flash, and Meta’s Llama 2, use web-crawled data to train their artificial intelligence systems, with over half of the training data coming from Common Crawl. Common Crawl offers users "a copy of the Internet," serving as one of the largest and most widely used repositories of scraped data. Common Crawl spans 250 billion web pages over 18 years, encompassing everything from blogs and Wikipedia to news articles and code repositories. Over half of Llama 2’s training data came from Common Crawl, illustrating just how central scraped web content has become to the AI ecosystem. Whether for competitive reasons or legal implications, more recent models like GPT-4 and Llama 4 haven’t disclosed information about their training data.

The Death of the Link Economy

Traditionally, the web’s link economy meant that people visited websites, generating ad revenue for bloggers, forums, and niche media. Now, a growing share of users no longer want to browse links. They expect AI to distill the web into instant answers, reinforcing a feedback loop that deprioritizes original exploration. In the old model, attention flowed to the source; in the new model, content flows to the platform, competing directly with its sources of data. This shift is so significant that publishers and website owners are seeing dramatic drops in web traffic, as users increasingly get instant answers from AI systems rather than visiting sources.

The launch of Google's AI Overviews and AI Mode in 2025 was followed by immediate, dramatic declines in referral traffic to news outlets, with some publishers reporting traffic drops of 50% or more within weeks. Beyond Google's AI Overviews, other referral AI tools, like Perplexity and ChatGPT, have replaced search for 25% of users, and are reportedly having a significant impact on traffic to major publishers.

Three prominent examples include the New York Times, Business Insider, and Washington Post. Data for the New York Times shows that organic search traffic dropped to 36.5% of total visits by April 2025, 44% less than three years earlier. Business Insider saw website traffic plummet by 55% between April 2024 and April 2025. The company cited "extreme traffic drops" as the reason for a 21% workforce reduction in May 2025. Finally, the Washington Post saw its online audience shrink by nearly half in 2025. Nicholas Thompson, CEO of The Atlantic, told his company in 2025 to expect traffic from Google to diminish to nearly zero over time as Google shifts from a search engine to an "answer engine."

By using publicly available data to train their LLMs, generative AI companies incur the direct advantages of scraping publicly available data. Conversely, the public bears the direct disadvantages of these companies' scraping of publicly available data. While major publishers can block or license to AI companies, independent creators and niche forums lack the resources to do so, making their content more vulnerable to scraping and uncredited summarization.

AI crawlers and scrapers contributed to a record 16% of all known-bot impressions in 2024, inflating traffic metrics and making it harder to measure genuine engagement. If users no longer visit sites, ad impressions fall, starving the revenue that funds free content, from niche sites to major outlets. As AI tools summarize and surface information directly, independent creators lose referral traffic, visibility, and potential ad or affiliate revenue, undermining the incentive to produce original work.

The collapse extends to knowledge-sharing communities. On Stack Overflow, the sum of questions and answers posted in April 2025 was down over 64% from April 2024, and more than 90% from April 2020, according to Stack Overflow's official data explorer. Developers are moving to Discord servers, niche forums, and even TikTok for code help, further fragmenting the traditional open web community.

Beyond the numbers, there's a temporal dimension to this crisis. Traditional journalism and research operate on human timescales. Processes like investigation, verification, and publication take days or weeks, whereas AI systems can process and synthesize information in seconds. While these systems still struggle with achieving the context and real-time accuracy that human journalists provide, they nevertheless create an asymmetric competition where human-generated content struggles to maintain relevance in fast-moving information cycles.

This speed differential particularly affects breaking news and technical documentation, where AI can provide instant summaries that reduce demand for the original reporting or detailed guides that took significant effort to produce.

Retreat Behind Paywalls & AI-Proofing

Websites are now fighting back for fear that AI crawlers will help displace them. But there's a problem: this pushback is also threatening the transparency and openness of the web that allow non-AI applications to flourish in the first place.

Companies on the internet previously made data publicly available and generated revenue through ads. However, the current business model is shifting toward safeguarding data on private websites, making it accessible only to registered or paying users. More than two-thirds of leading newspapers across the EU and the US now operate some kind of online paywall, a figure that has steadily increased since 2017. The New York Times alone boasts 10.8 million digital-only subscribers, with digital subscription revenue nearing $1 billion annually.

The rise of private APIs follows similar logic. While private APIs enable businesses to protect intellectual property and monetize data, they also limit experimentation, interoperability, and the free flow of information that defined the early web. Since mid-2023, websites have erected crawler restrictions on over 25% of the highest-quality data sources.

Some publishers have signed licensing deals; others are pursuing legal action or blocking bots. Major licensing deals include Reddit's $60 million per year agreement with Google (2024), giving Google access to Reddit's data for AI training, and the Associated Press's multi-year licensing deal with OpenAI (2023). The New York Times secured a three-year, $100 million deal with Google for a content and distribution partnership, but explicitly forbids scraping for AI training by other companies.

While major AI developers like OpenAI and Anthropic publicly commit to respecting website restrictions, reports suggest inconsistent compliance. Website operators have documented cases of aggressive crawling that overwhelms servers or ignores or circumvents robots.txt directives, despite their public statements. This has spawned a new industry of protective services; companies like TollBit and ScalePost offer monetization tools for AI data usage, while infrastructure providers like Cloudflare have developed bot detection and traffic management systems to help websites control automated access.

Legal cases are mounting. The New York Times is pursuing an ongoing lawsuit against OpenAI and Microsoft for copyright infringement over the use of its articles in AI training datasets. Over 88% of top US news outlets now block AI data collection bots from OpenAI and others.

A Self-Cannibalizing Knowledge System

This retreat behind paywalls could be creating a spiral into knowledge inequality. As premium content becomes increasingly gated, AI systems trained on freely available data may become less accurate or comprehensive over time, particularly for specialized domains. Meanwhile, those who can afford multiple subscriptions gain access to higher-quality information, while others rely on potentially degraded AI summaries.

Furthermore, as AI-generated content floods the web, it creates a feedback loop where new models train on synthetic data from previous AI systems. This risks what researchers call "model collapse", which is a degraded output quality when training data becomes increasingly artificial rather than human-generated. In image generation, this manifests as increasing blurriness and artifact accumulation. In text, it appears as semantic drift and reduced diversity in expression.

Source: Nature

More critically for the web's future, as AI systems produce content faster than humans can create it, the internet risks becoming primarily a training ground for machines rather than a space for human creativity and discovery.

What Kind of Internet Are We Building?

The internet is now undergoing a transition from an open knowledge commons to a privatized, AI-intermediated information ecosystem. Bundled content and exclusive partnerships are becoming more common. The EU and UK are considering opt-out copyright regimes for AI training, requiring explicit permission for scraping; California is advancing legislation to mandate transparency of materials used in AI training; and the US is increasing debate over new copyright protections for digital content.

The European Union's AI Act includes provisions for transparency in training data usage, while California's proposed legislation would require AI companies to disclose their data sources. However, enforcement remains challenging, particularly for international operators or companies that don't primarily operate in these jurisdictions. More fundamentally, regulation tends to lag technological change.

By the time comprehensive AI training data regulations are implemented and enforced, the current generation of models will already be trained, and new technical approaches may circumvent existing rules. Although lawmakers acknowledge the need for regulation, they fear restrictive regulations may lead the United States to lose its lead in the AI "arms race." Trying to force the web back to its early form won't work. That era was shaped by slower tools, different incentives, and a web less crowded by algorithms.

Despite the challenges, new models for content creation and distribution are emerging. Substack and similar newsletter platforms provide direct creator-audience relationships that bypass traditional advertising and search traffic dependencies. Patreon and OnlyFans demonstrate sustainable creator economies based on direct payment rather than attention arbitrage. Some publishers, like the Financial Times, are experimenting with creating content specifically designed to be discovered and cited by AI systems, with business models based on attribution and link-backs.

In addition, advertising may not completely disappear in an AI-driven internet. Tools like Perplexity, Google AI, and ChatGPT have shown some inclination towards including ads in their products. Perplexity first talked about the potential for running ads in April 2024 before starting to roll them out in results in early 2025. In December 2024, ChatGPT indicated that it was considering serving ads in the future, and in June 2025, Google started placing targeted ads inside third-party AI chatbot conversations. In March 2025, Perplexity’s Head of Advertising said the company believed it could “do a better job of surfacing advertising in a way that is truly incorporated into the user flow versus it being a distraction.”

But the presence of ads within AI responses only further distances the creators of original content and the value created from user attention in consuming that content. The Invisible Middleman makes value capture impossible for the would-be creators of original content online. While some platforms are experimenting with human-in-the-loop AI systems that provide initial content generation but require human editing and verification before publication, these approaches can only impact content quality. The inability of content creators to monetize online remains an obstacle to any continuation of the open web.

Ultimately, the open web is unlikely to die, but rather become a training ground for AI companies, increasingly synthetic and AI-generated, less economically viable for human creators, and accessible only through AI intermediaries for most users. The open web is becoming a source of raw material, driving a shift away from foundational values of openness and accessibility and toward closed systems, paywalls, and machine-to-machine content loops. The speed and scale of this shift mean the next few years will determine whether the internet preserves its role as an open platform for independent creators, diverse voices, and knowledge-sharing, or whether it becomes little more than infrastructure for machine learning.

Saildrone is developing ocean drones to understand the seafloor by collecting high-resolution data based on wave patterns and sensor data. To learn more, read our full memo here and check out some open roles below:

Senior DevOps Engineer - Alameda, CA (Hybrid)
Senior Software Engineer, Robotics - Alameda, CA (Hybrid)

ShipBob is a logistics company that provides a suite of products designed to streamline ecommerce supply chains. To learn more, read our full memo here and check out some open roles below:

Software Development Engineer II (Full Stack) - Remote (India)
Software Development Engineer III - Remote (India)

Tanium enables organizations to manage software and clients, monitor vulnerabilities and compliance, detect threats, respond to incidents, and discover unmanaged devices, all from a single, scalable platform. To learn more, read our full memo here and check out some open roles below:

AI Senior Software Engineer - Durham, NC
Senior Software Engineer, Backend/Full-Stack - Durham, NC

Waymo is an autonomous vehicle company spun out in 2016 from GoogleX’s Project Chauffeur team. Waymo’s primary product is Waymo Driver, a software and hardware suite that enables level 4 autonomous driving, meaning no rider supervision or intervention is required. To learn more, read our full memo here and check out some open roles below:

Machine Learning Engineer (Perception and Sensor Simulation) - Mountain View, CA
Senior Software Engineer, Planner Infrastructure - Mountain View, CA

Owner helps restaurants pull order traffic off third-party marketplaces into a direct channel it controls, boosting margins and helping restaurants capture first-party customer data to drive retention. To learn more, read our full memo here and check out some open roles below:

Senior Software Engineer, Full-Stack - Remote (US)
Senior Site Reliability Engineer - Remote (US)

Check out some standout roles from this week.

Databricks | Mountain View, CA - Engineering Manager (Applied AI), Senior Software Engineer (Backend), Senior Software Engineer (Compliance), Senior Software Engineer (IAM)
Glean | Palo Alto, CA - Senior Data Scientist (Core Product), Cloud Infrastructure Engineer, Lead Software Engineer (Data Foundations), Senior Site Reliability Engineer, Software Engineer (Fullstack), Software Engineer (Product Backend)
Rippling | San Francisco, CA - Forward Deployed Engineer, Senior Full Stack Engineer (Backend - Time Products), Senior Software Engineer (Data Bridge), Senior Fullstack Software Engineer (Benefits), Senior Software Engineer (Identity Platform)

IYO sues OpenAI over ‘io’ trademark. The Google X spin-out claims Sam Altman saw its smart-earbud pitch, passed, then copied the concept via Jony Ive’s newly acquired team, winning a temporary order that forced OpenAI to drop the “io” brand. Altman replied by sharing their email exchange, showing he declined to invest because he was “working on something competitive.” OpenAI also said in a court filing that what they are working on “is not an in-ear device, nor a wearable device.”
OpenAI leadership rejects 50% white-collar job-loss forecast from Anthropic’s CEO. Sam Altman and COO Brad Lightcap said there is no evidence to support Dario Amodei’s five-year displacement claim, signaling a more moderate narrative on AI’s labor impact. Lightcap said, “Dario’s a scientist, and I would hope he takes an evidence-based approach to these types of things.”
Kalshi, Polymarket both hit unicorn status in new funding rounds. Kalshi raised a $185 million Series C led by Paradigm at a $2 billion valuation, and The Information reports Polymarket is close to raising more than $200 million at a valuation of over $1 billion in a round led by Founders Fund. Polymarket reportedly has much more trading volume than Kalshi but, unlike Kalshi, it does not have a CFTC license to operate in the US.
Tesla launched its first Austin robotaxi rides. Tesla launched a limited, invite-only rollout of self-driving Model Y cabs in Austin on June 22. The service uses about 10-20 cars in a geofenced zone and charges $4.20 per ride. While overall successful, passenger videos showed several driving mistakes, and the National Highway Safety Administration contacted Tesla to gather more information about the incidents.
Intercom backs its Fin AI agent with $1 million performance guarantee. Intercom says it has beaten Decagon in 100% of “bake-offs” and is now offering new customers a $1 million refund if they are not satisfied with the first 90 days of using their AI resolution agent. According to Intercom’s CEO, Fin has eight figures of ARR and grew nearly 400% in Q1 2025.
Canva eyes a $400-$500 million secondary sale at a $37 billion valuation. This would be an increase from the $32 billion valuation from a secondary sale in October 2024, but still below the peak $40 billion valuation from a September 2021 funding round. According to The Information, Canva did $660 million in revenue and $150 million in EBITDA in Q1 2025.
Judge says Anthropic’s AI training is “fair use”, but storage wasn’t. US District Judge William Alsup said Anthropic’s use of copyrighted books to train Claude is “exceedingly transformative” and protected under the fair use doctrine, but he ordered a separate trial over the company’s download and storage of millions of pirated copies.
OpenAI quietly builds a ChatGPT docs suite to rival Workspace and Office. Designs revealed by The Information show in-app document editing and team chat, an expansion that pits OpenAI against Google’s and Microsoft’s software businesses. This comes as tensions between Microsoft and OpenAI have reached a “boiling point” over their existing partnership.
Airtable “relaunches” as an AI-native app platform with agents. The no-code software unveiled an “AI refounding” that lets users build production-ready apps, spin up thousands of agents, and access free AI credits on every plan. It’s a big product shift for Airtable, which last raised at a $11 billion valuation in 2021.
Western governors launch a “Energy Superabundance” push. Utah’s Spencer Cox and peers aim to scale production, modernize transmission, and fast-track advanced tech to cut costs and bolster grid reliability across the West. Cox has blamed the federal government for hobbling states’ ability to accelerate energy production because of permitting delays, outdated infrastructure, and overregulation.
An essay by Eric Flaningam outlines the logic behind Meta’s AI spending blitz. The essay argues there are four key reasons why Meta is making these investments: 1) it needs to own the next big platform, 2) it has one of the most profitable applications of AI in the world (ads), 3) it hasn’t been making enough progress on its models, and 4) it has no reasonable alternative.
Abridge raised a $300 million Series E at a $5.3 billion valuation. Andreessen Horowitz led the round for the clinical-documentation startup, which uses AI to automatically transcribe doctors’ conversations. The product is now live in 150 health systems, and Abridge says it will support more than 50 million medical conversations this year.

At Contrary Research, we’ve built the best starting place to understand private tech companies. We can't do it alone, nor would we want to. We focus on bringing together a variety of different perspectives.

That's why applications are open for our Research Fellowship. In the past, we've worked with software engineers, product managers, investors, and more. If you're interested in researching and writing about tech companies, apply here!