SearchSpace
Benchmark

Comparing web search API providers on a Deep Research gauntlet

Since every agentic web search API provider will tell you their results are the best, we felt it would be beneficial to put four of the major providers head to head, using an LLM as a judge to give each provider an ELO score, with 32 unique research tasks, focusing on real-world use cases rather than trivia.

An LLM judge does a pairwise round-robin: for each brief it's shown two providers' reports side by side (with each cited source, so the judge can verify grounding) and picks the winner on coverage, grounding, depth, and clarity. Everything else about the environment, agent, LLM used, etc. is the same, with the sole exception of the search provider. Every pairwise outcome feeds a per-providerElo rating, with the full win/loss matrix published.

Each task is written from scratch as an open-ended research question with a reproducible LLM prompt and hand selected verticals. These verticals are selected from industry reports (linked below) showing where revenue is actually flowing to different agentic/LLM use cases. The tasks are also intentionally nottrivia tasks: they're meant to be real-world things a paying customer would want insights on.

The judge uses a larger model than the agents, and its cost is excluded from anything we'd attribute to a provider. The agents all run on the same smaller, more cost-efficient model, so the benchmark measures the search, not the synthesizer.

In order to bring the benefits of Deep Research to as many people as possible, the economics of providing a deep research-like tool have to actually work. If offering Deep Research dossiers costs millions of dollars for millions of users, it will be limited to specialized use cases. We at SearchSpace see a future where web search access is a commodity, like LLMs have become, so the scope of use cases grows.

Cost per research brief

agent LLM + search fees · lower is better

Secondly, for an agent doing tens or hundreds of searches, the latency a search may take matters. If a Deep Research report can be served in single digit seconds instead of minutes, that is the equivalent of going from dial-up internet to gigabit fiber.

Search latency

p50 wall-clock per search call · lower is better

Search latency is the provider's API p50 wall-clock time for search results to come back.

Some providers do better in specific fields, so a single "Elo" for the provider does too much dimensionality reduction. We report full vertical breakdowns of win rate per provider. We'd also encourage you to check out the full data to see each and every dossier made from each provider, along with the judge's analysis as to why it selected the way it did.

A deep-research API isn't used for bar trivia; it's used on the questions AI applications actually do in prod, where the money is. So we chose the eight categories at the highest-revenue, most token-hungry regions of the AI market: healthcare & biotech, science & IP, legal & regulatory, financial markets, crypto, software & developer tools, cybersecurity, and current events.

There are copious industry reports outlining where money is being spent and where tokens are being generated. Enterprise spending on generative AI roughly tripled to about $37 billion in 2025, and beyond the obvious software engineering, use cases sit in regulated, high-stakes verticals like healthcare and biotech, legal, and finance. Importantly, these verticals require fresh, up-to-date information that knowledge cut-offs don't allow: fresh, external information the model doesn't have memorized — and that's precisely where the purpose-built AI-search vendors (Exa, Tavily, Parallel) sell: deep-research agents, competitive and market monitoring, financial research, technical-docs lookup, and news.

These market figures are 2025–2026 estimates from VC and analyst surveys (notably Menlo Ventures' enterprise survey and Sacra's private-company estimates); directional, not audited, and quickly-changing.

Benchmarks you can't critique, inspect, and judge yourself are a marketing claim. So this entire endeavor — every brief each provider's agent made, the sources it used, and the judge's winner and reasoning for every matchup — is published as a single database you open in your browser. Read the actual reports next to each other, see the sources each agent had access to, and read why the judge decided how it did.

The interactive explorer loads the full run entirely in your browser — no account. Open any task to compare each provider's report and cited sources, and check out the judge's verdict on every game.

Open the data explorer