Why code search at scale is essential when you grow beyond one repository

Graham McBain

December 10, 2025

‍Your AI coding assistant can write code, but can it tell you where that endpoint is called across your 500 microservices spread out across dozens of repositories? As engineering organizations scale from dozens to thousands of repositories, a critical gap emerges: AI tools optimize for the code you're writing, but enterprises need visibility into all the code that already exists. This is the big code problem, and it explains why organizations like Uber, Stripe, Dropbox, and 4 of the top 6 US banks have standardized on Sourcegraph.

The shift to agentic coding workflows has paradoxically made large-scale code search more valuable, not less. Claude Code, Cursor, and Codex excel at generating new code within your current project, but they operate in isolation, unable to answer the questions that matter most at scale: Where is this API consumed? Who depends on this deprecated function? What's the blast radius of this change across our entire organization?

The fundamental limitation of AI coding assistants

Every major AI coding tool shares the same architectural constraint: workspace-focused context. Cursor indexes your local workspace using cloud-hosted embeddings. Claude Code searches on-demand using grep and glob within your current directory. OpenAI Codex operates on a single working directory. Even Windsurf's enterprise remote indexing requires manual repository-by-repository configuration.

None of these tools can automatically discover and search across all repositories in an enterprise organization.

Consider what this means practically. You're an engineer at a company with 400 repositories across three code hosts. You've written a payment processing endpoint and need to know everywhere it's called before modifying its behavior. With Cursor or Claude Code, you'd need to clone all 400 repositories locally, open them in your workspace, and hope the context window captures the relevant callers. This sort of workaround doesn't scale and is prone to errors.

Research from Augment Code's engineering team captures the problem precisely: "Custom decorators buried three directories deep, subtle overrides in sibling microservices, critical business logic scattered across modules, all of this remains invisible to the model." The result is AI suggestions that "seem correct in isolation but fail when integrated with existing systems."

Some coding agents are starting to address this gap. Amp's Librarian subagent, for example, can search across GitHub repositories beyond your local workspace—and it works well for quick cross-repo research. But these features are early, limited to single code hosts (GitHub only for Librarian), and designed for developer exploration rather than enterprise-scale enumeration. When you need to search across thousands of repositories spanning GitHub, GitLab, Bitbucket, and Perforce—with sub-second latency, audit logging, and compliance controls—you need purpose-built infrastructure, not an agent subfeature.

How Sourcegraph solves cross-repository code discovery

Code search at scale is our primary objective, and the Sourcegraph platform is designed from the ground up - from the infrastructure, to the application driving the searches, to the query language - for this objective. The platform connects to your code hosts, GitHub, GitLab, Bitbucket, Gerrit, Perforce, Azure DevOps, and indexes every repository into a unified search corpus using Zoekt, a trigram-based search engine that delivers sub-second queries across billions of lines of code.

This architectural decision enables the queries that matter for enterprises:

repo:myorg/.* endpoint("/api/v2/payments") type:symbol

This single query searches for the payments endpoint symbol across every repository in your organization, returning every definition and reference. No cloning. No context window limits. Deterministic, auditable results.

The query language provides precision that LLM-based semantic search cannot. Sourcegraph's search syntax supports:

Repository filtering: repo:github.com/myorg/.* matches all repositories in an organization

Path filtering: file:\.go$ file:internal/ finds Go files in internal directories

Symbol search: type:symbol CreateUser finds function and class definitions

Diff search: type:diff after:"2 weeks ago" author:security-team tracks recent security changes

Commit message search: type:commit SLO finds commits that mention SLOs in their message—essential for understanding why changes were made, not just what changed

Boolean logic: (repo:service-a OR repo:service-b) AND deprecated lang:python

This precision is essential for security, compliance, and large-scale refactoring, use cases where you need every instance, not a representative sample.

Deep Search bridges semantic exploration and deterministic answers

Sourcegraph's Deep Search recognizes that developers need both natural language exploration and precise enumeration. Deep Search is an agentic search system powered by state-of-the-art models that understands questions like:

"How does authentication flow through our backend services?"
"Find examples of rate limiting implementations across our repositories"
"Which GraphQL APIs in the sourcegraph repo appear unused?"
"What's the history of our caching layer? How has it evolved over time?"

Deep Search uses Sourcegraph's code search and navigation as tools, iteratively refining its understanding through multiple search queries. The key differentiator: Deep Search shows which searches it performed as sources, enabling engineers to transition from semantic exploration to deterministic enumeration. You start with "how does X work?" and end with a precise query that finds every instance.

This workflow, semantic search to explore, query language to enumerate, is unavailable in any AI coding assistant. Claude Code and Cursor can answer questions about code they can see, but they cannot provide exhaustive, reproducible searches across an entire organization's codebase.

The cross-repository use cases that define enterprise value

Large engineering organizations consistently face scenarios where workspace-limited tools fail:

Impact analysis before changes. A platform team needs to modify a core authentication library. Before making the change, they need to identify every service that imports this library, understand how each service uses it, and estimate the migration effort. With Sourcegraph: repo:myorg/.* file:go.mod content:"auth-lib" finds every Go service with the dependency. Cross-repository code navigation then shows exactly how each service calls the library's methods.

API deprecation and migration tracking. Engineering leadership mandates deprecating an old REST endpoint in favor of a GraphQL API. Sourcegraph's Code Insights can track migration progress over time: how many repositories still reference the deprecated endpoint, how that number trends week-over-week, and which teams own the remaining callers.

Security vulnerability response. A critical CVE affects a logging library. Security teams need to find every instance of the vulnerable pattern across all repositories, not a representative sample, but every single instance before the disclosure window closes. Deterministic search with lang:java logger.format(userInput) returns auditable, complete results.

Onboarding acceleration. New engineers can search across the entire codebase without asking colleagues which repositories to clone. As one CERN engineer noted: "Sourcegraph helped me answer a question in like 5 seconds flat... Normally I probably would have bugged a bunch of people."

These scenarios share a common thread: they require organization-wide visibility that no single-repository tool provides.

Sourcegraph in agentic workflows: MCP integration

The agentic AI future doesn't eliminate the need for code search at scale, it amplifies it. AI agents making autonomous code changes need reliable, deterministic tools for understanding existing codebases. Sourcegraph's MCP (Model Context Protocol) server exposes code search and navigation capabilities directly to AI agents.

This means Claude, Codex, and other coding agents can query Sourcegraph's best in class search index programmatically, gaining organization-wide code context. The combination is powerful: AI creativity for generating code, Sourcegraph precision for understanding the codebase that code will integrate with.

Batch Changes: from finding problems to fixing them at scale

Search identifies problems; Batch Changes solves them organization-wide. Sourcegraph's Batch Changes product creates pull requests across every repository matching a search query, then tracks those changes from CI checks,code review, to merge in a unified dashboard.

The workflow:

Write a declarative batch spec defining what to find and how to change it
Preview changes before creating pull requests
Apply the batch change to create PRs across all affected repositories
Track progress through review to merge from a single dashboard

Workiva reported 80% reduction in time for large-scale code changes. Indeed called Batch Changes a "key capability for reducing the hidden burden of updates pushed across teams."

This find-and-fix capability doesn't exist in any AI coding assistant. You can ask Claude Code to modify files in your current directory, but you cannot orchestrate coordinated changes across 200 repositories with tracked review status.

Enterprise features for regulated environments

Sourcegraph offers deployment flexibility that AI coding assistants cannot match:

Self-hosted and air-gapped deployment: Customer code never leaves their infrastructure; Sourcegraph employees have no access to customer code
SOC 2 Type II and ISO 27001 certified: Enterprise security compliance with annual third-party penetration testing
Repository permission inheritance: Automatically syncs permissions from connected code hosts (GitHub, GitLab, Bitbucket)
Multi-code host unification: Single search interface across GitHub Enterprise, GitLab self-managed, Bitbucket Data Center, Perforce, and Gerrit

For engineering leaders in regulated industries, financial services, healthcare, government, these capabilities are non-negotiable. The ability to search code without sending it to third-party cloud services eliminates entire categories of compliance concerns.

Building the case for code search at scale

If you've used Sourcegraph at a previous company, you know the productivity difference it enables. Building the case for adoption at a new organization requires connecting capabilities to business outcomes:

Developer productivity: Forrester research indicates knowledge workers spend up to 30% of their time searching for information. code search at scale transforms code discovery from interrupting colleagues to self-service queries.

Security response time: When a vulnerability drops, the difference between finding all instances in minutes versus days directly impacts exposure window and remediation cost.

Technical debt visibility: Code Insights dashboards quantify technical debt, deprecated API usage, outdated dependencies, migration progress, in terms leadership can track and prioritize.

Agentic AI effectiveness: As your organization adopts AI coding tools, Sourcegraph provides the organization-wide context those tools lack, making AI-generated code more likely to integrate successfully with existing systems.

Your engineers can absolutely search code today. The question is whether they search efficiently across your entire codebase, or inefficiently across whatever repositories they happen to have cloned locally.

How competing solutions compare

GitHub Code Search: capable but constrained (and slow)

GitHub Code Search is actively maintained and supports cross-repository queries using org: and enterprise: qualifiers. The query syntax handles regex, symbol search, and boolean operations. For organizations fully committed to GitHub, it provides meaningful code search capabilities.

However, GitHub Code Search only searches GitHub-hosted repositories. Enterprises using GitLab, Bitbucket, Perforce, or multiple code hosts cannot achieve unified search. The query language lacks Sourcegraph's diff and commit search capabilities—you can't search through commit messages or track how code changed over time

Augment Code: semantic understanding without traditional search

Augment Code's Context Engine provides impressive semantic understanding of codebases, indexing up to 100 million+ lines of code with support for multi-repository awareness. The platform excels at answering questions like "where is the payment token validated?" with results spanning multiple services.

The limitation: Augment Code is an AI assistant, not a code search engine. It provides semantic, AI-interpreted responses rather than deterministic query results. You cannot write a regex to find every instance of a vulnerable pattern, you ask a question and receive an AI-curated answer. For compliance and security enumeration, this distinction matters.

Open-source alternatives: Hound, OpenGrok, Livegrep

Traditional open-source tools serve important roles at smaller scales:

Hound (created by Etsy) offers simple deployment and fast trigram-based regex search across multiple repositories. However, its single-server architecture limits scale, and it provides no semantic search or code intelligence.

OpenGrok (Oracle) supports multiple version control systems and provides web-based cross-referencing. The tradeoff: complex setup requiring Java, Tomcat, and ctags, with significant memory requirements (8-16GB+ JVM heap) and challenging operational overhead for large deployments.

Livegrep (from Stripe) delivers blazing-fast regex search for gigabyte-scale codebases, but explicitly targets single-digit gigabyte scale, not enterprise-wide deployment across thousands of repositories.

None offer semantic/AI-powered search, enterprise security features (SSO, RBAC, audit logging), or the ability to act on search results through automated changes.

grep and ripgrep: necessary but insufficient

Every developer uses grep or ripgrep for local searching, ripgrep searches the Linux kernel source 32x faster than GNU grep. But command-line tools require local file access, provide no persistent index or web interface, and cannot search repositories that haven't been cloned.

The limitation becomes acute at scale: you can't grep a repository you don't have locally, and no enterprise expects developers to clone 500+ repositories to answer basic questions about code usage.

Conclusion

Code search at scale solves a problem that scales faster than headcount: as repositories multiply, the cost of fragmented code discovery grows exponentially. AI coding assistants accelerate code generation but cannot provide organization-wide visibility, they're complementary capabilities, not substitutes.

Sourcegraph's combination of deterministic query language, semantic Deep Search, cross-repository code navigation, and Batch Changes creates a workflow unavailable elsewhere: explore with natural language, enumerate with precise queries, act with automated changes, track with unified dashboards. The MCP integration ensures these capabilities enhance rather than compete with agentic AI workflows.

For engineering leaders evaluating code search, the decision framework is straightforward: if your organization has more repositories than any developer can reasonably clone, if you use multiple code hosts, if security response requires exhaustive enumeration, or if you're building agentic AI workflows that need reliable code context, code search at scale is has only become more essential.

‍

Subscribe for the latest code AI news and product updates

Ready to accelerate
how you build software?

Use Sourcegraph to industrialize your software development