We wanted to help engineering managers answer a simple question: "How much is my team using AI coding tools?"
Simple question. Turns out, not a simple answer.
#The Integration Path (That Didn't Work Out)
We started where most people would—looking at direct integrations with the major AI coding tools.
Claude Code had the best API we found. Clean endpoints, good documentation, usage analytics by user and timeframe. We mapped out the fields we'd store: session counts, lines generated, acceptance rates. It looked promising.
Cursor required an Enterprise account to access their API. That rules it out for many teams we work with.
GitHub Copilot has a decent API, though it's more focused on admin controls than detailed usage analytics. Still workable.
OpenAI's API provides token usage, but nothing that connects to who's using it or in what context. Dead end.
We were ready to move forward with Claude Code as our first integration. Then we hit the attribution problem.
#The Email Matching Problem
Here's the thing about AI coding tools: developers often use different email addresses for different services.
Your GitHub account might use sarah@company.com. Your Claude Code account might use sarah.personal@gmail.com. Same person, different identity across systems.
We could solve this with an email mapping system—let teams manually link emails across services. Not elegant, but workable. We even built most of the backend for it.
But then we started asking: What are we actually trying to measure?
#The Pull Request Realization
The more we thought about it, the more we realized we were approaching this backwards.
We were trying to get generalized usage analytics over a time period: "Sarah used Claude Code for 15 hours this month and generated 2,000 lines of code."
But what does that actually tell you? Without context, it's just numbers. Did those 2,000 lines ship? Were they good quality? Did they solve real problems or introduce bugs?
We already have a source of truth in engineering teams: pull requests.
Every meaningful code change goes through a PR. PRs are reviewed, tested, merged, and deployed. They connect to actual features, bugs, and business value. They're timestamped, attributed to specific developers, and linked to repositories.
If we want to understand AI tool adoption, we should be asking: "Which pull requests were created with AI assistance?"
That question is answerable. It's specific. And it's useful.
#Switching to AI Detection
Instead of integrating with AI tool APIs, we started detecting AI usage directly in pull requests.
This shift changed everything. No more worrying about email mapping across services. No more dependency on specific tool APIs. No more incomplete data when developers use multiple AI tools.
Instead, we analyze the PR itself—the commits, the code, the patterns—and determine whether AI was involved.
#How Detection Works
We use a two-tier approach:
1. Explicit Attribution Check (Fast)
First, we scan for explicit markers that developers leave when using AI tools:
- AI bot commit authors: GitHub Copilot bot (
github-copilot[bot]), Devin bot (devin-bot), Claude bot emails (noreply@anthropic.com) - Claude Code footers: "🤖 Generated with Claude Code"
- GitHub Copilot co-author attributions:
Co-Authored-By: GitHub Copilot - Branch names:
codex/feature,cursor-refactor, etc. - PR labels:
codex,ai-generated, etc.
If we find explicit attribution, we're done. No API call needed.
This catches about 75% of AI-generated PRs instantly. Bot authors are the most reliable signal—if a commit is authored by github-copilot[bot] or contains noreply@anthropic.com, that's 100% definitive with zero false positives.
2. AI-Powered Detection (When Needed)
For PRs without explicit markers, we use OpenAI to analyze commit patterns and code structure:
Strong AI indicators:
- Commit messages that are unusually verbose with perfect grammar
- Highly structured PR descriptions with consistent formatting
- Multiple commits with identical structural patterns
- Detailed explanations for trivial changes
Human indicators:
- Incremental commits with typos ("WIP", "fix typo", "oops")
- Inconsistent formatting across commits
- References to conversations, tickets, or team members
- Domain-specific knowledge with business context
The prompt explicitly warns against penalizing good engineering practices. Well-documented code isn't evidence of AI. Clean commit messages aren't suspicious. We're looking for patterns that suggest AI assistance, not just professional work.
#The Good Developer Problem
This is the tricky part: How do you distinguish between AI-generated code and code written by a senior developer who writes clean, well-documented commits?
You can't always, with 100% certainty. That's fine.
We're not trying to catch people or enforce policies. We're measuring adoption trends at the team level. If your detection is 85% accurate, that's good enough to see "our team is using AI tools more this quarter" or "most AI usage is coming from our frontend developers."
The key is being conservative with detection. We use confidence levels (Definitive, High, Medium, Low) and err on the side of "probably human" when we're uncertain.
We also give teams a configurable threshold. The default is 50 (scores 50+ are considered AI-assisted), but teams can adjust to 40 for broader detection or 70 for stricter accuracy.
#The Retrospective Advantage
Here's another benefit of the PR-based approach: we can analyze historical pull requests without any setup.
When a team signs up, they don't need to configure integrations with Claude Code, GitHub Copilot, and whatever other tools their developers use. We just analyze their existing pull requests.
For teams on free trials, we scan the last 30 days. For paid teams, we go back 6 months.
This gives you immediate insights on day one. You can see adoption trends over time, identify when AI usage started ramping up, and understand current patterns in the context of historical behavior—no waiting, no integration configuration, no email mapping setup.
The alternative would be connecting to multiple AI tool APIs (each with different authentication, data formats, and permission models), solving the email attribution problem for each one, and then aggregating the results. That's weeks of work before you get your first insight.
With PR-based detection, the data is already there. Your pull requests exist, they have history, and they already connect to your developers through GitHub/Bitbucket accounts. We just run the detection and show you the results.
#The Metrics That Matter
We settled on three core metrics to start:
AI Adoption Rate: What percentage of your active developers are using AI tools?
This answers "Is AI being adopted across the team, or just by a few people?" You want to see steady growth here, not just a couple of early adopters carrying the entire team.
AI-Assisted PR Rate: What percentage of your pull requests involve AI assistance?
This shows actual usage patterns. A team might have 80% adoption but only 20% of PRs use AI—that suggests occasional use, not deep integration into workflows.
Tool Breakdown: Which AI tools is your team using?
We detect Claude Code, GitHub Copilot, Cursor, ChatGPT, and others. This helps you understand whether your team has standardized on one tool or if everyone's using something different.
These metrics integrate into our existing dashboard alongside DORA metrics and code review analytics. You can see them over time, filter by repository, and track adoption trends.
#What We're Not Measuring (Yet)
We deliberately didn't try to measure:
AI code quality vs. human code quality: Too many variables, too easy to misinterpret. Good developers using AI write better code than junior developers without it. Comparing the two doesn't tell you much.
Productivity gains from AI: You'd need a control group and consistent tasks to measure this properly. Most teams don't have that setup.
Individual developer AI dependency: We show team-level metrics, not individual "AI usage scores." The goal is team insights, not surveillance.
Maybe we'll add some of these later. Maybe not. We're starting with what's useful and not creepy.
#Lessons from Building This
1. API integrations aren't always the answer
We spent a week planning Claude Code integration before realizing we didn't need it. Sometimes the data you want is already in your system, just in a different form.
2. Source of truth matters
Pull requests are already your source of truth for code changes. Building on top of that foundation is easier than introducing a parallel tracking system.
3. Imperfect detection beats perfect integration
An 85% accurate detection system that works for all AI tools is more useful than a 100% accurate integration with one tool that only 30% of your team uses.
4. Context is everything
"Sarah used AI for 10 hours this week" tells you nothing. "Sarah's last 3 PRs were AI-assisted and all merged within a day" tells you she's using AI effectively.
5. Conservative detection prevents false accusations
When in doubt, assume human. You can always make detection stricter later. Starting too aggressive creates trust issues.
6. Zero-setup historical analysis is a huge win
No integrations to configure, no email mappings to solve, no waiting for data to accumulate. The PRs are already there with their full history. Day one insights beat weeks of integration work.
#What We're Still Wrestling With
We're confident in the approach, but honest about what we don't know yet.
How accurate is "85% accurate"?
That's our estimate based on testing with known Claude Code users. We don't have a perfect validation dataset. It's probably in the 80-90% range, but we're not certain.
The more explicit attribution we find (Claude Code footers, Copilot markers), the more confident we can be. When we're doing pure pattern analysis, accuracy varies more.
Are we biased toward certain tools?
Yes, probably. Claude Code users are likely over-represented because of automatic footers. Cursor users who don't mark their PRs might be under-counted. Developers using ChatGPT to write code that they then manually commit are probably invisible.
We're measuring what we can detect, not absolute truth. That's fine for team-level trends, but means you shouldn't obsess over exact percentages.
What about senior developers?
A skilled developer using AI but writing natural commit messages might be undetectable. Our metrics might skew toward obvious AI usage patterns.
This could mean we're undercounting AI usage among experienced developers who've learned to use AI tools subtly. Or it could mean our detection correctly identifies AI-generated patterns that even good developers leave behind.
We don't know which. Probably some of both.
Is tool breakdown actually useful?
We track which tools are being used, but honestly, we're not sure if that information drives decisions.
Maybe it matters if you're deciding which tools to license for the team. Maybe it helps you understand if everyone's on the same page or using different tools. Maybe it's just interesting trivia that doesn't change anything.
We included it because we can measure it and it might be useful. But we're not convinced it's as valuable as adoption rates or usage patterns.
What's the real accuracy of our detection?
The honest answer: we don't know precisely. We know explicit attribution is 100% accurate. We know our AI analysis is conservative and rarely false-positives. Based on spot-checking known cases, we're probably 80-90% accurate overall.
But without a large validated dataset, that's an educated guess, not a proven fact.
These aren't failures—they're the reality of any measurement system. We're starting with useful-but-imperfect data rather than waiting for perfect data that might never come.
The alternative is having no visibility at all into AI adoption. Imperfect measurement beats no measurement, as long as you know the limitations.
#What's Next
We're rolling this out with the three core metrics: adoption rate, PR assistance rate, and tool breakdown.
Next steps we're considering:
Improving detection accuracy: As we gather more data, we can refine the patterns we look for. Maybe commit timing patterns (AI tools tend to produce commits in bursts). Maybe code style consistency (AI is very consistent, humans less so).
Correlation with other metrics: Does AI usage correlate with faster cycle times? Lower bug rates? Better code review feedback? We'll have the data to explore these questions.
Team-specific customization: Different teams use AI differently. Backend developers might use it for boilerplate. Frontend developers for component generation. Letting teams customize detection for their context could improve accuracy.
Integration with team health metrics: Is AI helping reduce burnout by automating tedious work? Or creating pressure to ship faster? Understanding the human impact matters as much as the technical metrics.
#The Real Question
"How much is my team using AI tools?" is really asking: "Is AI helping my team work better?"
Usage metrics are just the starting point. The interesting questions come after: What are they using it for? How does it affect quality and velocity? Are some team members benefiting more than others?
We can't answer all these questions yet. But by starting with PR-based detection, we've built a foundation that grows with the data.
If you're trying to measure AI adoption in your team, start simple. Don't over-engineer. And remember: you're measuring trends, not judging developers.
The goal is insight, not surveillance.
Tracking AI usage in your engineering team? We've built AI detection into Coderbuds alongside DORA metrics and code review analytics. Start tracking for free.