Cosplaying the eval process

Should compliance be built into AI coding tools, or should AI tools be observable by existing compliance systems? Most existing enterprise compliance tools suggest the latter, but I wanted to explore what the former might look like. I wanted to explore this architectural question by cosplaying the eval process by imagining what evaluation systems might look like if they addressed real business problems.

While studying Claude Code and other AI coding tools, I stumbled onto something interesting: the compliance gap. These tools generate incredible value for individual developers but become potential liabilities at enterprise scale. Every feature that makes them powerful for individuals (speed, autonomy, broad capability) becomes a risk when deployed across an organization that can’t audit what the AI is actually doing.

Eval SystemsClaude CodeReactTypeScript

What I didn't do

I didn't interview compliance officers, use real violations, or test on actual code. This is pure speculation. A thought experiment about what evaluation systems could look like if they solved business problems.

Everything below uses synthetic examples and fictional metrics to illustrate concepts. I'm cosplaying the eval process to explore architectural questions, not conducting research or making claims about what actually works in production.

I'm borrowing concepts from the eval framework. I'm using concepts like error analysis, judge training, and TPR/TNR metrics while skipping operational elements like golden datasets, production monitoring, and the improvement flywheel that real implementation would require.

The discovery process

I started by researching how enterprises use AI coding tools. I gathered perspectives from online communities and technical articles to understand how enterprises might approach AI coding tools and integrate them into their engineering workflows. Three patterns emerged:

Velocity Maximizers focus on compressing development timelines from months to weeks
Debt Eliminators work to modernize legacy systems while maintaining velocity
Compliance Maintainers seek to maintain speed despite regulatory overhead

All three patterns share a common tension: they want AI to accelerate development while maintaining the governance and quality standards that enterprises require. This made me think that compliance might be where evaluation becomes critically important. When code generated by AI needs to pass SOC2 audits or handle PCI data, “it works” isn’t enough. Organizations might need systematic evaluation that can prove compliance, not just functionality.

The enterprise dilemma

Currently, deploying AI coding tools in enterprises gives developers superpowers with no way to monitor how they're used. Even with on-premises deployment, the core problems persist:

# Developer prompt:
"Make our user deletion more efficient"

# AI generates technically correct but legally problematic code
# No GDPR compliance, no audit trail, SQL injection risk
# Developer gets working code, company gets sued

The developer asked for efficiency but got code with compliance issues they didn’t know to look for. AI tools change how this problem manifests: instead of forgetting to implement audit logging yourself, you forget to ask the AI for it.

Testing the hypothesis

So I had a hypothesis: compliance automation could be a compelling use case for practical eval systems. But I needed to explore whether this actually made sense. What would systematic evaluation look like that helps rather than hinders AI-assisted development?

Important disclaimer: Everything below uses synthetic examples to illustrate concepts. These are thought experiments to test whether compliance evaluation makes sense as a practical use case.

I'm using the eval framework from Hamel Husain and Shreya Shankar's work as a structure for this exploration.

Experiment 1: Error Analysis

First, I wanted to see if compliance violations had recognizable patterns. To explore this concept, I imagined what it would look like to analyze synthetic coding interactions to test whether eval patterns could theoretically apply from a compliance perspective. The goal was to turn vague regulatory fears into specific, actionable patterns.

Important: This should never be done with synthetic data in the real world. Actual compliance requires real violations, real regulations, and expert knowledge. I'm oversimplifying compliance issues to look like cookie-cutter problems, which they're not. This is purely conceptual.

Error Analysis Trace Viewer

Developer Prompt

“Make our user deletion more efficient”

Generated Code

1function deleteUser(userId: string) {
2  const query = `DELETE FROM users WHERE id = ${userId}`;
3  return db.execute(query);
4}

Compliance Notes

•GDPR requires audit trail for all user data deletions

•SQL injection vulnerability through direct string interpolation

•Missing cascade deletion for related user data (sessions, preferences, etc.)

•No verification that user consented to deletion

The interface above is a trace viewer, which provides a detailed breakdown of what happened during code generation. It shows the developer's original request, the AI-generated code with highlighted potential issues, and compliance notes explaining what went wrong and why, written as if explaining regulations to someone who's never encountered them. In a real implementation, this information would come from a domain expert. The eval framework calls this a "benevolent dictator," someone who understands both the regulations and how they apply to code.

Click any highlighted line to see which regulation it might violate, why this matters in plain English, and how often this pattern appears. The concept would be to reveal compliance implications that developers might not think to look for when reviewing AI-generated code.

Pattern Analysis

Click any bar to see violation examples and regulatory impact

After collecting enough trace data from the analysis above, the next step would be to look for patterns in the violations. This conceptual pattern analysis suggests that most violations would likely happen in areas developers rarely think about when reviewing code.

Experiment 2: Training a Judge

Next question: could you theoretically train a model to reliably identify these compliance patterns? To illustrate the concept, imagine building a judge that goes from 34% accuracy to 91% through systematic refinement. This conceptual prototype explores what that iterative process might look like.

Judge Training Evolution

Systematic refinement from 34% to 90% accuracy

90%

Overall accuracy

Performance breakdown

Catches violationsTPR

91%

Passes clean codeTNR

88%

False alarmsFPR

12%

Misses violationsFNR

9%

Training Examples

75 total examples (+35 added)

Key Changes

•Added business context awareness

•Nuanced example distinctions

•Reasoning requirement

To illustrate the concept, imagine this evolution from 34% to 91% accuracy to demonstrate how systematic refinement might work in practice. Version 0.1 would start with naive yes/no classification with no regulatory knowledge, no examples, just "is this compliant?" Predictably terrible at a fictional 34% accuracy.

Version 0.2 would add regulation-specific rules (GDPR, PCI-DSS, SOC2), potentially improving to a fictional 59% but creating too many false positives. Version 0.3 might add concrete violation examples, reaching an imaginary 75% but still lacking nuance. The conceptual breakthrough in v1.0 would be adding business context awareness to achieve balance with fictional metrics of 91% true positive rate and 88% true negative rate.

You can click on any iteration to see the actual prompts used to understand how small changes might compound. In theory, the final version would learn to distinguish between caching user preferences (usually fine) and caching payment tokens (never fine). This kind of nuanced judgment could potentially emerge from systematic training with real metrics rather than clever prompting.

How this could work in practice

The eval framework includes CI/CD integration with golden datasets and async production monitoring. But the fundamental question remains: should compliance be embedded in AI tools, or should AI tools be observable by existing compliance systems?

The key challenge is balancing governance needs without stifling the AI assistance that makes developers productive. Heavy-handed compliance checking could turn AI coding tools into compliance theater that developers work around rather than tools that actually help them write better code.

This suggests embedding compliance intelligence into the AI generation process itself, rather than treating it as a separate validation step. Make the AI naturally produce more compliant code, then use lighter-touch monitoring to ensure it's working.

Success would look different for each enterprise use case. Velocity Maximizers might measure time saved on compliance reviews during rapid development cycles. Debt Eliminators could track how many legacy code updates maintain regulatory standards without slowing modernization. Compliance Maintainers might focus on audit preparation time and regulatory confidence scores. The challenge is building evaluation systems that prove value in these real business contexts.

What I learned

Compliance automation demonstrates how eval systems could address real business problems because the patterns seem recognizable, the stakes are real, and success would be measurable.

But there's a big difference between synthetic demos and production reality. I built this investigation on conceptual examples rather than talking to actual enterprise teams. Real challenges like proprietary training data, enterprise scale, and developer adoption are still open questions.

The biggest insight was realizing that AI coding tools create new governance challenges that traditional compliance approaches might not handle well. The speed of AI evolution versus the pace of governance adaptation creates interesting tensions that someone will need to solve.

Still, the exercise felt worthwhile. It's a good place to explore what practical evaluation systems might look like when they're solving real business problems.

This exploration follows the error analysis and judge training patterns from recent eval frameworks, adapted to a compliance context.

This exploration touched only the error analysis and judge training phases. A real system would need the full operationalization cycle of monitoring, analyzing, improving, and deploying that creates continuous improvement.

This eval system is very incomplete and there are much more complex implementations I'd like to explore, but I wanted to find an accessible way to investigate this topic and see if compliance automation could be a compelling use case.