Agentic Self-Optimization Framework: Architecture & Design
Date: April 3, 2026
Authors: Integrity Studio AI Research
Version: 1.0
Status: Architecture & Design Specification
Executive Summary
This document defines an OpenTelemetry (OTEL)-based observability framework that measures agent/skill/plugin code quality using the churn data model from Popescu et al. (2026), translating code stability metrics into reinforcement learning (RL) reward signals. The framework enables self-optimizing agentic systems that learn to produce higher-quality, longer-lived code by observing post-deployment code survival, merge velocity, and downstream stability patterns.
Core Innovation: Treating agent outputs as continuous experiments with measurable feedback loops, enabling RL agents to optimize for production-grade code stability—not just task completion.
Key Components:
- Code Churn Telemetry Model (OTEL spans + attributes)
- RL Environment abstraction (Gymnasium-compatible)
- Multi-agent policy learning (Stable Baselines3 + Pufferlib)
- Feedback integration pipeline (Git analysis → OTEL → RL reward)
- Deployment safety guardrails (gradual rollout, rollback triggers)
1. Problem Statement
Current State (2026)
Agent-generated code exhibits distinct failure modes:
- Higher churn rates: Claude Code median 0.8–1.0 vs. human median 0–0.4 (Popescu et al., Figure 9, §4.2)
- Lower survival rates: Effect sizes modest but consistently negative (Cliff’s δ = −0.05 to −0.14; Figure 8, §4.2)
- Repository concentration bias: Agent PRs concentrate in 0-star/test repos (Codex 75.3%, Claude Code 51.7%, Copilot 59.6%, Devin 64.1%; Popescu et al. Table 4, §4.1.1). Human PRs more distributed (40.5% in 0-star repos).
- Faster merge velocity: Codex median 0.5 min vs. human median 0.4 hours in low-star repos; human PRs ~10 hours in high-star repos (Figure 6, §4.1.3 Pull Request Merges)
Root cause: Agents optimize for immediate task completion; lack feedback on production-grade code quality.
Desired Future State
- Agents learn from code survival signals: Each line’s post-deployment lifetime becomes a training signal
- Continuous improvement loop: Agent policies updated as new code stability data arrives
- Production-grade outputs: Agents internalize patterns that correlate with high-survival code
- Measurable framework quality: Operators track agent skill improvement via OTEL dashboards
2. Churn Data Model (From Popescu et al. 2026)
Metrics Collected Per PR/Commit
| Metric | Type | Collection Window | Production Signal |
|---|---|---|---|
| Survival Rate | % lines unmodified | 3d, 7d, 21d post-merge | Long-term code viability |
| Churn Rate | % lines modified | Same windows | Instability indicator |
| Deletion Rate | % lines deleted | Same windows | Code rejection signal |
| Merge Rate | % PRs merged | Immediate | Code review quality |
| Merge Time | Hours to merge | Immediate | Review friction / urgency |
| Change Size | Median lines added | Per PR | Scope complexity |
| Review Comments | Count per PR | Pre-merge | Code clarity issues |
| Review Count | Number of reviewers | Pre-merge | Review thoroughness |
Why These Metrics?
- Survival metrics = Ground truth of code quality in production context
- Merge signals = Immediate feedback (faster → less controversial)
- Change size = Proxy for defect density (larger changes → more bugs)
- Review signals = Proxy for code clarity (fewer comments → clearer code)
3. OTEL Telemetry Architecture
Span Structure: Code Quality Pipeline
# Parent span: Code generation (day 0)
Span: "code-generation"
start: 2026-04-03T10:15:00Z
end: 2026-04-03T10:15:45Z
attributes:
agent.name: "claude-code-v4.5"
agent.skill: "api-endpoint-refactor"
agent.model: "sonnet-4.6"
tool.type: "plugin|skill|agent"
correlation.trace_id: "abc-123-def" # Used to link delayed spans
code.event: "generated"
# Immediate child: Code review/merge event (same day or +1-7 days)
Child Span: "code-merge-event"
start: 2026-04-04T08:00:00Z # Sometime after generation
end: 2026-04-04T08:05:00Z
attributes:
code.quality.merged: true
code.quality.merge_time_hours: 22.25
code.quality.review_comments: 4
code.quality.review_count: 2
git.commit_hash: "abc123..."
# Delayed spans: Code survival checkpoints (created on days 3, 7, 21)
# These are NOT children (violates OTEL spec). Instead, link them to parent via trace/correlation.
Span: "code-survival-3d" (created on 2026-04-06T10:15:00Z)
links:
- parent_span_context: (from code-generation trace)
attributes:
code.quality.survival_3d: 0.92
code.quality.churn_3d: 0.05
code.quality.deletion_3d: 0.03
correlation.trace_id: "abc-123-def" # Matches parent
code.event: "survival_checkpoint"
code.checkpoint_window: "3d"
Span: "code-survival-21d" (created on 2026-04-24T10:15:00Z)
links:
- parent_span_context: (from code-generation trace)
attributes:
code.quality.survival_21d: 0.78
code.quality.churn_21d: 0.15
code.quality.deletion_21d: 0.07
correlation.trace_id: "abc-123-def" # Matches parent
code.event: "survival_checkpoint"
code.checkpoint_window: "21d"
code.rl_reward_trigger: true # This span triggers RL training
OTEL Attribute Naming Convention
Pattern: code.quality.{metric_name} for all Popescu et al. metrics
code.quality.survival_3d # Float [0.0, 1.0]
code.quality.survival_7d
code.quality.survival_21d
code.quality.churn_7d
code.quality.churn_21d
code.quality.deletion_21d
code.quality.merge_rate # Float [0.0, 1.0]
code.quality.merge_time_hours # Float >= 0
code.quality.change_size_lines # Int >= 0
code.quality.review_comments # Int >= 0
code.quality.review_count # Int >= 0
Resource Attributes (agent/skill identity):
agent.name # "claude-code", "github-copilot"
agent.skill # "refactor-api", "write-tests"
agent.model # "sonnet-4.6", "gpt-4-turbo"
agent.version # "1.2.3"
tool.type # "plugin" | "skill" | "agent"
repository.name # GitHub repo name
Data Ingestion Pipeline
[Git Webhook: New PR merged]
↓
[Async job: Track commit hash + agent metadata]
↓
[Cron: Measure survival @ 3d, 7d, 21d]
↓
[Emit OTEL span with code.quality.survival_* attributes]
↓
[OTEL collector: Route to feature store + RL feedback buffer]
Implementation:
- Git hook (post-merge): Capture PR metadata + agent identity
- Scheduled cron (3d/7d/21d): Git blame analysis → survival metrics
- OTEL exporter: Batch emit spans with aggregated metrics
4. RL Environment Design (Gymnasium Compatible)
Agent’s Decision Space
State (observation):
observation = {
"task_description": str, # Task prompt
"context_files": List[str], # Code to refactor/modify
"recent_agent_history": List[{ # Past agent actions in this repo
"survival_21d": float,
"churn_7d": float,
"merge_time": float,
"change_size": int,
}],
"repository_profile": { # Repo characteristics
"star_count": int,
"team_size": int,
"language": str,
"avg_pr_size": int,
},
}
Action:
action = {
"code": str, # Generated code
"scope_confidence": float, # Agent's confidence in change scope
"test_coverage": float, # % new code with tests
"refactor_depth": int, # 1=minimal, 5=comprehensive
}
Reward Function (multi-objective):
def compute_reward(action, outcomes_at_21d):
"""
Outcomes = code metrics observed 21 days post-merge
"""
# Weighted sum of quality signals
survival_reward = outcomes_at_21d["survival_21d"] * 100 # [0, 100]
# Penalize large, high-churn code
churn_penalty = outcomes_at_21d["churn_21d"] * -50 # [0, -50]
# Reward fast merges (proxy for code clarity)
merge_speed = 1.0 / (1.0 + outcomes_at_21d["merge_time_hours"] / 24)
merge_reward = merge_speed * 30 # [0, 30]
# Intrinsic reward for reaching the task goal
task_completion_reward = 50 if task_success else 0
# Shape reward: penalize oversized changes
scope_penalty = min(outcomes_at_21d["change_size_lines"] / 1000, 1.0) * -20
total = (survival_reward + churn_penalty + merge_reward +
task_completion_reward + scope_penalty)
return total
Why 21 Days?
This framework adopts the 21-day window as the primary feedback signal based on the assumption that this timeframe captures meaningful code stability patterns observed in the Popescu et al. study. Using 21d as the reward window:
- ✅ Captures long-term stability patterns (aligns with Popescu et al.’s measurement window)
- ✅ Provides enough data for statistical significance
- ✅ Balances feedback latency with learning speed
- ⚠️ Trade-off: RL agents see rewards with 21-day delay; mitigate with auxiliary intermediate rewards (merge time, review count)
Note: Empirical validation of 21 days as a stability threshold is not provided in the Popescu et al. paper and should be verified via independent replication before deploying this framework to production systems.
Auxiliary Rewards (Immediate Feedback)
To reduce 21-day training latency, include immediate proxy rewards:
immediate_reward = {
"merge_success": +10 if merged else -5,
"review_velocity": 5 / (1 + merge_time_hours), # Fast = good signal
"comment_ratio": -num_review_comments / 10, # More comments = clarity issues
}
These don’t replace the 21d reward; they’re added as auxiliary signals for policy gradient updates.
5. Multi-Agent RL Training Architecture
Approach: Multi-Skill Policy Learning
Goal: Train separate policies for each skill/agent type; share learned patterns via a central value function.
┌─────────────────────────────────────────────────────┐
│ Central Agent Repository (GitHub) │
│ - claude-code/refactor-api/v1.2 │
│ - github-copilot/write-tests/v3.1 │
│ - devin/debug-integration-tests/v0.9 │
└─────────────────────────────────────────────────────┘
↓ (generates code + emits OTEL)
┌─────────────────────────────────────────────────────┐
│ OTEL Observation Buffer (Git-based Feature Store) │
│ - Collects 21-day survival outcomes │
│ - Batch exports for training (weekly) │
└─────────────────────────────────────────────────────┘
↓ (feeds into training loop)
┌─────────────────────────────────────────────────────┐
│ RL Training Environment (Gymnasium-compatible) │
│ - Vectorized environments (parallel training) │
│ - PPO with vectorized scaling (Pufferlib) │
│ - Shared value function (transfer learning) │
└─────────────────────────────────────────────────────┘
↓ (produces improved policies)
┌─────────────────────────────────────────────────────┐
│ Policy Deployment │
│ - Canary: 10% of requests → new policy │
│ - Monitor: Survival rate @ 7d vs. baseline │
│ - Rollback: If survival drops > 2% │
└─────────────────────────────────────────────────────┘
Training Algorithm: PPO + Advantage Aggregation
Why PPO (Proximal Policy Optimization):
- ✅ Sample-efficient (reuse 4-5 epochs per trajectory)
- ✅ Handles high-variance delayed rewards (21d window)
- ✅ Proven stable for continuous actions (scope_confidence, test_coverage)
- ⚠️ Limitation: Doesn’t natively handle 21-day reward delays
Mitigation: Advantage Estimation with Bootstrapping
# Pseudocode (Stable Baselines3 + custom callback)
def compute_gae_with_delayed_rewards(
trajectories: List[Trajectory],
delayed_rewards_at_21d: Dict[str, float],
lambda_gae: float = 0.95,
):
"""
Use immediate proxy rewards for on-policy updates;
retroactively adjust value estimates when 21d outcomes arrive.
"""
# Phase 1: Initial training (on immediate rewards)
for traj in trajectories:
advantage = compute_gae(
rewards=traj.immediate_rewards, # merge time, review signals
value_estimates=model.value_function(traj.observations),
lambda=lambda_gae,
)
update_policy(advantage)
# Phase 2: Delayed reward correction (async)
for traj_id, delayed_outcomes in delayed_rewards_at_21d.items():
actual_reward = compute_reward(delayed_outcomes)
prior_estimate = value_function_cache[traj_id]
# Adjust advantage retroactively
advantage_correction = actual_reward - prior_estimate
# Apply importance sampling correction (off-policy safety)
# PPO-style symmetric clipping: constrain ratio to [1-clip_range, 1+clip_range]
clip_range = 0.2 # Standard PPO value
importance_weight = torch.clamp(
action_prob_current / action_prob_old,
1 - clip_range,
1 + clip_range
)
# Compute clipped advantage correction loss
advantage_correction = actual_reward - prior_estimate
policy_loss = -(torch.minimum(
importance_weight * advantage_correction,
torch.clamp(importance_weight, 1 - clip_range, 1 + clip_range) * advantage_correction
))
loss += policy_loss.mean()
return loss
Stable Baselines3 Configuration
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
# 1. Vectorized environments for parallel rollouts
env = make_vec_env(
CodeQualityEnv,
n_envs=8, # 8 parallel agents (skills)
vec_env_cls=SubprocVecEnv,
)
# 2. PPO with tuned hyperparameters for code generation
model = PPO(
policy="MlpPolicy",
env=env,
learning_rate=3e-4,
n_steps=2048, # Rollout buffer size
batch_size=64, # Mini-batch for updates
n_epochs=10, # Epochs per rollout
gamma=0.99, # Discount factor (accounts for 21d delay)
gae_lambda=0.95,
clip_range=0.2,
clip_range_vf=0.2,
ent_coef=0.01, # Entropy for exploration
verbose=1,
)
# 3. Callbacks for intermediate rewards + delayed reward correction
model.learn(
total_timesteps=1_000_000,
callback=[
DelayedRewardCallback(), # Apply 21d outcomes retroactively
SurvivalMetricCallback(), # Log survival_21d to OTEL
RollbackTriggerCallback(), # Halt training if survival drops
],
)
Multi-Agent Coordination: Skill-Level Policies (Manual Weight Transfer)
Architecture: One policy per (agent, skill) pair; shared value function for transfer learning.
⚠️ Important: Stable Baselines3 does not natively support cross-policy weight sharing. This architecture achieves transfer learning through manual weight transfer — loading the value network from one trained skill as initialization for the next. This is a workaround, not a built-in SB3 feature.
skills = [
"refactor-api-endpoint",
"write-integration-tests",
"debug-failing-tests",
"add-observability",
]
base_policy_kwargs = dict(
net_arch=dict(pi=[64, 64], vf=[64, 64]), # Separate networks per policy
activation_fn=th.nn.ReLU,
)
policies = {}
shared_vf_weights = None # Will hold value network weights from first policy
for i, skill in enumerate(skills):
env = CodeQualityEnv(skill)
policies[skill] = PPO(
policy="MlpPolicy",
env=env,
policy_kwargs=base_policy_kwargs,
learning_rate=3e-4,
)
# Transfer learning: Use first policy's value network as warm-start for subsequent policies
if i == 0:
# Train first policy to convergence
policies[skill].learn(total_timesteps=100000)
shared_vf_weights = policies[skill].policy.value_net.state_dict()
else:
# Load value network weights from prior skill as initialization
if shared_vf_weights:
policies[skill].policy.value_net.load_state_dict(shared_vf_weights)
# Alternative: Use SurveyL2-based reward function to enforce consistency across skills
6. Feedback Integration Pipeline
Data Flow: Git → OTEL → RL Reward Buffer
┌──────────────────────────────────────────────┐
│ 1. GitHub Webhook (Post-Merge Event) │
│ - PR metadata + agent identity │
│ - Commit hash + timestamp │
└──────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────┐
│ 2. Async Indexer (Real-time) │
│ - Store metadata in feature store │
│ - Emit OTEL span "code-generation" │
│ - Tag with agent.name, tool.type │
└──────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────┐
│ 3. Survival Measurement Cron (Scheduled) │
│ - At +3d, +7d, +21d: │
│ * Run `git blame` on changed files │
│ * Count lines still in HEAD │
│ * Compute survival_rate │
│ - Emit OTEL checkpoint span with metrics │
└──────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────┐
│ 4. OTEL Collector & Export │
│ - Route spans to feature store (Parquet) │
│ - Buffer outcomes for weekly training │
│ - Dashboard: Survival by agent/skill │
└──────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────┐
│ 5. RL Training Loop (Weekly) │
│ - Load feature store → trajectories │
│ - Compute 21d rewards │
│ - Apply PPO updates │
│ - Save policy checkpoints │
└──────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────┐
│ 6. Deployment & Canary Testing │
│ - Version new policies │
│ - Route 10% of requests to new agents │
│ - Monitor survival_7d vs. baseline │
│ - Auto-rollback on degradation │
└──────────────────────────────────────────────┘
Implementation: Python-Based Feedback Collector
# feedback_collector.py
import json
import subprocess
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Dict, List
@dataclass
class CodeMetrics:
commit_hash: str
agent_name: str
skill: str
merge_time_hours: float
survival_3d: float
survival_7d: float
survival_21d: float
churn_7d: float
deletion_21d: float
class SurvivalMeasurer:
def __init__(self, repo_path: str):
self.repo_path = repo_path
def measure_survival_at_timestamp(
self,
commit_hash: str,
target_timestamp: datetime,
) -> Dict[str, float]:
"""
Measure survival at target_timestamp by:
1. Checkout HEAD (current state)
2. For each line in commit_hash, check if it still exists
3. Compute survival_rate = (lines_still_present / original_lines)
"""
# Get original lines from target commit
original_lines = self._get_lines_from_commit(commit_hash)
# Get current lines from HEAD
current_lines = self._get_lines_from_head()
# Compute metrics
survival_rate = self._compute_survival(original_lines, current_lines)
churn_rate = self._compute_churn(original_lines, current_lines)
deletion_rate = self._compute_deletion(original_lines, current_lines)
return {
"survival_rate": survival_rate,
"churn_rate": churn_rate,
"deletion_rate": deletion_rate,
}
def _get_lines_from_commit(self, commit_hash: str) -> Set[str]:
"""Get hash of each line in commit_hash"""
# Use git show + content hashing
pass
def _compute_survival(
self,
original_lines: Set[str],
current_lines: Set[str],
) -> float:
return len(original_lines & current_lines) / len(original_lines)
# Cron job: Measure at 3d, 7d, 21d
scheduler = APScheduler()
@scheduler.scheduled_job('cron', day_of_week=3) # Every 3 days
def measure_survival_3d():
for commit in pending_measurements_at_3d():
measurer = SurvivalMeasurer(commit.repo_path)
metrics = measurer.measure_survival_at_timestamp(
commit.hash,
datetime.now() - timedelta(days=3),
)
emit_otel_span("code-survival-checkpoint", attributes={
"code.quality.survival_3d": metrics["survival_rate"],
"agent.name": commit.agent_name,
"git.commit_hash": commit.hash,
})
7. Deployment Safety & Guardrails
Canary Rollout Strategy
Phase 1: Shadow Deployment (1 week)
- New policy runs in parallel; results NOT used
- Monitor 7-day survival rate vs. baseline
- No user impact
Phase 2: Canary (2 weeks)
- Route 10% of requests to new policy
- Compare survival_7d, merge_rate to baseline
- If divergence > 2%, revert immediately
Phase 3: Gradual Rollout (2-4 weeks)
- 10% → 25% → 50% → 100%
- Monitor at each step
- Roll back if survival_21d drops
Rollback Triggers
| Metric | Threshold | Action |
|---|---|---|
survival_21d | < baseline - 2% | Immediate rollback |
churn_21d | > baseline + 3% | Rollback after 24h |
merge_rate | < baseline - 5% | Halt rollout (investigate) |
merge_time_hours | > baseline * 2 | Pause (review friction) |
Value Function Monitoring
Track the trained value function to detect reward hacking (e.g., agent finds spurious correlation):
# Red flags for reward hacking
if value_function(observation) >> expected_value:
# Policy may have exploited an artifact
# Trigger manual code review of recent agent outputs
log_alert("value-function-anomaly", severity="high")
if survival_21d_actual < predicted_value * 0.8:
# Value function consistently overestimates
# Retrain with calibration adjustment
log_alert("value-prediction-miscalibration")
8. Monitoring & Observability
OTEL Dashboard Schema
# Grafana dashboard: Agent Code Quality
Panels:
- Agent Comparison (Heatmap)
Series: agent.name (rows) × survival_21d (color)
Metric: code.quality.survival_21d
- Skill Improvement Over Time (Time Series)
Series: Per-skill survival_21d
Aggregation: 7d rolling average
- Churn Rate by Agent (Bar Chart)
Metric: code.quality.churn_21d
Comparison: Agent vs. human baseline
- Merge Velocity (Box Plot)
Metric: code.quality.merge_time_hours
Grouping: agent.name
- Policy Version Deployment (Stacked Area)
Metric: % requests → policy_v1, policy_v2, etc.
Trigger: Rollback on survival drop
Key Queries (Prometheus-compatible)
# Average survival rate by agent (21 days)
avg(code_quality_survival_21d) by (agent_name)
# Survival degradation detection
code_quality_survival_21d < 0.75
# RL training frequency
increase(rl_policy_training_total[1w])
# Canary success (new policy maintains baseline)
(code_quality_survival_21d{policy_version="v2"}
/
code_quality_survival_21d{policy_version="v1"}) > 0.98
9. Implementation Roadmap
Phase 1: Instrumentation (Weeks 1-2)
- Git webhook → OTEL span emitter
SurvivalMeasurercron job (3d checkpoint)- OTEL collector export to Parquet feature store
- Dashboard: Survival by agent
Phase 2: RL Environment (Weeks 3-4)
- Gymnasium-compatible
CodeQualityEnv - Reward function (21d outcomes + immediate proxies)
- Stable Baselines3 PPO setup
- Delayed reward correction callback
Phase 3: Training Loop (Weeks 5-6)
- Weekly training cron
- Policy versioning + checkpointing
- Canary deployment infrastructure
- Rollback triggers
Phase 4: Production Deployment (Weeks 7-8)
- Shadow testing (1 week, phase 1)
- Canary rollout (phase 2)
- Gradual rollout (phase 3)
- Monitoring + alerting
10. Risks & Mitigation
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Reward hacking | High | Critical | Value function anomaly detection; manual code review; hold-out test set |
| Distribution shift (new codebase) | Medium | High | Transfer learning via shared value function; domain adaptation layer |
| 21d training latency | High | Medium | Auxiliary immediate rewards (merge time, reviews); policy distillation |
| Data quality (Git blame failures) | Medium | Medium | Validation checks; fallback to PR size + review count |
| Agent diversity (models diverge) | Medium | High | Centralized policy hub; skill-specific fine-tuning only |
11. Success Metrics
Primary (21-day outcomes) — Hypothesis-Level Projections
These targets are projections based on reward structure design. Validation requires Phase 1 pilot implementation with controlled rollout and empirical measurement. See Whitepaper §9 for full qualification.
- Survival Rate @ 21d: Hypothesis Target ≥ 85% (baseline: agent 75-78%, human ~90%)
- Churn Rate @ 21d: Hypothesis Target ≤ 10% (baseline: agent 15-18%, human ~5%)
- Agent-to-Human Code Quality Ratio: Hypothesis Target ≥ 0.90
- Computed as:
min(survival_agent / survival_human, (1 - churn_agent) / (1 - churn_human)) - Current baseline ratio: ~0.83-0.87 (agent 75-81% survival vs. human 90%)
- Computed as:
Secondary (user experience)
- Merge Rate: Target ≥ 90% (code passes review)
- Merge Time: Target ≤ 4 hours median
- Review Comments: Target ≤ 2 per PR (clarity proxy)
Tertiary (system health)
- Policy Update Frequency: ≥ 1 per week
- Canary Rollback Rate: ≤ 5% (stability)
- Value Function Calibration Error: ≤ 10%
References
Popescu et al. (2026): “Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time”. arXiv:2604.00917v1. https://arxiv.org/html/2604.00917v1
Schulman et al. (2017): “Proximal Policy Optimization Algorithms”. arXiv:1707.06347
Stable Baselines3 (2021): https://stable-baselines3.readthedocs.io/en/master/
Gymnasium (Farama Foundation): Fork of OpenAI Gym, maintained by Farama Foundation. https://gymnasium.farama.org/
OpenTelemetry (CNCF): https://opentelemetry.io/docs/
Pufferlib (Suarez, J., et al., 2024): Vectorized reinforcement learning library with parallel environment support and 1M+ steps/second performance. https://pypi.org/project/pufferlib/
Document Generated: April 3, 2026
Next Steps: Implement Phase 1 (Git webhook + OTEL instrumentation)