Building AI Agents You Can Trust
Jeff Toffoli
Founder, Quallaa

What time is it?
Access this presentation on your device
Demo repository
github.com/quallaa-ai/trace-demo
Arrow Keys
Navigate slides
Press F
Fullscreen
Save PDF
Print button top-right
Just learning about AI agents — haven't built anything yet
Built a simple chatbot or demo, but not in production
Working with AI agents in production (or close to it)
Managing teams or projects that involve AI agents
A framework for building AI agents that make better decisions — with live code and real API calls.
Why AI agents fail at timing decisions
Models can't do timestamp arithmetic — and the workarounds most teams reach for make it worse.
How to fix it with computed facts
Pre-compute what the model can't calculate. Present facts, not interpretations. Measure whether it works.
Where to draw the trust boundary
Which decisions the model should own, which ones code must enforce, and how to tell the difference.
How to measure what matters
Paired evals that test whether your context engineering actually changes agent behavior — not just whether responses look right.
Sarah — The Faucet
42 hours of silence→ Gentle follow-up referencing the faucet
→ References "checking with my husband"
→ No pressure. Respectful.
→ After two unanswered follow-ups, graceful close
Mike — The Burst Pipe
Saturday 11:31 PM→ Immediate acknowledgment of urgency
→ Check the on-call schedule. Escalate.
→ Saturday 11:31 PM is not Monday 9 AM
→ Time of day matters. Day of week matters.
Getting one right is easy. Getting both right from the same agent — no if/else, no special-casing — that's the hard problem.
What engineers reach for first — and why they backfire for timing decisions
Programmatic cadences
"Follow up at 24h, 48h, 72h, then stop." Rigid. Treats Sarah and Mike identically.
schedule(24h, 48h, 72h)Exponential back-off
"Wait 1h, then 2h, then 4h, then 8h..." Borrowed from retry logic. Customers aren't failed API calls.
delay *= 2Decision trees
if (urgency === 'high' && hoursElapsed > 1) { escalate() }. Every edge case spawns a new branch.
if/else/if/else...Curiosity budgets
"The model can ask at most 3 clarifying questions." Arbitrary caps on judgment.
maxQuestions: 3Reflexion loops
Reflexion with external feedback (test results, tool outputs) works. But for context problems — where the model can't see the data it needs — self-critique without new information just reinforces the gap.
retry(critique(output))"The scaffolding that feels safest is the most likely to be stealing a judgment call."
Hard-coded delays (e.g., wait 4 hours, then 1 day, then 3 days)
Rule-based logic with some conditional branching
Let the model decide, but I'm not sure it's working well
Haven't built follow-up logic yet
LLMs are optimized for probability, not precision. 847 × 293 isn't calculated — it's generated as a token sequence that looks like a result.
They'll get 2+2 right (common in training data) but fail on anything requiring multi-step precision. One wrong step cascades through the rest.
Reasoning models (o1, DeepSeek-R1) have gotten dramatically better through reinforcement learning. Tool-using agents can call calculators.
But for our use case — timestamp arithmetic in a system prompt — the model has to do the math itself.
Arithmetic and mathematical reasoning are genuinely different skills. For timestamp math in a system prompt, the model is on its own.
Inject facts, not interpretations.
Interpretations
Urgency scores, cadence steps, backoff rules, tone guidance. Code decides what the situation means and tells the model what to do.
Facts
Elapsed time, waiting status, response latency, unanswered count, conversation span. Code computes what happened and lets the model decide what it means.
The three-part contract: JavaScript computes. The prompt presents. The model interprets.
A five-layer framework. Each letter is a distinct responsibility.
Trust Boundaries
Runtime Observability
Adaptive Evals
Context Engineering
Enforcement

Every tool the agent has is a trust decision. What can it do — and where are the guardrails?
Four tools, each granting a specific power
send_sms
Can contact customers directly. Every message represents the business.
schedule_followup
Can commit to future actions. The agent decides when to come back.
check_schedule
Can see the full follow-up state. Knows what's already planned.
cancel_event
Can undo its own decisions. Self-correction without human intervention.
The send_sms handler pipeline:
Opt-out check
Enforcement — legal fines. Always blocked.
Length cap
Quality gate — SMS has limits.
Empty guard
Quality gate — don't send nothing.
Success
All checks pass → message sent.
Trust is the architecture — tools as capabilities, handlers as contracts. Enforcement rules live inside, but they're the subset that's never a judgment call.

If the agent's cadence is emergent, not prescribed, how do you know what it's doing? Without traces, an emergent cadence is a black box.
Every agent decision becomes a structured record
What the model saw
The temporal context block injected into the prompt
What the model decided
The response text — tone, content, judgment calls
What tools it called
Tool names, inputs, and results
Three panels, one decision. Fully queryable.
SAW
Last message from contact: 1d ago. Customer said "let me check with my husband."
DECIDED
"Hey, just checking in — still interested in the faucet repair? No rush at all."
TOOLS
send_sms → ✓
schedule_followup → Fri
Calm tone. Gentle. Referenced the faucet. Scheduled another check-in.
Same agent, same code, same tools. The trace makes the why visible.

Emergent behavior is harder to verify. The answer isn't to abandon emergence — it's to build adaptive evals.
"Change a signal, run the eval, measure the effect."
Standard evals check facts about a response
Regex judges. Fast, deterministic. They tell you the response is acceptable.
But they can't tell you if the model is responding to context
A response can pass every check and still be identical whether the customer texted 5 minutes ago or 2 days ago.
Paired differentiation tests whether context matters
Same scenario, two temporal conditions. An LLM judge scores 0–1 on whether behavior actually changes across urgency, action, and tone. Based on established NLP methodology — perturb inputs in meaningful ways, measure whether models appropriately shift output.
5 minutes ago
"Great, let me know when you're ready to schedule!"
2 days ago
"Great, let me know when you're ready to schedule!"
Follow timing-sensitive instructions (when to act, how long to wait)
Handle edge cases it wasn't explicitly told about
Stop doing something it was originally told to do (competing instructions)
Produce different behavior for situations that look similar but aren't

Everything we've built — trust, traces, evals — is in service of one question: what does the model actually see?
"Most agent failures are not model failures — they are context failures." — Phil Schmid
Before — Raw timestamps
[2026-03-10T14:47:00-07:00] Customer: "Leaky faucet..."
[2026-03-10T14:52:00-07:00] Agent: "I can help..."
[2026-03-10T14:55:00-07:00] Customer: "Let me check with my husband"
[2026-03-12T09:00:00-07:00] ← current time
Model must subtract ISO timestamps to know 42 hours passed. It can't.
After — Pre-computed context (Thursday 9 AM)
[1d ago] Customer: "Leaky faucet..."
[1d ago] Agent: "I can help..."
[1d ago] Customer: "Let me check with my husband"
── CONVERSATION TIMING ──
Last message from contact: 1d ago
Contact's last reply took 2m
Conversation spans 1d (5 messages)
No arithmetic required. The model reads facts and reasons.
Same conversation, two moments
Routine. She just said she'd check with her husband. Nothing to do.
Last message from contact: 2m ago
Status: Normal conversation pace
No cadence table. No if (hoursElapsed > 24). The model sees the gap and acts on it.
One computed fact changes whether the model acts
Without
"Last message from contact: 3h ago"
"Contact's last reply took 2m"
With response pattern
"Last message from contact: 3h ago"
"Contact replied to you in 2m. You have not replied in 3h."
No labels. No directives. Two computed durations — the model draws its own conclusion.
Merkelbach (2025): up to 18% performance shifts from how signals are presented. Specificity is the mechanism.
Not prompt engineering by intuition — engineering by measurement.

What must never be a model decision.
Enforcement rules live inside trust boundaries. They're the subset where the consequence is legal, financial, or trust-destroying regardless of context.
Enforcement — code decides
Always wrong, no matter the context.
Opt-out blocking
$500–$1,500 per-text fines.
Race condition guards
Duplicate messages confuse customers.
Duplication caps
Runaway scheduling is a system failure.
Interpretation — model decides
Only sometimes wrong, depending on context.
Tone and urgency
Whether to follow up after a reply
How long to wait
When to escalate
Inject the fact, let the model decide.
The scaffolding that feels safest is the most likely to be stealing a judgment call.
No fine-tuning from outcomes — general reasoning is the feature
No programmatic back-off — the model sees timing facts and decides
No per-agent skill progression — agents get better context, not levels
No Reflexion loops for context problems — fix the input, not the reasoning
No curiosity budgets — don't cap judgment with arbitrary limits
These aren't future work. They're intentionally excluded.
The thesis: temporal context + tools + model judgment is sufficient — adding these mechanisms would undermine the model's ability to exercise contextual judgment.
Trust Boundaries
Trust in tool handlers, not decision trees
Runtime Observability
Structured traces make every decision visible
Adaptive Evals
Measure behavioral change with paired differentiation
Context Engineering
Computed facts, response patterns, the flywheel
Enforcement
Hard rules for context-independent harms only
Sarah's faucet and Mike's burst pipe flow through the same stack and produce completely different, appropriate responses.
Pre-compute what models can't calculate
Convert temporal arithmetic into temporal state. Give the model facts, not math problems.
Trust the model with judgment, enforce only what's context-independent
The scaffolding that feels safest is the most likely to be stealing a judgment call.
Measure whether your signals actually change behavior
If you can't score it, you can't improve it. Inject facts, not interpretations — then measure whether the model acts on them.
Jeff Toffoli · jeff@quallaa.com
Questions, feedback, or war stories welcome
jeff@quallaa.com
quallaa.com/trace-framework
Your agent has no idea what time it is.
Now you know how to fix that.

