What exactly is Quallaa?

Quallaa is public-facing AI, done right. One product handles missed calls, texts, email, web chat, booking, follow-up, screening, and escalation — and a real person stands behind it. The AI introduces itself by your business name and gathers details about each interaction so nothing falls through the cracks.

Will my customers know it is AI?

Yes. Quallaa includes a natural AI disclosure in first messages by default. The AI introduces itself by your business name and explains how it can help. You can customize the wording in your settings. Starting June 2026, Colorado law requires this disclosure, and Quallaa handles it automatically.

How much does it cost?

One product, $149/month. All capabilities included. Free until you go live — prep work (saved conversations, customer profile, guardrails) costs nothing. You're charged $149/month when Quallaa provisions your AI number and takes you live. Month-to-month, cancel anytime. Flat monthly pricing, no per-message or per-call charges.

Do I need to change my phone number?

No. Your business number stays exactly the same. You get a dedicated AI number with your area code, and Quallaa can also text from your real number via the bridge app — customers never see a different number.

How long does setup take?

Talk to Quallaa at quallaa.com/quallaa. The agent walks you through what your business needs, surfaces the right inline panels for instructions and hours, and provisions an AI number when you are ready. Self-service end to end — no calls or appointments needed unless you want one.

Quallaa

The TRACE Framework

Building AI Agents You Can Trust

Jeff Toffoli

Founder, Quallaa

--:--

What time is it?

Follow Along

Access this presentation on your device

Demo repository

github.com/quallaa-ai/trace-demo

Arrow Keys

Navigate slides

Press F

Fullscreen

Save PDF

Print button top-right

Poll

Where are you in your AI agent journey today?

Just learning about AI agents — haven't built anything yet

Built a simple chatbot or demo, but not in production

Working with AI agents in production (or close to it)

Managing teams or projects that involve AI agents

Roadmap

What You'll Learn Today

A framework for building AI agents that make better decisions — with live code and real API calls.

Why AI agents fail at timing decisions

Models can't do timestamp arithmetic — and the workarounds most teams reach for make it worse.

How to fix it with computed facts

Pre-compute what the model can't calculate. Present facts, not interpretations. Measure whether it works.

Where to draw the trust boundary

Which decisions the model should own, which ones code must enforce, and how to tell the difference.

How to measure what matters

Paired evals that test whether your context engineering actually changes agent behavior — not just whether responses look right.

The Problem

Two Calls, One Agent

Tuesday 2:47 PM

9:415G100%

Sarah Chen

iMessage

Saturday 11:31 PM

9:415G100%

Mike Torres

iMessage

The Problem

What "Correctly" Looks Like

Sarah — The Faucet

42 hours of silence

→ Gentle follow-up referencing the faucet

→ References "checking with my husband"

→ No pressure. Respectful.

→ After two unanswered follow-ups, graceful close

Mike — The Burst Pipe

Saturday 11:31 PM

→ Immediate acknowledgment of urgency

→ Check the on-call schedule. Escalate.

→ Saturday 11:31 PM is not Monday 9 AM

→ Time of day matters. Day of week matters.

Getting one right is easy. Getting both right from the same agent — no if/else, no special-casing — that's the hard problem.

The Problem

The Anti-Patterns

What engineers reach for first — and why they backfire for timing decisions

Programmatic cadences

"Follow up at 24h, 48h, 72h, then stop." Rigid. Treats Sarah and Mike identically.

schedule(24h, 48h, 72h)

Exponential back-off

"Wait 1h, then 2h, then 4h, then 8h..." Borrowed from retry logic. Customers aren't failed API calls.

delay *= 2

Decision trees

if (urgency === 'high' && hoursElapsed > 1) { escalate() }. Every edge case spawns a new branch.

if/else/if/else...

Curiosity budgets

"The model can ask at most 3 clarifying questions." Arbitrary caps on judgment.

maxQuestions: 3

Reflexion loops

Reflexion with external feedback (test results, tool outputs) works. But for context problems — where the model can't see the data it needs — self-critique without new information just reinforces the gap.

retry(critique(output))

The Problem

Why These Fail

schedule(24h, 48h, 72h)

delay *= 2

if/else/if/else...

maxQuestions: 3

retry(critique(output))

Unanswered for 42 hours

Last said: "let me check with my husband"

Saturday 11:31 PM — after hours

Message tone: urgent, all caps

2 prior follow-ups with no reply

"The scaffolding that feels safest is the most likely to be stealing a judgment call."

Poll

When your AI agent needs to follow up with someone, how do you handle timing today?

Hard-coded delays (e.g., wait 4 hours, then 1 day, then 3 days)

Rule-based logic with some conditional branching

Let the model decide, but I'm not sure it's working well

Haven't built follow-up logic yet

Obstacle #1

LLMs Can't Do Arithmetic

LLMs are optimized for probability, not precision. 847 × 293 isn't calculated — it's generated as a token sequence that looks like a result.

They'll get 2+2 right (common in training data) but fail on anything requiring multi-step precision. One wrong step cascades through the rest.

Reasoning models (o1, DeepSeek-R1) have gotten dramatically better through reinforcement learning. Tool-using agents can call calculators.

But for our use case — timestamp arithmetic in a system prompt — the model has to do the math itself.

Arithmetic and mathematical reasoning are genuinely different skills. For timestamp math in a system prompt, the model is on its own.

Obstacle #2

Temporal Blindness

2 + 2 = 4

847 × 293 = 248,171

2026-03-10T14:52 − 2026-03-12T09:00 = 42 hours

How long has Sarah been waiting since she said "let me check with my husband"?

The Approach

The Core Insight

Inject facts, not interpretations.

Interpretations

Urgency scores, cadence steps, backoff rules, tone guidance. Code decides what the situation means and tells the model what to do.

Facts

Elapsed time, waiting status, response latency, unanswered count, conversation span. Code computes what happened and lets the model decide what it means.

The three-part contract: JavaScript computes. The prompt presents. The model interprets.

The Framework

Introducing TRACE

A five-layer framework. Each letter is a distinct responsibility.

Trust Boundaries

Runtime Observability

Adaptive Evals

Context Engineering

Enforcement

The three-part contract: JavaScript computes. The prompt presents. The model interprets.

Trust Boundaries — T

Every tool the agent has is a trust decision. What can it do — and where are the guardrails?

T — Trust Boundaries

Tools as Trust Decisions

Four tools, each granting a specific power

💬

send_sms

Can contact customers directly. Every message represents the business.

📅

schedule_followup

Can commit to future actions. The agent decides when to come back.

🔍

check_schedule

Can see the full follow-up state. Knows what's already planned.

cancel_event

Can undo its own decisions. Self-correction without human intervention.

T — Trust Boundaries

Trust Lives in the Handler

The send_sms handler pipeline:

Opt-out check

Enforcement — legal fines. Always blocked.

→

Length cap

Quality gate — SMS has limits.

→

Empty guard

Quality gate — don't send nothing.

→

✓

Success

All checks pass → message sent.

Trust is the architecture — tools as capabilities, handlers as contracts. Enforcement rules live inside, but they're the subset that's never a judgment call.

The Observability Problem

If the agent's cadence is emergent, not prescribed, how do you know what it's doing? Without traces, an emergent cadence is a black box.

R — Runtime Observability

Anatomy of a Trace

Every agent decision becomes a structured record

What the model saw

The temporal context block injected into the prompt

What the model decided

The response text — tone, content, judgment calls

What tools it called

Tool names, inputs, and results

Three panels, one decision. Fully queryable.

R — Runtime Observability

Two Traces, Two Scenarios

Thursday 9:00 AM42h since last message

SAW

Last message from contact: 1d ago. Customer said "let me check with my husband."

DECIDED

"Hey, just checking in — still interested in the faucet repair? No rush at all."

TOOLS

send_sms → ✓

schedule_followup → Fri

Calm tone. Gentle. Referenced the faucet. Scheduled another check-in.

Same agent, same code, same tools. The trace makes the why visible.

Adaptive Evals — A

Emergent behavior is harder to verify. The answer isn't to abandon emergence — it's to build adaptive evals.

A — Adaptive Evals

Eval as Laboratory

"Change a signal, run the eval, measure the effect."

Standard evals check facts about a response

✓ Did it mention the faucet?✓ No aggressive sales language?✓ Under 300 chars?

Regex judges. Fast, deterministic. They tell you the response is acceptable.

But they can't tell you if the model is responding to context

A response can pass every check and still be identical whether the customer texted 5 minutes ago or 2 days ago.

Paired differentiation tests whether context matters

Same scenario, two temporal conditions. An LLM judge scores 0–1 on whether behavior actually changes across urgency, action, and tone. Based on established NLP methodology — perturb inputs in meaningful ways, measure whether models appropriately shift output.

A — Adaptive Evals

Differentiation Scoring

5 minutes ago

"Great, let me know when you're ready to schedule!"

2 days ago

"Great, let me know when you're ready to schedule!"

0 — identical0.001 — fully differentiated

Poll

In your own work with AI, what's been the hardest thing to get the model to do consistently?

Follow timing-sensitive instructions (when to act, how long to wait)

Handle edge cases it wasn't explicitly told about

Stop doing something it was originally told to do (competing instructions)

Produce different behavior for situations that look similar but aren't

Context Engineering — C

Everything we've built — trust, traces, evals — is in service of one question: what does the model actually see?

C — Context Engineering

Before and After

"Most agent failures are not model failures — they are context failures." — Phil Schmid

Before — Raw timestamps

[2026-03-10T14:47:00-07:00] Customer: "Leaky faucet..."

[2026-03-10T14:52:00-07:00] Agent: "I can help..."

[2026-03-10T14:55:00-07:00] Customer: "Let me check with my husband"

[2026-03-12T09:00:00-07:00] ← current time

Model must subtract ISO timestamps to know 42 hours passed. It can't.

After — Pre-computed context (Thursday 9 AM)

[1d ago] Customer: "Leaky faucet..."

[1d ago] Agent: "I can help..."

[1d ago] Customer: "Let me check with my husband"

── CONVERSATION TIMING ──

Last message from contact: 1d ago

Contact's last reply took 2m

Conversation spans 1d (5 messages)

No arithmetic required. The model reads facts and reasons.

C — Context Engineering

Time Shift

Same conversation, two moments

2 minutes after her last message

Routine. She just said she'd check with her husband. Nothing to do.

Last message from contact: 2m ago

Status: Normal conversation pace

No cadence table. No if (hoursElapsed > 24). The model sees the gap and acts on it.

C — Context Engineering

Response Pattern

One computed fact changes whether the model acts

Without

"Last message from contact: 3h ago"

"Contact's last reply took 2m"

With response pattern

"Last message from contact: 3h ago"

"Contact replied to you in 2m. You have not replied in 3h."

No labels. No directives. Two computed durations — the model draws its own conclusion.

Merkelbach (2025): up to 18% performance shifts from how signals are presented. Specificity is the mechanism.

C — Context Engineering

The Flywheel

🔍

Eval exposes gap

🔧

Add or reframe signal

🚀

Deploy fix

✅

Eval confirms

Not prompt engineering by intuition — engineering by measurement.

Enforcement — E

What must never be a model decision.

E — Enforcement

The Boundary

Enforcement rules live inside trust boundaries. They're the subset where the consequence is legal, financial, or trust-destroying regardless of context.

Enforcement — code decides

Always wrong, no matter the context.

Opt-out blocking

$500–$1,500 per-text fines.

Race condition guards

Duplicate messages confuse customers.

Duplication caps

Runaway scheduling is a system failure.

Interpretation — model decides

Only sometimes wrong, depending on context.

Tone and urgency

Whether to follow up after a reply

How long to wait

When to escalate

Inject the fact, let the model decide.

The scaffolding that feels safest is the most likely to be stealing a judgment call.

Closing

What's Deliberately Absent

No fine-tuning from outcomes — general reasoning is the feature

No programmatic back-off — the model sees timing facts and decides

No per-agent skill progression — agents get better context, not levels

No Reflexion loops for context problems — fix the input, not the reasoning

No curiosity budgets — don't cap judgment with arbitrary limits

These aren't future work. They're intentionally excluded.

The thesis: temporal context + tools + model judgment is sufficient — adding these mechanisms would undermine the model's ability to exercise contextual judgment.

The Full Stack

The TRACE Framework

Trust Boundaries

Trust in tool handlers, not decision trees

Runtime Observability

Structured traces make every decision visible

Adaptive Evals

Measure behavioral change with paired differentiation

Context Engineering

Computed facts, response patterns, the flywheel

Enforcement

Hard rules for context-independent harms only

Sarah's faucet and Mike's burst pipe flow through the same stack and produce completely different, appropriate responses.

Takeaways

Three Things to Remember

Pre-compute what models can't calculate

Convert temporal arithmetic into temporal state. Give the model facts, not math problems.

Trust the model with judgment, enforce only what's context-independent

The scaffolding that feels safest is the most likely to be stealing a judgment call.

Measure whether your signals actually change behavior

If you can't score it, you can't improve it. Inject facts, not interpretations — then measure whether the model acts on them.

Jeff Toffoli · jeff@quallaa.com

Quallaa

Thank You

Questions, feedback, or war stories welcome

jeff@quallaa.com

quallaa.com/trace-framework

Your agent has no idea what time it is.

Now you know how to fix that.

T — Trust BoundariesR — Runtime ObservabilityA — Adaptive EvalsC — Context EngineeringE — Enforcement

1 / 34

Press F for fullscreen