What exactly is this service?

It's an AI assistant that responds to your missed calls via text message. When a customer calls and you can't answer, they automatically get a text within 30 seconds. The AI has a natural conversation with them and gathers details about their issue, so they're taken care of until you can call back.

Will my customers know it's AI?

Yes — Quallaa includes a natural AI disclosure in first messages by default. The AI introduces itself by your business name and explains how it can help. You can customize the wording in your settings. Starting June 2026, Colorado law requires this disclosure, and Quallaa handles it automatically.

How much does it cost?

Each AI agent is $49/month. The Complete Bundle is $199/month and includes every agent. 14-day free trial — you won't be charged until it ends. Month-to-month, cancel anytime. Flat monthly pricing, no per-message or per-call charges.

Do I need to change my phone number?

No. Your business number stays exactly the same. You get a dedicated AI number with your area code. With the Your Own Number add-on (+$50/mo), customers text your real number and AI replies from it — they never see a different number.

How long does setup take?

Sign up, enter your business info, and complete a quick interactive certification (~10-20 minutes). Then pick a local number, set up call forwarding, and your AI is live. Self-service — no calls or appointments needed.

Risk Scoring for Public-Facing AI: Eight Dimensions, Compound Scores, Hard Stops

"Is this AI deployment safe?" is the wrong question. Every deployment carries risk. The right question is: how much risk, on which dimensions, and what safety measures match it?

We built a risk scoring engine that answers exactly that — eight dimensions, scored 1 to 5, combined into a compound score with hard-stop thresholds for combinations that simply shouldn't ship. It runs inside our MCP write tools, so every configuration change to a public-facing AI agent gets evaluated before it takes effect.

This post walks through the framework. There's an interactive version at /risk-scoring where you can score example deployments yourself.

The interactive risk scoring presentation at /risk-scoring. Move the sliders for any deployment and watch the compound score change in real time.

The Eight Dimensions

The dimensions are designed to be independent — moving one shouldn't automatically move the others. They're also designed to be scorable by a non-expert reading the criteria, not just by someone who's read NIST AI RMF cover to cover. Each one is anchored in real frameworks (NIST, EU AI Act, OWASP Agentic Top 10, UC Berkeley Agentic Standards, NVIDIA Frontier Risk, AWS Agentic Security Scoping Matrix), then synthesized into criteria you can actually apply.

1. Autonomy. How much freedom does the agent have to act without human approval? Score 1: drafts responses, human reviews and sends every one. Score 5: initiates actions, spawns sub-tasks, runs continuously without a human in the loop. A plumber's text-back agent that responds, books, and follows up on its own — with the owner reviewing conversations daily — is autonomy 3.

2. Action capability. What can the agent actually do in the world? Score 1: read-only, no external effects. Score 5: irreversible financial, legal, or physical actions across multiple systems. Sending an email is higher than answering a question. Charging a card is higher than sending an email. Filing a permit is higher than charging a card.

3. Consequence severity. If the agent is wrong, how bad is it? Score 1: customer mildly annoyed. Score 5: physical harm, legal liability, or catastrophic financial loss. The same wrong answer about pricing has different severity for a yard care company versus a hospital triage line.

4. Reversibility. If the agent makes a mistake, can you undo it? Score 1: trivial to reverse, no residue. Score 5: permanent, with no remediation path. A booked appointment is highly reversible. A sent legal document is not.

5. Audience exposure. How many people see the agent's outputs, and who are they? Score 1: one authenticated user at a time. Score 5: broadcast to a public audience, indexed, archived. A 1:1 text conversation is low exposure. A social media auto-post is high exposure.

6. Domain sensitivity. What field is the agent operating in? Score 1: low-stakes general help. Score 5: regulated domains — health, legal, financial, election-related. Domain sensitivity is independent of consequence — a low-consequence wrong answer in a regulated domain still triggers compliance obligations that don't exist elsewhere.

7. Identity representation. Who does the agent appear to be? Score 1: explicitly labeled as AI, distinct from any individual. Score 5: speaks as a specific named human, creating reasonable belief that a human is present. The Air Canada chatbot was a 5 on this dimension. The customer believed a representative of Air Canada had quoted them a refund. The court agreed.

8. Data sensitivity. What kind of data does the agent touch? Score 1: public information only. Score 5: regulated PII, PHI, financial records, or credentials. An agent that knows your business hours is low. An agent that knows your customers' medication histories is high.

Why Independence Matters

The trick is that these dimensions don't track each other. A plumber's text-back bot might be autonomy 3 (acts on its own with daily review), action capability 2 (texts and books, nothing more), consequence severity 2 (a wrong booking is annoying but recoverable), reversibility 4 (easy to fix), exposure 1 (1:1 conversations), domain sensitivity 1, identity 3 (sounds like a person, not labeled as AI), data sensitivity 2.

A fully autonomous marketing campaign agent might be autonomy 5, action capability 4, consequence severity 4 (brand damage at scale), reversibility 1 (sent emails can't be unsent), exposure 5 (broadcast), domain sensitivity 2, identity 4, data sensitivity 3.

These are very different deployments. A scoring system that collapses them into a single number ("medium risk") loses the information that lets you decide what to do about it. The plumber bot needs guardrails on what it agrees to. The marketing agent needs human review on every send.

Compound Scoring

Independent dimensions still need to be combined into something actionable. We use a weighted geometric mean: each dimension contributes, but no single dimension can be hidden by averaging it against others. A deployment that's 5 on consequence severity and 1 on everything else doesn't get a "low risk" overall score — the geometric mean punishes the high dimension.

Then we layer hard-stop thresholds on top. Some combinations of dimensions trigger an automatic block, regardless of the average. Examples:

Identity representation 5 + audience exposure 4 = "speaking as a named human to a broadcast audience" — historically the configuration that produces the most damaging incidents. Not allowed without explicit owner acknowledgment.
Action capability 4 + reversibility 1 + autonomy 4 = "irreversible action, no human approval, high impact" — too dangerous to ship by default.
Domain sensitivity 5 + data sensitivity 5 = "regulated domain, regulated data" — requires compliance review, not just a configuration toggle.

The thresholds aren't arbitrary. They come from incident analysis: which configurations historically produce the failures that hit the news? Those are the ones we hard-stop.

What Happens at Each Tier

The compound score maps to five tiers, each with proportional safety measures:

Tier 1 (minimal): Standard logging, escalation on confusion. Default for low-stakes deployments.
Tier 2 (low): Above plus daily owner review of escalations.
Tier 3 (moderate): Above plus weekly conversation sampling, owner notification on policy edges, explicit guardrails on sensitive topics.
Tier 4 (high): Above plus human approval on configuration changes, real-time monitoring, scoped tool access, stricter trust boundary defaults.
Tier 5 (critical): Above plus pre-deployment review, ongoing audit, restricted to specific use cases.

The point is that "AI safety" stops being a binary. It's a sliding scale where the safety measures are proportional to the actual risk profile of this specific deployment, not the average of what AI vendors are selling.

Where the Engine Lives

The risk scoring engine isn't a separate dashboard or audit tool. It runs inside our MCP write handlers — the tools that let owners (and Claude Desktop, and other MCP clients) reconfigure their public-facing AI. Every time someone updates their instructions, toggles a tool, or changes their business info, the engine evaluates whether the change shifts the risk profile.

If the change moves a dimension up, the trust layer responds by surfacing a contextual interface that explains what just changed, walks the owner through what it means, and captures their understanding before the change takes effect. If the change crosses a hard-stop threshold, the system declines the change and explains why.

The owner never sees "your deployment is now Tier 3." They see: "You just enabled email tools. That means your agent can now send messages on your behalf to anyone in your contact list. Here's what that means for the kinds of mistakes that become harder to take back, and here's how to set the guardrails."

The risk framework is the engine. The owner experience is the conversation.

Why We Built It

Public-facing AI is the most dangerous deployment category for AI right now. Not because the models are bad, but because the deployments don't come with the safety scaffolding that internal AI tools and authenticated copilots have by default. There's no login, no account, no contract — just a stranger talking to an AI that represents your business.

The existing AI safety frameworks are good, but they're written for AI labs and large enterprises. They assume you have a compliance team, a deployment review board, and a budget for outside auditors. None of that exists for a plumber or a yoga studio. So the frameworks may as well not exist, for the audience that needs them most.

A scorable system, with criteria a non-expert can apply, that runs automatically at the moment of configuration, is the only way risk-proportional safety reaches the businesses actually deploying public-facing AI. Building it took less time than the marketing copy about it would have. The hard part was deciding it was the product, not a feature.

Try It

The interactive version at /risk-scoring lets you score the eight dimensions for example deployments — a plumber's text-back, a real estate lead qualifier, a healthcare appointment scheduler, a financial services intake bot — and see how the compound score and tier change as you move the sliders. You can also score your own deployment.

If you're building public-facing AI, the framework is yours to use. The numbers are derived from real frameworks, the criteria are documented, and the engine is open in the sense that it runs predictably from inputs you control. If something in the scoring looks wrong for your domain, tell us — incident data is how the thresholds get sharper.

Risk Scoring for Public-Facing AI: Eight Dimensions, Compound Scores, Hard Stops

The Eight Dimensions

Why Independence Matters

Compound Scoring

What Happens at Each Tier

Where the Engine Lives

Why We Built It

Try It

Stop losing jobs to missed calls

Related Articles

Why Our AI Cites Its Sources -- and How We Wired It Through Claude's API

Progressive Disclosure as Data Labeling: A Different Kind of AI Safety Loop

Why We Don't Sanitize User Messages in Our AI Agent