AI-driven inbox triage for small teams

TL;DR: A small-team inbox-triage AI sorts incoming mail, drafts replies, and routes by category, giving a founder back 1–2 hours a day. A typical small-team inbox gets 30–80 messages daily, of which 5–10 need a human reply. The five jobs: classify (support, sales, billing, partnership, spam), route, draft a response in your voice, auto-handle the rest, and surface the messages that need judgment with a draft attached. Build cost: $2–6k. Run cost: $20–100/mo. The failure modes to engineer for: never auto-send, never delete, always log, always escalate when confidence drops. Done well it returns hours. Done poorly it loses mail.

The shared inbox (hello@yourcompany.com, support@, sales@) is where founder time goes to die. A typical small-team inbox gets 30–80 messages a day, of which maybe 5–10 actually need a human reply, and the rest are notifications, newsletters, vendor pitches, and questions a saved reply could answer.

An inbox-triage AI sorts, drafts, and routes. Done well, it gives a founder back 1–2 hours a day. Done poorly, it loses mail and damages trust. The line between the two is a few specific design decisions.

What inbox triage actually does

Five jobs:

Classify every incoming message: support, sales, billing, partnership, spam, vendor pitch, internal.
Route to the right person, channel, or queue.
Draft a response for the messages that need one, in the team's voice.
Auto-handle the messages that don't need a human (auto-reply, archive, label).
Surface the messages that genuinely need human judgment, with a pre-drafted reply attached.

The goal isn't full automation. It's making the human's first interaction with the message be "approve or edit," not "read from scratch."

What the build looks like

Most builds have the same architecture.

Step 1: Pull from the inbox. Gmail API, Outlook Graph API, or IMAP. Polling every 1–5 minutes is fine for most teams. Webhooks where available.

Step 2: Classify with an LLM. A short prompt that returns one of 5–8 categories. Few-shot examples help a lot.

Step 3: Look up context. Is this a known customer? Has there been a previous thread? Do we have an open deal in HubSpot? Pull the relevant context.

Step 4: Route or draft. Routing is rule-based. Drafting is LLM-based, with retrieval over past replies and the team's saved templates.

Step 5: Surface the result. Either as a draft saved to the inbox, a Slack message with the draft, or a queue UI the founder reviews each morning.

That's the whole architecture. The hard parts are in steps 3, 4, and 5.

The cost

For a 1–5 person team handling 50 messages a day:

LLM API: 50 messages × 30 days × $0.02 per message = ~$30/mo.
Hosting: $20/mo.
Gmail API: free at this volume.
Vector store (for past-replies retrieval): $0–$30/mo.
Total: ~$50–$80/mo.

Build cost: $5k–$12k for a working system tuned to the team's voice and rules.

When this pays back

Math we see most often.

A founder spends 90 minutes a day on inbox triage. Of that, 60 minutes is reading and dismissing messages that didn't need them, and 30 minutes is drafting replies that mostly write themselves.

An inbox-triage system with a 70% auto-handle rate cuts that 90 minutes to ~25 minutes (review the AI's drafts, approve, edit the few that need real attention).

Savings: 65 minutes/day × 22 working days = 24 hours/month. At $200/hr founder time, that's $4,800/month. Build pays back in 1–3 months.

The failure modes to engineer for

The reason this is hard is the failure cost. A wrongly-classified or auto-archived message can be a lost lead or an angry customer.

Three failure modes to design around.

1. The auto-handle false positive. The AI classifies an important message as "newsletter" and archives it. The customer waits a week for a reply that never comes.

Fix: Conservative auto-handle rules. Only auto-archive things you're 99% sure about (Mailchimp newsletters, vendor pitches with specific patterns). When in doubt, surface for human review.

2. The drafted reply that's confidently wrong. The AI drafts a response with information that's almost right but not quite. The founder is busy, doesn't read carefully, hits send. Customer gets bad info.

Fix: The draft includes confidence flags. "I'm 90% sure about X, less sure about Y — please verify." Reps learn to trust high-confidence drafts and edit low-confidence ones.

3. The privacy slip. The system feeds customer email content to a third-party LLM. Some industries (healthcare, legal) can't do this without explicit consent or specific compliance posture.

Fix: Choose your model deliberately. OpenAI and Anthropic both offer enterprise tiers with stronger data handling. For sensitive industries, self-hosted local models may be the only viable path.

What's worth surfacing, and what isn't

The right ratio depends on your inbox.

For a typical small-team inbox of 50/day, we typically see:

Auto-handled (no human touch): 20–30 messages/day. Newsletters, automated notifications, obvious spam.
Drafted and queued for review: 15–20 messages/day. Routine support, sales replies, partnership outreach. Founder skims, approves, sends.
Surfaced as "needs you specifically": 5–10 messages/day. Anything genuinely high-stakes or ambiguous.

That ratio is the goal. If the AI is auto-handling under 30%, it's too conservative and not saving enough time. If it's auto-handling over 60%, it's almost certainly losing mail.

What to integrate with

A good inbox-triage system writes to:

The inbox itself. Drafts saved as drafts, labels applied, threads marked read.
Slack. Notifications for messages that need urgent attention.
The CRM. New leads or known customers tagged with the relevant context.
A review UI. A simple morning review queue that shows the AI's classifications and lets the founder fast-process them.

The review UI is what makes the difference between "AI helps" and "AI replaces founder time." Without it, the founder still has to context-switch into the inbox to use the drafts.

What we ship for clients

A typical Webdimonia inbox-triage build:

Inbox audit and rule set: included. We watch a sample of messages and define the categories that matter.
Classification + routing pipeline: $2k–$4k.
Drafting with retrieval over past replies: $2k–$4k.
Slack and CRM integrations: $1k–$2k.
Review queue UI: $1k–$3k.
Voice tuning, 30-day calibration window: included.

Total: $6k–$13k. The voice tuning is the part most studios skip and the part that decides whether the founder trusts the drafts.

Three questions to decide this week

How many inbox messages a day, and how many need a human reply? If under 20/day total, this isn't worth building.
Do you have past replies (1+ year of email history) the AI can learn voice from? If yes, voice matching is feasible. If no, the AI's drafts will sound generic.
Are there compliance constraints on what the AI can read? If yes, build with a self-hosted model. If no, OpenAI or Anthropic enterprise tier.

If you want a quote on an inbox-triage build for your team's specific inbox shape and voice, send us a description of the inbox volume, the categories of messages you receive, and your CRM. We send a tiered proposal back within two days.

What inbox triage actually does

Five jobs:

Classify every incoming message: support, sales, billing, partnership, spam, vendor pitch, internal.
Route to the right person, channel, or queue.
Draft a response for the messages that need one, in the team's voice.
Auto-handle the messages that don't need a human (auto-reply, archive, label).
Surface the messages that genuinely need human judgment, with a pre-drafted reply attached.

The goal isn't full automation. It's making the human's first interaction with the message be "approve or edit," not "read from scratch."

What the build looks like

Most builds have the same architecture.

Step 1: Pull from the inbox. Gmail API, Outlook Graph API, or IMAP. Polling every 1–5 minutes is fine for most teams. Webhooks where available.

Step 2: Classify with an LLM. A short prompt that returns one of 5–8 categories. Few-shot examples help a lot.

Step 3: Look up context. Is this a known customer? Has there been a previous thread? Do we have an open deal in HubSpot? Pull the relevant context.

Step 4: Route or draft. Routing is rule-based. Drafting is LLM-based, with retrieval over past replies and the team's saved templates.

Step 5: Surface the result. Either as a draft saved to the inbox, a Slack message with the draft, or a queue UI the founder reviews each morning.

That's the whole architecture. The hard parts are in steps 3, 4, and 5.

The cost

For a 1–5 person team handling 50 messages a day:

LLM API: 50 messages × 30 days × $0.02 per message = ~$30/mo.
Hosting: $20/mo.
Gmail API: free at this volume.
Vector store (for past-replies retrieval): $0–$30/mo.
Total: ~$50–$80/mo.

Build cost: $5k–$12k for a working system tuned to the team's voice and rules.

When this pays back

Math we see most often.

A founder spends 90 minutes a day on inbox triage. Of that, 60 minutes is reading and dismissing messages that didn't need them, and 30 minutes is drafting replies that mostly write themselves.

An inbox-triage system with a 70% auto-handle rate cuts that 90 minutes to ~25 minutes (review the AI's drafts, approve, edit the few that need real attention).

Savings: 65 minutes/day × 22 working days = 24 hours/month. At $200/hr founder time, that's $4,800/month. Build pays back in 1–3 months.

The failure modes to engineer for

The reason this is hard is the failure cost. A wrongly-classified or auto-archived message can be a lost lead or an angry customer.

Three failure modes to design around.

1. The auto-handle false positive. The AI classifies an important message as "newsletter" and archives it. The customer waits a week for a reply that never comes.

Fix: Conservative auto-handle rules. Only auto-archive things you're 99% sure about (Mailchimp newsletters, vendor pitches with specific patterns). When in doubt, surface for human review.

Fix: The draft includes confidence flags. "I'm 90% sure about X, less sure about Y — please verify." Reps learn to trust high-confidence drafts and edit low-confidence ones.

3. The privacy slip. The system feeds customer email content to a third-party LLM. Some industries (healthcare, legal) can't do this without explicit consent or specific compliance posture.

Fix: Choose your model deliberately. OpenAI and Anthropic both offer enterprise tiers with stronger data handling. For sensitive industries, self-hosted local models may be the only viable path.

What's worth surfacing, and what isn't

The right ratio depends on your inbox.

For a typical small-team inbox of 50/day, we typically see:

Auto-handled (no human touch): 20–30 messages/day. Newsletters, automated notifications, obvious spam.
Drafted and queued for review: 15–20 messages/day. Routine support, sales replies, partnership outreach. Founder skims, approves, sends.
Surfaced as "needs you specifically": 5–10 messages/day. Anything genuinely high-stakes or ambiguous.

That ratio is the goal. If the AI is auto-handling under 30%, it's too conservative and not saving enough time. If it's auto-handling over 60%, it's almost certainly losing mail.

What to integrate with

A good inbox-triage system writes to:

The inbox itself. Drafts saved as drafts, labels applied, threads marked read.
Slack. Notifications for messages that need urgent attention.
The CRM. New leads or known customers tagged with the relevant context.
A review UI. A simple morning review queue that shows the AI's classifications and lets the founder fast-process them.

The review UI is what makes the difference between "AI helps" and "AI replaces founder time." Without it, the founder still has to context-switch into the inbox to use the drafts.

What we ship for clients

A typical Webdimonia inbox-triage build:

Inbox audit and rule set: included. We watch a sample of messages and define the categories that matter.
Classification + routing pipeline: $2k–$4k.
Drafting with retrieval over past replies: $2k–$4k.
Slack and CRM integrations: $1k–$2k.
Review queue UI: $1k–$3k.
Voice tuning, 30-day calibration window: included.

Total: $6k–$13k. The voice tuning is the part most studios skip and the part that decides whether the founder trusts the drafts.

Three questions to decide this week

How many inbox messages a day, and how many need a human reply? If under 20/day total, this isn't worth building.
Do you have past replies (1+ year of email history) the AI can learn voice from? If yes, voice matching is feasible. If no, the AI's drafts will sound generic.
Are there compliance constraints on what the AI can read? If yes, build with a self-hosted model. If no, OpenAI or Anthropic enterprise tier.

AI-driven inbox triage for small teams

What inbox triage actually does

What the build looks like

The cost

When this pays back

The failure modes to engineer for

What's worth surfacing, and what isn't

What to integrate with

What we ship for clients

Three questions to decide this week

Related

AI-driven inbox triage for small teams

What inbox triage actually does

What the build looks like

The cost

When this pays back

The failure modes to engineer for

What's worth surfacing, and what isn't

What to integrate with

What we ship for clients

Three questions to decide this week

Related