Most businesses do not need an AI agent. They need a better workflow.

A lot of AI conversations begin with the model: should we use the most capable API, add a chatbot, build an agent, or run an open-weight model ourselves?

Those questions matter eventually. But there is a tendency, especially in business conversations, to start with the most impressive AI capability available and try to make it fit the organization afterward. The assumption seems to be that once people have access to the tools, the right answers and opportunities will reveal themselves.

That is a recipe for wasted time, unnecessary expense, and disappointing results.

I prefer to start with two simpler questions:

Which workflow is currently expensive, repetitive, slow, or difficult to scale?
What is the least complicated system that can improve it reliably?

Sometimes the answer is an LLM. Other times, the right approach is more conventional: a retrieval layer, document parsing, a rules engine, an API integration, or a better internal process. Often, the useful answer is a combination of those things.

That distinction matters because AI demos are easy to build. They only need to work once.

A real business workflow has to handle messy inputs, permissions, retries, exceptions, review steps, downstream systems, audit logs, and the fact that the model will occasionally be wrong. It also needs a way to measure whether the AI is helping at all.

That is the context worth keeping in mind before deciding where AI belongs in a workflow.

The opportunity is operational leverage

There is a practical reason smaller organizations should pay attention to AI: for some categories of work, it can reduce the labour and time needed to reach a useful result.

A workplace study involving 5,172 customer-support agents at a Fortune 500 software company found that AI assistance increased the number of issues resolved per hour by roughly 15%, with some of the largest benefits going to less-experienced workers.[1] The result came from one organization, so it should not be treated as a universal benchmark. But it is useful evidence that AI assistance can improve a bounded, measurable workflow under real operating conditions.

The opportunity is not limited to customer support. An OECD survey of more than 5,000 small and medium-sized businesses across seven countries found that, among SMEs already using generative AI, 65% reported improved employee performance, 35% said it helped them scale, and 29% said it helped them compete with larger companies.[2] Those are self-reported outcomes rather than measured causal effects, but they point to the kinds of advantages smaller organizations are actively looking for.

A separate field experiment on a cross-border online-retail platform offers a more concrete example. Generative-AI enhancements were introduced across seven consumer-facing workflows, with sales effects ranging from no detectable improvement to 16.3%. Smaller and newer sellers saw disproportionately larger gains.[3] That study is still a working paper, so its findings should be treated as promising evidence rather than a settled rule.

The point is not that AI automatically gives a small business the same advantages as a larger company. The evidence suggests something narrower:

AI can give small teams more operational leverage.

A smaller organization may be able to:

answer more customer requests without growing the support team at the same rate
process more documents before adding administrative headcount
search internal knowledge without relying on one person who remembers where everything lives
prepare first-pass summaries, comparisons, and classifications faster
respond to routine exceptions without starting every answer from scratch
explore new products, suppliers, or markets with less manual research

The gains are not automatic. The same technology can create value in one workflow and add friction in another. Task selection, implementation quality, and measurement still matter.

The useful question is not:

Does AI improve productivity?

It is:

Does this AI-assisted workflow improve productivity for this team, on this kind of work, under these operating conditions?

Use the right tool for each part of the problem

The easiest mistake to make is treating every operational problem as an LLM problem. If a process is deterministic, a deterministic system is usually better.

Problem	Likely starting point
Stable validation rules	Rules engine
Copying data between systems	API integration or RPA for legacy systems
Extracting fields from files	OCR and document AI
Comparing structured records	Deterministic reconciliation
Searching internal policies	Retrieval and reranking
Summarizing grounded documents	LLM with source links
Handling ambiguous exceptions	LLM with human review
Predicting demand or inventory	Classical ML and optimization
High-stakes judgment	Human decision-making with AI support

The most expensive AI mistake is not choosing the wrong model. It is using a model where a rules engine, retrieval system, or ordinary integration would have been more reliable.

That does not just add complexity. It also creates unnecessary operating costs. Whether the model is billed per token or running on your own infrastructure, you are paying for probabilistic behaviour in a part of the workflow that could have been predictable by design.

What makes a good first AI workflow?

The best early AI candidates tend to be:

repetitive enough that improvement matters
language-heavy or document-heavy
grounded in accessible source material
easy to measure
reviewable by a person
still useful when only partially automated

The weakest early candidates tend to require:

autonomous judgment under ambiguity
perfect accuracy
unreliable source data
too many fragile integrations
expensive verification
high-liability decisions without human review

Consider, for example, an operations team processing supplier invoices and packing lists manually. The first useful intervention may not be an agent. It may be document extraction, deterministic reconciliation against purchase orders, and an exception queue for the cases that still need a person. The model handles the ambiguous text and maps the result into a strict internal schema. Ordinary rules handle the stable checks.

This is why a useful first project may be less exciting than an AI demo. It may be an internal search tool, a document-processing queue, an exception-review interface, or a support assistant. But those are the systems that can save time while keeping the failure modes visible and the review path clear.

Where agents fit

Agents are useful when the next step cannot always be determined in advance. A system may need to gather missing context, retrieve records, choose from a small number of approved tools, or route an exception differently depending on what it finds.

That does not mean an agent should have broad authority.

The safer pattern is usually bounded:

allow-listed tools
least-privilege credentials
short action horizons
explicit stop conditions
approval gates for consequential actions
idempotent operations
an audit trail for tool calls
controls for indirect prompt injection in retrieved or uploaded content
a manual-review path when confidence is low

A workflow defines the rails. An agent may help navigate one constrained part of the route.

It should not be given the keys to every system because the word “agent” sounds advanced.

A logical decision flow

Not every workflow needs every layer, but the following diagram is a useful reference model for document-heavy and knowledge-heavy systems. It shows the logical decision flow rather than the runtime topology. At production volumes, slower stages are often split across queues, workers, and independent services.

flowchart TD
    A["Documents, messages,<br/>forms, or events"]
    B["Ingestion and access checks<br/>Optional redaction"]
    C["Parse and normalize"]
    D["Deterministic checks<br/>before inference"]
    E["Retrieve context<br/>Apply permissions"]
    F["Task-specific model<br/>or bounded agent loop"]
    G["Validate schema,<br/>sources, and policy"]
    H{"Valid<br/>and low risk?"}
    I{"Automated action<br/>allowed?"}
    J["Downstream update"]
    K["Manual-review queue"]
    L["Human decision"]
    M["Audit logs, traces,<br/>and workflow metrics"]
    N["External clarification<br/>Hold state"]

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H

    H -- Yes --> I
    H -- No --> K

    I -- Yes --> J
    I -- No --> K

    K --> L
    K --> M

    L -- Approve or correct --> J
    L -- Need more information --> N

    J --> M
    N --> M

In plain English: safe, validated cases can proceed automatically when policy allows it. Uncertain or higher-risk cases go to a person. The workflow does not need to automate every accepted output. Some workflows can safely write downstream after validation. Others should always require approval because the business risk is higher.

A clarification request exits the current workflow run and preserves its state until new information arrives.

The ordering is illustrative. Some workflows retrieve context earlier, apply additional rules after retrieval, or skip retrieval entirely. If a bounded agent pattern is used inside step F, that step becomes an internal loop: the model can call approved tools or gather missing context before passing a structured payload to validation.

Escalation should not depend on the model confidently describing its own answer. It should be driven by validation results, source coverage, deterministic rules, business-risk categories, and thresholds tested against real examples.

The workflow also needs ordinary distributed-systems controls:

retries
idempotency
deduplication
timeouts
rate-limit handling
provider-outage handling
safe handling of partial failures
downstream-write protection
tracing for model calls and tool use
selective caching, with freshness and access-control rules where appropriate

AI adds another failure-prone component to a software system.[4][5]

Architecture choices also carry operational consequences. The right option among SaaS, an enterprise API, a hybrid design, or a private deployment depends on the organization’s constraints:

Data governance: data classification, redaction requirements, and vendor retention or training terms
Performance and scale: latency, peak request volume, and whether the workload is interactive or batch-oriented
Financials and strategy: ongoing token costs, migration costs, and vendor lock-in
Operational ownership: who will monitor for model drift, handle vendor API deprecations, and maintain, evaluate, and support the workflow after launch

Evaluation is part of the architecture

A workflow is not production-ready because ten examples looked good in a demo. It needs a representative evaluation set.

For document extraction, measure:

field accuracy
false positives
false negatives
exception rate
correction rate
time per item

If the exception rate is too high, the context-switching and cognitive load of manual review can erase the time saved by automation.

For retrieval-based assistants, measure:

retrieval relevance
citation accuracy
groundedness
abstention quality
reviewer overrides
time saved

Track versions of the model, prompt, parser, retrieval configuration, source corpus, and policy rules. Approved human corrections should feed back into the evaluation set so regressions can be caught before later changes reach production.

At scale, those checks should run as an automated regression suite integrated into the deployment pipeline, with targeted human audits for cases where automated scoring is not enough.

A small change can improve one part of the workflow while quietly breaking another. Evaluation is not a final QA task. It is part of the production system.

What a useful first engagement looks like

A useful first engagement is rarely an “AI transformation” project.

It is a workflow audit.

The goal is to:

identify one bottleneck
measure the current process
separate deterministic steps from ambiguous ones
design the smallest useful intervention
test it against real examples
add review and audit controls
compare the result against the baseline
decide whether production integration is justified

A useful rule of thumb for step three: if a value can be validated with a fixed rule, a schema, or a database lookup, keep that validation outside the model prompt. Use the model for the language-heavy or ambiguous part.

Sometimes the answer is a retrieval-based assistant. Sometimes it is document extraction with an exception queue. Sometimes the correct answer is still a parser, a rules engine, and a normal API integration.

The practical skill is separating the steps that need language understanding from the ones that should remain predictable, testable, and boring.

If your team is trying to identify where AI could remove a real bottleneck without creating a larger operational problem, that workflow audit is the right place to start.

Source notes

Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond, “Generative AI at Work”, The Quarterly Journal of Economics, 2025.
OECD, Generative AI and the SME Workforce, 2025.
Lu Fang, Zhe Yuan, Kaifu Zhang, Dante Donati, and Miklos Sarvary, “Generative AI and Firm Productivity: Field Experiments in Online Retail”, 2025. Working paper.
National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, 2024.
OWASP, Top 10 for Large Language Model Applications.