Most businesses do not need an AI agent. They need a better workflow.
A practical framework for identifying which operational problems benefit from AI — and which are better served by a rules engine, integration, or better process.
A lot of AI conversations begin with the model: should we use the most capable API, add a chatbot, build an agent, or run an open-weight model ourselves?
Those questions matter eventually. But there is a tendency, especially in business conversations, to start with the most impressive AI capability available and try to make it fit the organization afterward. The assumption seems to be that once people have access to the tools, the right answers and opportunities will reveal themselves.
That is a recipe for wasted time, unnecessary expense, and disappointing results.
I prefer to start with two simpler questions:
- Which workflow is currently expensive, repetitive, slow, or difficult to scale?
- What is the least complicated system that can improve it reliably?
Sometimes the answer is an LLM. Other times, the right approach is more conventional: a retrieval layer, document parsing, a rules engine, an API integration, or a better internal process. Often, the useful answer is a combination of those things.
That distinction matters because AI demos are easy to build. They only need to work once.
A real business workflow has to handle messy inputs, permissions, retries, exceptions, review steps, downstream systems, audit logs, and the fact that the model will occasionally be wrong. It also needs a way to measure whether the AI is helping at all.
That is the context worth keeping in mind before deciding where AI belongs in a workflow.
The opportunity is operational leverage
There is a practical reason smaller organizations should pay attention to AI: for some categories of work, it can reduce the labour and time needed to reach a useful result.
A workplace study involving 5,172 customer-support agents at a Fortune 500 software company found that AI assistance increased the number of issues resolved per hour by roughly 15%, with some of the largest benefits going to less-experienced workers.[1] The result came from one organization, so it should not be treated as a universal benchmark. But it is useful evidence that AI assistance can improve a bounded, measurable workflow under real operating conditions.
The opportunity is not limited to customer support. An OECD survey of more than 5,000 small and medium-sized businesses across seven countries found that, among SMEs already using generative AI, 65% reported improved employee performance, 35% said it helped them scale, and 29% said it helped them compete with larger companies.[2] Those are self-reported outcomes rather than measured causal effects, but they point to the kinds of advantages smaller organizations are actively looking for.
A separate field experiment on a cross-border online-retail platform offers a more concrete example. Generative-AI enhancements were introduced across seven consumer-facing workflows, with sales effects ranging from no detectable improvement to 16.3%. Smaller and newer sellers saw disproportionately larger gains.[3] That study is still a working paper, so its findings should be treated as promising evidence rather than a settled rule.
The point is not that AI automatically gives a small business the same advantages as a larger company. The evidence suggests something narrower:
AI can give small teams more operational leverage.
A smaller organization may be able to:
- answer more customer requests without growing the support team at the same rate
- process more documents before adding administrative headcount
- search internal knowledge without relying on one person who remembers where everything lives
- prepare first-pass summaries, comparisons, and classifications faster
- respond to routine exceptions without starting every answer from scratch
- explore new products, suppliers, or markets with less manual research
The gains are not automatic. The same technology can create value in one workflow and add friction in another. Task selection, implementation quality, and measurement still matter.
The useful question is not:
Does AI improve productivity?
It is:
Does this AI-assisted workflow improve productivity for this team, on this kind of work, under these operating conditions?
Use the right tool for each part of the problem
The easiest mistake to make is treating every operational problem as an LLM problem. If a process is deterministic, a deterministic system is usually better.
| Problem | Likely starting point |
|---|---|
| Stable validation rules | Rules engine |
| Copying data between systems | API integration or RPA for legacy systems |
| Extracting fields from files | OCR and document AI |
| Comparing structured records | Deterministic reconciliation |
| Searching internal policies | Retrieval and reranking |
| Summarizing grounded documents | LLM with source links |
| Handling ambiguous exceptions | LLM with human review |
| Predicting demand or inventory | Classical ML and optimization |
| High-stakes judgment | Human decision-making with AI support |
The most expensive AI mistake is not choosing the wrong model. It is using a model where a rules engine, retrieval system, or ordinary integration would have been more reliable.
That does not just add complexity. It also creates unnecessary operating costs. Whether the model is billed per token or running on your own infrastructure, you are paying for probabilistic behaviour in a part of the workflow that could have been predictable by design.
What makes a good first AI workflow?
The best early AI candidates tend to be:
- repetitive enough that improvement matters
- language-heavy or document-heavy
- grounded in accessible source material
- easy to measure
- reviewable by a person
- still useful when only partially automated
The weakest early candidates tend to require:
- autonomous judgment under ambiguity
- perfect accuracy
- unreliable source data
- too many fragile integrations
- expensive verification
- high-liability decisions without human review
Consider, for example, an operations team processing supplier invoices and packing lists manually. The first useful intervention may not be an agent. It may be document extraction, deterministic reconciliation against purchase orders, and an exception queue for the cases that still need a person. The model handles the ambiguous text and maps the result into a strict internal schema. Ordinary rules handle the stable checks.
This is why a useful first project may be less exciting than an AI demo. It may be an internal search tool, a document-processing queue, an exception-review interface, or a support assistant. But those are the systems that can save time while keeping the failure modes visible and the review path clear.
Where agents fit
Agents are useful when the next step cannot always be determined in advance. A system may need to gather missing context, retrieve records, choose from a small number of approved tools, or route an exception differently depending on what it finds.
That does not mean an agent should have broad authority.
The safer pattern is usually bounded:
- allow-listed tools
- least-privilege credentials
- short action horizons
- explicit stop conditions
- approval gates for consequential actions
- idempotent operations
- an audit trail for tool calls
- controls for indirect prompt injection in retrieved or uploaded content
- a manual-review path when confidence is low
A workflow defines the rails. An agent may help navigate one constrained part of the route.
It should not be given the keys to every system because the word “agent” sounds advanced.
A logical decision flow
Not every workflow needs every layer, but the following diagram is a useful reference model for document-heavy and knowledge-heavy systems. It shows the logical decision flow rather than the runtime topology. At production volumes, slower stages are often split across queues, workers, and independent services.
flowchart TD
A["Documents, messages,<br/>forms, or events"]
B["Ingestion and access checks<br/>Optional redaction"]
C["Parse and normalize"]
D["Deterministic checks<br/>before inference"]
E["Retrieve context<br/>Apply permissions"]
F["Task-specific model<br/>or bounded agent loop"]
G["Validate schema,<br/>sources, and policy"]
H{"Valid<br/>and low risk?"}
I{"Automated action<br/>allowed?"}
J["Downstream update"]
K["Manual-review queue"]
L["Human decision"]
M["Audit logs, traces,<br/>and workflow metrics"]
N["External clarification<br/>Hold state"]
A --> B
B --> C
C --> D
D --> E
E --> F
F --> G
G --> H
H -- Yes --> I
H -- No --> K
I -- Yes --> J
I -- No --> K
K --> L
K --> M
L -- Approve or correct --> J
L -- Need more information --> N
J --> M
N --> M
In plain English: safe, validated cases can proceed automatically when policy allows it. Uncertain or higher-risk cases go to a person. The workflow does not need to automate every accepted output. Some workflows can safely write downstream after validation. Others should always require approval because the business risk is higher.
A clarification request exits the current workflow run and preserves its state until new information arrives.
The ordering is illustrative. Some workflows retrieve context earlier, apply additional rules after retrieval, or skip retrieval entirely. If a bounded agent pattern is used inside step F, that step becomes an internal loop: the model can call approved tools or gather missing context before passing a structured payload to validation.
Escalation should not depend on the model confidently describing its own answer. It should be driven by validation results, source coverage, deterministic rules, business-risk categories, and thresholds tested against real examples.
The workflow also needs ordinary distributed-systems controls:
- retries
- idempotency
- deduplication
- timeouts
- rate-limit handling
- provider-outage handling
- safe handling of partial failures
- downstream-write protection
- tracing for model calls and tool use
- selective caching, with freshness and access-control rules where appropriate
AI adds another failure-prone component to a software system.[4][5]
Architecture choices also carry operational consequences. The right option among SaaS, an enterprise API, a hybrid design, or a private deployment depends on the organization’s constraints:
- Data governance: data classification, redaction requirements, and vendor retention or training terms
- Performance and scale: latency, peak request volume, and whether the workload is interactive or batch-oriented
- Financials and strategy: ongoing token costs, migration costs, and vendor lock-in
- Operational ownership: who will monitor for model drift, handle vendor API deprecations, and maintain, evaluate, and support the workflow after launch
Evaluation is part of the architecture
A workflow is not production-ready because ten examples looked good in a demo. It needs a representative evaluation set.
For document extraction, measure:
- field accuracy
- false positives
- false negatives
- exception rate
- correction rate
- time per item
If the exception rate is too high, the context-switching and cognitive load of manual review can erase the time saved by automation.
For retrieval-based assistants, measure:
- retrieval relevance
- citation accuracy
- groundedness
- abstention quality
- reviewer overrides
- time saved
Track versions of the model, prompt, parser, retrieval configuration, source corpus, and policy rules. Approved human corrections should feed back into the evaluation set so regressions can be caught before later changes reach production.
At scale, those checks should run as an automated regression suite integrated into the deployment pipeline, with targeted human audits for cases where automated scoring is not enough.
A small change can improve one part of the workflow while quietly breaking another. Evaluation is not a final QA task. It is part of the production system.
What a useful first engagement looks like
A useful first engagement is rarely an “AI transformation” project.
It is a workflow audit.
The goal is to:
- identify one bottleneck
- measure the current process
- separate deterministic steps from ambiguous ones
- design the smallest useful intervention
- test it against real examples
- add review and audit controls
- compare the result against the baseline
- decide whether production integration is justified
A useful rule of thumb for step three: if a value can be validated with a fixed rule, a schema, or a database lookup, keep that validation outside the model prompt. Use the model for the language-heavy or ambiguous part.
Sometimes the answer is a retrieval-based assistant. Sometimes it is document extraction with an exception queue. Sometimes the correct answer is still a parser, a rules engine, and a normal API integration.
The practical skill is separating the steps that need language understanding from the ones that should remain predictable, testable, and boring.
If your team is trying to identify where AI could remove a real bottleneck without creating a larger operational problem, that workflow audit is the right place to start.
Source notes
- Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond, “Generative AI at Work”, The Quarterly Journal of Economics, 2025.
- OECD, Generative AI and the SME Workforce, 2025.
- Lu Fang, Zhe Yuan, Kaifu Zhang, Dante Donati, and Miklos Sarvary, “Generative AI and Firm Productivity: Field Experiments in Online Retail”, 2025. Working paper.
- National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, 2024.
- OWASP, Top 10 for Large Language Model Applications.