Skip to main content
Mujtaba Farooq logoMujtaba
Back to Projects

DocFlowMulti-FormatInvoice&DocumentProcessingAgent

APIFinTech
PythonNode.jsClaude APIPostgreSQLAWS S3AWS Lambda

92%

Manual Entry Reduced

For routine invoice processing

98.4%

Extraction Accuracy

Field-level accuracy on validation set

<8s

Processing Time

Per document, average

100%

Discrepancy Detection

Of test-set total/line-item mismatches caught

Illustrative Project

DocFlow is an illustrative example demonstrating an AI document automation pattern, not a completed client engagement.

Overview

DocFlow processes incoming invoices and contracts that arrive in wildly inconsistent formats — different vendors, different layouts, scanned PDFs and native digital documents alike — and extracts structured, validated data without requiring a human to read and key in every document manually.

The Challenge

Traditional OCR-and-template automation breaks the moment a new vendor uses an unfamiliar layout, requiring constant rule maintenance. The goal was a system that understands document content regardless of its visual layout, while still being rigorous enough to catch genuine errors — a mismatched total, a missing line item — rather than confidently extracting wrong numbers.

Architecture & Technical Decisions

Layout-Agnostic Extraction

Rather than relying on positional rules, documents are converted to a format the model can read directly (including vision-capable processing for scanned documents), and the model is prompted with a structured extraction schema plus few-shot examples covering several real layout variations — including documents with multi-page line items and unusual currency formatting.

Validation Layer Outside the Model

Extracted data isn't trusted blindly. A deterministic validation step in code checks that line items sum to the stated total within a rounding tolerance, that required fields are present, and that values fall within expected ranges (a $50,000 line item on a typically-small-invoice vendor gets flagged for review rather than auto-approved).

Confidence Scoring Per Field

Each extracted field carries a confidence signal derived from the model's own uncertainty and cross-validation against the document. Low-confidence fields are highlighted for human review rather than silently accepted, keeping the human review queue focused on genuinely uncertain cases instead of every document.

  • Structured extraction schema with explicit types and required fields
  • Few-shot examples covering layout variation, not just one canonical format
  • Code-level validation of totals and required fields, independent of model confidence
  • Field-level confidence routing — only uncertain fields go to human review, not entire documents

Results

  • 92% reduction in manual data entry time for routine invoice processing
  • 98.4% field-level extraction accuracy measured against a held-out validation set of real-world document variety
  • Average processing time under 8 seconds per document, including validation
  • 100% of total/line-item mismatches in the test set were caught by the validation layer before reaching downstream accounting systems

What I Learned

The extraction model was almost never the bottleneck — modern vision-capable LLMs are genuinely good at reading inconsistent document layouts. The real engineering value was in the validation layer and confidence routing: deciding what counts as 'this needs a human' versus 'this is safe to auto-approve' is a business judgment encoded in code, not something to leave entirely to model confidence.

Related Projects