Agentic document digitization

Hand it to a process,not an API.

alma reads any document with a multi-agent pipeline — typed, handwritten, tables, maps, stamps, signatures — then verifies every field against cited evidence before it ships. Nothing is fabricated.

Trusted by archivists, registries and enterprise data teams

text track

“Ezekiel A. Holder”

est 0.71

vision track

“Ezekiel Braithwaite”

est 0.69

both reads below threshold → escalated to a reviewer

grantor ="Ezekiel Braithwaite"· reviewer · ¶2 L3

amber + ✓ = a human confirmed it — not the machine

Verified, not generated

Drag the seam. Watch ink become a dataset you can trust.

Left, a 1911 conveyance in fading copperplate. Right, the same page after alma — every field boxed, scored, and linked to the exact pixels it came from.

Read & scored — 47 fields · 44 auto-confirmed (machine) · 3 to review
grantor ="Ezekiel A. Holder"· est 0.96· ¶2 L3parish ="St. Michael"· est 0.98parcel =Lot 7 · №1843/214· est 0.94consideration =£ 240 ?· est 0.71· review
teal = machine-read & scored · amber = a human confirmed it
every value → score + citation + bounding box
Raw scan — 1911 conveyance, mixed type + handwriting

Know all men by these Presents that I, Ezekiel A. Holder of the parish of Saint Michael, planter, in consideration of the sum of…

blur · foxing · faded ink · no structure
The pipeline

Five agents. One orchestrated pass.

A document is handed to a process, not an API — detected, read on two tracks, reconciled, and shipped only with the evidence behind it.

  • text
  • vision
  • verified ✓ human
  • escalated
  1. 01
    Detect & segment

    Locate every field, table, stamp and margin note.

  2. 02
    Dual-track read

    Text and vision models read each field on their own.

  3. 03
    Verify

    Reconcile both reads into one scored, cited value.

  4. 04
    Export

    Confirmed values ship as csv, json or API.

  5. 05
    Escalate

    The doubtful ~3% goes to a human reviewer.

How it shows its work

Two readings. One value.

field · passenger.surname · ship's manifest, 1907

text track · OCR/HTRsurname ="Kowalczyk"· est 0.68
vision track · VLMsurname ="Kowalczyk"· est 0.71
machine-resolved
surname = “Kowalczyk”est 0.95
cited · manifest p.3 · passenger roll 1907-A

Neither track is trusted alone.

est = alma's estimate — only a human sign-off turns it amber + ✓

What's under the hood

Built to read what OCR gives up on.

table
map
stamp
sign
hand
redact
Any artifact, not just clean type
Handwriting, tables, maps, stamps, signatures, redactions — the things plain OCR drops.
birth.date ="1923-04-07"· est 0.93
A score on every field
Each value carries an estimate and a citation to the source pixels.
text 0.68
vision 0.71
Two tracks, cross-checked
A text track and a vision track — never one model's lone guess.
“Kraków”
gazetteer
matched · place index
Grounded in your vocabulary
Readings checked against your gazetteers, indexes and code lists.
97% auto3% → human
Only the residue reaches a person
Reviewers see the doubtful 3% — candidates and evidence pre-assembled.
csvxlsxxmljsonAPIMCP
Export anywhere, access anywhere
Excel, CSV, XML, JSON — plus an enterprise API and MCP, with RBAC on every key.

Any document, any century

Digitize ships' manifests

Centuries of formats, one process — built for the archive, not the demo.

  • Ship's manifest
    surname ="Kowalczyk"· est 0.71
  • Estate ledger
    amount ="£240 14s"· est 0.88
  • Birth register
    born ="1923-04-07"· est 0.93
  • Cadastral map
    parcel.id ="№1843/214"· est 0.90
  • Census return
    occupation ="shipwright"· est 0.74
  • Medical intake
    blood_type ="O+"· est 0.82
  • Conveyance deed
    grantor ="Holder"· est 0.62
text trackvision tracksent to a reviewer

The product, not a pitch

See it in the app.

Real screens from the verifying app — every value scored, evidenced, and yours to check.

01 / 04

The Verifying Room

Reviewers see the scan and every reading side by side.

  • scan ↔ readings, locked together
  • confirm, edit or reject in a keystroke
  • every field scored as you go
alma.intergen.app/review
alma's verifying room: a scanned document beside its extracted fields, each reading shown with a confidence score for side-by-side human review.
02 / 04

The review docket

Only the doubtful residue reaches a person.

  • sorted by what needs you most
  • deeds, manifests, ledgers, forms
  • the easy 97% never lands here
alma.intergen.app/queue
alma's review docket: a queue of only the low-confidence documents routed to a human, spanning several document types.
03 / 04

Knowledge grounding

Every reading checked against your own vocabulary.

  • names, places, parcels, codes
  • matches shown inline with each read
  • your corrections compound over time
alma.intergen.app/knowledge
alma's knowledge layer: extracted readings matched against a project's own controlled vocabulary and gazetteer of names, places and codes.
04 / 04

Per-field confidence

Each value carries a score, evidence, and a reason.

  • value · score · evidence · reason
  • export csv · json · xml · xlsx
  • RBAC scoped on every key
alma.intergen.app/console
alma's developer console: per-field results where each value is returned with a confidence score, an evidence citation, and a reason.

A model you own.

Every correction your reviewers make is captured as training data. alma fine-tunes and distills a model specific to your archive — one you own, host how you choose, and call via API.

The Flywheel

The learning loop

Each lap, fewer fields need a human — and the model gets cheaper to run.

Cycle 171%auto-confirmed38% / page
  1. occupation ="shipwright"· est 0.71
    alma readsmachine estimate
  2. you correcta human confirms
  3. corrections → training dataevery fix captured
  4. distilled modelyours to own

↺ reads better — next cycle

rising est = the machine reading better; only means a human confirmed it.

Call it from anywhere

One request. Cited fields back.

Invoke your own model over REST or MCP. Every field returns scored and evidence-linked.

alma.intergen.app/settings/keys
alma developer console: a scoped API key, schema selector, and a live log of recent /v1/digitize calls with per-field confidence.

Issue scoped keys; watch every call land.

$ curl -sX POST https://api.alma.intergen.app/v1/digitize \    -H "Authorization: Bearer $ALMA_KEY" \    -d '{ "document": "s3://intake/manifest-1907A.tif",          "schema": "ship_manifest.v3" }' # 200 OK{  "doc_id": "doc_3f9a2c",  "status": "unreviewed",  "fields": [    {      "key": "passenger.surname",      "value": "Kowalczyk",      "confidence": 0.91,      "evidence": {        "page": 3, "line": 12,        "bbox": [0.18, 0.42, 0.39, 0.46]      },      "status": "unreviewed"    }  ]}
passenger.surname ="Kowalczyk"· est 0.91· manifest p.3

est = machine score; a ✓ appears only after a reviewer accepts.

webhooks on escalation · RBAC scopes on every key

Built for archives that matter

Enterprise control, archival discipline.

  • RBAC at every step

    Who can read, correct, and export — enforced on every action and every key.

  • Evidence-linked audit

    A tamper-evident trail on every field: who, when, what changed, and why.

  • Self-host or VPC

    Open-weight models you run in your own environment. Your data and model stay yours.

In production

  • Barbados Land Registry · historical land records
  • National archive · 40k handwritten pages
  • Title insurer · auditor-signed exports
  • 40,000 pages of handwriting we'd written off as un-digitizable — alma gave us a queue of 900 to actually check.

    Head of RecordsNational Archive
  • The confidence scores are why our auditors signed off.

    VP OperationsTitle Insurer
  • We own the model now. That changed the procurement conversation.

    CIOMinistry of Lands

Stop posting documents to an API. Hand them to a process.

Bring alma your hardest archive — the faded, the handwritten, the irregular. We'll verify it, field by field, and leave you a model you own.