Skip to main content

Your Documents Deserve a Vault, Not a Cloud.

153 AI-powered tools for document processing, search, and compliance -- running 100% on your hardware. Zero cloud. Zero data exposure. One install command.

$ npx -y ocr-provenance-mcp install
terminal

No cloud APIs. No data exposure. Your GPU, your container, your data.

HIPAA, SOC 2, SOX compliance exports built in. Not bolted on.

One install. Works with Claude Code, Cursor, and Windsurf automatically.

154 MCP Tools
5 AI Models
100% Local
GPU Accelerated
3,700+ Tests Passing

After You Install

Just Tell Your AI: "Open the OCR Dashboard"

Once installed, your AI client already has all 153 tools loaded. Simply ask it to open the dashboard and it will launch the full web UI on localhost:3367 -- no extra setup needed.

Step 1: Install

npx -y ocr-provenance-mcp install

Step 2: Ask Your AI

"Open the OCR dashboard"

Your AI opens the dashboard with document stats, search, compliance reports, and account management.

See dashboard preview

1,150 docs

Processed in ~3 min

153

MCP Tools

30-75x

More tools than competitors

100%

Local Processing

$0.03

Per File

Your AI Tools Are Leaking Your Documents to the Cloud.

  • Your team sends confidential documents to cloud AI tools -- and you can't prove where the data goes
  • Your compliance audit asks for document processing provenance -- and you have nothing to show
  • You need AI-powered search across 10,000 contracts -- but your legal team won't approve cloud processing
  • Your HIPAA officer just asked how patient records are handled by your AI tools
  • Every OCR vendor wants you to upload documents to THEIR servers
"Every document you send to a cloud API is a compliance risk you can't take back. And your auditor is going to ask about it."

One Install. Zero Cloud. Complete Document Intelligence.

Send documents to cloud APIs and hope for the best

100% local processing -- documents never leave your machine

'Where did this data come from?' -- no answer

SHA-256 provenance chains with one-click W3C PROV export

Manual HIPAA compliance documentation

Automated compliance reports in seconds -- HIPAA, SOC 2, SOX

Pay $0.07 per page to cloud OCR

$0.03 per file -- a 100-page contract costs three cents

Hire a consultant to build document search ($50K+)

153 tools installed in 60 seconds -- semantic search, contract analysis, approval workflows

'We need 6 months to build this internally'

Processing documents in under 5 minutes after install

$ npx -y ocr-provenance-mcp install

See Exactly What It Does

Included: Business Strategy Knowledge Base -- Every install ships with a fully searchable database of 1,147 Alex Hormozi YouTube video transcripts plus all 3 of his books ($100M Offers, $100M Leads, $100M Money Models). That's 2.6+ million tokens of business strategy context -- more than fits in any context window. Point your AI at this database and ask anything about pricing, offers, lead generation, or growth strategy. Ready to search the moment you install. No credits needed.

~2-5 sec/page

OCR Speed

~12ms/chunk

Embedding

<100ms

Semantic Search

<200ms

Hybrid Search

~3 min

Full Pipeline (1,150 docs)

$ npx -y ocr-provenance-mcp install

Built for the Industries That Can't Afford a Data Breach

Healthcare

HIPAA compliance built in. Patient records processed on your hardware. Automated compliance reports. 6+ year audit trails.

Learn more

Legal

Attorney-client privilege protected by architecture. Contract analysis, obligation tracking, and playbook comparison -- without exposing a single clause to the cloud.

Learn more

Financial Services

SOC 2 and SOX compliance exports. HMAC-signed billing. SHA-256 provenance chains. The audit trail your regulator wants to see.

Learn more

154 Tools Across 18 Categories. All Included.

Every tool listed below installs with a single command. Click any category to explore.

📥

Document Ingestion

7

Ingest files, directories. Process, retry, reprocess.

  • ocr_ingest_files

    Ingest specific files

  • ocr_ingest_directory

    Bulk ingest entire directory

  • ocr_process_pending

    Run full OCR pipeline on pending documents

  • ocr_convert_raw

    Quick OCR preview without creating records

  • ocr_reprocess

    Re-run OCR with different settings

  • ocr_retry_failed

    Reset failed documents to pending

  • ocr_status

    Check processing status

🔍

Search

7

Keyword, semantic, hybrid. Cross-database. RAG context.

  • ocr_search

    Unified search — keyword, semantic, or hybrid

  • ocr_search_cross_db

    Search across multiple databases

  • ocr_rag_context

    Assemble search context for RAG

  • ocr_benchmark_compare

    Compare search quality across databases

  • ocr_fts_manage

    FTS5 index maintenance (rebuild/status)

  • ocr_search_export

    Export search results to CSV/JSON

  • ocr_search_saved

    Manage saved searches (save/list/execute)

📄

Document Management

10

List, view, delete, deduplicate, version history, structure.

  • ocr_document_get

    Get full document details

  • ocr_document_list

    Browse documents with cursor pagination

  • ocr_document_delete

    Permanently delete document

  • ocr_document_find_similar

    Find similar documents by embedding

  • ocr_document_duplicates

    Find duplicate documents

  • ocr_document_structure

    Get document outline/tree structure

  • ocr_document_update_metadata

    Update title/author/subject

  • ocr_document_versions

    Find all versions of re-ingested document

  • ocr_document_workflow

    Track document review states

  • ocr_export

    Export document as JSON/markdown/CSV

🔗

Provenance Tracking

6

Full chain-of-custody. SHA-256. W3C PROV export.

  • ocr_provenance_get

    Get provenance chain with descendants

  • ocr_provenance_query

    Query provenance with filters

  • ocr_provenance_timeline

    Processing timeline for document

  • ocr_provenance_verify

    Verify data integrity via hash chains

  • ocr_provenance_export

    Export provenance (JSON/W3C-PROV/CSV)

  • ocr_provenance_processor_stats

    Per-processor performance stats

👁️

Vision AI (VLM)

3

Describe images, charts, diagrams — local Chandra VLM.

  • ocr_vlm_describe

    Generate AI description of image (Chandra VLM)

  • ocr_vlm_process

    Run VLM analysis on all document images

  • ocr_vlm_status

    Check VLM processing status

🖼️

Image Processing

9

Extract, search, reanalyze, stats.

  • ocr_extract_images

    Extract images from PDF/DOCX files

  • ocr_image_get

    Get full image details

  • ocr_image_list

    List images with VLM status

  • ocr_image_search

    Search images by keyword or semantic similarity

  • ocr_image_pending

    List images needing VLM processing

  • ocr_image_reanalyze

    Re-run VLM with custom prompt

  • ocr_image_reset_failed

    Reset failed VLM images to pending

  • ocr_image_delete

    Delete images permanently

  • ocr_image_stats

    Get image processing statistics

🧮

Embeddings

4

768-dim vectors with nomic-embed-text-v1.5.

  • ocr_embedding_get

    Get specific embedding details

  • ocr_embedding_list

    List embeddings with filtering

  • ocr_embedding_rebuild

    Rebuild embeddings for chunk/image/document

  • ocr_embedding_stats

    Get embedding coverage statistics

⚖️

Document Comparison

6

Side-by-side diff. Batch compare. Similarity matrix.

  • ocr_document_compare

    Diff two documents (text + structural)

  • ocr_comparison_get

    Retrieve full diff data

  • ocr_comparison_list

    List past comparisons

  • ocr_comparison_batch

    Compare multiple document pairs

  • ocr_comparison_discover

    Find likely-similar document pairs

  • ocr_comparison_matrix

    NxN pairwise cosine similarity matrix

🗂️

Clustering

7

Auto-cluster by similarity. No parameter tuning needed.

  • ocr_cluster_documents

    Group documents by similarity (HDBSCAN/k-means)

  • ocr_cluster_list

    Browse existing clusters

  • ocr_cluster_get

    Inspect cluster details

  • ocr_cluster_assign

    Auto-classify document into existing cluster

  • ocr_cluster_merge

    Merge two clusters

  • ocr_cluster_reassign

    Move document to different cluster

  • ocr_cluster_delete

    Delete all clusters for a run

📋

Contract Lifecycle

9

Clauses, obligations, calendar, playbooks, summaries.

  • ocr_contract_extract

    Extract contract-specific information

  • ocr_document_summarize

    Generate structured summary from chunks

  • ocr_corpus_summarize

    Summarize entire corpus

  • ocr_obligation_list

    List contract obligations with filters

  • ocr_obligation_update

    Update obligation status

  • ocr_obligation_calendar

    Calendar view of deadlines

  • ocr_playbook_create

    Create playbook with preferred terms

  • ocr_playbook_list

    List all playbooks

  • ocr_playbook_compare

    Compare document against playbook

🛡️

Compliance & Audit

5

SOC 2, HIPAA, SOX. Full audit trail.

  • ocr_compliance_report

    Generate compliance overview

  • ocr_compliance_hipaa

    HIPAA-specific compliance report

  • ocr_compliance_export

    Export audit trail (SOC 2/HIPAA/SOX format)

  • ocr_user_info

    Get/create user with roles

  • ocr_audit_query

    Query audit log with filters

👥

Collaboration

11

Annotations, locking, alerts, review workflows.

  • ocr_annotation_create

    Create annotation (comment/correction/flag/approval)

  • ocr_annotation_get

    Get annotation with threaded replies

  • ocr_annotation_list

    List annotations with filters

  • ocr_annotation_update

    Edit annotation or change status

  • ocr_annotation_delete

    Delete annotation and replies

  • ocr_annotation_summary

    Get annotation statistics

  • ocr_document_lock

    Acquire exclusive/shared lock

  • ocr_document_lock_status

    Check lock status

  • ocr_document_unlock

    Release document lock

  • ocr_search_alert_check

    Check for new docs matching saved search

  • ocr_search_alert_enable

    Enable/disable search alerts

Workflow & Approvals

8

Multi-step chains. Assignment. Queue management.

  • ocr_workflow_submit

    Submit document for review

  • ocr_workflow_assign

    Assign reviewer

  • ocr_workflow_review

    Review (approve/reject/changes requested)

  • ocr_workflow_status

    Get workflow state and history

  • ocr_workflow_queue

    List documents in workflow queue

  • ocr_approval_chain_create

    Create reusable approval chain

  • ocr_approval_chain_apply

    Apply approval chain to document

  • ocr_approval_step_decide

    Decide on approval step

💾

Database Management

20

Multi-DB. Backup, restore, clone, merge, snapshot, share.

  • ocr_db_create

    Create new SQLite database

  • ocr_db_delete

    Permanently delete database

  • ocr_db_list

    List all databases with pagination

  • ocr_db_select

    Switch active database

  • ocr_db_stats

    Get database statistics (size, counts, quality)

  • ocr_db_archive

    Archive database (hide from default list)

  • ocr_db_recent

    Show recently accessed databases

  • ocr_db_rename

    Rename database

  • ocr_db_search

    Find databases by name/description/tags

  • ocr_db_summary

    AI-readable database profile

  • ocr_db_tag

    Add/remove/set tags and metadata

  • ocr_db_unarchive

    Restore archived database

  • ocr_db_workspace

    Create/list/manage database workspaces

  • ocr_db_backup

    Create atomic backup (VACUUM INTO)

  • ocr_db_clone

    Clone database to new name

  • ocr_db_import

    Import documents from JSON export

  • ocr_db_merge

    Merge source database into current

  • ocr_db_restore

    Restore database from backup

  • ocr_db_snapshot

    Create/list/restore/delete snapshots

  • ocr_export_stream

    Stream export as JSON-Lines

🏷️

Tags & Organization

6

Create, apply, search tags across everything.

  • ocr_tag_create

    Create reusable tag with color

  • ocr_tag_list

    List tags with usage counts

  • ocr_tag_apply

    Attach tag to any entity

  • ocr_tag_remove

    Detach tag from entity

  • ocr_tag_search

    Find entities by tag

  • ocr_tag_delete

    Delete tag permanently

📊

Reports & Analytics

8

Quality, cost, performance, error, trend analysis.

  • ocr_report_overview

    Quality and corpus overview

  • ocr_report_performance

    Pipeline performance analytics

  • ocr_document_report

    Detailed report for single document

  • ocr_cost_summary

    Cost analytics by document/mode/month

  • ocr_error_analytics

    Error and recovery analytics

  • ocr_evaluation_report

    Comprehensive evaluation report

  • ocr_trends

    Time-series trends (quality/volume)

  • ocr_export_audit_log

    Export audit log as CSV/JSON

🧠

Intelligence

5

Interactive guide. Table extraction. Smart recommendations.

  • ocr_guide

    System state overview with prioritized next steps

  • ocr_document_extras

    Supplementary data (charts, links, tracked changes)

  • ocr_document_tables

    Extract table data from document

  • ocr_table_export

    Export table data as CSV/JSON/markdown

  • ocr_document_recommend

    Related document recommendations

⚙️

System

33

Health, config, maintenance, license, dashboard, webhooks.

  • ocr_health_check

    Diagnose data integrity issues

  • ocr_db_maintenance

    Database maintenance (analyze/vacuum)

  • ocr_config_get

    View system configuration

  • ocr_config_set

    Change configuration setting

  • ocr_license_status

    Check license status and balance

  • ocr_dashboard_open

    Open dashboard in browser

  • ocr_dashboard_status

    Check health of all 3 services

  • ocr_webhook_create

    Register webhook for event notifications

  • ocr_webhook_list

    List registered webhooks

  • ocr_webhook_delete

    Remove webhook registration

  • ocr_chunk_get

    Inspect specific chunk by ID

  • ocr_chunk_list

    Browse all chunks in document

  • ocr_chunk_context

    Expand result with surrounding text

  • ocr_document_page

    Read specific page of document

  • ocr_db_share

    Export database to shared folder

  • ocr_db_import_shared

    Import from shared folder

  • ocr_db_transfer

    Package database for transfer

  • ocr_db_receive

    Import transfer bundle

  • ocr_extraction_get

    Get extraction results by ID

  • ocr_extraction_list

    List structured extractions

  • ocr_export_annotations

    Export annotations as CSV/JSON

  • ocr_export_obligations_csv

    Export obligations as CSV

"Other document tools give you OCR. We give you OCR + semantic search + vision AI + provenance + compliance + clustering + contract lifecycle + collaboration + workflow + analytics. All local. All in one install."

How It Works

Watch the full walkthrough

Watch: OCR Provenance MCP full demo and walkthrough video
Watch Full Demo (15 min)
Model Purpose VRAM
Marker-pdf v1.10.2 Document OCR with layout preservation 8-10 GB
Chandra v0.1.8 Vision AI -- images, charts, diagrams ~18 GB
nomic-embed-text-v1.5 768-dim semantic embeddings 2-3 GB
HDBSCAN Auto-clustering by similarity CPU
ms-marco-MiniLM-L-12-v2 Cross-encoder reranking ~1 GB
DOCUMENT --> OCR_RESULT --> CHUNK --> EMBEDDING
                       --> IMAGE --> VLM_DESC --> EMBEDDING
     ^ SHA-256 hash at every node

Built for the Industries That Can't Afford a Data Breach

Healthcare

HIPAA compliance built in. Patient records processed on your hardware. Automated compliance reports. 6+ year audit trails. Consumer AI tools are explicitly prohibited for unredacted PHI -- this isn't.

Legal

Attorney-client privilege protected by architecture, not policy. Contract analysis, obligation tracking, and playbook comparison -- without exposing a single clause to the cloud. 50% of large law firms still use on-premises document management. Now they can have AI too.

Financial Services

SOC 2 and SOX compliance exports. HMAC-signed billing with tamper detection. SHA-256 provenance chains with W3C PROV export. DORA requires proof of operational resilience -- this is it.

100% local processing -- all inference on YOUR hardware

Zero telemetry -- no analytics, no tracking, no phone-home

SHA-256 provenance chains -- every extraction linked to source

Container hardening -- cap-drop=ALL, non-root, no-new-privileges

Zod schema validation on all 154 tool inputs

Models run offline -- HF_HUB_OFFLINE=1 enforced

"We don't need to guarantee your data stays private. There's no cloud to send it to. The processing happens on your GPU, in a container, on your machine. That's not a promise -- it's the architecture."

Everything You Need. Nothing You Don't.

First 100 customers who spend $100 get $10,000 credited to their account. That's 333,000+ files of processing power. Limited to the first 100.

  • 153 MCP tools for complete document lifecycle
    $50,000/yr
  • HIPAA, SOC 2, SOX compliance export suite
    $10,000/yr
  • Contract management: obligations, playbooks, extraction
    $15,000/yr
  • Hybrid semantic + keyword search with RAG assembly
    $5,000/yr
  • SHA-256 cryptographic provenance chains
    Priceless
  • 1,150 Hormozi business strategy transcripts
    $997
  • Approval workflows with multi-step chains
    $5,000/yr
  • Zero telemetry, zero tracking, zero cloud
    Peace of mind
Total Value $86,000+/yr

Your Price: $0.03 per file.

No subscription. Credits never expire.

Markdown and text files: Free. Forever.

$0.03 / file

Pay for what you use

  • All 154 tools included free
  • .md and .txt files process for free
  • Buy credits via Stripe
  • No monthly fee, no subscription
  • Credits never expire
$ npx -y ocr-provenance-mcp install

Enterprise

For regulated organizations

  • Everything in Pay-Per-Use
  • Commercial license
  • Priority support
  • Volume pricing
  • Compliance documentation
  • Custom terms
"Cloud OCR charges $0.01-0.07 per page -- and your data leaves every time. We charge $0.03 per file -- and your data never leaves. A 100-page contract costs three cents."

What You Need

Supported Formats

PDF DOCX DOC PPTX PPT XLSX XLS EPUB PNG JPG JPEG TIFF TIF BMP GIF WEBP TXT FREE MD FREE CSV HTML

20 file types supported. .md and .txt files process free -- no OCR models needed, just embedding. Works great on CPU.

System Requirements

Component Minimum Recommended
Docker Engine 20+ Desktop (latest)
Node.js 20+ 22+ LTS
RAM 8 GB 16+ GB
Disk 30 GB 50+ GB
GPU Optional (CPU works for .md/.txt) NVIDIA RTX 3060+ (16+ GB VRAM)
OS Windows with WSL2 Windows with WSL2 + NVIDIA GPU

Full GPU processing (OCR + VLM + Embeddings): Windows with NVIDIA GPU. Minimum 16 GB VRAM for VLM (Chandra). Recommended: 24 GB (RTX 3090/4090).

CPU-only mode: Works on Windows for .md/.txt embedding and all search/management tools. No GPU required.

macOS: Bare metal release coming soon. The Docker container does not currently support Mac GPU passthrough.

Linux: Supported with NVIDIA GPU via Docker.

Works With Your AI Client. Automatically.

Claude Code

claude mcp add ocr-provenance-mcp -s user -- npx -y ocr-provenance-mcp

Claude Desktop

Add to claude_desktop_config.json

Cursor

Add to ~/.cursor/mcp.json

Windsurf

Standard MCP configuration
"The installer auto-detects and registers with every supported client. You probably don't need to do any of this."

Your Next Compliance Audit Is Coming. Be Ready.

Install in 60 seconds. Process your first document in 5 minutes. Generate your first compliance report in 10.

$ npx -y ocr-provenance-mcp install

Start processing documents in under 60 seconds