Your Documents Deserve a Vault, Not a Cloud.

153 AI-powered tools for document processing, search, and compliance -- running 100% on your hardware. Zero cloud. Zero data exposure. One install command.

$ npx -y ocr-provenance-mcp install

See it in action | Read full documentation

terminal

$ npx -y ocr-provenance-mcp install ✓ Docker detected ✓ Pulling AI models... ✓ Starting container... ✓ License provisioned ✓ Registered with Claude Code Ready. 154 tools available.

No cloud APIs. No data exposure. Your GPU, your container, your data.

HIPAA, SOC 2, SOX compliance exports built in. Not bolted on.

One install. Works with Claude Code, Cursor, and Windsurf automatically.

154 MCP Tools

5 AI Models

100% Local

GPU Accelerated

3,700+ Tests Passing

After You Install

Just Tell Your AI: "Open the OCR Dashboard"

Once installed, your AI client already has all 153 tools loaded. Simply ask it to open the dashboard and it will launch the full web UI on localhost:3367 -- no extra setup needed.

Step 1: Install

npx -y ocr-provenance-mcp install

Step 2: Ask Your AI

"Open the OCR dashboard"

Your AI opens the dashboard with document stats, search, compliance reports, and account management.

See dashboard preview

1,150 docs

Processed in ~3 min

153

MCP Tools

30-75x

More tools than competitors

100%

Local Processing

$0.03

Per File

Your AI Tools Are Leaking Your Documents to the Cloud.

Your team sends confidential documents to cloud AI tools -- and you can't prove where the data goes
Your compliance audit asks for document processing provenance -- and you have nothing to show
You need AI-powered search across 10,000 contracts -- but your legal team won't approve cloud processing
Your HIPAA officer just asked how patient records are handled by your AI tools
Every OCR vendor wants you to upload documents to THEIR servers

"Every document you send to a cloud API is a compliance risk you can't take back. And your auditor is going to ask about it."

One Install. Zero Cloud. Complete Document Intelligence.

Before

Send documents to cloud APIs and hope for the best

'Where did this data come from?' -- no answer

Manual HIPAA compliance documentation

Pay $0.07 per page to cloud OCR

Hire a consultant to build document search ($50K+)

'We need 6 months to build this internally'

After

100% local processing -- documents never leave your machine

SHA-256 provenance chains with one-click W3C PROV export

Automated compliance reports in seconds -- HIPAA, SOC 2, SOX

$0.03 per file -- a 100-page contract costs three cents

153 tools installed in 60 seconds -- semantic search, contract analysis, approval workflows

Processing documents in under 5 minutes after install

Send documents to cloud APIs and hope for the best

100% local processing -- documents never leave your machine

'Where did this data come from?' -- no answer

SHA-256 provenance chains with one-click W3C PROV export

Manual HIPAA compliance documentation

Automated compliance reports in seconds -- HIPAA, SOC 2, SOX

Pay $0.07 per page to cloud OCR

$0.03 per file -- a 100-page contract costs three cents

Hire a consultant to build document search ($50K+)

153 tools installed in 60 seconds -- semantic search, contract analysis, approval workflows

'We need 6 months to build this internally'

Processing documents in under 5 minutes after install

$ npx -y ocr-provenance-mcp install

See Exactly What It Does

OCR Provenance MCP system architecture showing documents flowing through local GPU processing engine with OCR, VLM, and Embedding workers into 154 tool branches with SHA-256 provenance tracking chain

Included: Business Strategy Knowledge Base -- Every install ships with a fully searchable database of 1,147 Alex Hormozi YouTube video transcripts plus all 3 of his books ($100M Offers, $100M Leads, $100M Money Models). That's 2.6+ million tokens of business strategy context -- more than fits in any context window. Point your AI at this database and ask anything about pricing, offers, lead generation, or growth strategy. Ready to search the moment you install. No credits needed.

~2-5 sec/page

OCR Speed

~12ms/chunk

Embedding

<100ms

Semantic Search

<200ms

Hybrid Search

~3 min

Full Pipeline (1,150 docs)

$ npx -y ocr-provenance-mcp install

Built for the Industries That Can't Afford a Data Breach

Healthcare

HIPAA compliance built in. Patient records processed on your hardware. Automated compliance reports. 6+ year audit trails.

Learn more

Legal

Attorney-client privilege protected by architecture. Contract analysis, obligation tracking, and playbook comparison -- without exposing a single clause to the cloud.

Learn more

Financial Services

SOC 2 and SOX compliance exports. HMAC-signed billing. SHA-256 provenance chains. The audit trail your regulator wants to see.

Learn more

154 Tools Across 18 Categories. All Included.

Every tool listed below installs with a single command. Click any category to explore.

📥

Document Ingestion

Ingest files, directories. Process, retry, reprocess.

▸
ocr_ingest_files
Ingest specific files
▸
ocr_ingest_directory
Bulk ingest entire directory
▸
ocr_process_pending
Run full OCR pipeline on pending documents
▸
ocr_convert_raw
Quick OCR preview without creating records
▸
ocr_reprocess
Re-run OCR with different settings
▸
ocr_retry_failed
Reset failed documents to pending
▸
ocr_status
Check processing status

🔍

Search

Keyword, semantic, hybrid. Cross-database. RAG context.

▸
ocr_search
Unified search — keyword, semantic, or hybrid
▸
ocr_search_cross_db
Search across multiple databases
▸
ocr_rag_context
Assemble search context for RAG
▸
ocr_benchmark_compare
Compare search quality across databases
▸
ocr_fts_manage
FTS5 index maintenance (rebuild/status)
▸
ocr_search_export
Export search results to CSV/JSON
▸
ocr_search_saved
Manage saved searches (save/list/execute)

📄

Document Management

List, view, delete, deduplicate, version history, structure.

▸
ocr_document_get
Get full document details
▸
ocr_document_list
Browse documents with cursor pagination
▸
ocr_document_delete
Permanently delete document
▸
ocr_document_find_similar
Find similar documents by embedding
▸
ocr_document_duplicates
Find duplicate documents
▸
ocr_document_structure
Get document outline/tree structure
▸
ocr_document_update_metadata
Update title/author/subject
▸
ocr_document_versions
Find all versions of re-ingested document
▸
ocr_document_workflow
Track document review states
▸
ocr_export
Export document as JSON/markdown/CSV

🔗

Provenance Tracking

Full chain-of-custody. SHA-256. W3C PROV export.

▸
ocr_provenance_get
Get provenance chain with descendants
▸
ocr_provenance_query
Query provenance with filters
▸
ocr_provenance_timeline
Processing timeline for document
▸
ocr_provenance_verify
Verify data integrity via hash chains
▸
ocr_provenance_export
Export provenance (JSON/W3C-PROV/CSV)
▸
ocr_provenance_processor_stats
Per-processor performance stats

👁️

Vision AI (VLM)

Describe images, charts, diagrams — local Chandra VLM.

▸
ocr_vlm_describe
Generate AI description of image (Chandra VLM)
▸
ocr_vlm_process
Run VLM analysis on all document images
▸
ocr_vlm_status
Check VLM processing status

🖼️

Image Processing

Extract, search, reanalyze, stats.

▸
ocr_extract_images
Extract images from PDF/DOCX files
▸
ocr_image_get
Get full image details
▸
ocr_image_list
List images with VLM status
▸
ocr_image_search
Search images by keyword or semantic similarity
▸
ocr_image_pending
List images needing VLM processing
▸
ocr_image_reanalyze
Re-run VLM with custom prompt
▸
ocr_image_reset_failed
Reset failed VLM images to pending
▸
ocr_image_delete
Delete images permanently
▸
ocr_image_stats
Get image processing statistics

🧮

Embeddings

768-dim vectors with nomic-embed-text-v1.5.

▸
ocr_embedding_get
Get specific embedding details
▸
ocr_embedding_list
List embeddings with filtering
▸
ocr_embedding_rebuild
Rebuild embeddings for chunk/image/document
▸
ocr_embedding_stats
Get embedding coverage statistics

⚖️

Document Comparison

Side-by-side diff. Batch compare. Similarity matrix.

▸
ocr_document_compare
Diff two documents (text + structural)
▸
ocr_comparison_get
Retrieve full diff data
▸
ocr_comparison_list
List past comparisons
▸
ocr_comparison_batch
Compare multiple document pairs
▸
ocr_comparison_discover
Find likely-similar document pairs
▸
ocr_comparison_matrix
NxN pairwise cosine similarity matrix

🗂️

Clustering

Auto-cluster by similarity. No parameter tuning needed.

▸
ocr_cluster_documents
Group documents by similarity (HDBSCAN/k-means)
▸
ocr_cluster_list
Browse existing clusters
▸
ocr_cluster_get
Inspect cluster details
▸
ocr_cluster_assign
Auto-classify document into existing cluster
▸
ocr_cluster_merge
Merge two clusters
▸
ocr_cluster_reassign
Move document to different cluster
▸
ocr_cluster_delete
Delete all clusters for a run

📋

Contract Lifecycle

Clauses, obligations, calendar, playbooks, summaries.

▸
ocr_contract_extract
Extract contract-specific information
▸
ocr_document_summarize
Generate structured summary from chunks
▸
ocr_corpus_summarize
Summarize entire corpus
▸
ocr_obligation_list
List contract obligations with filters
▸
ocr_obligation_update
Update obligation status
▸
ocr_obligation_calendar
Calendar view of deadlines
▸
ocr_playbook_create
Create playbook with preferred terms
▸
ocr_playbook_list
List all playbooks
▸
ocr_playbook_compare
Compare document against playbook

🛡️

Compliance & Audit

SOC 2, HIPAA, SOX. Full audit trail.

▸
ocr_compliance_report
Generate compliance overview
▸
ocr_compliance_hipaa
HIPAA-specific compliance report
▸
ocr_compliance_export
Export audit trail (SOC 2/HIPAA/SOX format)
▸
ocr_user_info
Get/create user with roles
▸
ocr_audit_query
Query audit log with filters

👥

Collaboration

Annotations, locking, alerts, review workflows.

▸
ocr_annotation_create
Create annotation (comment/correction/flag/approval)
▸
ocr_annotation_get
Get annotation with threaded replies
▸
ocr_annotation_list
List annotations with filters
▸
ocr_annotation_update
Edit annotation or change status
▸
ocr_annotation_delete
Delete annotation and replies
▸
ocr_annotation_summary
Get annotation statistics
▸
ocr_document_lock
Acquire exclusive/shared lock
▸
ocr_document_lock_status
Check lock status
▸
ocr_document_unlock
Release document lock
▸
ocr_search_alert_check
Check for new docs matching saved search
▸
ocr_search_alert_enable
Enable/disable search alerts

✅

Workflow & Approvals

Multi-step chains. Assignment. Queue management.

▸
ocr_workflow_submit
Submit document for review
▸
ocr_workflow_assign
Assign reviewer
▸
ocr_workflow_review
Review (approve/reject/changes requested)
▸
ocr_workflow_status
Get workflow state and history
▸
ocr_workflow_queue
List documents in workflow queue
▸
ocr_approval_chain_create
Create reusable approval chain
▸
ocr_approval_chain_apply
Apply approval chain to document
▸
ocr_approval_step_decide
Decide on approval step

💾

Database Management

Multi-DB. Backup, restore, clone, merge, snapshot, share.

▸
ocr_db_create
Create new SQLite database
▸
ocr_db_delete
Permanently delete database
▸
ocr_db_list
List all databases with pagination
▸
ocr_db_select
Switch active database
▸
ocr_db_stats
Get database statistics (size, counts, quality)
▸
ocr_db_archive
Archive database (hide from default list)
▸
ocr_db_recent
Show recently accessed databases
▸
ocr_db_rename
Rename database
▸
ocr_db_search
Find databases by name/description/tags
▸
ocr_db_summary
AI-readable database profile
▸
ocr_db_tag
Add/remove/set tags and metadata
▸
ocr_db_unarchive
Restore archived database
▸
ocr_db_workspace
Create/list/manage database workspaces
▸
ocr_db_backup
Create atomic backup (VACUUM INTO)
▸
ocr_db_clone
Clone database to new name
▸
ocr_db_import
Import documents from JSON export
▸
ocr_db_merge
Merge source database into current
▸
ocr_db_restore
Restore database from backup
▸
ocr_db_snapshot
Create/list/restore/delete snapshots
▸
ocr_export_stream
Stream export as JSON-Lines

🏷️

Tags & Organization

Create, apply, search tags across everything.

▸
ocr_tag_create
Create reusable tag with color
▸
ocr_tag_list
List tags with usage counts
▸
ocr_tag_apply
Attach tag to any entity
▸
ocr_tag_remove
Detach tag from entity
▸
ocr_tag_search
Find entities by tag
▸
ocr_tag_delete
Delete tag permanently

📊

Reports & Analytics

Quality, cost, performance, error, trend analysis.

▸
ocr_report_overview
Quality and corpus overview
▸
ocr_report_performance
Pipeline performance analytics
▸
ocr_document_report
Detailed report for single document
▸
ocr_cost_summary
Cost analytics by document/mode/month
▸
ocr_error_analytics
Error and recovery analytics
▸
ocr_evaluation_report
Comprehensive evaluation report
▸
ocr_trends
Time-series trends (quality/volume)
▸
ocr_export_audit_log
Export audit log as CSV/JSON

🧠

Intelligence

Interactive guide. Table extraction. Smart recommendations.

▸
ocr_guide
System state overview with prioritized next steps
▸
ocr_document_extras
Supplementary data (charts, links, tracked changes)
▸
ocr_document_tables
Extract table data from document
▸
ocr_table_export
Export table data as CSV/JSON/markdown
▸
ocr_document_recommend
Related document recommendations

⚙️

System

Health, config, maintenance, license, dashboard, webhooks.

▸
ocr_health_check
Diagnose data integrity issues
▸
ocr_db_maintenance
Database maintenance (analyze/vacuum)
▸
ocr_config_get
View system configuration
▸
ocr_config_set
Change configuration setting
▸
ocr_license_status
Check license status and balance
▸
ocr_dashboard_open
Open dashboard in browser
▸
ocr_dashboard_status
Check health of all 3 services
▸
ocr_webhook_create
Register webhook for event notifications
▸
ocr_webhook_list
List registered webhooks
▸
ocr_webhook_delete
Remove webhook registration
▸
ocr_chunk_get
Inspect specific chunk by ID
▸
ocr_chunk_list
Browse all chunks in document
▸
ocr_chunk_context
Expand result with surrounding text
▸
ocr_document_page
Read specific page of document
▸
ocr_db_share
Export database to shared folder
▸
ocr_db_import_shared
Import from shared folder
▸
ocr_db_transfer
Package database for transfer
▸
ocr_db_receive
Import transfer bundle
▸
ocr_extraction_get
Get extraction results by ID
▸
ocr_extraction_list
List structured extractions
▸
ocr_export_annotations
Export annotations as CSV/JSON
▸
ocr_export_obligations_csv
Export obligations as CSV

"Other document tools give you OCR. We give you OCR + semantic search + vision AI + provenance + compliance + clustering + contract lifecycle + collaboration + workflow + analytics. All local. All in one install."

See the complete tool reference and API documentation →

How It Works

Watch the full walkthrough

Watch Full Demo (15 min)

Model	Purpose	VRAM
Marker-pdf v1.10.2	Document OCR with layout preservation	8-10 GB
Chandra v0.1.8	Vision AI -- images, charts, diagrams	~18 GB
nomic-embed-text-v1.5	768-dim semantic embeddings	2-3 GB
HDBSCAN	Auto-clustering by similarity	CPU
ms-marco-MiniLM-L-12-v2	Cross-encoder reranking	~1 GB

DOCUMENT --> OCR_RESULT --> CHUNK --> EMBEDDING
                       --> IMAGE --> VLM_DESC --> EMBEDDING
     ^ SHA-256 hash at every node

Built for the Industries That Can't Afford a Data Breach

Healthcare

HIPAA compliance built in. Patient records processed on your hardware. Automated compliance reports. 6+ year audit trails. Consumer AI tools are explicitly prohibited for unredacted PHI -- this isn't.

Legal

Attorney-client privilege protected by architecture, not policy. Contract analysis, obligation tracking, and playbook comparison -- without exposing a single clause to the cloud. 50% of large law firms still use on-premises document management. Now they can have AI too.

Financial Services

SOC 2 and SOX compliance exports. HMAC-signed billing with tamper detection. SHA-256 provenance chains with W3C PROV export. DORA requires proof of operational resilience -- this is it.

100% local processing -- all inference on YOUR hardware

Zero telemetry -- no analytics, no tracking, no phone-home

SHA-256 provenance chains -- every extraction linked to source

Container hardening -- cap-drop=ALL, non-root, no-new-privileges

Zod schema validation on all 154 tool inputs

Models run offline -- HF_HUB_OFFLINE=1 enforced

"We don't need to guarantee your data stays private. There's no cloud to send it to. The processing happens on your GPU, in a container, on your machine. That's not a promise -- it's the architecture."

Everything You Need. Nothing You Don't.

First 100 customers who spend $100 get $10,000 credited to their account. That's 333,000+ files of processing power. Limited to the first 100.

153 MCP tools for complete document lifecycle
$50,000/yr
HIPAA, SOC 2, SOX compliance export suite
$10,000/yr
Contract management: obligations, playbooks, extraction
$15,000/yr
Hybrid semantic + keyword search with RAG assembly
$5,000/yr
SHA-256 cryptographic provenance chains
Priceless
1,150 Hormozi business strategy transcripts
$997
Approval workflows with multi-step chains
$5,000/yr
Zero telemetry, zero tracking, zero cloud
Peace of mind

Total Value $86,000+/yr

Your Price: $0.03 per file.

No subscription. Credits never expire.

Markdown and text files: Free. Forever.

$0.03 / file

Pay for what you use

All 154 tools included free
.md and .txt files process for free
Buy credits via Stripe
No monthly fee, no subscription
Credits never expire

$ npx -y ocr-provenance-mcp install

Enterprise

For regulated organizations

Everything in Pay-Per-Use
Commercial license
Priority support
Volume pricing
Compliance documentation
Custom terms

Contact Sales

"Cloud OCR charges $0.01-0.07 per page -- and your data leaves every time. We charge $0.03 per file -- and your data never leaves. A 100-page contract costs three cents."

What You Need

Supported Formats

PDF DOCX DOC PPTX PPT XLSX XLS EPUB PNG JPG JPEG TIFF TIF BMP GIF WEBP TXT FREE MD FREE CSV HTML

20 file types supported. .md and .txt files process free -- no OCR models needed, just embedding. Works great on CPU.

System Requirements

Component	Minimum	Recommended
Docker	Engine 20+	Desktop (latest)
Node.js	20+	22+ LTS
RAM	8 GB	16+ GB
Disk	30 GB	50+ GB
GPU	Optional (CPU works for .md/.txt)	NVIDIA RTX 3060+ (16+ GB VRAM)
OS	Windows with WSL2	Windows with WSL2 + NVIDIA GPU

Full GPU processing (OCR + VLM + Embeddings): Windows with NVIDIA GPU. Minimum 16 GB VRAM for VLM (Chandra). Recommended: 24 GB (RTX 3090/4090).

CPU-only mode: Works on Windows for .md/.txt embedding and all search/management tools. No GPU required.

macOS: Bare metal release coming soon. The Docker container does not currently support Mac GPU passthrough.

Linux: Supported with NVIDIA GPU via Docker.

Works With Your AI Client. Automatically.

Claude Code

claude mcp add ocr-provenance-mcp -s user -- npx -y ocr-provenance-mcp

Claude Desktop

Add to claude_desktop_config.json

Cursor

Add to ~/.cursor/mcp.json

Windsurf

Standard MCP configuration

"The installer auto-detects and registers with every supported client. You probably don't need to do any of this."

Full setup guide, configuration options, and troubleshooting →

Your Next Compliance Audit Is Coming. Be Ready.

Install in 60 seconds. Process your first document in 5 minutes. Generate your first compliance report in 10.

 $ npx -y ocr-provenance-mcp install

Start processing documents in under 60 seconds

Watch the demo → Full documentation → See how law firms use it →

[email protected]