Back to projects

Case Study

OCR Document Search Tool

Python/Flask/Tesseract full-text search pipeline

PythonFlaskTesseract OCRElasticsearch

Role

Sole Engineer

Outcome

Eliminated 100% of manual file uploads

Proprietary / internal project

Overview

This internal tool automated the ingestion of scanned research documents, running them through an OCR pipeline and indexing the extracted text for instant keyword search — removing an entirely manual step from the research workflow.

Problem

Thousands of scanned research documents were sitting in shared drives with no searchable text layer, forcing analysts to manually open and skim files to find what they needed.

Features

  • Automated ingestion pipeline with zero manual upload steps
  • Tesseract-based OCR with post-processing for accuracy
  • Full-text keyword search across the entire document archive
  • Simple internal UI for search and retrieval

Key Learnings

  • OCR accuracy on real-world scanned documents needs meaningful pre- and post-processing to be usable in search.
  • Automating a manual step that seems 'small' can unlock outsized time savings across a team.