PDF Extractor
Transform unsearchable, image-based PDFs into structured, searchable text. Concurrent processing, AI-powered named entity recognition, and comprehensive analytics — built for researchers, journalists, and OSINT professionals.
View on GitHubConcurrent OCR Processing
Utilizes multiple CPU cores to process pages simultaneously. High-DPI rendering at 300 DPI ensures superior OCR accuracy. Resume capability for interrupted jobs means you never lose progress on large documents.
AI-Powered Analytics
Named Entity Recognition powered by spaCy identifies person names across documents. Comprehensive word count analysis with per-page breakdowns and CSV reports tracking individual word occurrences and page locations.
Batch Processing
Process entire directories of PDFs automatically. Consolidated output, detailed error logging, and multi-language support via Tesseract. Feed it a folder, walk away, come back to structured data.
Tech Stack
Built with Python 3.8+, Tesseract OCR, PyMuPDF, Pillow, and spaCy. Clean architecture with automated installation scripts. Open source under a permissive license — use it, modify it, build on it.
Output & Reporting
Each processed PDF generates a dedicated folder containing combined text output, word count reports in CSV format, detected person names with page locations, high-resolution page renderings, and individual page extractions.
Get Started
View on GitHub →