Machine Learning

Historical Documents AI

Multilingual OCR and RAG over historical document archives — 3rd place at a national AI hackathon.

Last updated 2026-04-16
Python · Jupyter Notebook

Overview

A document-intelligence platform that processes degraded 1960s-1990s scans in Azerbaijani, Russian, and English. It pairs a Llama-4-Maverick vision model for OCR (87.75% character accuracy) with a BAAI bge-large embedding pipeline into Pinecone for semantic search, and a Llama-4-Maverick LLM that answers questions with citations. Packaged as a FastAPI service with Docker and ngrok exposure, plus a benchmarking framework that drove every model choice. Placed 3rd at the hackathon's AI track.

Key highlights

Vision-language OCR on degraded historical scans (87.75% character accuracy)
Trilingual support: Azerbaijani, Russian, English
Hybrid pipeline: Llama-4-Maverick vision OCR → bge-large embeddings → Pinecone
FastAPI service containerized with Docker; ngrok-exposed for live demos
Benchmarking harness that compared candidate models before final selection
3rd place at a national AI hackathon

Tech stack

Python 3.11
FastAPI
Pinecone
Llama-4-Maverick
BAAI bge-large-en-v1.5
PyMuPDF
Sentence-Transformers
Docker
Azure OpenAI

Topics

#rag
#ocr
#vision-language-model
#fastapi
#pinecone
#multilingual
#document-ai
#hackathon