Machine Learning
Historical Documents AI
Multilingual OCR and RAG over historical document archives — 3rd place at a national AI hackathon.
- Python · Jupyter Notebook
Overview
A document-intelligence platform that processes degraded 1960s-1990s scans in Azerbaijani, Russian, and English. It pairs a Llama-4-Maverick vision model for OCR (87.75% character accuracy) with a BAAI bge-large embedding pipeline into Pinecone for semantic search, and a Llama-4-Maverick LLM that answers questions with citations. Packaged as a FastAPI service with Docker and ngrok exposure, plus a benchmarking framework that drove every model choice. Placed 3rd at the hackathon's AI track.
Key highlights
- Vision-language OCR on degraded historical scans (87.75% character accuracy)
- Trilingual support: Azerbaijani, Russian, English
- Hybrid pipeline: Llama-4-Maverick vision OCR → bge-large embeddings → Pinecone
- FastAPI service containerized with Docker; ngrok-exposed for live demos
- Benchmarking harness that compared candidate models before final selection
- 3rd place at a national AI hackathon
Tech stack
- Python 3.11
- FastAPI
- Pinecone
- Llama-4-Maverick
- BAAI bge-large-en-v1.5
- PyMuPDF
- Sentence-Transformers
- Docker
- Azure OpenAI
Topics
- #rag
- #ocr
- #vision-language-model
- #fastapi
- #pinecone
- #multilingual
- #document-ai
- #hackathon