Machine Learning
Real Estate RAG
Production-grade retrieval-augmented generation over Azerbaijani real-estate listings with cited answers.
- JavaScript
Overview
An end-to-end Retrieval-Augmented Generation system that scrapes pasharealestate.az via Firecrawl, chunks pages with structure-aware overlap, and stores embeddings in PostgreSQL with pgvector. A single SQL CTE runs hybrid retrieval — HNSW vector kNN fused with full-text search via Reciprocal Rank Fusion — and streams Anthropic Claude answers over Server-Sent Events with strict per-claim [Sn] citations. Ships with an offline evaluation harness that tracks recall and citation rate across query sets.
Key highlights
- Hybrid retrieval: HNSW vector similarity + PostgreSQL full-text search fused with Reciprocal Rank Fusion in a single SQL CTE
- Streamed Anthropic Claude responses over SSE with strict per-claim citation enforcement
- Structure-aware chunking pipeline that preserves listing context across embeddings
- Offline eval harness measuring recall@k and citation correctness on held-out queries
- Express server with Zod schema validation and Pino structured logging
Tech stack
- Node.js
- Express
- PostgreSQL
- pgvector
- Anthropic Claude
- Firecrawl
- Xenova Transformers (MiniLM)
- SSE
- Zod
- Pino
Topics
- #rag
- #retrieval-augmented-generation
- #hybrid-search
- #pgvector
- #anthropic-claude
- #real-estate
- #semantic-search