OCR and Hybrid Retrieval Pipeline

I developed OCR and indexing workflows for scanned enterprise documents stored in OracleDB, then connected the extracted text to hybrid retrieval systems for agentic search.

Highlights

  • Processed 250k+ OracleDB document records, each with multiple scanned pages.
  • Used DeepSeek-OCR and other OCR/VLM models for text extraction.
  • Indexed extracted content into Milvus for retrieval and agent workflows.
  • Combined dense vector search with BM25 for better recall and precision.
  • Added multi-concept AND ranking, adaptive thresholds, Persian exact-match boosts, duplicate-chunk cleanup, indexed-letter discovery, and nightly incremental indexing jobs.
  • Built lookback windows for delayed Oracle rows and operational maintenance workflows for OCR/no-OCR indexing.

Technologies

Python, OracleDB, Milvus, BM25, embeddings, DeepSeek-OCR, OCR/VLM models, Persian text retrieval, indexing jobs, and local LLM agents.