Scalable Hybrid Retrieval for Natural Language Email Queries

Abstract
Most email retrieval systems still rely on basic keyword matching, often failing when queries lack shared terms or use natural language. These systems also overlook entities like senders or dates unless formatted rigidly. We introduce a scalable hybrid retrieval framework that combines sparse keyword search with dense semantic retrieval. Our approach expands queries into diverse variants, embeds them using a pretrained model, and fuses rankings to improve robustness. On a public email dataset, our method produces consistent, semantically relevant results across diverse queries and outperforms either component alone.
Type
This paper develops a robust hybrid retrieval system for natural language email queries. Final project for CPSC 477 at Yale.