HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems (WWW 2025)
The HtmlRAG system proposes the use of HTML instead of plain text for external knowledge modeling in RAG systems. The system introduces Lossless HTML Cleaning and Two-Step Block-Tree-Based HTML Pruning to manage the complexities of HTML formats. The approach aims to retain semantic information while addressing the challenge of long context in RAG systems.
HtmlRAG presents a novel approach to using HTML in RAG systems, offering solutions for managing the complexities of HTML formats through Lossless HTML Cleaning and Two-Step Block-Tree-Based HTML Pruning. Users can apply HtmlRAG in their own systems by following the installation guide provided above.