HtmlRAG

screenshot of HtmlRAG

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems (WWW 2025)

Overview:

The HtmlRAG system proposes the use of HTML instead of plain text for external knowledge modeling in RAG systems. The system introduces Lossless HTML Cleaning and Two-Step Block-Tree-Based HTML Pruning to manage the complexities of HTML formats. The approach aims to retain semantic information while addressing the challenge of long context in RAG systems.

Features:

  • HtmlRAG Proposal: Uses HTML for external knowledge in RAG systems.
  • Lossless HTML Cleaning: Removes irrelevant contents and compresses redundant structures while retaining semantic information.
  • Two-Step Block-Tree-Based HTML Pruning: Pruning process using embedding and generative models on block tree structures.

Summary:

HtmlRAG presents a novel approach to using HTML in RAG systems, offering solutions for managing the complexities of HTML formats through Lossless HTML Cleaning and Two-Step Block-Tree-Based HTML Pruning. Users can apply HtmlRAG in their own systems by following the installation guide provided above.