Python Boilerpipe

screenshot of Python Boilerpipe

Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages

Overview:

The python-boilerpipe is a Python wrapper for Boilerpipe, a Java library that is used for removing boilerplate content and extracting fulltext from HTML pages. It provides an interface for easy integration of Boilerpipe functionality into Python applications.

Features:

  • Boilerplate Removal: The python-boilerpipe library allows users to easily remove boilerplate content from HTML pages.
  • Fulltext Extraction: Users can extract the full text from HTML pages using the python-boilerpipe library.
  • Multiple Extractor Types: The library supports multiple extractor types, including DefaultExtractor, ArticleExtractor, ArticleSentencesExtractor, KeepEverythingExtractor, KeepEverythingWithMinKWordsExtractor, LargestContentExtractor, NumWordsRulesExtractor, and CanolaExtractor.

Summary:

The python-boilerpipe library is a Python wrapper for Boilerpipe, a Java library used for boilerplate removal and fulltext extraction from HTML pages. It provides an easy-to-use interface for integrating Boilerpipe functionality into Python applications. With support for multiple extractor types and the ability to remove boilerplate content and extract full text, the python-boilerpipe library is a useful tool for content extraction tasks in Python.