Dedoc

screenshot of Dedoc

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parse...

Overview

Dedoc is an innovative, open universal system designed to convert various document types into a unified output format. It effectively extracts the logical structure and content of documents, including tables, text formatting, and metadata, making it an essential tool for efficient document analysis. With its advanced capabilities, Dedoc can be integrated into broader document content and structure analysis systems, providing significant flexibility and extensibility.

This software stands out due to its automatic processing features, supporting both structured and unstructured data formats. It offers a seamless way to handle diverse document types—from legal papers to scientific reports—while maintaining high accuracy in extraction and formatting. Dedoc has also received recognition for its potential in the AI field, with a grant awarded for its development from the Innovation Assistance Foundation.

Features

  • Automatic Document Structure Extraction: Processes documents irrespective of the input data type, providing effortless extraction of contents and logical structure.
  • Wide Format Compatibility: Works with various formats like DOC/DOCX, ODT, XLS/XLSX, CSV, TXT, JSON, as well as images, archives, PDFs, and HTML.
  • Metadata and Formatting Extraction: Automatically retrieves metadata and text formatting attributes, enhancing the usability of the extracted information.
  • Extensible Architecture: Allows for the easy addition of new document formats and modifications to the output data format.
  • Advanced Table Data Extraction: Capable of recognizing complex multipage tables through contour analysis, extracting both textual content and physical structure.
  • Support for Scanned Documents: Utilizes Tesseract OCR for processing images and PDFs without a textual layer, ensuring robust document interpretation.
  • Intelligent Document Orientation Detection: Incorporates modern machine learning techniques to identify document orientations and structures accurately.