CCKS2019 Task5

screenshot of CCKS2019 Task5
flask

CCKS2019评测任务五-公众公司公告信息抽取,第3名

Overview

In today's digital landscape, PDF has solidified its status as a standard for electronic document distribution and digital information dissemination. This ubiquitous format is widely used across academia and various institutions for releasing announcements. However, extracting structured data from unstructured PDF documents remains a significant challenge in the field of knowledge graphs. The development of a solution utilizing Adobe's Acrobat DC SDK to transform PDF files into structured data marks a considerable advancement in this area.

By leveraging Acrobat's capability for format conversion, our approach facilitates the extraction of comprehensive and accurate information from semi-structured intermediate files. This method performed exceptionally well, achieving third place in the CCKS 2019 public company announcement evaluation. The ability to convert PDF files to XML and effectively extract tables and text segments gives our solution a competitive edge over existing open-source PDF parsing methods.

Features

  • Comprehensive Format Conversion: Utilizes Acrobat DC SDK to convert PDF documents into XML format, preserving table and text integrity for better data extraction.
  • Effective Table Extraction: Implements Table tag searches in XML to identify and extract relevant tables from the PDF documents quickly.
  • Advanced Entity Recognition: Employs a Bi-LSTM-CRF model for named entity recognition, significantly improving the accuracy of information retrieval from text paragraphs.
  • Template-Based Extraction: Designs specific templates based on trigger words (such as "resignation") to identify and extract sentences containing valuable information points.
  • Training on Pre-Trained Word Vectors: Incorporates financial domain pre-trained word vectors to enhance the performance of the Bi-LSTM-CRF model, improving extraction results.
  • Heuristic Rule-Based Approach: Integrates heuristic rules to annotate training data, streamlining the process despite potential noise in data.
  • Single Evaluation Opportunity: Conducted evaluations using a web API, giving only one chance for testing, which underlines the necessity for precision in parameter settings during the initial run.
flask
Flask

Flask is a lightweight and popular web framework for Python, known for its simplicity and flexibility. It is widely used to build web applications, providing a minimalistic approach to web development with features like routing, templates, and support for extensions.