PDF (.pdf)

PDF file support information

Mantium provides robust support for PDF files with various features that allow users to extract data and analyze PDF documents in a comprehensive manner. This document will provide a detailed overview of the features available in Mantium's PDF tech.

Features

  • PDF Object Parsing: Mantium's PDF tech allows for the parsing of all objects in a PDF document into Python objects. This feature makes it easier for users to manipulate PDF documents programmatically.
  • Text Analysis: Mantium's PDF tech can group and analyze text in a human-readable way. This feature makes it easier for users to extract specific pieces of text from a document, such as headings or paragraphs.
  • Data Extraction: Mantium's PDF tech supports the extraction of various types of data from PDF documents, including text, images (JPG, JBIG2, and Bitmaps), table-of-contents, tagged contents, and more. This feature provides users with a comprehensive view of the data contained in a PDF document.
  • PDF Specification Support: Mantium's PDF tech supports almost all features from the PDF-1.7 specification. This feature ensures that users can access and analyze all the data contained in a PDF document.
  • CJK Language Support: Mantium's PDF tech supports Chinese, Japanese, and Korean (CJK) languages, as well as vertical writing. This feature ensures that users can access and analyze PDF documents in languages other than English.
  • Font Support: Mantium's PDF tech supports various font types, including Type1, TrueType, Type3, and CID. This feature ensures that users can access and analyze PDF documents that use different font types.
  • Encryption Support: Mantium's PDF tech supports both RC4 and AES encryption. This feature ensures that users can access and analyze PDF documents that are encrypted with these encryption algorithms.
  • Interactive Form Extraction: Mantium's PDF tech supports AcroForm interactive form extraction. This feature makes it easier for users to extract and analyze data from PDF forms.

Limitations

It's important to note that Mantium's PDF tech has a few limitations:

  • PDF Compatibility: While Mantium's PDF tech supports almost all features from the PDF-1.7 specification, some PDF documents may not be fully compatible with the tech. In some cases, certain features may not be extracted or analyzed correctly.
  • Image Extraction: While Mantium's PDF tech supports the extraction of various types of images, it may not be able to extract certain image types in some cases. For example, some PDF documents may use image formats that are not supported by Mantium's tech.
  • OCR: Mantium's PDF tech does not currently support optical character recognition (OCR) for scanned PDF documents. This means that text cannot be extracted from scanned PDF documents unless the text is embedded in the document.
  • Form Extraction: While Mantium's PDF tech supports AcroForm interactive form extraction, it may not be able to extract data from certain types of forms. In some cases, the data may need to be manually extracted from the form.

Usage

To use the PDF Data Connector in Mantium, follow these steps:

  1. Click Data Source on the left navigation bar to go to the Data Sources section.
  2. On the top right corner, select Add Data Source.
  3. From the list of Data Sources, select the PDF Data Connector.
  4. Provide the necessary details to label the Data Source and wait for the job process to complete.
  5. Complete the data upload process by uploading the PDF file containing the data you want to analyze (e.g., research papers, reports).
  6. Click the Finish and Sync button to finalize the setup and synchronize the data.