Convert PDF to Text

Convert PDF files to text which can be used for various purposes such as indexing, searching, or analyzing the content of the PDF document. This enrichment is useful for applications that require the text content of a PDF file without the need for the original layout or formatting.

Parameters

The PDF to Text transformation has two required parameters:

  • Source Column: The column name containing the PDF files you want to extract text from. Defaults to content.
  • Destination Column: The column name that holds the extracted text. Defaults to text.

Usage

To use the PDF to Text transformation in Mantium, follow these steps:

  1. Configure the Source Column parameter by selecting the column that contains the PDF file to be converted.
  2. Configure the Destination Column parameter by specifying the name of the new column that will be created with the text data.
  3. Run the transformation by clicking the Save and Run Transforms button. The resulting dataset will have the specified PDF file converted to text and stored in the new column.

Example 1: Extracting Text Data from PDF Files

The video below demonstrates how to use the PDF to Text transformation in just a few seconds. If you prefer text, please continue reading.

Suppose we have a dataset with a column called 'PDF File' that contains PDF files in binary format:

PDF File
<PDF binary data>
<PDF binary data>

If we want to extract the text data from the PDF files and create a new column called 'Text Data', we can use the PDF to Text transformation. We would configure the transformation as follows:

  • Source Column: PDF File
  • Destination Column: Text Data

The resulting dataset would look like this:

PDF File, Text Data
<PDF binary data>, <Text extracted from PDF file>
<PDF binary data>, <Text extracted from PDF file>

As shown in the example above, the PDF to Text transformation can be a useful tool for extracting text data from PDF files and creating a text-based column in a dataset.