Convert Microsoft Word to Text

Convert Microsoft Word files (.docx) to text which can be used for various purposes such as indexing, searching, or analyzing the content of the document. This enrichment is useful for applications that require the text content of a Microsoft Word file without the need for the original layout or formatting.

Parameters

  • Source Column: The column name containing the Microsoft Word files you want to extract text from. Defaults to content.
  • Destination Column: The column name that holds the extracted text. Defaults to text.

Usage

To use the Convert Microsoft Word (.docx) to Text transformation in Mantium, follow these steps:

  1. Configure the Source Column parameter by selecting the column that contains the Microsoft Word files to be converted.
  2. Configure the Destination Column parameter by specifying the name of the new column that will be created with the extracted text data.
  3. Run the transformation by clicking the Save and Run Transforms button. The resulting dataset will have the specified Microsoft Word files converted to text and stored in the new column.

Example 1 - Extracting Text Data from Microsoft Word Files

Suppose we have a dataset with a column called 'Docx File' that contains Microsoft Word file in binary format:

<binary >

If we want to extract the text data from the Microsoft PowerPoint files and create a new column called 'Text', we can use the Convert Microsoft PowerPoint (.pptx) to Text transformation.

We would configure the transformation as follows (see image below):

Source Column: content  
Destination Column: text

The resulting dataset would look like this:

DOCX File, Text 
<Docx binary data>, <Text extracted from Docx file>