Using Apache Nifi and Tika to extract content from pdf

Ralf Meier Sat, 20 Feb 2016 07:32:00 -0800

Hi Everybody, 

I’m new to Nifi and I want to find out if it is possible to extract content and 
metadata from PDF’s using a library like tika. 
My first Idea was to to use the following processors:
- GetFile (Watch a specific Folder)
- IdentifyMimeType (Identify if the file is a typ application/pdf) 
- RouteOnAttribute (If it is a pdf)
- ExecuteStreamCommand:
        I changed the following settings.
        Command Arguments: {flowfilw_contents}
        Command Path: tika-python parse all
        
I use the python tika wrapper from 
(https://github.com/chrismattmann/tika-python 
<https://github.com/chrismattmann/tika-python>)


But it is not working. 
Has somebody an Idea how to use tika to extract the content and the metadata 
using nifi or what I’m doing wrong.

Thanks for your help.
BR 
Ralf

Using Apache Nifi and Tika to extract content from pdf

Reply via email to