Re: Using Apache Nifi and Tika to extract content from pdf

Matt Burgess Sat, 20 Feb 2016 07:44:55 -0800

I have a blog post on how to do this with NiFi using a Groovy script in the 
ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:


http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1

Jython is also supported but can't yet use Java libraries (it uses Jython 
scripts/modules instead). The other languages (Groovy, Lua, JavaScript, JRuby) 
can use Java libraries like Tika and PDFBox.

Regards,
Matt

Sent from my iPhone

> On Feb 20, 2016, at 10:31 AM, Ralf Meier <[email protected]> wrote:
> 
> Hi Everybody, 
> 
> I’m new to Nifi and I want to find out if it is possible to extract content 
> and metadata from PDF’s using a library like tika. 
> My first Idea was to to use the following processors:
> - GetFile (Watch a specific Folder)
> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
> - RouteOnAttribute (If it is a pdf)
> - ExecuteStreamCommand:
>       I changed the following settings.
>       Command Arguments: {flowfilw_contents}
>       Command Path: tika-python parse all
>       
> I use the python tika wrapper from 
> (https://github.com/chrismattmann/tika-python)
> 
> But it is not working. 
> Has somebody an Idea how to use tika to extract the content and the metadata 
> using nifi or what I’m doing wrong.
> 
> Thanks for your help.
> BR 
> Ralf

Re: Using Apache Nifi and Tika to extract content from pdf

Reply via email to