Re: Using Apache Nifi and Tika to extract content from pdf

Matt Burgess Sat, 20 Feb 2016 13:33:07 -0800

I will update the blog to make these more clear. I used PDFBox 1.8.10 so I'm 
not sure what else you need for the 2.0-series. For the JAR issue with 1.8.10, 
PDFBox doc says you need 3 JARs: PDFBox, fontbox, and jempbox, plus 
commons-logging but I think that's already in NiFi.


The stack trace from the script error should be in logs/nifi-app.log, if you 
send it along I can take a look. You should be able to point to the folder 
containing the JARs, or supply a comma-separated list of each JAR in the Module 
Path property.

For the groovy "magic" stuff (syntactic sugar and closure coercion while using 
the NiFi APIs), I explain some of that in another post on that blog: 
http://funnifi.blogspot.com/2016/02/executescript-processor-replacing-flow.html?m=1

Hope this helps,
Matt

> On Feb 20, 2016, at 3:54 PM, Ralf Meier <[email protected]> wrote:
> 
> Hi,
> 
> thanks for your information. I try to understand your workflow but get some 
> errors when I test it:
> 
> : org.apache.nifi.processor.exception.ProcessException: 
> javax.script.ScriptException: 
> org.codehaus.groovy.control.MultipleCompilationErrorsException: startup 
> failed:
> Script36800.groovy: 15: unable to resolve class PDFTextStripper 
>  @ line 15, column 9.
>    def s = new PDFTextStripper()
> 
> I downloaded the pdfbox-2.0.0-RC3.jar and copied in a folder pdfbox in my 
> download folder. I then changed the path (Module Directory)  in the 
> ExecuteScript to this folder. The rest I didn’t changed. 
> 
> But I get this error. Do you have some hints? This would be great.
> 
> 
> To be honest (I’m totally new to groovy) in addition I did also not 
> understand what happens here in detail:
> 
> flowFile = session.write(flowFile, {inputStream, outputStream ->
>       doc = PDDocument.load(inputStream)
>       info = doc.getDocumentInformation()
>         s.writeText(doc, new OutputStreamWriter(outputStream))
>     } as StreamCallback
> )
> 
> Thanks for your help.
> 
> BR
> Ralf
> 
> 
> 
> 
>> Am 20.02.2016 um 16:44 schrieb Matt Burgess <[email protected]>:
>> 
>> I have a blog post on how to do this with NiFi using a Groovy script in the 
>> ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
>> 
>> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>> 
>> Jython is also supported but can't yet use Java libraries (it uses Jython 
>> scripts/modules instead). The other languages (Groovy, Lua, JavaScript, 
>> JRuby) can use Java libraries like Tika and PDFBox.
>> 
>> Regards,
>> Matt
>> 
>> Sent from my iPhone
>> 
>>> On Feb 20, 2016, at 10:31 AM, Ralf Meier <[email protected]> wrote:
>>> 
>>> Hi Everybody, 
>>> 
>>> I’m new to Nifi and I want to find out if it is possible to extract content 
>>> and metadata from PDF’s using a library like tika. 
>>> My first Idea was to to use the following processors:
>>> - GetFile (Watch a specific Folder)
>>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>>> - RouteOnAttribute (If it is a pdf)
>>> - ExecuteStreamCommand:
>>>     I changed the following settings.
>>>     Command Arguments: {flowfilw_contents}
>>>     Command Path: tika-python parse all
>>>     
>>> I use the python tika wrapper from 
>>> (https://github.com/chrismattmann/tika-python)
>>> 
>>> But it is not working. 
>>> Has somebody an Idea how to use tika to extract the content and the 
>>> metadata using nifi or what I’m doing wrong.
>>> 
>>> Thanks for your help.
>>> BR 
>>> Ralf
>

Re: Using Apache Nifi and Tika to extract content from pdf

Reply via email to