Re: Using Apache Nifi and Tika to extract content from pdf

Ralf Meier Sat, 20 Feb 2016 12:59:38 -0800

Hi,

thanks for your information. I try to understand your workflow but get some 
errors when I test it:


: org.apache.nifi.processor.exception.ProcessException: 
javax.script.ScriptException: 
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
Script36800.groovy: 15: unable to resolve class PDFTextStripper 
 @ line 15, column 9.
   def s = new PDFTextStripper()

I downloaded the pdfbox-2.0.0-RC3.jar and copied in a folder pdfbox in my 
download folder. I then changed the path (Module Directory)  in the 
ExecuteScript to this folder. The rest I didn’t changed. 

But I get this error. Do you have some hints? This would be great.


To be honest (I’m totally new to groovy) in addition I did also not understand 
what happens here in detail:

flowFile = session.write(flowFile, {inputStream, outputStream ->
        doc = PDDocument.load(inputStream)
        info = doc.getDocumentInformation()
        s.writeText(doc, new OutputStreamWriter(outputStream))
    } as StreamCallback
)

Thanks for your help.

BR
Ralf




> Am 20.02.2016 um 16:44 schrieb Matt Burgess <[email protected]>:
> 
> I have a blog post on how to do this with NiFi using a Groovy script in the 
> ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
> 
> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>  
> <http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1>
> 
> Jython is also supported but can't yet use Java libraries (it uses Jython 
> scripts/modules instead). The other languages (Groovy, Lua, JavaScript, 
> JRuby) can use Java libraries like Tika and PDFBox.
> 
> Regards,
> Matt
> 
> Sent from my iPhone
> 
> On Feb 20, 2016, at 10:31 AM, Ralf Meier <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> Hi Everybody, 
>> 
>> I’m new to Nifi and I want to find out if it is possible to extract content 
>> and metadata from PDF’s using a library like tika. 
>> My first Idea was to to use the following processors:
>> - GetFile (Watch a specific Folder)
>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>> - RouteOnAttribute (If it is a pdf)
>> - ExecuteStreamCommand:
>>      I changed the following settings.
>>      Command Arguments: {flowfilw_contents}
>>      Command Path: tika-python parse all
>>      
>> I use the python tika wrapper from 
>> (https://github.com/chrismattmann/tika-python 
>> <https://github.com/chrismattmann/tika-python>)
>> 
>> But it is not working. 
>> Has somebody an Idea how to use tika to extract the content and the metadata 
>> using nifi or what I’m doing wrong.
>> 
>> Thanks for your help.
>> BR 
>> Ralf

Re: Using Apache Nifi and Tika to extract content from pdf

Reply via email to