Hi,
thanks for your information. I try to understand your workflow but get some
errors when I test it:
: org.apache.nifi.processor.exception.ProcessException:
javax.script.ScriptException:
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
Script36800.groovy: 15: unable to resolve class PDFTextStripper
@ line 15, column 9.
def s = new PDFTextStripper()
I downloaded the pdfbox-2.0.0-RC3.jar and copied in a folder pdfbox in my
download folder. I then changed the path (Module Directory) in the
ExecuteScript to this folder. The rest I didn’t changed.
But I get this error. Do you have some hints? This would be great.
To be honest (I’m totally new to groovy) in addition I did also not understand
what happens here in detail:
flowFile = session.write(flowFile, {inputStream, outputStream ->
doc = PDDocument.load(inputStream)
info = doc.getDocumentInformation()
s.writeText(doc, new OutputStreamWriter(outputStream))
} as StreamCallback
)
Thanks for your help.
BR
Ralf
> Am 20.02.2016 um 16:44 schrieb Matt Burgess <[email protected]>:
>
> I have a blog post on how to do this with NiFi using a Groovy script in the
> ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
>
> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>
> <http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1>
>
> Jython is also supported but can't yet use Java libraries (it uses Jython
> scripts/modules instead). The other languages (Groovy, Lua, JavaScript,
> JRuby) can use Java libraries like Tika and PDFBox.
>
> Regards,
> Matt
>
> Sent from my iPhone
>
> On Feb 20, 2016, at 10:31 AM, Ralf Meier <[email protected]
> <mailto:[email protected]>> wrote:
>
>> Hi Everybody,
>>
>> I’m new to Nifi and I want to find out if it is possible to extract content
>> and metadata from PDF’s using a library like tika.
>> My first Idea was to to use the following processors:
>> - GetFile (Watch a specific Folder)
>> - IdentifyMimeType (Identify if the file is a typ application/pdf)
>> - RouteOnAttribute (If it is a pdf)
>> - ExecuteStreamCommand:
>> I changed the following settings.
>> Command Arguments: {flowfilw_contents}
>> Command Path: tika-python parse all
>>
>> I use the python tika wrapper from
>> (https://github.com/chrismattmann/tika-python
>> <https://github.com/chrismattmann/tika-python>)
>>
>> But it is not working.
>> Has somebody an Idea how to use tika to extract the content and the metadata
>> using nifi or what I’m doing wrong.
>>
>> Thanks for your help.
>> BR
>> Ralf