I will update the blog to make these more clear. I used PDFBox 1.8.10 so I'm not sure what else you need for the 2.0-series. For the JAR issue with 1.8.10, PDFBox doc says you need 3 JARs: PDFBox, fontbox, and jempbox, plus commons-logging but I think that's already in NiFi.
The stack trace from the script error should be in logs/nifi-app.log, if you send it along I can take a look. You should be able to point to the folder containing the JARs, or supply a comma-separated list of each JAR in the Module Path property. For the groovy "magic" stuff (syntactic sugar and closure coercion while using the NiFi APIs), I explain some of that in another post on that blog: http://funnifi.blogspot.com/2016/02/executescript-processor-replacing-flow.html?m=1 Hope this helps, Matt > On Feb 20, 2016, at 3:54 PM, Ralf Meier <n...@cht3.com> wrote: > > Hi, > > thanks for your information. I try to understand your workflow but get some > errors when I test it: > > : org.apache.nifi.processor.exception.ProcessException: > javax.script.ScriptException: > org.codehaus.groovy.control.MultipleCompilationErrorsException: startup > failed: > Script36800.groovy: 15: unable to resolve class PDFTextStripper > @ line 15, column 9. > def s = new PDFTextStripper() > > I downloaded the pdfbox-2.0.0-RC3.jar and copied in a folder pdfbox in my > download folder. I then changed the path (Module Directory) in the > ExecuteScript to this folder. The rest I didn’t changed. > > But I get this error. Do you have some hints? This would be great. > > > To be honest (I’m totally new to groovy) in addition I did also not > understand what happens here in detail: > > flowFile = session.write(flowFile, {inputStream, outputStream -> > doc = PDDocument.load(inputStream) > info = doc.getDocumentInformation() > s.writeText(doc, new OutputStreamWriter(outputStream)) > } as StreamCallback > ) > > Thanks for your help. > > BR > Ralf > > > > >> Am 20.02.2016 um 16:44 schrieb Matt Burgess <mattyb...@gmail.com>: >> >> I have a blog post on how to do this with NiFi using a Groovy script in the >> ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika: >> >> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1 >> >> Jython is also supported but can't yet use Java libraries (it uses Jython >> scripts/modules instead). The other languages (Groovy, Lua, JavaScript, >> JRuby) can use Java libraries like Tika and PDFBox. >> >> Regards, >> Matt >> >> Sent from my iPhone >> >>> On Feb 20, 2016, at 10:31 AM, Ralf Meier <n...@cht3.com> wrote: >>> >>> Hi Everybody, >>> >>> I’m new to Nifi and I want to find out if it is possible to extract content >>> and metadata from PDF’s using a library like tika. >>> My first Idea was to to use the following processors: >>> - GetFile (Watch a specific Folder) >>> - IdentifyMimeType (Identify if the file is a typ application/pdf) >>> - RouteOnAttribute (If it is a pdf) >>> - ExecuteStreamCommand: >>> I changed the following settings. >>> Command Arguments: {flowfilw_contents} >>> Command Path: tika-python parse all >>> >>> I use the python tika wrapper from >>> (https://github.com/chrismattmann/tika-python) >>> >>> But it is not working. >>> Has somebody an Idea how to use tika to extract the content and the >>> metadata using nifi or what I’m doing wrong. >>> >>> Thanks for your help. >>> BR >>> Ralf >