Re: Using Apache Nifi and Tika to extract content from pdf

Matt Burgess Sun, 21 Feb 2016 15:31:20 -0800

There are some RegEx processors you can use to see if the PDF parsed text
is "empty" or full of just whitespace, or you can use the scripting
processor for that too.


For Jython, check the unit test:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-scripting-bundle/nifi-scripting-processors/src/test/java/org/apache/nifi/processors/script/TestExecuteJython.java
 It refers to resources in
nifi/nifi-nar-bundles/nifi-scripting-bundle/nifi-scripting-processors/src/test/resources/jython,
and also does some flowfile manipulation. Remember that if you use Jython
you can't use JARs like PDFBox; you'd need a Jython-compatible
module/script. It should eventually support both same-language modules
(although currently only JRuby and Jython support it) and JVM libraries
(JARs), to allow to max flexibility and power.

To that end, I am would love to hear any comments, questions, or
suggestions to make the scripting processors better. Russell's suggestion
for adding Clojure is a great example, I am hoping we can take this thing
as far as it can go :)

Regards,
Matt


On Sun, Feb 21, 2016 at 7:22 AM, Ralf Meier <[email protected]> wrote:

> Hi,
>
> thanks for your help. Now the workflow is working. But I still have some
> issues. The PutFile at the end of the workflow writes the file to disk. But
> in my case the content of the flow file is mostly empty (only one PDF
> worked for me). Even that the rest is processed just fine. Also when I try
> to put the result e.g. into Elasitcsearch it is empty.
>
> Is there a special hint for this?
>
> And in addition I searched the documentation to find out what would be the
> syntax in python to read the input-flowfile and to create a new flowfile
> and parse it back. Is there a documentation? Or where did I find some infos?
>
> Sorry for all my questions.
>
> BR and thanks.
>
> Ralf
>
>
> Am 20.02.2016 um 22:27 schrieb Matt Burgess <[email protected]>:
>
> I will update the blog to make these more clear. I used PDFBox 1.8.10 so
> I'm not sure what else you need for the 2.0-series. For the JAR issue with
> 1.8.10, PDFBox doc says you need 3 JARs: PDFBox, fontbox, and jempbox, plus
> commons-logging but I think that's already in NiFi.
>
> The stack trace from the script error should be in logs/nifi-app.log, if
> you send it along I can take a look. You should be able to point to the
> folder containing the JARs, or supply a comma-separated list of each JAR in
> the Module Path property.
>
> For the groovy "magic" stuff (syntactic sugar and closure coercion while
> using the NiFi APIs), I explain some of that in another post on that blog:
>
> http://funnifi.blogspot.com/2016/02/executescript-processor-replacing-flow.html?m=1
>
> Hope this helps,
> Matt
>
> On Feb 20, 2016, at 3:54 PM, Ralf Meier <[email protected]> wrote:
>
> Hi,
>
> thanks for your information. I try to understand your workflow but get
> some errors when I test it:
>
> : org.apache.nifi.processor.exception.ProcessException: 
> javax.script.ScriptException: 
> org.codehaus.groovy.control.MultipleCompilationErrorsException: startup 
> failed:
> Script36800.groovy: 15: unable to resolve class PDFTextStripper
>  @ line 15, column 9.
>    def s = new PDFTextStripper()
>
>
> I downloaded the pdfbox-2.0.0-RC3.jar and copied in a folder pdfbox in my
> download folder. I then changed the path (Module Directory)  in the
> ExecuteScript to this folder. The rest I didn’t changed.
>
> But I get this error. Do you have some hints? This would be great.
>
>
> To be honest (I’m totally new to groovy) in addition I did also not
> understand what happens here in detail:
>
> flowFile = session.write(flowFile, {inputStream, outputStream ->
> doc = PDDocument.load(inputStream)
> info = doc.getDocumentInformation()
>         s.writeText(doc, new OutputStreamWriter(outputStream))
>     } as StreamCallback
> )
>
> Thanks for your help.
>
> BR
> Ralf
>
>
>
>
> Am 20.02.2016 um 16:44 schrieb Matt Burgess <[email protected]>:
>
> I have a blog post on how to do this with NiFi using a Groovy script in
> the ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
>
>
> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>
> Jython is also supported but can't yet use Java libraries (it uses Jython
> scripts/modules instead). The other languages (Groovy, Lua, JavaScript,
> JRuby) can use Java libraries like Tika and PDFBox.
>
> Regards,
> Matt
>
> Sent from my iPhone
>
> On Feb 20, 2016, at 10:31 AM, Ralf Meier <[email protected]> wrote:
>
> Hi Everybody,
>
> I’m new to Nifi and I want to find out if it is possible to extract
> content and metadata from PDF’s using a library like tika.
> My first Idea was to to use the following processors:
> - GetFile (Watch a specific Folder)
> - IdentifyMimeType (Identify if the file is a typ application/pdf)
> - RouteOnAttribute (If it is a pdf)
> - ExecuteStreamCommand:
> I changed the following settings.
> Command Arguments: {flowfilw_contents}
> Command Path: tika-python parse all
> I use the python tika wrapper from (
> https://github.com/chrismattmann/tika-python)
>
> But it is not working.
> Has somebody an Idea how to use tika to extract the content and the
> metadata using nifi or what I’m doing wrong.
>
> Thanks for your help.
> BR
> Ralf
>
>
>
>

Re: Using Apache Nifi and Tika to extract content from pdf

Reply via email to