Re: Using Apache Nifi and Tika to extract content from pdf

Ralf Meier Sun, 21 Feb 2016 04:23:31 -0800

Hi,

thanks for your help. Now the workflow is working. But I still have some 
issues. The PutFile at the end of the workflow writes the file to disk. But in 
my case the content of the flow file is mostly empty (only one PDF worked for 
me). Even that the rest is processed just fine. Also when I try to put the 
result e.g. into Elasitcsearch it is empty.


Is there a special hint for this?

And in addition I searched the documentation to find out what would be the 
syntax in python to read the input-flowfile and to create a new flowfile and 
parse it back. Is there a documentation? Or where did I find some infos?

Sorry for all my questions.

BR and thanks.

Ralf


> Am 20.02.2016 um 22:27 schrieb Matt Burgess <[email protected]>:
> 
> I will update the blog to make these more clear. I used PDFBox 1.8.10 so I'm 
> not sure what else you need for the 2.0-series. For the JAR issue with 
> 1.8.10, PDFBox doc says you need 3 JARs: PDFBox, fontbox, and jempbox, plus 
> commons-logging but I think that's already in NiFi.
> 
> The stack trace from the script error should be in logs/nifi-app.log, if you 
> send it along I can take a look. You should be able to point to the folder 
> containing the JARs, or supply a comma-separated list of each JAR in the 
> Module Path property.
> 
> For the groovy "magic" stuff (syntactic sugar and closure coercion while 
> using the NiFi APIs), I explain some of that in another post on that blog: 
> http://funnifi.blogspot.com/2016/02/executescript-processor-replacing-flow.html?m=1
>  
> <http://funnifi.blogspot.com/2016/02/executescript-processor-replacing-flow.html?m=1>
> 
> Hope this helps,
> Matt
> 
> On Feb 20, 2016, at 3:54 PM, Ralf Meier <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> Hi,
>> 
>> thanks for your information. I try to understand your workflow but get some 
>> errors when I test it:
>> 
>> : org.apache.nifi.processor.exception.ProcessException: 
>> javax.script.ScriptException: 
>> org.codehaus.groovy.control.MultipleCompilationErrorsException: startup 
>> failed:
>> Script36800.groovy: 15: unable to resolve class PDFTextStripper 
>>  @ line 15, column 9.
>>    def s = new PDFTextStripper()
>> 
>> I downloaded the pdfbox-2.0.0-RC3.jar and copied in a folder pdfbox in my 
>> download folder. I then changed the path (Module Directory)  in the 
>> ExecuteScript to this folder. The rest I didn’t changed. 
>> 
>> But I get this error. Do you have some hints? This would be great.
>> 
>> 
>> To be honest (I’m totally new to groovy) in addition I did also not 
>> understand what happens here in detail:
>> 
>> flowFile = session.write(flowFile, {inputStream, outputStream ->
>>      doc = PDDocument.load(inputStream)
>>      info = doc.getDocumentInformation()
>>         s.writeText(doc, new OutputStreamWriter(outputStream))
>>     } as StreamCallback
>> )
>> 
>> Thanks for your help.
>> 
>> BR
>> Ralf
>> 
>> 
>> 
>> 
>>> Am 20.02.2016 um 16:44 schrieb Matt Burgess <[email protected] 
>>> <mailto:[email protected]>>:
>>> 
>>> I have a blog post on how to do this with NiFi using a Groovy script in the 
>>> ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
>>> 
>>> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>>>  
>>> <http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1>
>>> 
>>> Jython is also supported but can't yet use Java libraries (it uses Jython 
>>> scripts/modules instead). The other languages (Groovy, Lua, JavaScript, 
>>> JRuby) can use Java libraries like Tika and PDFBox.
>>> 
>>> Regards,
>>> Matt
>>> 
>>> Sent from my iPhone
>>> 
>>> On Feb 20, 2016, at 10:31 AM, Ralf Meier <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>>> Hi Everybody, 
>>>> 
>>>> I’m new to Nifi and I want to find out if it is possible to extract 
>>>> content and metadata from PDF’s using a library like tika. 
>>>> My first Idea was to to use the following processors:
>>>> - GetFile (Watch a specific Folder)
>>>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>>>> - RouteOnAttribute (If it is a pdf)
>>>> - ExecuteStreamCommand:
>>>>    I changed the following settings.
>>>>    Command Arguments: {flowfilw_contents}
>>>>    Command Path: tika-python parse all
>>>>    
>>>> I use the python tika wrapper from 
>>>> (https://github.com/chrismattmann/tika-python 
>>>> <https://github.com/chrismattmann/tika-python>)
>>>> 
>>>> But it is not working. 
>>>> Has somebody an Idea how to use tika to extract the content and the 
>>>> metadata using nifi or what I’m doing wrong.
>>>> 
>>>> Thanks for your help.
>>>> BR 
>>>> Ralf
>>

Re: Using Apache Nifi and Tika to extract content from pdf

Reply via email to