RE: Error parsing PDFs

Allison, Timothy B. Mon, 17 Oct 2016 08:02:55 -0700

Hmmm…Thank you, Julien.  I’m trying to find the exact version of nutch’s 
TikaParser that would result in that stacktrace…I don’t see one where line 167 
is the call to tika’s parser to parse the pdf…any recommendations?

I’m at a loss to figure out how Tika would be adding xmpMM:DocumentID more than 
once.  Any ideas?

We could change our code to “set”, and if there are multiples, that would 
overwrite the earlier ids, but there really should only be DocumentID.

Also, any pointers to setting up nutch in Intellij aside from what Google 
returns?  Seems to be non-trivial.

From: Julien Nioche [mailto:[email protected]]
Sent: Monday, October 17, 2016 9:39 AM
To: [email protected]
Subject: Re: Error parsing PDFs

The Metadata object is brand new for each document parsed, see 
[https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L102]

On 17 October 2016 at 14:03, Allison, Timothy B. 
<[email protected]<mailto:[email protected]>> wrote:
New feature. :)   We didn't extract xmpMM in 1.11.

Thank you for sharing a test file! I'm not able reproduce this with Tika trunk.

The error means that a value for xmpMM:DocumentID was already set in the 
Metadata object, and you're trying to add another value.  xmpMM:DocumentID is 
"SIMPLE" and only allows one value.

Is nutch reusing the Metadata object, not clearing it, or prepopulating it with 
xmp metadata?  I'll take a look at nutch.

-----Original Message-----
From: Vincent 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Monday, October 17, 2016 8:13 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Error parsing PDFs

Hi,

After some additional trying I found that this error does not occur for this 
document in Tika 1.11. I forgot to mention in my last message that I was using 
Tika 1.13. So is this perhaps a bug in the new Tika version?

Regards,

Vincent

On 17-10-16 13:37, Vincent wrote:
> Hi all,
>
> I have some trouble using Tika to parse some PDFs. I crawl them with
> Nutch 1.11, using parse-tika. Some documents will get parsed
> correctly, but most won't, and the error isn't very clear to me:
>
> org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : SIMPLE
>         at org.apache.tika.metadata.Metadata.add(Metadata.java:338)
>         at
> org.apache.tika.parser.image.xmp.JempboxExtractor.addMetadata(JempboxExtractor.java:199)
>         at
> org.apache.tika.parser.image.xmp.JempboxExtractor.extractXMPMM(JempboxExtractor.java:145)
>         at
> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:216)
>         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:136)
>         at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:167)
>         at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
>         at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> I tested the document with PDFBox ExtractText, and it works fine.
>
> An example of a failing document is:
>
> https://gemeente.groningen.nl/system/files/1._jaarstukken_groninger_ar
> chieven_br_raad.pdf
>
>
> Any suggestions?
>
> Thanks in advance!
> Vincent Slot
>

--

Open Source Solutions for Text Engineering

http://www.digitalpebble.com<http://www.digitalpebble.com/>
http://digitalpebble.blogspot.com/
#digitalpebble<http://twitter.com/digitalpebble>

RE: Error parsing PDFs

Reply via email to