Re: Error parsing PDFs

Julien Nioche Mon, 17 Oct 2016 06:39:15 -0700

The Metadata object is brand new for each document parsed, see [
https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L102
]


On 17 October 2016 at 14:03, Allison, Timothy B. <[email protected]> wrote:

> New feature. :)   We didn't extract xmpMM in 1.11.
>
> Thank you for sharing a test file! I'm not able reproduce this with Tika
> trunk.
>
> The error means that a value for xmpMM:DocumentID was already set in the
> Metadata object, and you're trying to add another value.  xmpMM:DocumentID
> is "SIMPLE" and only allows one value.
>
> Is nutch reusing the Metadata object, not clearing it, or prepopulating it
> with xmp metadata?  I'll take a look at nutch.
>
>
> -----Original Message-----
> From: Vincent [mailto:[email protected]]
> Sent: Monday, October 17, 2016 8:13 AM
> To: [email protected]
> Subject: Re: Error parsing PDFs
>
> Hi,
>
> After some additional trying I found that this error does not occur for
> this document in Tika 1.11. I forgot to mention in my last message that I
> was using Tika 1.13. So is this perhaps a bug in the new Tika version?
>
> Regards,
>
> Vincent
>
> On 17-10-16 13:37, Vincent wrote:
> > Hi all,
> >
> > I have some trouble using Tika to parse some PDFs. I crawl them with
> > Nutch 1.11, using parse-tika. Some documents will get parsed
> > correctly, but most won't, and the error isn't very clear to me:
> >
> > org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID :
> SIMPLE
> >         at org.apache.tika.metadata.Metadata.add(Metadata.java:338)
> >         at
> > org.apache.tika.parser.image.xmp.JempboxExtractor.
> addMetadata(JempboxExtractor.java:199)
> >         at
> > org.apache.tika.parser.image.xmp.JempboxExtractor.
> extractXMPMM(JempboxExtractor.java:145)
> >         at
> > org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:216)
> >         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.
> java:136)
> >         at
> > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:167)
> >         at
> > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> >         at
> > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> >         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> >         at java.lang.Thread.run(Thread.java:745)
> >
> > I tested the document with PDFBox ExtractText, and it works fine.
> >
> > An example of a failing document is:
> >
> > https://gemeente.groningen.nl/system/files/1._jaarstukken_groninger_ar
> > chieven_br_raad.pdf
> >
> >
> > Any suggestions?
> >
> > Thanks in advance!
> > Vincent Slot
> >
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Re: Error parsing PDFs

Reply via email to