Re: Error parsing PDFs

Julien Nioche Mon, 17 Oct 2016 09:22:08 -0700

Hi Tim

On 17 October 2016 at 16:02, Allison, Timothy B. <[email protected]> wrote:


> Hmmm…Thank you, Julien.  I’m trying to find the exact version of nutch’s
> TikaParser that would result in that stacktrace…I don’t see one where line
> 167 is the call to tika’s parser to parse the pdf…any recommendations?
>

Weird, me neither.


>
>
> I’m at a loss to figure out how Tika would be adding xmpMM:DocumentID
> more than once.  Any ideas?
>
>
>
> We could change our code to “set”, and if there are multiples, that would
> overwrite the earlier ids, but there really should only be DocumentID.
>
>
>
> Also, any pointers to setting up nutch in Intellij aside from what Google
> returns?  Seems to be non-trivial.
>

No idea, sorry. I never used Intellij in my life

J.


>
>
>
>
>
>
>
>
> *From:* Julien Nioche [mailto:[email protected]]
> *Sent:* Monday, October 17, 2016 9:39 AM
>
> *To:* [email protected]
> *Subject:* Re: Error parsing PDFs
>
>
>
> The Metadata object is brand new for each document parsed, see [
> https://github.com/apache/nutch/blob/master/src/plugin/
> parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L102]
>
>
>
> On 17 October 2016 at 14:03, Allison, Timothy B. <[email protected]>
> wrote:
>
> New feature. :)   We didn't extract xmpMM in 1.11.
>
> Thank you for sharing a test file! I'm not able reproduce this with Tika
> trunk.
>
> The error means that a value for xmpMM:DocumentID was already set in the
> Metadata object, and you're trying to add another value.  xmpMM:DocumentID
> is "SIMPLE" and only allows one value.
>
> Is nutch reusing the Metadata object, not clearing it, or prepopulating it
> with xmp metadata?  I'll take a look at nutch.
>
>
>
> -----Original Message-----
> From: Vincent [mailto:[email protected]]
> Sent: Monday, October 17, 2016 8:13 AM
> To: [email protected]
> Subject: Re: Error parsing PDFs
>
> Hi,
>
> After some additional trying I found that this error does not occur for
> this document in Tika 1.11. I forgot to mention in my last message that I
> was using Tika 1.13. So is this perhaps a bug in the new Tika version?
>
> Regards,
>
> Vincent
>
> On 17-10-16 13:37, Vincent wrote:
> > Hi all,
> >
> > I have some trouble using Tika to parse some PDFs. I crawl them with
> > Nutch 1.11, using parse-tika. Some documents will get parsed
> > correctly, but most won't, and the error isn't very clear to me:
> >
> > org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID :
> SIMPLE
> >         at org.apache.tika.metadata.Metadata.add(Metadata.java:338)
> >         at
> > org.apache.tika.parser.image.xmp.JempboxExtractor.
> addMetadata(JempboxExtractor.java:199)
> >         at
> > org.apache.tika.parser.image.xmp.JempboxExtractor.
> extractXMPMM(JempboxExtractor.java:145)
> >         at
> > org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:216)
> >         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.
> java:136)
> >         at
> > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:167)
> >         at
> > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> >         at
> > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> >         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> >         at java.lang.Thread.run(Thread.java:745)
> >
> > I tested the document with PDFBox ExtractText, and it works fine.
> >
> > An example of a failing document is:
> >
> > https://gemeente.groningen.nl/system/files/1._jaarstukken_groninger_ar
> > chieven_br_raad.pdf
> >
> >
> > Any suggestions?
> >
> > Thanks in advance!
> > Vincent Slot
> >
>
>
>
>
>
> --
>
>
> *Open Source Solutions for Text Engineering*
>
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble <http://twitter.com/digitalpebble>
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Re: Error parsing PDFs

Reply via email to