Hi Tim On 17 October 2016 at 16:02, Allison, Timothy B. <talli...@mitre.org> wrote:
> Hmmm…Thank you, Julien. I’m trying to find the exact version of nutch’s > TikaParser that would result in that stacktrace…I don’t see one where line > 167 is the call to tika’s parser to parse the pdf…any recommendations? > Weird, me neither. > > > I’m at a loss to figure out how Tika would be adding xmpMM:DocumentID > more than once. Any ideas? > > > > We could change our code to “set”, and if there are multiples, that would > overwrite the earlier ids, but there really should only be DocumentID. > > > > Also, any pointers to setting up nutch in Intellij aside from what Google > returns? Seems to be non-trivial. > No idea, sorry. I never used Intellij in my life J. > > > > > > > > > *From:* Julien Nioche [mailto:lists.digitalpeb...@gmail.com] > *Sent:* Monday, October 17, 2016 9:39 AM > > *To:* user@tika.apache.org > *Subject:* Re: Error parsing PDFs > > > > The Metadata object is brand new for each document parsed, see [ > https://github.com/apache/nutch/blob/master/src/plugin/ > parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L102] > > > > On 17 October 2016 at 14:03, Allison, Timothy B. <talli...@mitre.org> > wrote: > > New feature. :) We didn't extract xmpMM in 1.11. > > Thank you for sharing a test file! I'm not able reproduce this with Tika > trunk. > > The error means that a value for xmpMM:DocumentID was already set in the > Metadata object, and you're trying to add another value. xmpMM:DocumentID > is "SIMPLE" and only allows one value. > > Is nutch reusing the Metadata object, not clearing it, or prepopulating it > with xmp metadata? I'll take a look at nutch. > > > > -----Original Message----- > From: Vincent [mailto:vincent.s...@openindex.io] > Sent: Monday, October 17, 2016 8:13 AM > To: user@tika.apache.org > Subject: Re: Error parsing PDFs > > Hi, > > After some additional trying I found that this error does not occur for > this document in Tika 1.11. I forgot to mention in my last message that I > was using Tika 1.13. So is this perhaps a bug in the new Tika version? > > Regards, > > Vincent > > On 17-10-16 13:37, Vincent wrote: > > Hi all, > > > > I have some trouble using Tika to parse some PDFs. I crawl them with > > Nutch 1.11, using parse-tika. Some documents will get parsed > > correctly, but most won't, and the error isn't very clear to me: > > > > org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : > SIMPLE > > at org.apache.tika.metadata.Metadata.add(Metadata.java:338) > > at > > org.apache.tika.parser.image.xmp.JempboxExtractor. > addMetadata(JempboxExtractor.java:199) > > at > > org.apache.tika.parser.image.xmp.JempboxExtractor. > extractXMPMM(JempboxExtractor.java:145) > > at > > org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:216) > > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser. > java:136) > > at > > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:167) > > at > > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) > > at > > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > at > > java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > > > > I tested the document with PDFBox ExtractText, and it works fine. > > > > An example of a failing document is: > > > > https://gemeente.groningen.nl/system/files/1._jaarstukken_groninger_ar > > chieven_br_raad.pdf > > > > > > Any suggestions? > > > > Thanks in advance! > > Vincent Slot > > > > > > > > -- > > > *Open Source Solutions for Text Engineering* > > > http://www.digitalpebble.com > http://digitalpebble.blogspot.com/ > #digitalpebble <http://twitter.com/digitalpebble> > -- *Open Source Solutions for Text Engineering* http://www.digitalpebble.com http://digitalpebble.blogspot.com/ #digitalpebble <http://twitter.com/digitalpebble>