Hmmm…Thank you, Julien. I’m trying to find the exact version of nutch’s TikaParser that would result in that stacktrace…I don’t see one where line 167 is the call to tika’s parser to parse the pdf…any recommendations?
I’m at a loss to figure out how Tika would be adding xmpMM:DocumentID more than once. Any ideas? We could change our code to “set”, and if there are multiples, that would overwrite the earlier ids, but there really should only be DocumentID. Also, any pointers to setting up nutch in Intellij aside from what Google returns? Seems to be non-trivial. From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Monday, October 17, 2016 9:39 AM To: user@tika.apache.org Subject: Re: Error parsing PDFs The Metadata object is brand new for each document parsed, see [https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L102] On 17 October 2016 at 14:03, Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>> wrote: New feature. :) We didn't extract xmpMM in 1.11. Thank you for sharing a test file! I'm not able reproduce this with Tika trunk. The error means that a value for xmpMM:DocumentID was already set in the Metadata object, and you're trying to add another value. xmpMM:DocumentID is "SIMPLE" and only allows one value. Is nutch reusing the Metadata object, not clearing it, or prepopulating it with xmp metadata? I'll take a look at nutch. -----Original Message----- From: Vincent [mailto:vincent.s...@openindex.io<mailto:vincent.s...@openindex.io>] Sent: Monday, October 17, 2016 8:13 AM To: user@tika.apache.org<mailto:user@tika.apache.org> Subject: Re: Error parsing PDFs Hi, After some additional trying I found that this error does not occur for this document in Tika 1.11. I forgot to mention in my last message that I was using Tika 1.13. So is this perhaps a bug in the new Tika version? Regards, Vincent On 17-10-16 13:37, Vincent wrote: > Hi all, > > I have some trouble using Tika to parse some PDFs. I crawl them with > Nutch 1.11, using parse-tika. Some documents will get parsed > correctly, but most won't, and the error isn't very clear to me: > > org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : SIMPLE > at org.apache.tika.metadata.Metadata.add(Metadata.java:338) > at > org.apache.tika.parser.image.xmp.JempboxExtractor.addMetadata(JempboxExtractor.java:199) > at > org.apache.tika.parser.image.xmp.JempboxExtractor.extractXMPMM(JempboxExtractor.java:145) > at > org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:216) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:136) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:167) > at > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) > at > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > I tested the document with PDFBox ExtractText, and it works fine. > > An example of a failing document is: > > https://gemeente.groningen.nl/system/files/1._jaarstukken_groninger_ar > chieven_br_raad.pdf > > > Any suggestions? > > Thanks in advance! > Vincent Slot > -- Open Source Solutions for Text Engineering http://www.digitalpebble.com<http://www.digitalpebble.com/> http://digitalpebble.blogspot.com/ #digitalpebble<http://twitter.com/digitalpebble>