New feature. :) We didn't extract xmpMM in 1.11. Thank you for sharing a test file! I'm not able reproduce this with Tika trunk.
The error means that a value for xmpMM:DocumentID was already set in the Metadata object, and you're trying to add another value. xmpMM:DocumentID is "SIMPLE" and only allows one value. Is nutch reusing the Metadata object, not clearing it, or prepopulating it with xmp metadata? I'll take a look at nutch. -----Original Message----- From: Vincent [mailto:[email protected]] Sent: Monday, October 17, 2016 8:13 AM To: [email protected] Subject: Re: Error parsing PDFs Hi, After some additional trying I found that this error does not occur for this document in Tika 1.11. I forgot to mention in my last message that I was using Tika 1.13. So is this perhaps a bug in the new Tika version? Regards, Vincent On 17-10-16 13:37, Vincent wrote: > Hi all, > > I have some trouble using Tika to parse some PDFs. I crawl them with > Nutch 1.11, using parse-tika. Some documents will get parsed > correctly, but most won't, and the error isn't very clear to me: > > org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : SIMPLE > at org.apache.tika.metadata.Metadata.add(Metadata.java:338) > at > org.apache.tika.parser.image.xmp.JempboxExtractor.addMetadata(JempboxExtractor.java:199) > at > org.apache.tika.parser.image.xmp.JempboxExtractor.extractXMPMM(JempboxExtractor.java:145) > at > org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:216) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:136) > at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:167) > at > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) > at > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > I tested the document with PDFBox ExtractText, and it works fine. > > An example of a failing document is: > > https://gemeente.groningen.nl/system/files/1._jaarstukken_groninger_ar > chieven_br_raad.pdf > > > Any suggestions? > > Thanks in advance! > Vincent Slot >
