Hi all,

I have some trouble using Tika to parse some PDFs. I crawl them with Nutch 1.11, using parse-tika. Some documents will get parsed correctly, but most won't, and the error isn't very clear to me:

org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : SIMPLE
        at org.apache.tika.metadata.Metadata.add(Metadata.java:338)
at org.apache.tika.parser.image.xmp.JempboxExtractor.addMetadata(JempboxExtractor.java:199) at org.apache.tika.parser.image.xmp.JempboxExtractor.extractXMPMM(JempboxExtractor.java:145) at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:216)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:136)
at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:167) at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

I tested the document with PDFBox ExtractText, and it works fine.

An example of a failing document is:

https://gemeente.groningen.nl/system/files/1._jaarstukken_groninger_archieven_br_raad.pdf

Any suggestions?

Thanks in advance!
Vincent Slot

Reply via email to