Hi all,
I have some trouble using Tika to parse some PDFs. I crawl them with
Nutch 1.11, using parse-tika. Some documents will get parsed correctly,
but most won't, and the error isn't very clear to me:
org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : SIMPLE
at org.apache.tika.metadata.Metadata.add(Metadata.java:338)
at
org.apache.tika.parser.image.xmp.JempboxExtractor.addMetadata(JempboxExtractor.java:199)
at
org.apache.tika.parser.image.xmp.JempboxExtractor.extractXMPMM(JempboxExtractor.java:145)
at
org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:216)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:136)
at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:167)
at
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
at
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I tested the document with PDFBox ExtractText, and it works fine.
An example of a failing document is:
https://gemeente.groningen.nl/system/files/1._jaarstukken_groninger_archieven_br_raad.pdf
Any suggestions?
Thanks in advance!
Vincent Slot