I'm getting lots of error from parsing pdf. it comes from TikaParser. Is tika a reliable pdf parser? just try to understand whether tika fails at most pdf files or only pdf with incorrect format. thanks, aj
2010-07-11 03:06:11,867 WARN parse.Parser - Error parsing: http://www.ninds.nih.gov/research/epilepsyweb/epilepsy_benchmarks_guide_2007.pdf: failed(2,0): null 2010-07-11 03:06:11,900 ERROR tika.TikaParser - Error parsing http://www.ninds.nih.gov/research/molecular_libraries/Flyer.pdf java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.pushbackinputstr...@14cf72c at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:380) at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:528) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:814) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:63) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

