I'm getting lots of error from parsing pdf. it comes from TikaParser. Is
tika a reliable pdf parser? just try to understand whether tika fails at
most pdf files or only pdf with incorrect format. thanks, aj

2010-07-11 03:06:11,867 WARN  parse.Parser - Error parsing:
http://www.ninds.nih.gov/research/epilepsyweb/epilepsy_benchmarks_guide_2007.pdf:
failed(2,0): null
2010-07-11 03:06:11,900 ERROR tika.TikaParser - Error parsing
http://www.ninds.nih.gov/research/molecular_libraries/Flyer.pdf
java.io.IOException: expected='endstream' actual=''
org.apache.pdfbox.io.pushbackinputstr...@14cf72c
        at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:380)
        at
org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:528)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:814)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:63)
        at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to