Hi AJ,
Tika wraps PDFBox, which is a pretty good PDF parser.
From the error below, I'm wondering if you've got the
file.content.limit set to something too small.
Partial PDFs cannot be parsed by PDFBox (or just about any PDF parser
that I know of). I usually use a max size of 10MB if I'm processing
PDFs. Or you can use -1 to specify no limit.
-- Ken
On Jul 11, 2010, at 2:50pm, AJ Chen wrote:
I'm getting lots of error from parsing pdf. it comes from
TikaParser. Is
tika a reliable pdf parser? just try to understand whether tika
fails at
most pdf files or only pdf with incorrect format. thanks, aj
2010-07-11 03:06:11,867 WARN parse.Parser - Error parsing:
http://www.ninds.nih.gov/research/epilepsyweb/epilepsy_benchmarks_guide_2007.pdf
:
failed(2,0): null
2010-07-11 03:06:11,900 ERROR tika.TikaParser - Error parsing
http://www.ninds.nih.gov/research/molecular_libraries/Flyer.pdf
java.io.IOException: expected='endstream' actual=''
org.apache.pdfbox.io.pushbackinputstr...@14cf72c
at
org
.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:
380)
at
org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:528)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:
179)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:
847)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:
814)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:
63)
at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:
85)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:
41)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:
358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:
177)
--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g