Hi
I am debuging Nutch in Eclipse on Ubuntu platform. I can run the crawler
program smoothly. However, when it tries to parse a PDF file, I just get
the error msg "failed(2,0): Can't retrieve Tika parser for mime-type
application/pdf".
I try to debug deeply into Tika and find that in TikaConfig class,
public TikaConfig() throws MimeTypeException, IOException {
ParseContext context = new ParseContext();
Iterator<Parser> iterator = ServiceRegistry.lookupProviders(
Parser.class, this.getClass().getClassLoader());
while (iterator.hasNext()) {
Parser parser = iterator.next();
for (MediaType type : parser.getSupportedTypes(context)) {
parsers.put(type.toString(), parser);
}
}
mimeTypes = MimeTypesFactory.create("tika-mimetypes.xml");
}
the while loop does not do anything. It does not put a <application/pdf,
class> entry in its Map. That's why it can not retrieve a parse for mime
application/pdf. I strongly suspect that there is no parser class
registered in ServiceRegistry. However, even when I write the property in
nutch-site.xml and parse-plugin.xml. The problem is still.
Can anybody help me?
cheers
Xiao