nutch parse Tika problem

Xiao Li Wed, 21 Dec 2011 21:11:59 -0800

Hi

I am debuging Nutch in Eclipse on Ubuntu platform. I can run the crawler
program smoothly. However, when it tries to parse a PDF file, I just get
the error msg  "failed(2,0): Can't retrieve Tika parser for mime-type
application/pdf".


I try to debug deeply into Tika and find that in TikaConfig class,

public TikaConfig() throws MimeTypeException, IOException {
    ParseContext context = new ParseContext();
    Iterator<Parser> iterator = ServiceRegistry.lookupProviders(
        Parser.class, this.getClass().getClassLoader());
    while (iterator.hasNext()) {
        Parser parser = iterator.next();
        for (MediaType type : parser.getSupportedTypes(context)) {
        parsers.put(type.toString(), parser);
        }
    }
    mimeTypes = MimeTypesFactory.create("tika-mimetypes.xml");
    }

the while loop does not do anything. It does not put a <application/pdf,
class> entry in its Map. That's why it can not retrieve a parse for mime
application/pdf. I strongly suspect that there is no parser class
registered in ServiceRegistry. However, even when I write the property in
nutch-site.xml and parse-plugin.xml. The problem is still.

Can anybody help me?

cheers
Xiao

nutch parse Tika problem

Reply via email to