Hello,
I try to use nutch-2.x trunk to parse text/html types with tika.
I get error "parser for text/html not found".
I see that parse-tika code was changed. These lines
// get the right parser using the mime type as a clue
String mimeType = page.getContentType().toString();
CompositeParser compositeParser = (CompositeParser) tikaConfig.getParser();
Parser parser = compositeParser.getParsers().get(MediaType.parse(mimeType));
return no parser.
However, if I revert back to older version with
// get the right parser using the mime type as a clue
String mimeType = page.getContentType().toString();
Parser parser = tikaConfig.getParser(mimeType);
it works.
Has anyone tested the new tika with text/html types?
Thanks.
Alex.