Hi,
In the process of addressing ad porting NUTCH-840 [0], I've discovered a
couple of anomalies.
Within org.apache.nutch.tika.TestDOMContentutils#setup trunk uses
org.cyberneko.html.parsers.DOMFragmentParser like so
private static void setup() throws Exception {
conf = NutchConfiguration.create();
conf.setBoolean("parser.html.form.use_action", true);
utils = new DOMContentUtils(conf);
DOMFragmentParser parser = new DOMFragmentParser();
for (int i = 0; i < testPages.length; i++) {
DocumentFragment node = new HTMLDocumentImpl().createDocumentFragment();
try {
parser.parse(
new InputSource(new ByteArrayInputStream(testPages[i].getBytes())),
node);
testBaseHrefURLs[i] = new URL(testBaseHrefs[i]);
} catch (Exception e) {
assertTrue("caught exception: " + e, false);
Whereas 2.x proposes to use org.apache.nutch.parse.tika.TikaParser like so
private static void setup() throws Exception {
conf = NutchConfiguration.create();
conf.setBoolean("parser.html.form.use_action", true);
utils = new DOMContentUtils(conf);
TikaParser tikaParser = new TikaParser();
tikaParser.setConf(conf);
Parser parser =
tikaParser.getTikaConfig().getParser("text/html");
for (int i = 0; i < testPages.length; i++) {
Metadata tikamd = new Metadata();
HTMLDocumentImpl doc = new HTMLDocumentImpl();
doc.setErrorChecking(false);
DocumentFragment root = doc.createDocumentFragment();
DOMBuilder domhandler = new DOMBuilder(doc, root);
ParseContext context = new ParseContext();
// to add once available in Tika
//context.set(HtmlMapper.class,
IdentityHtmlMapper.INSTANCE);
try {
parser.parse(new
ByteArrayInputStream(testPages[i].getBytes()),
domhandler, tikamd, context);
testBaseHrefURLs[i] = new URL(testBaseHrefs[i]);
} catch (Exception e) {
e.printStackTrace();
fail("caught exception: " + e);
Some observations then:
* all 3 tests for trunk pass when using DOMFargmentParser and only the
first test passes for 2.x when the above code is executed and JUnit
assertions made.
* Even when the trunk code as above is ported to 2.x, we still have a
failing test indicating unpredictable/different parsing behavior.
One final problem I am having is that the code below always returns null
when debug this in Eclipse however this must be my Eclipse environment!
Parser parser = tikaParser.getTikaConfig().getParser("text/html");
Anyone experienced/aware of any apparent discrepancies in DOM/DOMFragment
handling between cyberneko and Tika within Nutch? Maybe this is one for the
Tika mailing list?
Thanks all the same
Lewis
[0] https://issues.apache.org/jira/browse/NUTCH-840
--
*Lewis*