Hi,
In the process of addressing ad porting NUTCH-840 [0], I've discovered a
couple of anomalies.
Within org.apache.nutch.tika.TestDOMContentutils#setup trunk uses
org.cyberneko.html.parsers.DOMFragmentParser like so

  private static void setup() throws Exception {
    conf = NutchConfiguration.create();
    conf.setBoolean("parser.html.form.use_action", true);
    utils = new DOMContentUtils(conf);
    DOMFragmentParser parser = new DOMFragmentParser();
    for (int i = 0; i < testPages.length; i++) {
      DocumentFragment node = new HTMLDocumentImpl().createDocumentFragment();
      try {
        parser.parse(
            new InputSource(new ByteArrayInputStream(testPages[i].getBytes())),
            node);
        testBaseHrefURLs[i] = new URL(testBaseHrefs[i]);
      } catch (Exception e) {
        assertTrue("caught exception: " + e, false);

Whereas 2.x proposes to use org.apache.nutch.parse.tika.TikaParser like so

        private static void setup() throws Exception {
                conf = NutchConfiguration.create();
                conf.setBoolean("parser.html.form.use_action", true);
                utils = new DOMContentUtils(conf);
                TikaParser tikaParser = new TikaParser();
                tikaParser.setConf(conf);
                Parser parser = 
tikaParser.getTikaConfig().getParser("text/html");
                for (int i = 0; i < testPages.length; i++) {
                        Metadata tikamd = new Metadata();

                        HTMLDocumentImpl doc = new HTMLDocumentImpl();
                        doc.setErrorChecking(false);
                        DocumentFragment root = doc.createDocumentFragment();
                        DOMBuilder domhandler = new DOMBuilder(doc, root);
                        ParseContext context = new ParseContext();
                        // to add once available in Tika
                        //context.set(HtmlMapper.class, 
IdentityHtmlMapper.INSTANCE);
                        try {
                                parser.parse(new 
ByteArrayInputStream(testPages[i].getBytes()),
                                                domhandler, tikamd, context);
                                testBaseHrefURLs[i] = new URL(testBaseHrefs[i]);
                        } catch (Exception e) {
                                e.printStackTrace();
                                fail("caught exception: " + e);


Some observations then:

* all 3 tests for trunk pass when using DOMFargmentParser and only the
first test passes for 2.x when the above code is executed and JUnit
assertions made.
* Even when the trunk code as above is ported to 2.x, we still have a
failing test indicating unpredictable/different parsing behavior.

One final problem I am having is that the code below always returns null
when debug this in Eclipse however this must be my Eclipse environment!

Parser parser = tikaParser.getTikaConfig().getParser("text/html");

Anyone experienced/aware of any apparent discrepancies in DOM/DOMFragment
handling between cyberneko and Tika within Nutch? Maybe this is one for the
Tika mailing list?
Thanks all the same
Lewis

[0] https://issues.apache.org/jira/browse/NUTCH-840

-- 
*Lewis*

Reply via email to