[ https://issues.apache.org/jira/browse/NUTCH-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243194#comment-14243194 ]
Sebastian Nagel commented on NUTCH-1898: ---------------------------------------- The raw document could be also viewed by (similar for other protocol implementations): {noformat} bin/nutch plugin protocol-http org.apache.nutch.protocol.http.Http <url> {noformat} But agreed, to find problems it's always useful to have a look at the raw HTML, and it's easier to have few but powerful debugging tools. Or does "raw" mean the serialized DOM which (1) is also available for binary, non-HTML document formats, and (2) may look slightly different and (3) isn't easily viewed by any other tool? > Add -dumpRawHTML prameter to parsechecker tool > ---------------------------------------------- > > Key: NUTCH-1898 > URL: https://issues.apache.org/jira/browse/NUTCH-1898 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.9, 2.2.1 > Reporter: Lewis John McGibbney > Priority: Minor > Fix For: 2.4, 1.10 > > > The ability to obtain raw HTML alongside all of the other parse data we get > within existing parsechecker would compliment the tool. > This issue should merely append the raw HTML markup to the existing output. > It should be an optional parameter, same as -dumpText -- This message was sent by Atlassian JIRA (v6.3.4#6332)