[ 
https://issues.apache.org/jira/browse/NUTCH-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243194#comment-14243194
 ] 

Sebastian Nagel commented on NUTCH-1898:
----------------------------------------

The raw document could be also viewed by (similar for other protocol 
implementations):
{noformat}
bin/nutch plugin protocol-http org.apache.nutch.protocol.http.Http  <url>
{noformat}
But agreed, to find problems it's always useful to have a look at the raw HTML, 
and it's easier to have few but powerful debugging tools.
Or does "raw" mean the serialized DOM which (1) is also available for binary, 
non-HTML document formats, and (2) may look slightly different and (3) isn't 
easily viewed by any other tool?

> Add -dumpRawHTML prameter to parsechecker tool
> ----------------------------------------------
>
>                 Key: NUTCH-1898
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1898
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.9, 2.2.1
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.4, 1.10
>
>
> The ability to obtain raw HTML alongside all of the other parse data we get 
> within existing parsechecker would compliment the tool.
> This issue should merely append the raw HTML markup to the existing output. 
> It should be an optional parameter, same as -dumpText



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to