If you wish to just check the parser, use this command

$ nutch parsechecker -dumpText <url>

This should work out of the box without any modification.

On Sun, Aug 26, 2012 at 8:48 AM, Shaya Potter <[email protected]> wrote:

> I'm trying to run the main function in HtmlParser (just to see test how
> Nutch's parser works compared to others) and I can't see to figure out how
> to get it to run.
>
> http://svn.apache.org/viewvc/**nutch/branches/branch-1.5.1/**
> src/plugin/parse-html/src/**java/org/apache/nutch/parse/**
> html/HtmlParser.java?revision=**1356339&view=markup<http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup>
>
> when I run it naively, I get an error
>
> Exception in thread "main" java.lang.RuntimeException:
> org.apache.nutch.parse.**HtmlParseFilter not found.
>     at org.apache.nutch.parse.**HtmlParseFilters.<init>(**
> HtmlParseFilters.java:55)
>
> in looking at HtmlParseFilters, I see that it throws the runtime exception
> if it can't find any HtmlParseFilter classes, however, I can't seem to
> figure out how to make it able to find them (I see the jar's in the plugins
> dir, but do they have to be registered?  could the main() in HtmlParser
> ever work as is?
>
> any pointers would be appreciated.
>
> thanks.
>
>

Reply via email to