You can easily run any plugin from the terminal using
./bin/nutch plugin
in the case of the HtmlParser main() method you would want to do
./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
$pathToLocalFile
You have actually identified an improvement which we could do with
having in the main() method for this class e.g.
1) When the arguments are not correctly specified it should print a
usage message to std out explaining the correct plugin usage as with
more or less every other plugin. Currently we just get a nasty stack
like the following
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
Caused by: java.io.FileNotFoundException:
http:/www.trancearoundtheworld.com (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:120)
at org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274)
... 5 more
2) The plugin main method only enables you to parse local files an
improvement would be to add functionality similar to the parserchecker
as highlighted by Sourajit
If you wish to add these functions then please open a Jira issue, the
contribution would be great.
Thanks
Lewis
On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter <[email protected]> wrote:
> I'm trying to run the main function in HtmlParser (just to see test how
> Nutch's parser works compared to others) and I can't see to figure out how
> to get it to run.
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup
>
> when I run it naively, I get an error
>
> Exception in thread "main" java.lang.RuntimeException:
> org.apache.nutch.parse.HtmlParseFilter not found.
> at
> org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55)
>
> in looking at HtmlParseFilters, I see that it throws the runtime exception
> if it can't find any HtmlParseFilter classes, however, I can't seem to
> figure out how to make it able to find them (I see the jar's in the plugins
> dir, but do they have to be registered? could the main() in HtmlParser ever
> work as is?
>
> any pointers would be appreciated.
>
> thanks.
>
--
Lewis