Hi Shaya,

Can you elaborate? The plugin has been around for a good while. If you
have suggestions to improve they are very welcome.

Thanks

On Sun, Aug 26, 2012 at 1:41 PM, Shaya Potter <[email protected]> wrote:
> ok, so it seems that Nutch isn't doing much different (at least from a
> smattering of tests I've done) than Jsoup's Document.text() ability (from
> what I can tell at least, perhaps only some issues with spacing between
> elements).
>
> On 08/26/2012 06:28 AM, Lewis John Mcgibbney wrote:
>>
>> You can easily run any plugin from the terminal using
>>
>> ./bin/nutch plugin
>>
>> in the case of the HtmlParser main() method you would want to do
>>
>> ./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
>> $pathToLocalFile
>>
>> You have actually identified an improvement which we could do with
>> having in the main() method for this class e.g.
>>
>> 1) When the arguments are not correctly specified it should print a
>> usage message to std out explaining the correct plugin usage as with
>> more or less every other plugin. Currently we just get a nasty stack
>> like the following
>>
>> Exception in thread "main" java.lang.reflect.InvocationTargetException
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at
>> org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
>> Caused by: java.io.FileNotFoundException:
>> http:/www.trancearoundtheworld.com (No such file or directory)
>>         at java.io.FileInputStream.open(Native Method)
>>         at java.io.FileInputStream.<init>(FileInputStream.java:120)
>>         at
>> org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274)
>>         ... 5 more
>>
>> 2) The plugin main method only enables you to parse local files an
>> improvement would be to add functionality similar to the parserchecker
>> as highlighted by Sourajit
>>
>> If you wish to add these functions then please open a Jira issue, the
>> contribution would be great.
>>
>> Thanks
>>
>> Lewis
>>
>> On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter <[email protected]> wrote:
>>>
>>> I'm trying to run the main function in HtmlParser (just to see test how
>>> Nutch's parser works compared to others) and I can't see to figure out
>>> how
>>> to get it to run.
>>>
>>>
>>> http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup
>>>
>>> when I run it naively, I get an error
>>>
>>> Exception in thread "main" java.lang.RuntimeException:
>>> org.apache.nutch.parse.HtmlParseFilter not found.
>>>      at
>>> org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55)
>>>
>>> in looking at HtmlParseFilters, I see that it throws the runtime
>>> exception
>>> if it can't find any HtmlParseFilter classes, however, I can't seem to
>>> figure out how to make it able to find them (I see the jar's in the
>>> plugins
>>> dir, but do they have to be registered?  could the main() in HtmlParser
>>> ever
>>> work as is?
>>>
>>> any pointers would be appreciated.
>>>
>>> thanks.
>>>
>>
>>
>>
>



-- 
Lewis

Reply via email to