See: https://issues.apache.org/jira/browse/NUTCH-961

 
 
-----Original message-----
> From:Shaya Potter <[email protected]>
> Sent: Sun 26-Aug-2012 17:59
> To: [email protected]
> Subject: Re: running main() in plugins?
> 
> It could be the "magic" (i.e. analysis) that Nutch is doing in the 
> background gets rid of most of the cruft, I'm just playing around on my 
> own trying to see how I can get the best text to analyze, and in many 
> cases, there's a lot of cruft and I was wondering if Nutch did anything 
> to remove said cruft (headers, footers, sidebars....)
> 
> what I'm doing now for my experiments is relatively heavyweight, but 
> I'm, applying the readability algorithm to web pages before I index them 
> into my a lucene database.  probably not the best idea for nutch though.
> 
> With that said, if Nutch is doing more processing than a jsoup 
> Document.text() operation, the question is why?  (some might be obvious, 
> metadata, getting outbound links)
> 
> On 08/26/2012 08:55 AM, Lewis John Mcgibbney wrote:
> > Hi Shaya,
> >
> > Can you elaborate? The plugin has been around for a good while. If you
> > have suggestions to improve they are very welcome.
> >
> > Thanks
> >
> > On Sun, Aug 26, 2012 at 1:41 PM, Shaya Potter <[email protected]> wrote:
> >> ok, so it seems that Nutch isn't doing much different (at least from a
> >> smattering of tests I've done) than Jsoup's Document.text() ability (from
> >> what I can tell at least, perhaps only some issues with spacing between
> >> elements).
> >>
> >> On 08/26/2012 06:28 AM, Lewis John Mcgibbney wrote:
> >>>
> >>> You can easily run any plugin from the terminal using
> >>>
> >>> ./bin/nutch plugin
> >>>
> >>> in the case of the HtmlParser main() method you would want to do
> >>>
> >>> ./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
> >>> $pathToLocalFile
> >>>
> >>> You have actually identified an improvement which we could do with
> >>> having in the main() method for this class e.g.
> >>>
> >>> 1) When the arguments are not correctly specified it should print a
> >>> usage message to std out explaining the correct plugin usage as with
> >>> more or less every other plugin. Currently we just get a nasty stack
> >>> like the following
> >>>
> >>> Exception in thread "main" java.lang.reflect.InvocationTargetException
> >>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>          at
> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>          at
> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>          at java.lang.reflect.Method.invoke(Method.java:597)
> >>>          at
> >>> org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
> >>> Caused by: java.io.FileNotFoundException:
> >>> http:/www.trancearoundtheworld.com (No such file or directory)
> >>>          at java.io.FileInputStream.open(Native Method)
> >>>          at java.io.FileInputStream.<init>(FileInputStream.java:120)
> >>>          at
> >>> org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274)
> >>>          ... 5 more
> >>>
> >>> 2) The plugin main method only enables you to parse local files an
> >>> improvement would be to add functionality similar to the parserchecker
> >>> as highlighted by Sourajit
> >>>
> >>> If you wish to add these functions then please open a Jira issue, the
> >>> contribution would be great.
> >>>
> >>> Thanks
> >>>
> >>> Lewis
> >>>
> >>> On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter <[email protected]> wrote:
> >>>>
> >>>> I'm trying to run the main function in HtmlParser (just to see test how
> >>>> Nutch's parser works compared to others) and I can't see to figure out
> >>>> how
> >>>> to get it to run.
> >>>>
> >>>>
> >>>> http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup
> >>>>
> >>>> when I run it naively, I get an error
> >>>>
> >>>> Exception in thread "main" java.lang.RuntimeException:
> >>>> org.apache.nutch.parse.HtmlParseFilter not found.
> >>>>       at
> >>>> org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55)
> >>>>
> >>>> in looking at HtmlParseFilters, I see that it throws the runtime
> >>>> exception
> >>>> if it can't find any HtmlParseFilter classes, however, I can't seem to
> >>>> figure out how to make it able to find them (I see the jar's in the
> >>>> plugins
> >>>> dir, but do they have to be registered?  could the main() in HtmlParser
> >>>> ever
> >>>> work as is?
> >>>>
> >>>> any pointers would be appreciated.
> >>>>
> >>>> thanks.
> >>>>
> >>>
> >>>
> >>>
> >>
> >
> >
> >
> 

Reply via email to