See: https://issues.apache.org/jira/browse/NUTCH-961
-----Original message----- > From:Shaya Potter <[email protected]> > Sent: Sun 26-Aug-2012 17:59 > To: [email protected] > Subject: Re: running main() in plugins? > > It could be the "magic" (i.e. analysis) that Nutch is doing in the > background gets rid of most of the cruft, I'm just playing around on my > own trying to see how I can get the best text to analyze, and in many > cases, there's a lot of cruft and I was wondering if Nutch did anything > to remove said cruft (headers, footers, sidebars....) > > what I'm doing now for my experiments is relatively heavyweight, but > I'm, applying the readability algorithm to web pages before I index them > into my a lucene database. probably not the best idea for nutch though. > > With that said, if Nutch is doing more processing than a jsoup > Document.text() operation, the question is why? (some might be obvious, > metadata, getting outbound links) > > On 08/26/2012 08:55 AM, Lewis John Mcgibbney wrote: > > Hi Shaya, > > > > Can you elaborate? The plugin has been around for a good while. If you > > have suggestions to improve they are very welcome. > > > > Thanks > > > > On Sun, Aug 26, 2012 at 1:41 PM, Shaya Potter <[email protected]> wrote: > >> ok, so it seems that Nutch isn't doing much different (at least from a > >> smattering of tests I've done) than Jsoup's Document.text() ability (from > >> what I can tell at least, perhaps only some issues with spacing between > >> elements). > >> > >> On 08/26/2012 06:28 AM, Lewis John Mcgibbney wrote: > >>> > >>> You can easily run any plugin from the terminal using > >>> > >>> ./bin/nutch plugin > >>> > >>> in the case of the HtmlParser main() method you would want to do > >>> > >>> ./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser > >>> $pathToLocalFile > >>> > >>> You have actually identified an improvement which we could do with > >>> having in the main() method for this class e.g. > >>> > >>> 1) When the arguments are not correctly specified it should print a > >>> usage message to std out explaining the correct plugin usage as with > >>> more or less every other plugin. Currently we just get a nasty stack > >>> like the following > >>> > >>> Exception in thread "main" java.lang.reflect.InvocationTargetException > >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >>> at > >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > >>> at > >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >>> at java.lang.reflect.Method.invoke(Method.java:597) > >>> at > >>> org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421) > >>> Caused by: java.io.FileNotFoundException: > >>> http:/www.trancearoundtheworld.com (No such file or directory) > >>> at java.io.FileInputStream.open(Native Method) > >>> at java.io.FileInputStream.<init>(FileInputStream.java:120) > >>> at > >>> org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274) > >>> ... 5 more > >>> > >>> 2) The plugin main method only enables you to parse local files an > >>> improvement would be to add functionality similar to the parserchecker > >>> as highlighted by Sourajit > >>> > >>> If you wish to add these functions then please open a Jira issue, the > >>> contribution would be great. > >>> > >>> Thanks > >>> > >>> Lewis > >>> > >>> On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter <[email protected]> wrote: > >>>> > >>>> I'm trying to run the main function in HtmlParser (just to see test how > >>>> Nutch's parser works compared to others) and I can't see to figure out > >>>> how > >>>> to get it to run. > >>>> > >>>> > >>>> http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup > >>>> > >>>> when I run it naively, I get an error > >>>> > >>>> Exception in thread "main" java.lang.RuntimeException: > >>>> org.apache.nutch.parse.HtmlParseFilter not found. > >>>> at > >>>> org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55) > >>>> > >>>> in looking at HtmlParseFilters, I see that it throws the runtime > >>>> exception > >>>> if it can't find any HtmlParseFilter classes, however, I can't seem to > >>>> figure out how to make it able to find them (I see the jar's in the > >>>> plugins > >>>> dir, but do they have to be registered? could the main() in HtmlParser > >>>> ever > >>>> work as is? > >>>> > >>>> any pointers would be appreciated. > >>>> > >>>> thanks. > >>>> > >>> > >>> > >>> > >> > > > > > > >

