Hi Shaya, Can you elaborate? The plugin has been around for a good while. If you have suggestions to improve they are very welcome.
Thanks On Sun, Aug 26, 2012 at 1:41 PM, Shaya Potter <[email protected]> wrote: > ok, so it seems that Nutch isn't doing much different (at least from a > smattering of tests I've done) than Jsoup's Document.text() ability (from > what I can tell at least, perhaps only some issues with spacing between > elements). > > On 08/26/2012 06:28 AM, Lewis John Mcgibbney wrote: >> >> You can easily run any plugin from the terminal using >> >> ./bin/nutch plugin >> >> in the case of the HtmlParser main() method you would want to do >> >> ./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser >> $pathToLocalFile >> >> You have actually identified an improvement which we could do with >> having in the main() method for this class e.g. >> >> 1) When the arguments are not correctly specified it should print a >> usage message to std out explaining the correct plugin usage as with >> more or less every other plugin. Currently we just get a nasty stack >> like the following >> >> Exception in thread "main" java.lang.reflect.InvocationTargetException >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421) >> Caused by: java.io.FileNotFoundException: >> http:/www.trancearoundtheworld.com (No such file or directory) >> at java.io.FileInputStream.open(Native Method) >> at java.io.FileInputStream.<init>(FileInputStream.java:120) >> at >> org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274) >> ... 5 more >> >> 2) The plugin main method only enables you to parse local files an >> improvement would be to add functionality similar to the parserchecker >> as highlighted by Sourajit >> >> If you wish to add these functions then please open a Jira issue, the >> contribution would be great. >> >> Thanks >> >> Lewis >> >> On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter <[email protected]> wrote: >>> >>> I'm trying to run the main function in HtmlParser (just to see test how >>> Nutch's parser works compared to others) and I can't see to figure out >>> how >>> to get it to run. >>> >>> >>> http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup >>> >>> when I run it naively, I get an error >>> >>> Exception in thread "main" java.lang.RuntimeException: >>> org.apache.nutch.parse.HtmlParseFilter not found. >>> at >>> org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55) >>> >>> in looking at HtmlParseFilters, I see that it throws the runtime >>> exception >>> if it can't find any HtmlParseFilter classes, however, I can't seem to >>> figure out how to make it able to find them (I see the jar's in the >>> plugins >>> dir, but do they have to be registered? could the main() in HtmlParser >>> ever >>> work as is? >>> >>> any pointers would be appreciated. >>> >>> thanks. >>> >> >> >> > -- Lewis

