It could be the "magic" (i.e. analysis) that Nutch is doing in the background gets rid of most of the cruft, I'm just playing around on my own trying to see how I can get the best text to analyze, and in many cases, there's a lot of cruft and I was wondering if Nutch did anything to remove said cruft (headers, footers, sidebars....)

what I'm doing now for my experiments is relatively heavyweight, but I'm, applying the readability algorithm to web pages before I index them into my a lucene database. probably not the best idea for nutch though.

With that said, if Nutch is doing more processing than a jsoup Document.text() operation, the question is why? (some might be obvious, metadata, getting outbound links)

On 08/26/2012 08:55 AM, Lewis John Mcgibbney wrote:
Hi Shaya,

Can you elaborate? The plugin has been around for a good while. If you
have suggestions to improve they are very welcome.

Thanks

On Sun, Aug 26, 2012 at 1:41 PM, Shaya Potter <[email protected]> wrote:
ok, so it seems that Nutch isn't doing much different (at least from a
smattering of tests I've done) than Jsoup's Document.text() ability (from
what I can tell at least, perhaps only some issues with spacing between
elements).

On 08/26/2012 06:28 AM, Lewis John Mcgibbney wrote:

You can easily run any plugin from the terminal using

./bin/nutch plugin

in the case of the HtmlParser main() method you would want to do

./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
$pathToLocalFile

You have actually identified an improvement which we could do with
having in the main() method for this class e.g.

1) When the arguments are not correctly specified it should print a
usage message to std out explaining the correct plugin usage as with
more or less every other plugin. Currently we just get a nasty stack
like the following

Exception in thread "main" java.lang.reflect.InvocationTargetException
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
         at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at
org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
Caused by: java.io.FileNotFoundException:
http:/www.trancearoundtheworld.com (No such file or directory)
         at java.io.FileInputStream.open(Native Method)
         at java.io.FileInputStream.<init>(FileInputStream.java:120)
         at
org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274)
         ... 5 more

2) The plugin main method only enables you to parse local files an
improvement would be to add functionality similar to the parserchecker
as highlighted by Sourajit

If you wish to add these functions then please open a Jira issue, the
contribution would be great.

Thanks

Lewis

On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter <[email protected]> wrote:

I'm trying to run the main function in HtmlParser (just to see test how
Nutch's parser works compared to others) and I can't see to figure out
how
to get it to run.


http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup

when I run it naively, I get an error

Exception in thread "main" java.lang.RuntimeException:
org.apache.nutch.parse.HtmlParseFilter not found.
      at
org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:55)

in looking at HtmlParseFilters, I see that it throws the runtime
exception
if it can't find any HtmlParseFilter classes, however, I can't seem to
figure out how to make it able to find them (I see the jar's in the
plugins
dir, but do they have to be registered?  could the main() in HtmlParser
ever
work as is?

any pointers would be appreciated.

thanks.








Reply via email to