On Mon, Jan 24, 2011 at 11:07 AM, Mattmann, Chris A (388J)
<chris.a.mattm...@jpl.nasa.gov> wrote:
> I'd be happy to comment:
> A simple shell script doesn't provide URL filtering and control of how you 
> crawl those documents on the local file system. Nutch has several levels of 
> URL filtering based on regex, MIME type, and others. Also, if there are any 
> outlinks in those local files that point to remote content, Nutch will go and 
> crawl it for you, something that a simple shell script doesn't take care of.

OK, thanks, those are good points. What we have dealt with,
and what I believe that the original poster in this thread wanted,
was a requirement just to dump the contents of documents in a
filesystem hierarchy.

> Also, it would be great if you could elaborate what the extra configuration 
> and maintenance issues are regarding Nutch? If you had something specific in 
> mind, patches or issue comments, welcome :)

Didn't mean it in that way. Nutch is indeed quite easy to set up,
and run. Nevertheless, if one's use case does not require the
features it provides, learning how to do that, and maintaining an
instance of Nutch, are all unnecessary overhead.


Reply via email to