BELLINI ADAM wrote:
hi,
but how to dump the content ? i tried this command :
./bin/nutch readseg -dump crawl/segments/20090903121951/content/ toto
and it said :
Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
Actually its quite easy to modify the parse-html filter to do this.
That is saving the HTML to a file or to some database, you could then
configure it to skip all unnecessary plugins. I think it depends a lot on
the other requirements you have whether using nutch for this task is the
right way to
Skúlason magg...@gmail.com ha scritto:
Da: Magnús Skúlason magg...@gmail.com
Oggetto: Re: R: Using Nutch for only retriving HTML
A: nutch-user@lucene.apache.org
Data: Mercoledì 30 settembre 2009, 11:48
Actually its quite easy to modify the
parse-html filter to do this.
That is saving the HTML
:50 +
From: olson_...@yahoo.it
Subject: Re: R: Using Nutch for only retriving HTML
To: nutch-user@lucene.apache.org
Thanks Magnús and Susam for your responses and pointing me in the right
direction. I think I would spend time over the next few weeks trying out
Nutch over. I only needed
From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: RE: R: Using Nutch for only retriving HTML
Date: Wed, 30 Sep 2009 21:04:03 +
hi
mabe you can run a crawl (dont forget to filter the pages just to keep html
or htm files (you will do it at conf/crawl-urlfilter.txt) )
after
BELLINI ADAM wrote:
me again,
i forgot to tell u the easiest way...
once the crawl is finished you can dump the whole db (it contains all the links
to your html pages) in a text file..
./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile
and you can perfor the wget on this db and
Sorry for pushing this topic, but I would like to know if Nutch would help me
get the raw HTML in my situation described below.
I am sure it would be a simple answer to those who know Nutch. If not then I
guess Nutch is the wrong tool for the job.
Thanks,
O. O.
--- Gio 24/9/09, O. Olson
On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it wrote:
Sorry for pushing this topic, but I would like to know if Nutch would help me
get the raw HTML in my situation described below.
I am sure it would be a simple answer to those who know Nutch. If not then I
guess Nutch is the