thx dude it works fine :)
> Date: Thu, 1 Oct 2009 20:05:09 +0200
> From: a...@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: R: Using Nutch for only retriving HTML
>
> BELLINI ADAM wrote:
> > hi,
> > but how to dump the c
BELLINI ADAM wrote:
hi,
but how to dump the content ? i tried this command :
./bin/nutch readseg -dump crawl/segments/20090903121951/content/ toto
and it said :
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/usr/local/nutch-1.0/
ct 2009 18:16:43 +0200
> From: a...@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: R: Using Nutch for only retriving HTML
>
> BELLINI ADAM wrote:
> > hi,
> > thx for the advise,
> > but guess when u run the readseg command it will not retun the pages a
BELLINI ADAM wrote:
hi,
thx for the advise,
but guess when u run the readseg command it will not retun the pages as is (as
if browsed ).
i tried it and it returns information about pages :
Recno:: 0
URL:: http://blabla.com/blabla.jsp
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon
23:38:28 +0200
> From: a...@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: R: Using Nutch for only retriving HTML
>
> BELLINI ADAM wrote:
> >
> > me again,
> >
> > i forgot to tell u the easiest way...
> >
> > once the crawl is f
BELLINI ADAM wrote:
me again,
i forgot to tell u the easiest way...
once the crawl is finished you can dump the whole db (it contains all the links
to your html pages) in a text file..
./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile
and you can perfor the wget on this db and archi
> From: mbel...@msn.com
> To: nutch-user@lucene.apache.org
> Subject: RE: R: Using Nutch for only retriving HTML
> Date: Wed, 30 Sep 2009 21:04:03 +
>
>
> hi
> mabe you can run a crawl (dont forget to filter the pages just to keep html
> or htm files (you will do it
ep 2009 20:46:50 +
> From: olson_...@yahoo.it
> Subject: Re: R: Using Nutch for only retriving HTML
> To: nutch-user@lucene.apache.org
>
> Thanks Magnús and Susam for your responses and pointing me in the right
> direction. I think I would spend time over the next few weeks t
Skúlason ha scritto:
> Da: Magnús Skúlason
> Oggetto: Re: R: Using Nutch for only retriving HTML
> A: nutch-user@lucene.apache.org
> Data: Mercoledì 30 settembre 2009, 11:48
> Actually its quite easy to modify the
> parse-html filter to do this.
>
> That is saving the H
Actually its quite easy to modify the parse-html filter to do this.
That is saving the HTML to a file or to some database, you could then
configure it to skip all unnecessary plugins. I think it depends a lot on
the other requirements you have whether using nutch for this task is the
right way to
On Wed, Sep 30, 2009 at 1:39 AM, O. Olson wrote:
> Sorry for pushing this topic, but I would like to know if Nutch would help me
> get the raw HTML in my situation described below.
>
> I am sure it would be a simple answer to those who know Nutch. If not then I
> guess Nutch is the wrong tool fo
11 matches
Mail list logo