RE: R: Using Nutch for only retriving HTML

2009-10-02 Thread BELLINI ADAM
thx dude it works fine :) > Date: Thu, 1 Oct 2009 20:05:09 +0200 > From: a...@getopt.org > To: nutch-user@lucene.apache.org > Subject: Re: R: Using Nutch for only retriving HTML > > BELLINI ADAM wrote: > > hi, > > but how to dump the c

Re: R: Using Nutch for only retriving HTML

2009-10-01 Thread Andrzej Bialecki
BELLINI ADAM wrote: hi, but how to dump the content ? i tried this command : ./bin/nutch readseg -dump crawl/segments/20090903121951/content/ toto and it said : Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/local/nutch-1.0/

RE: R: Using Nutch for only retriving HTML

2009-10-01 Thread BELLINI ADAM
ct 2009 18:16:43 +0200 > From: a...@getopt.org > To: nutch-user@lucene.apache.org > Subject: Re: R: Using Nutch for only retriving HTML > > BELLINI ADAM wrote: > > hi, > > thx for the advise, > > but guess when u run the readseg command it will not retun the pages a

Re: R: Using Nutch for only retriving HTML

2009-10-01 Thread Andrzej Bialecki
BELLINI ADAM wrote: hi, thx for the advise, but guess when u run the readseg command it will not retun the pages as is (as if browsed ). i tried it and it returns information about pages : Recno:: 0 URL:: http://blabla.com/blabla.jsp CrawlDatum:: Version: 7 Status: 67 (linked) Fetch time: Mon

RE: R: Using Nutch for only retriving HTML

2009-10-01 Thread BELLINI ADAM
23:38:28 +0200 > From: a...@getopt.org > To: nutch-user@lucene.apache.org > Subject: Re: R: Using Nutch for only retriving HTML > > BELLINI ADAM wrote: > > > > me again, > > > > i forgot to tell u the easiest way... > > > > once the crawl is f

Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread Andrzej Bialecki
BELLINI ADAM wrote: me again, i forgot to tell u the easiest way... once the crawl is finished you can dump the whole db (it contains all the links to your html pages) in a text file.. ./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile and you can perfor the wget on this db and archi

RE: R: Using Nutch for only retriving HTML

2009-09-30 Thread BELLINI ADAM
> From: mbel...@msn.com > To: nutch-user@lucene.apache.org > Subject: RE: R: Using Nutch for only retriving HTML > Date: Wed, 30 Sep 2009 21:04:03 + > > > hi > mabe you can run a crawl (dont forget to filter the pages just to keep html > or htm files (you will do it

RE: R: Using Nutch for only retriving HTML

2009-09-30 Thread BELLINI ADAM
ep 2009 20:46:50 + > From: olson_...@yahoo.it > Subject: Re: R: Using Nutch for only retriving HTML > To: nutch-user@lucene.apache.org > > Thanks Magnús and Susam for your responses and pointing me in the right > direction. I think I would spend time over the next few weeks t

Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread O. Olson
Skúlason ha scritto: > Da: Magnús Skúlason > Oggetto: Re: R: Using Nutch for only retriving HTML > A: nutch-user@lucene.apache.org > Data: Mercoledì 30 settembre 2009, 11:48 > Actually its quite easy to modify the > parse-html filter to do this. > > That is saving the H

Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread Magnús Skúlason
Actually its quite easy to modify the parse-html filter to do this. That is saving the HTML to a file or to some database, you could then configure it to skip all unnecessary plugins. I think it depends a lot on the other requirements you have whether using nutch for this task is the right way to

Re: R: Using Nutch for only retriving HTML

2009-09-29 Thread Susam Pal
On Wed, Sep 30, 2009 at 1:39 AM, O. Olson wrote: > Sorry for pushing this topic, but I would like to know if Nutch would help me > get the raw HTML in my situation described below. > > I am sure it would be a simple answer to those who know Nutch. If not then I > guess Nutch is the wrong tool fo