Re: R: Using Nutch for only retriving HTML

2009-10-01 Thread Andrzej Bialecki
BELLINI ADAM wrote: hi, but how to dump the content ? i tried this command : ./bin/nutch readseg -dump crawl/segments/20090903121951/content/ toto and it said : Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:

Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread Magnús Skúlason
Actually its quite easy to modify the parse-html filter to do this. That is saving the HTML to a file or to some database, you could then configure it to skip all unnecessary plugins. I think it depends a lot on the other requirements you have whether using nutch for this task is the right way to

Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread O. Olson
Skúlason magg...@gmail.com ha scritto: Da: Magnús Skúlason magg...@gmail.com Oggetto: Re: R: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Mercoledì 30 settembre 2009, 11:48 Actually its quite easy to modify the parse-html filter to do this. That is saving the HTML

RE: R: Using Nutch for only retriving HTML

2009-09-30 Thread BELLINI ADAM
:50 + From: olson_...@yahoo.it Subject: Re: R: Using Nutch for only retriving HTML To: nutch-user@lucene.apache.org Thanks Magnús and Susam for your responses and pointing me in the right direction. I think I would spend time over the next few weeks trying out Nutch over. I only needed

RE: R: Using Nutch for only retriving HTML

2009-09-30 Thread BELLINI ADAM
From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: R: Using Nutch for only retriving HTML Date: Wed, 30 Sep 2009 21:04:03 + hi mabe you can run a crawl (dont forget to filter the pages just to keep html or htm files (you will do it at conf/crawl-urlfilter.txt) ) after

Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread Andrzej Bialecki
BELLINI ADAM wrote: me again, i forgot to tell u the easiest way... once the crawl is finished you can dump the whole db (it contains all the links to your html pages) in a text file.. ./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile and you can perfor the wget on this db and

R: Using Nutch for only retriving HTML

2009-09-29 Thread O. Olson
Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below. I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the wrong tool for the job. Thanks, O. O. --- Gio 24/9/09, O. Olson

Re: R: Using Nutch for only retriving HTML

2009-09-29 Thread Susam Pal
On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it wrote: Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below. I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the