Re: Pulling out URLs

2009-03-12 Thread vishal vachhani
Simple solution would be done the segments using following command and just write a script which can extract the Outlinks present in the documents of the segment. $NUTCH_home/bin/nutch readseg -dump -dir segDirsPath -nocontent -nofetch -nogenerate -noparse -noparsetext this will give you a dump

Re: Pulling out URLs

2009-03-12 Thread MyD
Thank you for the hint. How can this be done with the Segment Reader (Nutch 0.9 api)? Thanks in advance. Cheers, MyD vishal vachhani wrote: Simple solution would be done the segments using following command and just write a script which can extract the Outlinks present in the documents

Re: Pulling out URLs

2009-03-12 Thread Gosavi.Shyam
hi try this command bin/nutch readseg segment_dir output (i.e bin/nutch readseg ./crawldir/segments/* output.log Regards sanjshra MyD wrote: Thank you for the hint. How can this be done with the Segment Reader (Nutch 0.9 api)? Thanks in advance. Cheers, MyD vishal vachhani

Re: Fwd: fetch but not index

2009-03-12 Thread Gosavi.Shyam
Hi yes, you can index using index command try following commands bin/nutch invertlinks crawl/linkdb crawl/segments/* then bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/* Regards sanjshra :working: 陈琛 wrote: thanks very much. i am testing by the way ,