Hi and thanks Ferdy,

It seems that since i'm using -noFilter and -noNorm with "nutch generate
..." everything is going more quicky (by the way, my version of nutch is
1.6)

Now i would like to optimize my crawling loop since i don't want to reindex
everything with solrindex, and also only add new discovered links to linkdb.

Here is my loop content :

bin/nutch generate crawl/crawldb crawl/segments  -topN 10000 -noFilter
-noNorm
s2=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2

bin/nutch generate crawl/crawldb crawl/segments  -topN 10000 -noFilter
-noNorm
s3=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s3
bin/nutch parse $s3
bin/nutch updatedb crawl/crawldb $s3

bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb
crawl/linkdb crawl/segments/*

i've read the doc about invertlinks and solrindex, but i'm still not
undertanding how i can invertlinks / solrindex only for the last segments
(here $s2 and $s3).

Could someone tell me how to set my command line to something like :
bin/nutch invertlinks crawl/linkdb -dir $s2 $s3
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb
crawl/linkdb $s2 $s3

I already have about 1000000 indexed urls and i don't really want to break
something by making wrong tests.


My tool will be used for press coverage (search new articles and store them
for making data reporting). So i'll need to have a quick loop so the site
database (currently 2000 urls) will always have all urls indexed (would be
critical to miss some important news just because the crawl is taking too
much time).






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-tp4037673p4038583.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to