Dear all.
I have a problem with nutch Internet crawl/recrawl script (I'm wanted to
understand how it works, so I wrote it by myself).
After I merge indexes (merging segments seems to be fine), I search
doesn't work for me:
$ bin/nutch org.apache.nutch.searcher.NutchBean http
Total hits: 0
Before recrawling I was able to search (index was placed at crawl/indexes)
My script:
---------------------------------------------
#!/bin/bash
export JAVA_HOME=/usr/lib/jvm/java-6-sun
#Inject new urls
bin/nutch inject crawl/crawldb dmoz/urls
echo "new URLs injected (dmoz/urls)"
#generate segments
bin/nutch generate crawl/crawldb crawl/segments -topN $3
echo "segments generated"
#generate fetch-list
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
echo "fetch-list generated"
#fetch
bin/nutch fetch $s1 -threads $2
echo "fetching done"
#update the database with results of fetch
bin/nutch updatedb crawl/crawldb $s1
echo "database updated"
#merge segments
bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
rm -r crawl/segments
mv crawl/MERGEDsegments crawl/segments
echo "segments merged"
#inverting links
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
echo "links inverted"
#indexing
bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb crawl/segments/*
echo "indexing done"
#dedup - delete duplicate documents in the index
bin/nutch dedup crawl/NEWindexes
echo "dedup done"
#merging indexes
bin/nutch merge crawl/MERGEDindexes crawl/NEWindexes
echo "indexes merged"
# replace indexes with indexes_merged
mv --verbose crawl/indexes crawl/OLDindexes
mv --verbose crawl/MERGEDindexes crawl/indexes/part-00000
#clean up
rm -rf crawl/NEWindexes
rm -rf crawl/OLDindexes
-------------------------------------------------
What's wrong with the script?
Thank You in advance,
Kind Regards,
--
Andrey Sapegin,
Software Developer,
Unister GmbH
[email protected]
www.unister.de