As generate does not get the urls not yet fetched, no amount of indexing now adds more too my index i've hit somekind of wall.

Can i force Nutch to only generate urls not yet fetched and not the ones already fetched.

Cheer
Shane.


On 26/03/14 09:29, Shane Wood wrote:
Yes only error "warn i get is"

mapred.FileOutputCommitter - Output path is null in cleanup

What does this mean? what would be the command line too index a single domain. say test.com

Why does generate give me the same fetch list every time ? i thought Nutch would only re indexed the same page once every 30 days my setup fetch the same pages every time i index, this seems a waist of resources.

Cheers
Shane.


On 26/03/14 06:37, d_k wrote:
Are you sure all the steps are working? Did you look at the logs?


On Tue, Mar 25, 2014 at 4:50 AM, Shane Wood<[email protected]>  wrote:

I have setup Nutch Solr and MYSQL as per this how too
http://nlp.solutions.asia/?p=362
I run Nutch using these commands.

./bin/nutch inject urls
./bin/nutch generate -topN 20
./bin/nutch fetch -all
./bin/nutch parse -all
./bin/nutch updatedb

./bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex

I have a /crawl folder yet nothing appears in it while it's indexing where
does nutch
store the content etc while it's indexing ?

Is there a informative faq on what differences using MYSQL makes too your
setup.

Cheers for any help
Shane.



Reply via email to