I am using nutch 2.0 and solr 4.0 and am having minimal success I have 3 urls and my regex-urlfilter.xml is set to allow everything. I ran this script
#!/bin/bash # Nutch crawl export NUTCH_HOME=~/java/workspace/Nutch2.0/runtime/local # depth in the web exploration n=1 # number of selected urls for fetching maxUrls=50000 # solr server solrUrl=http://localhost:8983 for (( i = 1 ; i <= $n ; i++ )) do log=$NUTCH_HOME/logs/log # Generate $NUTCH_HOME/bin/nutch generate -topN $maxUrls > $log batchId=`sed -n 's|.*batch id: \(.*\)|\1|p' < $log` # rename log file by appending the batch id log2=$log$batchId mv $log $log2 log=$log2 # Fetch $NUTCH_HOME/bin/nutch fetch $batchId >> $log # Parse $NUTCH_HOME/bin/nutch parse $batchId >> $log # Update $NUTCH_HOME/bin/nutch updatedb >> $log # Index $NUTCH_HOME/bin/nutch solrindex $solrUrl $batchId >> $log done ---------------------------- Of course I bin/nutch inject urls before i run the script, but when I look at the logs, I see Skipping : different batch id and some of the urls that I see are ones that arent in the seed.txt and I want to include them into solr, but they aren't added. I have 3 urls in my seed.txt After I ran this script I had tried bin/nutch parse -force -all bin/nutch updatedb bin/nutch solrindex http://127.0.0.1:8983/solr/sites -reindex *My questions are as follows. 1. The last three commands why were they necessary? 2. How do I get all of the urls during the parse job, even with the -force -all i still get different batch id skipping 3. The script above, if i set generate -topN to 5. Does this mean if a site has a link to another site to another site to another site to another site to another site. That it will be included in the fetch/parse cycle? 4. What about this command, why is this even mentioned bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/sites -depth 3 -topN 10000 -threads 3. 5. When i run bin/nutch updateb it takes 1-2 mineuts then it echos Killed. This concerns me. Please Help.*

