I am running Nutch 2.0 in local mode with solr 4.0 beta
I have this script here
And I run this
root@serverip:/usr/share/nutch/runtime/local# bin/nutch updatedb
DbUpdaterJob: starting
Exception in thread "main" java.lang.RuntimeException: job failed:
name=update-table, jobid=job_local_0001
at
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47)
at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:96)
at
org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:103)
at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:117)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:121)
I looked at the logs, but there wasn't much information here is the log
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: topN: 18000
GeneratorJob: done
GeneratorJob: generated batch id: 1346109594-506815820
FetcherJob: starting
FetcherJob: batchId: 1346109594-506815820
FetcherJob : timelimit set for : -1
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread2, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread4, activeThreads=5
-finishing thread FetcherThread5, activeThreads=4
-finishing thread FetcherThread6, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0
URLs in 0 queues
-activeThreads=0
FetcherJob: done
ParserJob: starting
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: 1346109594-506815820
Skipping http://www.questacon.edu.au/; different batch id
Skipping http://www.cbc.ca/; different batch id
Skipping http://www.ecokids.ca/; different batch id
Skipping http://www.texted.ca/; different batch id
Skipping http://www.texted.ca/app/en/; different batch id
Skipping http://www.911forkids.com/; different batch id
Skipping http://www.abcmouse.com/; different batch id
Skipping http://get.adobe.com/flashplayer; different batch id
Skipping http://get.adobe.com/flashplayer/; different batch id
Skipping http://get.adobe.com/flashplayer/otherversions/; different batch id
Skipping http://www.adobe.com/go/getflashplayer; different batch id
Skipping http://www.afrigadget.com/; different batch id
Skipping http://www.anamalz.com/; different batch id
Skipping http://www.angelinaballerina.com/; different batch id
Skipping http://www.angelinaballerina.com/usa/index.html; different batch id
Skipping http://www.animaljam.com/; different batch id
Skipping http://kids.aol.com/; different batch id
Skipping http://www.aquariumofthebay.com/; different batch id
Skipping http://www.bbc.com/news/; different batch id
Skipping http://www.bbc.com/sport/; different batch id
Skipping http://www.bbc.com/travel; different batch id
Skipping http://www.bbc.com/travel/; different batch id
Skipping http://www.bbcamerica.com/; different batch id
Skipping
http://www.bbcamericashop.com/dvd/life-discovery-channel-version-15686.html;
different batch id
Skipping http://bbcearth.com/; different batch id
Skipping http://bbcearth.com/meet-your-planet; different batch id
Skipping http://bbcearth.com/people; different batch id
Skipping http://bbcearth.com/people/alastair-fothergill; different batch id
Skipping http://www.bbc.co.uk/news/world_radio_and_tv/; different batch id
Skipping http://www.bbc.co.uk/sport/0/; different batch id
Skipping http://www.themouseclub.co.uk/; different batch id
ParserJob: success
DbUpdaterJob: starting
SolrIndexerJob: starting
SolrIndexerJob: done.
The script I am running is
#!/bin/bash
# Nutch crawl
export NUTCH_HOME=/usr/share/nutch/runtime/local
# depth in the web exploration
n=5
# number of selected urls for fetching
maxUrls=18000
# solr server
solrUrl=http://localhost:8983/solr/sites
for (( i = 1 ; i <= $n ; i++ ))
do
log=$NUTCH_HOME/logs/log
# Generate
$NUTCH_HOME/bin/nutch generate -topN $maxUrls > $log
batchId=`sed -n 's|.*batch id: \(.*\)|\1|p' < $log`
# rename log file by appending the batch id
log2=$log$batchId
mv $log $log2
log=$log2
echo "Starting cycle $i of $n Log file : $log2"
# Fetch
$NUTCH_HOME/bin/nutch fetch $batchId >> $log
# Parse
$NUTCH_HOME/bin/nutch parse $batchId >> $log
# Update
$NUTCH_HOME/bin/nutch updatedb >> $log
# Index
$NUTCH_HOME/bin/nutch solrindex $solrUrl $batchId >> $log
done
echo "starting finish crawl";
bin/nutch parse -force -all
bin/nutch updatedb
bin/nutch solrindex http://127.0.0.1:8983/solr/sites -reindex
echo "done"
---------------------------------
It seems that when I change the
# depth in the web exploration
n=5
# number of selected urls for fetching
maxUrls=18000
MaxURLS to anything higher, I don't get any newer results
and The n=5 doesn't affect it as well, because each iteration it errors on
the updatedb each time
it outputs the same 3 lines
adding 250 documents
adding 250 documents
adding 3 documents
And my seeds have total up to 300 and I am running this on 618 MB of ram on
the amazon ec2 free servers.
Please Help!
On Mon, Aug 27, 2012 at 5:09 AM, Lewis John Mcgibbney <
[email protected]> wrote:
> Hi Robert,
>
> Please describe your problem and we will be more than happy to give
> you a hand. The Nutch community is pretty active and in a very healthy
> state, if people do not get back to your messages immediately then
> don't be dissapointed, its because people have lives outside of the
> ASF and Nutch ;0)
>
> What version are you using, what version of Solr as well?
> Do Solrj libraries match? All the usual stuff... lets try and debug
> and get to the bottom of your error.
>
> Thanks
>
> Lewis
>
> On Sun, Aug 26, 2012 at 9:41 PM, Robert Irribarren <[email protected]>
> wrote:
> > Thank you I just sent a few good ones, and I was fed up with no replys
> so I
> > just sent an error log with no description to see if people actually
> cared.
> > Thanks lewis for your response even if it shows no interest in the error
> > itself but rather a course of action that I can follow to fit into the
> > mailing list better. I thank you.
> >
> > On Sun, Aug 26, 2012 at 3:39 AM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> >> Hi Robert,,
> >>
> >>
> >> On Sun, Aug 26, 2012 at 5:25 AM, Robert Irribarren <
> [email protected]>
> >> wrote:
> >> > org.apache.solr.common.SolrException: Server Error
> >> >
> >> > Server Error
> >> ...
> >>
> >> Please read this [0] before posting to the list. It saves both you and
> >> us loads of time and also means there is less unnecessary noise of the
> >> list.
> >>
> >> Thank you
> >>
> >> Lewis
> >>
> >> [0]
> >>
> http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_One:_Using_the_Mailing_Lists
> >>
>
>
>
> --
> Lewis
>