I forgot to mention that my regex-urlfilter.txt looks like this # skip file: ftp: and mailto: urls -^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. On Mon, Aug 27, 2012 at 7:39 PM, Robert Irribarren <[email protected]>wrote: > I am running Nutch 2.0 in local mode with solr 4.0 beta > I have this script here > > And I run this > root@serverip:/usr/share/nutch/runtime/local# bin/nutch updatedb > DbUpdaterJob: starting > Exception in thread "main" java.lang.RuntimeException: job failed: > name=update-table, jobid=job_local_0001 > at > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47) > at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:96) > at > org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:103) > at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:117) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:121) > > I looked at the logs, but there wasn't much information here is the log > GeneratorJob: Selecting best-scoring urls due for fetch. > GeneratorJob: starting > GeneratorJob: filtering: true > GeneratorJob: topN: 18000 > GeneratorJob: done > GeneratorJob: generated batch id: 1346109594-506815820 > FetcherJob: starting > FetcherJob: batchId: 1346109594-506815820 > FetcherJob : timelimit set for : -1 > FetcherJob: threads: 10 > FetcherJob: parsing: false > FetcherJob: resuming: false > Using queue mode : byHost > Fetcher: threads: 10 > QueueFeeder finished: total 0 records. Hit by time limit :0 > -finishing thread FetcherThread0, activeThreads=0 > Fetcher: throughput threshold: -1 > Fetcher: throughput threshold sequence: 5 > -finishing thread FetcherThread2, activeThreads=7 > -finishing thread FetcherThread3, activeThreads=6 > -finishing thread FetcherThread4, activeThreads=5 > -finishing thread FetcherThread5, activeThreads=4 > -finishing thread FetcherThread6, activeThreads=3 > -finishing thread FetcherThread7, activeThreads=2 > -finishing thread FetcherThread1, activeThreads=1 > -finishing thread FetcherThread8, activeThreads=0 > -finishing thread FetcherThread9, activeThreads=0 > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 > URLs in 0 queues > -activeThreads=0 > FetcherJob: done > ParserJob: starting > ParserJob: resuming: false > ParserJob: forced reparse: false > ParserJob: batchId: 1346109594-506815820 > Skipping http://www.questacon.edu.au/; different batch id > Skipping http://www.cbc.ca/; different batch id > Skipping http://www.ecokids.ca/; different batch id > Skipping http://www.texted.ca/; different batch id > Skipping http://www.texted.ca/app/en/; different batch id > Skipping http://www.911forkids.com/; different batch id > Skipping http://www.abcmouse.com/; different batch id > Skipping http://get.adobe.com/flashplayer; different batch id > Skipping http://get.adobe.com/flashplayer/; different batch id > Skipping http://get.adobe.com/flashplayer/otherversions/; different batch > id > Skipping http://www.adobe.com/go/getflashplayer; different batch id > Skipping http://www.afrigadget.com/; different batch id > Skipping http://www.anamalz.com/; different batch id > Skipping http://www.angelinaballerina.com/; different batch id > Skipping http://www.angelinaballerina.com/usa/index.html; different batch > id > Skipping http://www.animaljam.com/; different batch id > Skipping http://kids.aol.com/; different batch id > Skipping http://www.aquariumofthebay.com/; different batch id > Skipping http://www.bbc.com/news/; different batch id > Skipping http://www.bbc.com/sport/; different batch id > Skipping http://www.bbc.com/travel; different batch id > Skipping http://www.bbc.com/travel/; different batch id > Skipping http://www.bbcamerica.com/; different batch id > Skipping > http://www.bbcamericashop.com/dvd/life-discovery-channel-version-15686.html; > different batch id > Skipping http://bbcearth.com/; different batch id > Skipping http://bbcearth.com/meet-your-planet; different batch id > Skipping http://bbcearth.com/people; different batch id > Skipping http://bbcearth.com/people/alastair-fothergill; different batch > id > Skipping http://www.bbc.co.uk/news/world_radio_and_tv/; different batch id > Skipping http://www.bbc.co.uk/sport/0/; different batch id > Skipping http://www.themouseclub.co.uk/; different batch id > ParserJob: success > DbUpdaterJob: starting > SolrIndexerJob: starting > SolrIndexerJob: done. > > > The script I am running is > > #!/bin/bash > > # Nutch crawl > > export NUTCH_HOME=/usr/share/nutch/runtime/local > > # depth in the web exploration > n=5 > # number of selected urls for fetching > maxUrls=18000 > # solr server > solrUrl=http://localhost:8983/solr/sites > > > for (( i = 1 ; i <= $n ; i++ )) > do > > log=$NUTCH_HOME/logs/log > > # Generate > $NUTCH_HOME/bin/nutch generate -topN $maxUrls > $log > > batchId=`sed -n 's|.*batch id: \(.*\)|\1|p' < $log` > > # rename log file by appending the batch id > log2=$log$batchId > mv $log $log2 > log=$log2 > > echo "Starting cycle $i of $n Log file : $log2" > # Fetch > $NUTCH_HOME/bin/nutch fetch $batchId >> $log > > # Parse > $NUTCH_HOME/bin/nutch parse $batchId >> $log > > # Update > $NUTCH_HOME/bin/nutch updatedb >> $log > > # Index > $NUTCH_HOME/bin/nutch solrindex $solrUrl $batchId >> $log > > done > > echo "starting finish crawl"; > bin/nutch parse -force -all > bin/nutch updatedb > bin/nutch solrindex http://127.0.0.1:8983/solr/sites -reindex > echo "done" > > > > --------------------------------- > > It seems that when I change the > # depth in the web exploration > n=5 > # number of selected urls for fetching > maxUrls=18000 > > MaxURLS to anything higher, I don't get any newer results > and The n=5 doesn't affect it as well, because each iteration it errors on > the updatedb each time > it outputs the same 3 lines > adding 250 documents > adding 250 documents > adding 3 documents > > And my seeds have total up to 300 and I am running this on 618 MB of ram > on the amazon ec2 free servers. > Please Help! > > > > On Mon, Aug 27, 2012 at 5:09 AM, Lewis John Mcgibbney < > [email protected]> wrote: > >> Hi Robert, >> >> Please describe your problem and we will be more than happy to give >> you a hand. The Nutch community is pretty active and in a very healthy >> state, if people do not get back to your messages immediately then >> don't be dissapointed, its because people have lives outside of the >> ASF and Nutch ;0) >> >> What version are you using, what version of Solr as well? >> Do Solrj libraries match? All the usual stuff... lets try and debug >> and get to the bottom of your error. >> >> Thanks >> >> Lewis >> >> On Sun, Aug 26, 2012 at 9:41 PM, Robert Irribarren <[email protected]> >> wrote: >> > Thank you I just sent a few good ones, and I was fed up with no replys >> so I >> > just sent an error log with no description to see if people actually >> cared. >> > Thanks lewis for your response even if it shows no interest in the error >> > itself but rather a course of action that I can follow to fit into the >> > mailing list better. I thank you. >> > >> > On Sun, Aug 26, 2012 at 3:39 AM, Lewis John Mcgibbney < >> > [email protected]> wrote: >> > >> >> Hi Robert,, >> >> >> >> >> >> On Sun, Aug 26, 2012 at 5:25 AM, Robert Irribarren < >> [email protected]> >> >> wrote: >> >> > org.apache.solr.common.SolrException: Server Error >> >> > >> >> > Server Error >> >> ... >> >> >> >> Please read this [0] before posting to the list. It saves both you and >> >> us loads of time and also means there is less unnecessary noise of the >> >> list. >> >> >> >> Thank you >> >> >> >> Lewis >> >> >> >> [0] >> >> >> http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_One:_Using_the_Mailing_Lists >> >> >> >> >> >> -- >> Lewis >> > >

