Re: Nutch 2.0 error

Robert Irribarren Mon, 27 Aug 2012 19:44:21 -0700

I forgot to mention that my regex-urlfilter.txt looks like this

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):


# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.



On Mon, Aug 27, 2012 at 7:39 PM, Robert Irribarren <[email protected]>wrote:

> I am running Nutch 2.0 in local mode with solr 4.0 beta
> I have this script here
>
> And I run this
> root@serverip:/usr/share/nutch/runtime/local# bin/nutch updatedb
> DbUpdaterJob: starting
> Exception in thread "main" java.lang.RuntimeException: job failed:
> name=update-table, jobid=job_local_0001
>         at
> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47)
>         at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:96)
>         at
> org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:103)
>         at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:117)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:121)
>
> I looked at the logs, but there wasn't much information here is the log
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: true
> GeneratorJob: topN: 18000
> GeneratorJob: done
> GeneratorJob: generated batch id: 1346109594-506815820
> FetcherJob: starting
> FetcherJob: batchId: 1346109594-506815820
> FetcherJob : timelimit set for : -1
> FetcherJob: threads: 10
> FetcherJob: parsing: false
> FetcherJob: resuming: false
> Using queue mode : byHost
> Fetcher: threads: 10
> QueueFeeder finished: total 0 records. Hit by time limit :0
> -finishing thread FetcherThread0, activeThreads=0
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> -finishing thread FetcherThread2, activeThreads=7
> -finishing thread FetcherThread3, activeThreads=6
> -finishing thread FetcherThread4, activeThreads=5
> -finishing thread FetcherThread5, activeThreads=4
> -finishing thread FetcherThread6, activeThreads=3
> -finishing thread FetcherThread7, activeThreads=2
> -finishing thread FetcherThread1, activeThreads=1
> -finishing thread FetcherThread8, activeThreads=0
> -finishing thread FetcherThread9, activeThreads=0
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0
> URLs in 0 queues
> -activeThreads=0
> FetcherJob: done
> ParserJob: starting
> ParserJob: resuming:    false
> ParserJob: forced reparse:      false
> ParserJob: batchId:     1346109594-506815820
> Skipping http://www.questacon.edu.au/; different batch id
> Skipping http://www.cbc.ca/; different batch id
> Skipping http://www.ecokids.ca/; different batch id
> Skipping http://www.texted.ca/; different batch id
> Skipping http://www.texted.ca/app/en/; different batch id
> Skipping http://www.911forkids.com/; different batch id
> Skipping http://www.abcmouse.com/; different batch id
> Skipping http://get.adobe.com/flashplayer; different batch id
> Skipping http://get.adobe.com/flashplayer/; different batch id
> Skipping http://get.adobe.com/flashplayer/otherversions/; different batch
> id
> Skipping http://www.adobe.com/go/getflashplayer; different batch id
> Skipping http://www.afrigadget.com/; different batch id
> Skipping http://www.anamalz.com/; different batch id
> Skipping http://www.angelinaballerina.com/; different batch id
> Skipping http://www.angelinaballerina.com/usa/index.html; different batch
> id
> Skipping http://www.animaljam.com/; different batch id
> Skipping http://kids.aol.com/; different batch id
> Skipping http://www.aquariumofthebay.com/; different batch id
> Skipping http://www.bbc.com/news/; different batch id
> Skipping http://www.bbc.com/sport/; different batch id
> Skipping http://www.bbc.com/travel; different batch id
> Skipping http://www.bbc.com/travel/; different batch id
> Skipping http://www.bbcamerica.com/; different batch id
> Skipping
> http://www.bbcamericashop.com/dvd/life-discovery-channel-version-15686.html;
> different batch id
> Skipping http://bbcearth.com/; different batch id
> Skipping http://bbcearth.com/meet-your-planet; different batch id
> Skipping http://bbcearth.com/people; different batch id
> Skipping http://bbcearth.com/people/alastair-fothergill; different batch
> id
> Skipping http://www.bbc.co.uk/news/world_radio_and_tv/; different batch id
> Skipping http://www.bbc.co.uk/sport/0/; different batch id
> Skipping http://www.themouseclub.co.uk/; different batch id
> ParserJob: success
> DbUpdaterJob: starting
> SolrIndexerJob: starting
> SolrIndexerJob: done.
>
>
> The script I am running is
>
> #!/bin/bash
>
> # Nutch crawl
>
> export NUTCH_HOME=/usr/share/nutch/runtime/local
>
> # depth in the web exploration
> n=5
> # number of selected urls for fetching
> maxUrls=18000
> # solr server
> solrUrl=http://localhost:8983/solr/sites
>
>
> for (( i = 1 ; i <= $n ; i++ ))
> do
>
> log=$NUTCH_HOME/logs/log
>
> # Generate
> $NUTCH_HOME/bin/nutch generate -topN $maxUrls > $log
>
> batchId=`sed -n 's|.*batch id: \(.*\)|\1|p' < $log`
>
> # rename log file by appending the batch id
> log2=$log$batchId
> mv $log $log2
> log=$log2
>
> echo "Starting cycle $i of $n  Log file : $log2"
> # Fetch
> $NUTCH_HOME/bin/nutch fetch $batchId >> $log
>
> # Parse
> $NUTCH_HOME/bin/nutch parse $batchId >> $log
>
> # Update
> $NUTCH_HOME/bin/nutch updatedb >> $log
>
> # Index
> $NUTCH_HOME/bin/nutch solrindex $solrUrl $batchId >> $log
>
> done
>
> echo "starting finish crawl";
> bin/nutch parse -force -all
> bin/nutch updatedb
> bin/nutch solrindex http://127.0.0.1:8983/solr/sites -reindex
> echo "done"
>
>
>
> ---------------------------------
>
> It seems that when I change the
> # depth in the web exploration
> n=5
> # number of selected urls for fetching
> maxUrls=18000
>
> MaxURLS to anything higher, I don't get any newer results
> and The n=5 doesn't affect it as well, because each iteration it errors on
> the updatedb each time
> it outputs the same 3 lines
> adding 250 documents
> adding 250 documents
> adding 3 documents
>
> And my seeds have total up to 300 and I am running this on 618 MB of ram
> on the amazon ec2 free servers.
> Please Help!
>
>
>
> On Mon, Aug 27, 2012 at 5:09 AM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> Hi Robert,
>>
>> Please describe your problem and we will be more than happy to give
>> you a hand. The Nutch community is pretty active and in a very healthy
>> state, if people do not get back to your messages immediately then
>> don't be dissapointed, its because people  have lives outside of the
>> ASF and Nutch ;0)
>>
>> What version are you using, what version of Solr as well?
>> Do Solrj libraries match? All the usual stuff... lets try and debug
>> and get to the bottom of your error.
>>
>> Thanks
>>
>> Lewis
>>
>> On Sun, Aug 26, 2012 at 9:41 PM, Robert Irribarren <[email protected]>
>> wrote:
>> > Thank you I just sent a few good ones, and I was fed up with no replys
>> so I
>> > just sent an error log with no description to see if people actually
>> cared.
>> > Thanks lewis for your response even if it shows no interest in the error
>> > itself but rather a course of action that I can follow to fit into the
>> > mailing list better. I thank you.
>> >
>> > On Sun, Aug 26, 2012 at 3:39 AM, Lewis John Mcgibbney <
>> > [email protected]> wrote:
>> >
>> >> Hi Robert,,
>> >>
>> >>
>> >> On Sun, Aug 26, 2012 at 5:25 AM, Robert Irribarren <
>> [email protected]>
>> >> wrote:
>> >> > org.apache.solr.common.SolrException: Server Error
>> >> >
>> >> > Server Error
>> >> ...
>> >>
>> >> Please read this [0] before posting to the list. It saves both you and
>> >> us loads of time and also means there is less unnecessary noise of the
>> >> list.
>> >>
>> >> Thank you
>> >>
>> >> Lewis
>> >>
>> >> [0]
>> >>
>> http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_One:_Using_the_Mailing_Lists
>> >>
>>
>>
>>
>> --
>> Lewis
>>
>
>

Re: Nutch 2.0 error

Reply via email to