Nutch 2.1 - fetching is not working (maybe broken generate?)

glumet Sat, 15 Mar 2014 15:29:32 -0700

Hello everybody. I have one problem with my current Nutch 2.1 configuration.
I run my crawling process (throught bin/crawl script) every morning during
the last few months and everything went fine - I mean that in every run the
script generated new batch which was fetched and after that parsed and
finally indexed into Solr (I use also hBase as my data store).


I do not know why but something had broken and nothing is fetched now  (also
because this problem there is nothing to parse and index). 

If it could be helpful, I tried to do bin/nutch readdb -stats and output is
here:  http://pastebin.com/r3vFK4T4 <http://pastebin.com/r3vFK4T4>  

There are some things like
status 3 (status_gone): 9907
status 1 (status_unfetched):    6798166
status 0 (null):        2

but I dont understand too much this attributes... but 6798166 unfetched urls
seem very suspicious.

During crawling process I can see that GeneratorJob generates some batch id
but I dont know if this batchId is not empty or something.

/GeneratorJob: generated batch id: 1329930901-1268252438 
FetcherJob: starting 
FetcherJob : timelimit set for : -1 
FetcherJob: threads: 10 
FetcherJob: parsing: false 
FetcherJob: resuming: false 
FetcherJob: batchId: 1329930901-1268252438 
Using queue mode : byHost 
.
.
.
-activeThreads=0, spinWaiting=0, fetchQueues= 0, fetchQueues.totalSize=0 
-activeThreads=0 
FetcherJob: done /

fetching is really fast because as I said there is no url to be fetched.

Can you help me or give any advice?
Thanks, Jan



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-2-1-fetching-is-not-working-maybe-broken-generate-tp4123813.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Nutch 2.1 - fetching is not working (maybe broken generate?)

Reply via email to