Nutch not crawling fully

Suresh V S Sun, 02 Jun 2013 22:02:42 -0700

Hello all

I'm new to Nutch and learning to crawl, so pardon me for my newbie questions.


I set up urls/seed.txt to a single URL and also updated 
conf/regex-urlfilter.txt for the same url in the last line of the file.

When I issue '  bin/nutch crawl urls -dir mydir -depth 5 ' command, it 
completes in a minute or so, and when I issue readdb with -stats option, there 
is only 1 URL retrieved.
Please advise why Nutch is not retrieving all pages from the given url. I am 
behind a proxy and have setup proxy details in nutch-default.xml file...

Any help greatly appreciated. Thank you. Dumping the log below...

Thanks
Suresh.

dwbilab01@dwbilab01-OptiPlex-990:~/apache-nutch-1.6$ bin/nutch crawl urls/ -dir 
mondaycrawl/ -depth 5
solrUrl is not set, indexing will be skipped...
crawl started in: mondaycrawl
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=null
Injector: starting at 2013-06-03 10:24:28
Injector: crawlDb: mondaycrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-06-03 10:24:43, elapsed: 00:00:14
Generator: starting at 2013-06-03 10:24:43
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: mondaycrawl/segments/20130603102451
Generator: finished at 2013-06-03 10:24:58, elapsed: 00:00:15
Fetcher: Your 'http.agent.name' value should be listed first in 
'http.robots.agents' property.
Fetcher: starting at 2013-06-03 10:24:58
Fetcher: segment: mondaycrawl/segments/20130603102451
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.igate.com/
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
fetch of http://www.igate.com/ failed with: Http code=407, 
url=http://www.igate.com/
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-06-03 10:25:05, elapsed: 00:00:07
ParseSegment: starting at 2013-06-03 10:25:05
ParseSegment: segment: mondaycrawl/segments/20130603102451
ParseSegment: finished at 2013-06-03 10:25:12, elapsed: 00:00:07
CrawlDb update: starting at 2013-06-03 10:25:12
CrawlDb update: db: mondaycrawl/crawldb
CrawlDb update: segments: [mondaycrawl/segments/20130603102451]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-06-03 10:25:25, elapsed: 00:00:13
Generator: starting at 2013-06-03 10:25:25
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2013-06-03 10:25:32
LinkDb: linkdb: mondaycrawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: 
file:/home/dwbilab01/apache-nutch-1.6/mondaycrawl/segments/20130603102451
LinkDb: finished at 2013-06-03 10:25:39, elapsed: 00:00:07
crawl finished: mondaycrawl





dwbilab01@dwbilab01-OptiPlex-990:~/apache-nutch-1.6$ bin/nutch readdb 
mondaycrawl/crawldb/ -stats
CrawlDb statistics start: mondaycrawl/crawldb/
Statistics for CrawlDb: mondaycrawl/crawldb/
TOTAL urls:     1
retry 1:        1
min score:      1.0
avg score:      1.0
max score:      1.0
status 1 (db_unfetched):        1
CrawlDb statistics: done





~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Information contained and transmitted by this e-mail is confidential and 
proprietary to iGATE and its affiliates and is intended for use only by the 
recipient. If you are not the intended recipient, you are hereby notified that 
any dissemination, distribution, copying or use of this e-mail is strictly 
prohibited and you are requested to delete this e-mail immediately and notify 
the originator or [email protected] <mailto:[email protected]>. iGATE does 
not enter into any agreement with any party by e-mail. Any views expressed by 
an individual do not necessarily reflect the view of iGATE. iGATE is not 
responsible for the consequences of any actions taken on the basis of 
information provided, through this email. The contents of an attachment to this 
e-mail may contain software viruses, which could damage your own computer 
system. While iGATE has taken every reasonable precaution to minimise this 
risk, we cannot accept liability for any damage which you sustain as a result 
of software viruses. You should carry out your own virus checks before opening 
an attachment. To know more about iGATE please visit www.igate.com 
<http://www.igate.com>.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Nutch not crawling fully

Reply via email to