Hello all I'm new to Nutch and learning to crawl, so pardon me for my newbie questions.
I set up urls/seed.txt to a single URL and also updated conf/regex-urlfilter.txt for the same url in the last line of the file. When I issue ' bin/nutch crawl urls -dir mydir -depth 5 ' command, it completes in a minute or so, and when I issue readdb with -stats option, there is only 1 URL retrieved. Please advise why Nutch is not retrieving all pages from the given url. I am behind a proxy and have setup proxy details in nutch-default.xml file... Any help greatly appreciated. Thank you. Dumping the log below... Thanks Suresh. dwbilab01@dwbilab01-OptiPlex-990:~/apache-nutch-1.6$ bin/nutch crawl urls/ -dir mondaycrawl/ -depth 5 solrUrl is not set, indexing will be skipped... crawl started in: mondaycrawl rootUrlDir = urls threads = 10 depth = 5 solrUrl=null Injector: starting at 2013-06-03 10:24:28 Injector: crawlDb: mondaycrawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Injector: finished at 2013-06-03 10:24:43, elapsed: 00:00:14 Generator: starting at 2013-06-03 10:24:43 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: mondaycrawl/segments/20130603102451 Generator: finished at 2013-06-03 10:24:58, elapsed: 00:00:15 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2013-06-03 10:24:58 Fetcher: segment: mondaycrawl/segments/20130603102451 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost fetching http://www.igate.com/ Using queue mode : byHost -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 fetch of http://www.igate.com/ failed with: Http code=407, url=http://www.igate.com/ -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2013-06-03 10:25:05, elapsed: 00:00:07 ParseSegment: starting at 2013-06-03 10:25:05 ParseSegment: segment: mondaycrawl/segments/20130603102451 ParseSegment: finished at 2013-06-03 10:25:12, elapsed: 00:00:07 CrawlDb update: starting at 2013-06-03 10:25:12 CrawlDb update: db: mondaycrawl/crawldb CrawlDb update: segments: [mondaycrawl/segments/20130603102451] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2013-06-03 10:25:25, elapsed: 00:00:13 Generator: starting at 2013-06-03 10:25:25 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2013-06-03 10:25:32 LinkDb: linkdb: mondaycrawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: internal links will be ignored. LinkDb: adding segment: file:/home/dwbilab01/apache-nutch-1.6/mondaycrawl/segments/20130603102451 LinkDb: finished at 2013-06-03 10:25:39, elapsed: 00:00:07 crawl finished: mondaycrawl dwbilab01@dwbilab01-OptiPlex-990:~/apache-nutch-1.6$ bin/nutch readdb mondaycrawl/crawldb/ -stats CrawlDb statistics start: mondaycrawl/crawldb/ Statistics for CrawlDb: mondaycrawl/crawldb/ TOTAL urls: 1 retry 1: 1 min score: 1.0 avg score: 1.0 max score: 1.0 status 1 (db_unfetched): 1 CrawlDb statistics: done ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Information contained and transmitted by this e-mail is confidential and proprietary to iGATE and its affiliates and is intended for use only by the recipient. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or use of this e-mail is strictly prohibited and you are requested to delete this e-mail immediately and notify the originator or [email protected] <mailto:[email protected]>. iGATE does not enter into any agreement with any party by e-mail. Any views expressed by an individual do not necessarily reflect the view of iGATE. iGATE is not responsible for the consequences of any actions taken on the basis of information provided, through this email. The contents of an attachment to this e-mail may contain software viruses, which could damage your own computer system. While iGATE has taken every reasonable precaution to minimise this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should carry out your own virus checks before opening an attachment. To know more about iGATE please visit www.igate.com <http://www.igate.com>. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

