Hello All, I am running Nutch 1.7 in eclipse and I start out with the Crawl job with the following settings.
Main Class :org.apache.nutch.crawl.Crawl Arguments : urls -dir crawl -depth 10 -topN 10 In the "urls" directory I have only one URL http://www.ebay.com and I expect the whole website to be crawled , however I get the following log output and the crawl seems to stop after a few urls are fetched. I use the nutch-default.xml and have already set http.content.limit to -1 in it as mentioned in the other message in this mailing list. However the crawl stops after a few URLs are fetched, please see the log below and advise. I am running eclipse on CentOS 6.4/Nutch 1.7 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 10 solrUrl=null topN = 10 Injector: starting at 2013-08-18 22:48:45 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 1 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Injector: finished at 2013-08-18 22:48:47, elapsed: 00:00:02 Generator: starting at 2013-08-18 22:48:47 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 10 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20130818224849 Generator: finished at 2013-08-18 22:48:51, elapsed: 00:00:03 Fetcher: starting at 2013-08-18 22:48:51 Fetcher: segment: crawl/segments/20130818224849 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 fetching http://www.ebay.com/ (queue crawl delay=5000ms) -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2013-08-18 22:48:56, elapsed: 00:00:05 ParseSegment: starting at 2013-08-18 22:48:56 ParseSegment: segment: crawl/segments/20130818224849 Parsed (15ms):http://www.ebay.com/ ParseSegment: finished at 2013-08-18 22:48:57, elapsed: 00:00:01 CrawlDb update: starting at 2013-08-18 22:48:57 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20130818224849] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2013-08-18 22:48:58, elapsed: 00:00:01 Generator: starting at 2013-08-18 22:48:58 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 10 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2013-08-18 22:48:59 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: internal links will be ignored. LinkDb: adding segment: file:/home/general/workspace/nutch/crawl/segments/20130818224849 LinkDb: finished at 2013-08-18 22:49:00, elapsed: 00:00:01 crawl finished: crawl

