Hello All,

I am running Nutch 1.7 in eclipse and I start out with the Crawl job with
the following settings.

Main Class :org.apache.nutch.crawl.Crawl
Arguments : urls -dir crawl -depth 10 -topN 10

In the "urls" directory I have only one URL http://www.ebay.com and I
expect the whole website to be crawled , however I get the following log
output and the crawl seems to stop after a few urls are fetched.

I use the nutch-default.xml and have already set http.content.limit to -1
in it as mentioned in the other message in this mailing list. However  the
crawl stops after a few URLs are fetched, please see the log below and
advise.

I am running eclipse on CentOS 6.4/Nutch 1.7

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 10
solrUrl=null
topN = 10
Injector: starting at 2013-08-18 22:48:45
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-08-18 22:48:47, elapsed: 00:00:02
Generator: starting at 2013-08-18 22:48:47
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20130818224849
Generator: finished at 2013-08-18 22:48:51, elapsed: 00:00:03
Fetcher: starting at 2013-08-18 22:48:51
Fetcher: segment: crawl/segments/20130818224849
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetching http://www.ebay.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-08-18 22:48:56, elapsed: 00:00:05
ParseSegment: starting at 2013-08-18 22:48:56
ParseSegment: segment: crawl/segments/20130818224849
Parsed (15ms):http://www.ebay.com/
ParseSegment: finished at 2013-08-18 22:48:57, elapsed: 00:00:01
CrawlDb update: starting at 2013-08-18 22:48:57
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20130818224849]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-08-18 22:48:58, elapsed: 00:00:01
Generator: starting at 2013-08-18 22:48:58
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2013-08-18 22:48:59
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment:
file:/home/general/workspace/nutch/crawl/segments/20130818224849
LinkDb: finished at 2013-08-18 22:49:00, elapsed: 00:00:01
crawl finished: crawl

Reply via email to