The logs say:
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.

Please get the segment dump and analyse it for the outlinks extracted. Also
check your filters.


On Sun, Aug 18, 2013 at 8:02 PM, S.L <[email protected]> wrote:

> Hello All,
>
> I am running Nutch 1.7 in eclipse and I start out with the Crawl job with
> the following settings.
>
> Main Class :org.apache.nutch.crawl.Crawl
> Arguments : urls -dir crawl -depth 10 -topN 10
>
> In the "urls" directory I have only one URL http://www.ebay.com and I
> expect the whole website to be crawled , however I get the following log
> output and the crawl seems to stop after a few urls are fetched.
>
> I use the nutch-default.xml and have already set http.content.limit to -1
> in it as mentioned in the other message in this mailing list. However  the
> crawl stops after a few URLs are fetched, please see the log below and
> advise.
>
> I am running eclipse on CentOS 6.4/Nutch 1.7
>
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
>
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
>
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> solrUrl is not set, indexing will be skipped...
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 10
> solrUrl=null
> topN = 10
> Injector: starting at 2013-08-18 22:48:45
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: total number of urls rejected by filters: 1
> Injector: total number of urls injected after normalization and filtering:
> 1
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2013-08-18 22:48:47, elapsed: 00:00:02
> Generator: starting at 2013-08-18 22:48:47
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 10
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20130818224849
> Generator: finished at 2013-08-18 22:48:51, elapsed: 00:00:03
> Fetcher: starting at 2013-08-18 22:48:51
> Fetcher: segment: crawl/segments/20130818224849
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold retries: 5
> fetching http://www.ebay.com/ (queue crawl delay=5000ms)
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2013-08-18 22:48:56, elapsed: 00:00:05
> ParseSegment: starting at 2013-08-18 22:48:56
> ParseSegment: segment: crawl/segments/20130818224849
> Parsed (15ms):http://www.ebay.com/
> ParseSegment: finished at 2013-08-18 22:48:57, elapsed: 00:00:01
> CrawlDb update: starting at 2013-08-18 22:48:57
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20130818224849]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2013-08-18 22:48:58, elapsed: 00:00:01
> Generator: starting at 2013-08-18 22:48:58
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 10
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting at 2013-08-18 22:48:59
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: internal links will be ignored.
> LinkDb: adding segment:
> file:/home/general/workspace/nutch/crawl/segments/20130818224849
> LinkDb: finished at 2013-08-18 22:49:00, elapsed: 00:00:01
> crawl finished: crawl
>

Reply via email to