How can I analyze the segment dump from Nutch , its has number of folders it seems , can you please let me know which specific folder in the segments folder do I need to look into , also the index and the data files are not exactly text files to make any sense out of them .
I am using the default regex-filter that comes with nutch 1.7 , I have not changed that. Thank You. On Mon, Aug 19, 2013 at 4:07 AM, Tejas Patil <[email protected]>wrote: > The logs say: > Generator: 0 records selected for fetching, exiting ... > Stopping at depth=1 - no more URLs to fetch. > > Please get the segment dump and analyse it for the outlinks extracted. Also > check your filters. > > > On Sun, Aug 18, 2013 at 8:02 PM, S.L <[email protected]> wrote: > > > Hello All, > > > > I am running Nutch 1.7 in eclipse and I start out with the Crawl job with > > the following settings. > > > > Main Class :org.apache.nutch.crawl.Crawl > > Arguments : urls -dir crawl -depth 10 -topN 10 > > > > In the "urls" directory I have only one URL http://www.ebay.com and I > > expect the whole website to be crawled , however I get the following log > > output and the crawl seems to stop after a few urls are fetched. > > > > I use the nutch-default.xml and have already set http.content.limit to -1 > > in it as mentioned in the other message in this mailing list. However > the > > crawl stops after a few URLs are fetched, please see the log below and > > advise. > > > > I am running eclipse on CentOS 6.4/Nutch 1.7 > > > > SLF4J: Class path contains multiple SLF4J bindings. > > SLF4J: Found binding in > > > > > [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: Found binding in > > > > > [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > > explanation. > > solrUrl is not set, indexing will be skipped... > > crawl started in: crawl > > rootUrlDir = urls > > threads = 10 > > depth = 10 > > solrUrl=null > > topN = 10 > > Injector: starting at 2013-08-18 22:48:45 > > Injector: crawlDb: crawl/crawldb > > Injector: urlDir: urls > > Injector: Converting injected urls to crawl db entries. > > Injector: total number of urls rejected by filters: 1 > > Injector: total number of urls injected after normalization and > filtering: > > 1 > > Injector: Merging injected urls into crawl db. > > Injector: finished at 2013-08-18 22:48:47, elapsed: 00:00:02 > > Generator: starting at 2013-08-18 22:48:47 > > Generator: Selecting best-scoring urls due for fetch. > > Generator: filtering: true > > Generator: normalizing: true > > Generator: topN: 10 > > Generator: jobtracker is 'local', generating exactly one partition. > > Generator: Partitioning selected urls for politeness. > > Generator: segment: crawl/segments/20130818224849 > > Generator: finished at 2013-08-18 22:48:51, elapsed: 00:00:03 > > Fetcher: starting at 2013-08-18 22:48:51 > > Fetcher: segment: crawl/segments/20130818224849 > > Using queue mode : byHost > > Fetcher: threads: 10 > > Fetcher: time-out divisor: 2 > > QueueFeeder finished: total 1 records + hit by time limit :0 > > Using queue mode : byHost > > Using queue mode : byHost > > Using queue mode : byHost > > Using queue mode : byHost > > Using queue mode : byHost > > Using queue mode : byHost > > Using queue mode : byHost > > Using queue mode : byHost > > Using queue mode : byHost > > Using queue mode : byHost > > Fetcher: throughput threshold: -1 > > Fetcher: throughput threshold retries: 5 > > fetching http://www.ebay.com/ (queue crawl delay=5000ms) > > -finishing thread FetcherThread, activeThreads=8 > > -finishing thread FetcherThread, activeThreads=8 > > -finishing thread FetcherThread, activeThreads=2 > > -finishing thread FetcherThread, activeThreads=3 > > -finishing thread FetcherThread, activeThreads=4 > > -finishing thread FetcherThread, activeThreads=5 > > -finishing thread FetcherThread, activeThreads=6 > > -finishing thread FetcherThread, activeThreads=7 > > -finishing thread FetcherThread, activeThreads=1 > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > > -finishing thread FetcherThread, activeThreads=0 > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > > -activeThreads=0 > > Fetcher: finished at 2013-08-18 22:48:56, elapsed: 00:00:05 > > ParseSegment: starting at 2013-08-18 22:48:56 > > ParseSegment: segment: crawl/segments/20130818224849 > > Parsed (15ms):http://www.ebay.com/ > > ParseSegment: finished at 2013-08-18 22:48:57, elapsed: 00:00:01 > > CrawlDb update: starting at 2013-08-18 22:48:57 > > CrawlDb update: db: crawl/crawldb > > CrawlDb update: segments: [crawl/segments/20130818224849] > > CrawlDb update: additions allowed: true > > CrawlDb update: URL normalizing: true > > CrawlDb update: URL filtering: true > > CrawlDb update: 404 purging: false > > CrawlDb update: Merging segment data into db. > > CrawlDb update: finished at 2013-08-18 22:48:58, elapsed: 00:00:01 > > Generator: starting at 2013-08-18 22:48:58 > > Generator: Selecting best-scoring urls due for fetch. > > Generator: filtering: true > > Generator: normalizing: true > > Generator: topN: 10 > > Generator: jobtracker is 'local', generating exactly one partition. > > Generator: 0 records selected for fetching, exiting ... > > Stopping at depth=1 - no more URLs to fetch. > > LinkDb: starting at 2013-08-18 22:48:59 > > LinkDb: linkdb: crawl/linkdb > > LinkDb: URL normalize: true > > LinkDb: URL filter: true > > LinkDb: internal links will be ignored. > > LinkDb: adding segment: > > file:/home/general/workspace/nutch/crawl/segments/20130818224849 > > LinkDb: finished at 2013-08-18 22:49:00, elapsed: 00:00:01 > > crawl finished: crawl > > >

