How can I analyze the segment dump from Nutch , its has number of folders
it seems , can you please let me know which specific folder in the segments
folder do I need to look into , also the index and the data files are not
exactly text files to make any sense out of them .

I am using the default regex-filter that comes with nutch 1.7 , I have not
changed that.

Thank You.


On Mon, Aug 19, 2013 at 4:07 AM, Tejas Patil <[email protected]>wrote:

> The logs say:
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
>
> Please get the segment dump and analyse it for the outlinks extracted. Also
> check your filters.
>
>
> On Sun, Aug 18, 2013 at 8:02 PM, S.L <[email protected]> wrote:
>
> > Hello All,
> >
> > I am running Nutch 1.7 in eclipse and I start out with the Crawl job with
> > the following settings.
> >
> > Main Class :org.apache.nutch.crawl.Crawl
> > Arguments : urls -dir crawl -depth 10 -topN 10
> >
> > In the "urls" directory I have only one URL http://www.ebay.com and I
> > expect the whole website to be crawled , however I get the following log
> > output and the crawl seems to stop after a few urls are fetched.
> >
> > I use the nutch-default.xml and have already set http.content.limit to -1
> > in it as mentioned in the other message in this mailing list. However
>  the
> > crawl stops after a few URLs are fetched, please see the log below and
> > advise.
> >
> > I am running eclipse on CentOS 6.4/Nutch 1.7
> >
> > SLF4J: Class path contains multiple SLF4J bindings.
> > SLF4J: Found binding in
> >
> >
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > SLF4J: Found binding in
> >
> >
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> > explanation.
> > solrUrl is not set, indexing will be skipped...
> > crawl started in: crawl
> > rootUrlDir = urls
> > threads = 10
> > depth = 10
> > solrUrl=null
> > topN = 10
> > Injector: starting at 2013-08-18 22:48:45
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: total number of urls rejected by filters: 1
> > Injector: total number of urls injected after normalization and
> filtering:
> > 1
> > Injector: Merging injected urls into crawl db.
> > Injector: finished at 2013-08-18 22:48:47, elapsed: 00:00:02
> > Generator: starting at 2013-08-18 22:48:47
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: topN: 10
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: Partitioning selected urls for politeness.
> > Generator: segment: crawl/segments/20130818224849
> > Generator: finished at 2013-08-18 22:48:51, elapsed: 00:00:03
> > Fetcher: starting at 2013-08-18 22:48:51
> > Fetcher: segment: crawl/segments/20130818224849
> > Using queue mode : byHost
> > Fetcher: threads: 10
> > Fetcher: time-out divisor: 2
> > QueueFeeder finished: total 1 records + hit by time limit :0
> > Using queue mode : byHost
> > Using queue mode : byHost
> > Using queue mode : byHost
> > Using queue mode : byHost
> > Using queue mode : byHost
> > Using queue mode : byHost
> > Using queue mode : byHost
> > Using queue mode : byHost
> > Using queue mode : byHost
> > Using queue mode : byHost
> > Fetcher: throughput threshold: -1
> > Fetcher: throughput threshold retries: 5
> > fetching http://www.ebay.com/ (queue crawl delay=5000ms)
> > -finishing thread FetcherThread, activeThreads=8
> > -finishing thread FetcherThread, activeThreads=8
> > -finishing thread FetcherThread, activeThreads=2
> > -finishing thread FetcherThread, activeThreads=3
> > -finishing thread FetcherThread, activeThreads=4
> > -finishing thread FetcherThread, activeThreads=5
> > -finishing thread FetcherThread, activeThreads=6
> > -finishing thread FetcherThread, activeThreads=7
> > -finishing thread FetcherThread, activeThreads=1
> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > -finishing thread FetcherThread, activeThreads=0
> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: finished at 2013-08-18 22:48:56, elapsed: 00:00:05
> > ParseSegment: starting at 2013-08-18 22:48:56
> > ParseSegment: segment: crawl/segments/20130818224849
> > Parsed (15ms):http://www.ebay.com/
> > ParseSegment: finished at 2013-08-18 22:48:57, elapsed: 00:00:01
> > CrawlDb update: starting at 2013-08-18 22:48:57
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/segments/20130818224849]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: true
> > CrawlDb update: URL filtering: true
> > CrawlDb update: 404 purging: false
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: finished at 2013-08-18 22:48:58, elapsed: 00:00:01
> > Generator: starting at 2013-08-18 22:48:58
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: topN: 10
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: 0 records selected for fetching, exiting ...
> > Stopping at depth=1 - no more URLs to fetch.
> > LinkDb: starting at 2013-08-18 22:48:59
> > LinkDb: linkdb: crawl/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: internal links will be ignored.
> > LinkDb: adding segment:
> > file:/home/general/workspace/nutch/crawl/segments/20130818224849
> > LinkDb: finished at 2013-08-18 22:49:00, elapsed: 00:00:01
> > crawl finished: crawl
> >
>

Reply via email to