Hi , I am using notch 1 and using crawl script to crawl my seed with number of rounds 2. My seed is http://www.apple.com/ I was checking the generate list for 2nd iteration and it does not include all urls rather just 11, I was exception second round generate list should contain all outlines on page. I have set db.max.outlinks.per.page to 50 and -topn by default it take is 50000 that i could see from logs.
My understanding was for each iteration its fetch the seed and record outline for which it fetch in next iteration. As apple.com <http://apple.com/> is my seed for 1st iteration it looks fine it shows only one url but for second iteration I was hoping it should include all outlines till to count reaches to db.max.outlinks.per.page value. regex-urlfilter look like this. # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +^http(s?)://([a-z0-9]*\.)*apple.com Below is command I am using. ./crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/ 2 Here is the logs of 2nd iteration /Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch clean -Dsolr.server.url=http://localhost:8983/solr/ TestCrawl//crawldb Wed Dec 9 15:39:57 PST 2015 : Iteration 2 of 2 Generating a new segment /Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true TestCrawl//crawldb TestCrawl//segments -topN 50000 -numFetchers 1 -noFilter Generator: starting at 2015-12-09 15:39:58 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: topN: 50000 Generator: Partitioning selected urls for politeness. Generator: segment: TestCrawl/segments/20151209154000 Generator: finished at 2015-12-09 15:40:01, elapsed: 00:00:03 Operating on segment : 20151209154000 Fetching : 20151209154000 /Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch fetch -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D fetcher.timelimit.mins=180 TestCrawl//segments/20151209154000 -noParsing -threads 50 Fetcher: starting at 2015-12-09 15:40:01 Fetcher: segment: TestCrawl/segments/20151209154000 Fetcher Timelimit set for : 1449715201950 Using queue mode : byHost Fetcher: threads: 50 Fetcher: time-out divisor: 2 QueueFeeder finished: total 12 records + hit by time limit :0 Using queue mode : byHost fetching http://www.apple.com/retail/code/ (queue crawl delay=1000ms) Using queue mode : byHost fetching http://www.apple.com/gifts/ (queue crawl delay=1000ms) Using queue mode : byHost fetching http://www.apple.com/gifts/for-music-lovers/ (queue crawl delay=1000ms) Using queue mode : byHost fetching http://www.apple.com/ipad-pro/ (queue crawl delay=1000ms) Using queue mode : byHost fetching http://www.apple.com/imac/ (queue crawl delay=1000ms) Using queue mode : byHost Using queue mode : byHost fetching http://www.apple.com/wss/fonts/?family=Myriad+Set+Pro&v=1 (queue crawl delay=1000ms) fetching http://www.apple.com/ipad-mini-4/ (queue crawl delay=1000ms) Using queue mode : byHost fetching http://www.apple.com/tv/ (queue crawl delay=1000ms) Using queue mode : byHost fetching http://www.apple.com/iphone-6s/ (queue crawl delay=1000ms) Using queue mode : byHost fetching http://www.apple.com/watch/ (queue crawl delay=1000ms) Using queue mode : byHost fetching http://www.apple.com/sitemap/ (queue crawl delay=1000ms) Using queue mode : byHost fetching http://images.apple.com/main/rss/hotnews/hotnews.rss (queue crawl delay=1000ms) Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=12 Using queue mode : byHost Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 Thread FetcherThread has no more work available fetcher.maxNum.threads can't be < than 50 : using 50 instead -finishing thread FetcherThread, activeThreads=12 Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=11 Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=10 Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=9 Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=8 Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=7 Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=6 Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=5 Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=4 Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=3 Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=2 Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1 Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0 -activeThreads=0 Fetcher: finished at 2015-12-09 15:40:06, elapsed: 00:00:04 Parsing : 20151209154000 /Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch parse -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1 TestCrawl//segments/20151209154000 ParseSegment: starting at 2015-12-09 15:40:07 ParseSegment: segment: TestCrawl/segments/20151209154000 Parsed (5ms):http://images.apple.com/main/rss/hotnews/hotnews.rss Parsed (3ms):http://www.apple.com/gifts/ Parsed (2ms):http://www.apple.com/gifts/for-music-lovers/ Parsed (2ms):http://www.apple.com/imac/ Parsed (2ms):http://www.apple.com/ipad-mini-4/ Parsed (3ms):http://www.apple.com/ipad-pro/ Parsed (3ms):http://www.apple.com/iphone-6s/ Parsed (2ms):http://www.apple.com/retail/code/ Parsed (5ms):http://www.apple.com/sitemap/ Parsed (3ms):http://www.apple.com/tv/ Parsed (3ms):http://www.apple.com/watch/ ParseSegment: finished at 2015-12-09 15:40:09, elapsed: 00:00:02 CrawlDB update /Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true TestCrawl//crawldb TestCrawl//segments/20151209154000 CrawlDb update: starting at 2015-12-09 15:40:10 CrawlDb update: db: TestCrawl/crawldb CrawlDb update: segments: [TestCrawl/segments/20151209154000] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2015-12-09 15:40:11, elapsed: 00:00:01 Link inversion /Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch invertlinks TestCrawl//linkdb TestCrawl//segments/20151209154000 LinkDb: starting at 2015-12-09 15:40:12 LinkDb: linkdb: TestCrawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: internal links will be ignored. LinkDb: adding segment: TestCrawl/segments/20151209154000 LinkDb: merging with existing linkdb: TestCrawl/linkdb LinkDb: finished at 2015-12-09 15:40:14, elapsed: 00:00:02 Dedup on crawldb /Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch dedup TestCrawl//crawldb Indexing 20151209154000 to index /Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/ TestCrawl//crawldb -linkdb TestCrawl//linkdb TestCrawl//segments/20151209154000 Indexer: starting at 2015-12-09 15:40:18 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : username for authentication solr.auth.password : password for authentication ***************************************************** URL: http://images.apple.com/main/rss/hotnews/hotnews.rss ***************************************************** URL: http://www.apple.com/gifts/ ***************************************************** URL: http://www.apple.com/gifts/for-music-lovers/ ***************************************************** URL: http://www.apple.com/imac/ ***************************************************** URL: http://www.apple.com/ipad-mini-4/ ***************************************************** URL: http://www.apple.com/ipad-pro/ ***************************************************** URL: http://www.apple.com/iphone-6s/ ***************************************************** URL: http://www.apple.com/retail/code/ ***************************************************** URL: http://www.apple.com/sitemap/ ***************************************************** URL: http://www.apple.com/tv/ ***************************************************** URL: http://www.apple.com/watch/ Indexer: finished at 2015-12-09 15:40:20, elapsed: 00:00:01 Cleaning up index if possible /Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch clean -Dsolr.server.url=http://localhost:8983/solr/ TestCrawl//crawldb Please Sugest. Thanks Manish Verma AML Search +1 669 224 9924

