Hi ,

I am using notch 1 and using crawl script to crawl my seed with number of 
rounds 2. My seed is http://www.apple.com/ 
I was checking the generate list for 2nd iteration and it does not include all 
urls rather just 11, I was exception second round generate list should contain 
all outlines on page.
I have set db.max.outlinks.per.page to 50 and -topn by default it take is 50000 
that i could see from logs. 

My understanding was for each iteration its fetch the seed and record outline 
for which it fetch in next iteration. As apple.com <http://apple.com/> is my 
seed for 1st iteration it looks fine it shows only one url but for second 
iteration I was hoping it should include all outlines till to count  reaches to 
db.max.outlinks.per.page value.

regex-urlfilter look like this.
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^http(s?)://([a-z0-9]*\.)*apple.com

Below is command I am using.

./crawl -i -D solr.server.url=http://localhost:8983/solr/  urls/ TestCrawl/  2


Here is the logs of 2nd iteration

/Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch
 clean -Dsolr.server.url=http://localhost:8983/solr/ TestCrawl//crawldb
Wed Dec 9 15:39:57 PST 2015 : Iteration 2 of 2
Generating a new segment
/Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch
 generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
TestCrawl//crawldb TestCrawl//segments -topN 50000 -numFetchers 1 -noFilter
Generator: starting at 2015-12-09 15:39:58
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: Partitioning selected urls for politeness.
Generator: segment: TestCrawl/segments/20151209154000
Generator: finished at 2015-12-09 15:40:01, elapsed: 00:00:03
Operating on segment : 20151209154000
Fetching : 20151209154000
/Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch
 fetch -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-D fetcher.timelimit.mins=180 TestCrawl//segments/20151209154000 -noParsing 
-threads 50
Fetcher: starting at 2015-12-09 15:40:01
Fetcher: segment: TestCrawl/segments/20151209154000
Fetcher Timelimit set for : 1449715201950
Using queue mode : byHost
Fetcher: threads: 50
Fetcher: time-out divisor: 2
QueueFeeder finished: total 12 records + hit by time limit :0
Using queue mode : byHost
fetching http://www.apple.com/retail/code/ (queue crawl delay=1000ms)
Using queue mode : byHost
fetching http://www.apple.com/gifts/ (queue crawl delay=1000ms)
Using queue mode : byHost
fetching http://www.apple.com/gifts/for-music-lovers/ (queue crawl delay=1000ms)
Using queue mode : byHost
fetching http://www.apple.com/ipad-pro/ (queue crawl delay=1000ms)
Using queue mode : byHost
fetching http://www.apple.com/imac/ (queue crawl delay=1000ms)
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.apple.com/wss/fonts/?family=Myriad+Set+Pro&v=1 (queue crawl 
delay=1000ms)
fetching http://www.apple.com/ipad-mini-4/ (queue crawl delay=1000ms)
Using queue mode : byHost
fetching http://www.apple.com/tv/ (queue crawl delay=1000ms)
Using queue mode : byHost
fetching http://www.apple.com/iphone-6s/ (queue crawl delay=1000ms)
Using queue mode : byHost
fetching http://www.apple.com/watch/ (queue crawl delay=1000ms)
Using queue mode : byHost
fetching http://www.apple.com/sitemap/ (queue crawl delay=1000ms)
Using queue mode : byHost
fetching http://images.apple.com/main/rss/hotnews/hotnews.rss (queue crawl 
delay=1000ms)
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=12
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
Thread FetcherThread has no more work available
fetcher.maxNum.threads can't be < than 50 : using 50 instead
-finishing thread FetcherThread, activeThreads=12
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=11
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=10
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=9
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=8
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=7
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=6
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=5
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=4
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=3
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=1
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=0
-activeThreads=0
Fetcher: finished at 2015-12-09 15:40:06, elapsed: 00:00:04
Parsing : 20151209154000
/Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch
 parse -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-D mapred.skip.attempts.to.start.skipping=2 -D 
mapred.skip.map.max.skip.records=1 TestCrawl//segments/20151209154000
ParseSegment: starting at 2015-12-09 15:40:07
ParseSegment: segment: TestCrawl/segments/20151209154000
Parsed (5ms):http://images.apple.com/main/rss/hotnews/hotnews.rss
Parsed (3ms):http://www.apple.com/gifts/
Parsed (2ms):http://www.apple.com/gifts/for-music-lovers/
Parsed (2ms):http://www.apple.com/imac/
Parsed (2ms):http://www.apple.com/ipad-mini-4/
Parsed (3ms):http://www.apple.com/ipad-pro/
Parsed (3ms):http://www.apple.com/iphone-6s/
Parsed (2ms):http://www.apple.com/retail/code/
Parsed (5ms):http://www.apple.com/sitemap/
Parsed (3ms):http://www.apple.com/tv/
Parsed (3ms):http://www.apple.com/watch/
ParseSegment: finished at 2015-12-09 15:40:09, elapsed: 00:00:02
CrawlDB update
/Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch
 updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
TestCrawl//crawldb TestCrawl//segments/20151209154000
CrawlDb update: starting at 2015-12-09 15:40:10
CrawlDb update: db: TestCrawl/crawldb
CrawlDb update: segments: [TestCrawl/segments/20151209154000]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2015-12-09 15:40:11, elapsed: 00:00:01
Link inversion
/Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch
 invertlinks TestCrawl//linkdb TestCrawl//segments/20151209154000
LinkDb: starting at 2015-12-09 15:40:12
LinkDb: linkdb: TestCrawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: TestCrawl/segments/20151209154000
LinkDb: merging with existing linkdb: TestCrawl/linkdb
LinkDb: finished at 2015-12-09 15:40:14, elapsed: 00:00:02
Dedup on crawldb
/Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch
 dedup TestCrawl//crawldb
Indexing 20151209154000 to index
/Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch
 index -Dsolr.server.url=http://localhost:8983/solr/ TestCrawl//crawldb -linkdb 
TestCrawl//linkdb TestCrawl//segments/20151209154000
Indexer: starting at 2015-12-09 15:40:18
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default 
solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : username for authentication
        solr.auth.password : password for authentication


*****************************************************
URL: http://images.apple.com/main/rss/hotnews/hotnews.rss
*****************************************************
URL: http://www.apple.com/gifts/
*****************************************************
URL: http://www.apple.com/gifts/for-music-lovers/
*****************************************************
URL: http://www.apple.com/imac/
*****************************************************
URL: http://www.apple.com/ipad-mini-4/
*****************************************************
URL: http://www.apple.com/ipad-pro/
*****************************************************
URL: http://www.apple.com/iphone-6s/
*****************************************************
URL: http://www.apple.com/retail/code/
*****************************************************
URL: http://www.apple.com/sitemap/
*****************************************************
URL: http://www.apple.com/tv/
*****************************************************
URL: http://www.apple.com/watch/
Indexer: finished at 2015-12-09 15:40:20, elapsed: 00:00:01
Cleaning up index if possible
/Users/manishverma/Manish/temp/Solr-Nutch-backup/nutchBackup-12-7-15/apache-nutch-1.10/runtime/local/bin/nutch
 clean -Dsolr.server.url=http://localhost:8983/solr/ TestCrawl//crawldb



Please Sugest.

Thanks
Manish Verma
AML Search
+1 669 224 9924

Reply via email to