Hello All, I am crawling a website www.xyz.com the home page of which is provided as the seed URL , however I see the following error in the log and nutch does not go beyond fetching the home page(www.xyz.com). Please advise if the robots.txt for this particular website is causing the further crawling of it to be stopped.
The robots.txt error that I see is 2013-11-24 01:33:48,118 WARN robots.SimpleRobotRulesParser - Unknown line in robots.txt file (size 597): noindex: *natuzzi* 2013-11-24 01:33:48,118 WARN robots.SimpleRobotRulesParser - Unknown line in robots.txt file (size 597): noindex: *natuzzi* 2013-11-24 01:33:48,120 WARN robots.SimpleRobotRulesParser - Unknown line in robots.txt file (size 597): noindex: *natuzzi* 2013-11-24 01:33:48,121 WARN robots.SimpleRobotRulesParser - Unknown line in robots.txt file (size 597): noindex: *natuzzi* The full log is as follows 2013-11-24 01:33:40,467 WARN crawl.Crawl - solrUrl is not set, indexing will be skipped... 2013-11-24 01:33:40,631 INFO crawl.Crawl - crawl started in: crawl 2013-11-24 01:33:40,631 INFO crawl.Crawl - rootUrlDir = urls 2013-11-24 01:33:40,631 INFO crawl.Crawl - threads = 30 2013-11-24 01:33:40,631 INFO crawl.Crawl - depth = 1000 2013-11-24 01:33:40,631 INFO crawl.Crawl - solrUrl=null 2013-11-24 01:33:40,631 INFO crawl.Crawl - topN = 100000000 2013-11-24 01:33:40,791 INFO crawl.Injector - Injector: starting at 2013-11-24 01:33:40 2013-11-24 01:33:40,791 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2013-11-24 01:33:40,791 INFO crawl.Injector - Injector: urlDir: urls 2013-11-24 01:33:40,853 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2013-11-24 01:33:41,138 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2013-11-24 01:33:41,154 WARN mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 2013-11-24 01:33:41,201 WARN snappy.LoadSnappy - Snappy native library not loaded 2013-11-24 01:33:42,046 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2013-11-24 01:33:42,498 INFO crawl.Injector - Injector: total number of urls rejected by filters: 0 2013-11-24 01:33:42,498 INFO crawl.Injector - Injector: total number of urls injected after normalization and filtering: 1 2013-11-24 01:33:42,498 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2013-11-24 01:33:42,534 WARN mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 2013-11-24 01:33:43,649 INFO crawl.Injector - Injector: finished at 2013-11-24 01:33:43, elapsed: 00:00:02 2013-11-24 01:33:43,651 INFO crawl.Generator - Generator: starting at 2013-11-24 01:33:43 2013-11-24 01:33:43,651 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2013-11-24 01:33:43,651 INFO crawl.Generator - Generator: filtering: true 2013-11-24 01:33:43,651 INFO crawl.Generator - Generator: normalizing: true 2013-11-24 01:33:43,652 INFO crawl.Generator - Generator: topN: 100000000 2013-11-24 01:33:43,652 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2013-11-24 01:33:43,740 WARN mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 2013-11-24 01:33:43,875 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2013-11-24 01:33:43,876 INFO crawl.AbstractFetchSchedule - defaultInterval=5 2013-11-24 01:33:43,876 INFO crawl.AbstractFetchSchedule - maxInterval=5 2013-11-24 01:33:43,903 INFO regex.RegexURLNormalizer - can't find rules for scope 'partition', using default 2013-11-24 01:33:43,922 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2013-11-24 01:33:43,922 INFO crawl.AbstractFetchSchedule - defaultInterval=5 2013-11-24 01:33:43,922 INFO crawl.AbstractFetchSchedule - maxInterval=5 2013-11-24 01:33:43,924 INFO regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default 2013-11-24 01:33:44,799 INFO crawl.Generator - Generator: Partitioning selected urls for politeness. 2013-11-24 01:33:45,800 INFO crawl.Generator - Generator: segment: crawl/segments/20131124013345 2013-11-24 01:33:45,845 WARN mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 2013-11-24 01:33:45,988 INFO regex.RegexURLNormalizer - can't find rules for scope 'partition', using default 2013-11-24 01:33:46,896 INFO crawl.Generator - Generator: finished at 2013-11-24 01:33:46, elapsed: 00:00:03 2013-11-24 01:33:46,896 INFO fetcher.Fetcher - Fetcher: starting at 2013-11-24 01:33:46 2013-11-24 01:33:46,896 INFO fetcher.Fetcher - Fetcher: segment: crawl/segments/20131124013345 2013-11-24 01:33:47,028 WARN mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 2013-11-24 01:33:47,434 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,435 INFO fetcher.Fetcher - Fetcher: threads: 30 2013-11-24 01:33:47,435 INFO fetcher.Fetcher - Fetcher: time-out divisor: 2 2013-11-24 01:33:47,624 INFO fetcher.Fetcher - QueueFeeder finished: total 1 records + hit by time limit :0 2013-11-24 01:33:47,636 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,641 INFO fetcher.Fetcher - fetching http://www.xyz.com/(queue crawl delay=0ms) 2013-11-24 01:33:47,641 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,642 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,642 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,642 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,643 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,644 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,644 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,644 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,646 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,646 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,647 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,647 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,647 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,649 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,650 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,650 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,651 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,651 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,652 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,652 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,653 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,653 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,654 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,654 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,655 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,655 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,656 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,656 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,657 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,657 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,658 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,658 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,659 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,659 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,660 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,660 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,661 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,662 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,662 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,662 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,662 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,662 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,663 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,663 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,664 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,664 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,664 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,664 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,664 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,665 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,665 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,665 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,666 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,666 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,667 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,667 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,667 INFO fetcher.Fetcher - Using queue mode : byHost 2013-11-24 01:33:47,667 INFO fetcher.Fetcher - Fetcher: throughput threshold: -1 2013-11-24 01:33:47,667 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-11-24 01:33:47,667 INFO fetcher.Fetcher - Fetcher: throughput threshold retries: 5 2013-11-24 01:33:47,749 INFO http.Http - http.proxy.host = null 2013-11-24 01:33:47,749 INFO http.Http - http.proxy.port = 8080 2013-11-24 01:33:47,749 INFO http.Http - http.timeout = 50000 2013-11-24 01:33:47,749 INFO http.Http - http.content.limit = -1 2013-11-24 01:33:47,749 INFO http.Http - http.agent = Test-Crawler (Test-Crawler) 2013-11-24 01:33:47,749 INFO http.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2013-11-24 01:33:47,749 INFO http.Http - http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 2013-11-24 01:33:48,118 WARN robots.SimpleRobotRulesParser - Problem processing robots.txt for http://www.xyz.com/ 2013-11-24 01:33:48,118 WARN robots.SimpleRobotRulesParser - Unknown line in robots.txt file (size 597): noindex: *natuzzi* 2013-11-24 01:33:48,118 WARN robots.SimpleRobotRulesParser - Unknown line in robots.txt file (size 597): noindex: *natuzzi* 2013-11-24 01:33:48,120 WARN robots.SimpleRobotRulesParser - Unknown line in robots.txt file (size 597): noindex: *natuzzi* 2013-11-24 01:33:48,121 WARN robots.SimpleRobotRulesParser - Unknown line in robots.txt file (size 597): noindex: *natuzzi* 2013-11-24 01:33:48,124 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 2013-11-24 01:33:48,668 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 2013-11-24 01:33:48,668 INFO fetcher.Fetcher - -activeThreads=0 2013-11-24 01:33:49,065 INFO fetcher.Fetcher - Fetcher: finished at 2013-11-24 01:33:49, elapsed: 00:00:02 2013-11-24 01:33:49,069 INFO crawl.CrawlDb - CrawlDb update: starting at 2013-11-24 01:33:49 2013-11-24 01:33:49,070 INFO crawl.CrawlDb - CrawlDb update: db: crawl/crawldb 2013-11-24 01:33:49,070 INFO crawl.CrawlDb - CrawlDb update: segments: [crawl/segments/20131124013345] 2013-11-24 01:33:49,070 INFO crawl.CrawlDb - CrawlDb update: additions allowed: true 2013-11-24 01:33:49,070 INFO crawl.CrawlDb - CrawlDb update: URL normalizing: true 2013-11-24 01:33:49,070 INFO crawl.CrawlDb - CrawlDb update: URL filtering: true 2013-11-24 01:33:49,070 INFO crawl.CrawlDb - CrawlDb update: 404 purging: false 2013-11-24 01:33:49,070 INFO crawl.CrawlDb - CrawlDb update: Merging segment data into db. 2013-11-24 01:33:49,072 WARN mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 2013-11-24 01:33:49,371 INFO regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using default 2013-11-24 01:33:49,450 INFO regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using default 2013-11-24 01:33:49,544 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2013-11-24 01:33:49,544 INFO crawl.AbstractFetchSchedule - defaultInterval=5 2013-11-24 01:33:49,545 INFO crawl.AbstractFetchSchedule - maxInterval=5 2013-11-24 01:33:50,116 INFO crawl.CrawlDb - CrawlDb update: finished at 2013-11-24 01:33:50, elapsed: 00:00:01 2013-11-24 01:33:50,119 INFO crawl.Generator - Generator: starting at 2013-11-24 01:33:50 2013-11-24 01:33:50,119 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2013-11-24 01:33:50,119 INFO crawl.Generator - Generator: filtering: true 2013-11-24 01:33:50,119 INFO crawl.Generator - Generator: normalizing: true 2013-11-24 01:33:50,119 INFO crawl.Generator - Generator: topN: 100000000 2013-11-24 01:33:50,120 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2013-11-24 01:33:50,125 WARN mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 2013-11-24 01:33:50,259 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2013-11-24 01:33:50,259 INFO crawl.AbstractFetchSchedule - defaultInterval=5 2013-11-24 01:33:50,259 INFO crawl.AbstractFetchSchedule - maxInterval=5 2013-11-24 01:33:50,279 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2013-11-24 01:33:50,279 INFO crawl.AbstractFetchSchedule - defaultInterval=5 2013-11-24 01:33:50,279 INFO crawl.AbstractFetchSchedule - maxInterval=5 2013-11-24 01:33:51,157 WARN crawl.Generator - Generator: 0 records selected for fetching, exiting ... 2013-11-24 01:33:51,159 INFO crawl.Crawl - Stopping at depth=1 - no more URLs to fetch. 2013-11-24 01:33:51,209 INFO crawl.LinkDb - LinkDb: starting at 2013-11-24 01:33:51 2013-11-24 01:33:51,209 INFO crawl.LinkDb - LinkDb: linkdb: crawl/linkdb 2013-11-24 01:33:51,209 INFO crawl.LinkDb - LinkDb: URL normalize: true 2013-11-24 01:33:51,209 INFO crawl.LinkDb - LinkDb: URL filter: true 2013-11-24 01:33:51,210 INFO crawl.LinkDb - LinkDb: adding segment: file:/home/general/workspace/nutch/crawl/segments/20131124013345 2013-11-24 01:33:51,211 WARN mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 2013-11-24 01:33:52,260 INFO crawl.LinkDb - LinkDb: finished at 2013-11-24 01:33:52, elapsed: 00:00:01 2013-11-24 01:33:52,260 INFO crawl.Crawl - crawl finished: crawl

