Hello All,

I am crawling a website www.xyz.com the home page of which is provided as
the seed URL , however I see the following error in the log and nutch does
not go beyond fetching the home page(www.xyz.com). Please advise if the
robots.txt for this particular website is causing the further crawling of
it to be stopped.

The robots.txt error that I see is

2013-11-24 01:33:48,118 WARN  robots.SimpleRobotRulesParser -     Unknown
line in robots.txt file (size 597): noindex: *natuzzi*
2013-11-24 01:33:48,118 WARN  robots.SimpleRobotRulesParser -     Unknown
line in robots.txt file (size 597): noindex: *natuzzi*
2013-11-24 01:33:48,120 WARN  robots.SimpleRobotRulesParser -     Unknown
line in robots.txt file (size 597): noindex: *natuzzi*
2013-11-24 01:33:48,121 WARN  robots.SimpleRobotRulesParser -     Unknown
line in robots.txt file (size 597): noindex: *natuzzi*


The full log is as follows

2013-11-24 01:33:40,467 WARN  crawl.Crawl - solrUrl is not set, indexing
will be skipped...
2013-11-24 01:33:40,631 INFO  crawl.Crawl - crawl started in: crawl
2013-11-24 01:33:40,631 INFO  crawl.Crawl - rootUrlDir = urls
2013-11-24 01:33:40,631 INFO  crawl.Crawl - threads = 30
2013-11-24 01:33:40,631 INFO  crawl.Crawl - depth = 1000
2013-11-24 01:33:40,631 INFO  crawl.Crawl - solrUrl=null
2013-11-24 01:33:40,631 INFO  crawl.Crawl - topN = 100000000
2013-11-24 01:33:40,791 INFO  crawl.Injector - Injector: starting at
2013-11-24 01:33:40
2013-11-24 01:33:40,791 INFO  crawl.Injector - Injector: crawlDb:
crawl/crawldb
2013-11-24 01:33:40,791 INFO  crawl.Injector - Injector: urlDir: urls
2013-11-24 01:33:40,853 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2013-11-24 01:33:41,138 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2013-11-24 01:33:41,154 WARN  mapred.JobClient - No job jar file set.  User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-11-24 01:33:41,201 WARN  snappy.LoadSnappy - Snappy native library not
loaded
2013-11-24 01:33:42,046 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2013-11-24 01:33:42,498 INFO  crawl.Injector - Injector: total number of
urls rejected by filters: 0
2013-11-24 01:33:42,498 INFO  crawl.Injector - Injector: total number of
urls injected after normalization and filtering: 1
2013-11-24 01:33:42,498 INFO  crawl.Injector - Injector: Merging injected
urls into crawl db.
2013-11-24 01:33:42,534 WARN  mapred.JobClient - No job jar file set.  User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-11-24 01:33:43,649 INFO  crawl.Injector - Injector: finished at
2013-11-24 01:33:43, elapsed: 00:00:02
2013-11-24 01:33:43,651 INFO  crawl.Generator - Generator: starting at
2013-11-24 01:33:43
2013-11-24 01:33:43,651 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2013-11-24 01:33:43,651 INFO  crawl.Generator - Generator: filtering: true
2013-11-24 01:33:43,651 INFO  crawl.Generator - Generator: normalizing: true
2013-11-24 01:33:43,652 INFO  crawl.Generator - Generator: topN: 100000000
2013-11-24 01:33:43,652 INFO  crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2013-11-24 01:33:43,740 WARN  mapred.JobClient - No job jar file set.  User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-11-24 01:33:43,875 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-11-24 01:33:43,876 INFO  crawl.AbstractFetchSchedule -
defaultInterval=5
2013-11-24 01:33:43,876 INFO  crawl.AbstractFetchSchedule - maxInterval=5
2013-11-24 01:33:43,903 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
2013-11-24 01:33:43,922 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-11-24 01:33:43,922 INFO  crawl.AbstractFetchSchedule -
defaultInterval=5
2013-11-24 01:33:43,922 INFO  crawl.AbstractFetchSchedule - maxInterval=5
2013-11-24 01:33:43,924 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
2013-11-24 01:33:44,799 INFO  crawl.Generator - Generator: Partitioning
selected urls for politeness.
2013-11-24 01:33:45,800 INFO  crawl.Generator - Generator: segment:
crawl/segments/20131124013345
2013-11-24 01:33:45,845 WARN  mapred.JobClient - No job jar file set.  User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-11-24 01:33:45,988 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
2013-11-24 01:33:46,896 INFO  crawl.Generator - Generator: finished at
2013-11-24 01:33:46, elapsed: 00:00:03
2013-11-24 01:33:46,896 INFO  fetcher.Fetcher - Fetcher: starting at
2013-11-24 01:33:46
2013-11-24 01:33:46,896 INFO  fetcher.Fetcher - Fetcher: segment:
crawl/segments/20131124013345
2013-11-24 01:33:47,028 WARN  mapred.JobClient - No job jar file set.  User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-11-24 01:33:47,434 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,435 INFO  fetcher.Fetcher - Fetcher: threads: 30
2013-11-24 01:33:47,435 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 2
2013-11-24 01:33:47,624 INFO  fetcher.Fetcher - QueueFeeder finished: total
1 records + hit by time limit :0
2013-11-24 01:33:47,636 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,641 INFO  fetcher.Fetcher - fetching
http://www.xyz.com/(queue crawl delay=0ms)
2013-11-24 01:33:47,641 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,642 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,642 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,642 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,643 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,644 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,644 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,644 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,646 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,646 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,647 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,647 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,647 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,649 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,650 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,650 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,651 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,651 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,652 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,652 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,653 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,653 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,654 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,654 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,655 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,655 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,656 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,656 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,657 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,657 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,658 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,658 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,659 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,659 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,660 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,660 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,661 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,662 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,662 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,662 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,662 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,662 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,663 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,663 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,664 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,664 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,664 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,664 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,664 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,665 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,665 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,665 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,666 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,666 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,667 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,667 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,667 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-11-24 01:33:47,667 INFO  fetcher.Fetcher - Fetcher: throughput
threshold: -1
2013-11-24 01:33:47,667 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-11-24 01:33:47,667 INFO  fetcher.Fetcher - Fetcher: throughput
threshold retries: 5
2013-11-24 01:33:47,749 INFO  http.Http - http.proxy.host = null
2013-11-24 01:33:47,749 INFO  http.Http - http.proxy.port = 8080
2013-11-24 01:33:47,749 INFO  http.Http - http.timeout = 50000
2013-11-24 01:33:47,749 INFO  http.Http - http.content.limit = -1
2013-11-24 01:33:47,749 INFO  http.Http - http.agent = Test-Crawler
(Test-Crawler)
2013-11-24 01:33:47,749 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2013-11-24 01:33:47,749 INFO  http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2013-11-24 01:33:48,118 WARN  robots.SimpleRobotRulesParser - Problem
processing robots.txt for http://www.xyz.com/
2013-11-24 01:33:48,118 WARN  robots.SimpleRobotRulesParser -     Unknown
line in robots.txt file (size 597): noindex: *natuzzi*
2013-11-24 01:33:48,118 WARN  robots.SimpleRobotRulesParser -     Unknown
line in robots.txt file (size 597): noindex: *natuzzi*
2013-11-24 01:33:48,120 WARN  robots.SimpleRobotRulesParser -     Unknown
line in robots.txt file (size 597): noindex: *natuzzi*
2013-11-24 01:33:48,121 WARN  robots.SimpleRobotRulesParser -     Unknown
line in robots.txt file (size 597): noindex: *natuzzi*
2013-11-24 01:33:48,124 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2013-11-24 01:33:48,668 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2013-11-24 01:33:48,668 INFO  fetcher.Fetcher - -activeThreads=0
2013-11-24 01:33:49,065 INFO  fetcher.Fetcher - Fetcher: finished at
2013-11-24 01:33:49, elapsed: 00:00:02
2013-11-24 01:33:49,069 INFO  crawl.CrawlDb - CrawlDb update: starting at
2013-11-24 01:33:49
2013-11-24 01:33:49,070 INFO  crawl.CrawlDb - CrawlDb update: db:
crawl/crawldb
2013-11-24 01:33:49,070 INFO  crawl.CrawlDb - CrawlDb update: segments:
[crawl/segments/20131124013345]
2013-11-24 01:33:49,070 INFO  crawl.CrawlDb - CrawlDb update: additions
allowed: true
2013-11-24 01:33:49,070 INFO  crawl.CrawlDb - CrawlDb update: URL
normalizing: true
2013-11-24 01:33:49,070 INFO  crawl.CrawlDb - CrawlDb update: URL
filtering: true
2013-11-24 01:33:49,070 INFO  crawl.CrawlDb - CrawlDb update: 404 purging:
false
2013-11-24 01:33:49,070 INFO  crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2013-11-24 01:33:49,072 WARN  mapred.JobClient - No job jar file set.  User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-11-24 01:33:49,371 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'crawldb', using default
2013-11-24 01:33:49,450 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'crawldb', using default
2013-11-24 01:33:49,544 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-11-24 01:33:49,544 INFO  crawl.AbstractFetchSchedule -
defaultInterval=5
2013-11-24 01:33:49,545 INFO  crawl.AbstractFetchSchedule - maxInterval=5
2013-11-24 01:33:50,116 INFO  crawl.CrawlDb - CrawlDb update: finished at
2013-11-24 01:33:50, elapsed: 00:00:01
2013-11-24 01:33:50,119 INFO  crawl.Generator - Generator: starting at
2013-11-24 01:33:50
2013-11-24 01:33:50,119 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2013-11-24 01:33:50,119 INFO  crawl.Generator - Generator: filtering: true
2013-11-24 01:33:50,119 INFO  crawl.Generator - Generator: normalizing: true
2013-11-24 01:33:50,119 INFO  crawl.Generator - Generator: topN: 100000000
2013-11-24 01:33:50,120 INFO  crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2013-11-24 01:33:50,125 WARN  mapred.JobClient - No job jar file set.  User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-11-24 01:33:50,259 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-11-24 01:33:50,259 INFO  crawl.AbstractFetchSchedule -
defaultInterval=5
2013-11-24 01:33:50,259 INFO  crawl.AbstractFetchSchedule - maxInterval=5
2013-11-24 01:33:50,279 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-11-24 01:33:50,279 INFO  crawl.AbstractFetchSchedule -
defaultInterval=5
2013-11-24 01:33:50,279 INFO  crawl.AbstractFetchSchedule - maxInterval=5
2013-11-24 01:33:51,157 WARN  crawl.Generator - Generator: 0 records
selected for fetching, exiting ...
2013-11-24 01:33:51,159 INFO  crawl.Crawl - Stopping at depth=1 - no more
URLs to fetch.
2013-11-24 01:33:51,209 INFO  crawl.LinkDb - LinkDb: starting at 2013-11-24
01:33:51
2013-11-24 01:33:51,209 INFO  crawl.LinkDb - LinkDb: linkdb: crawl/linkdb
2013-11-24 01:33:51,209 INFO  crawl.LinkDb - LinkDb: URL normalize: true
2013-11-24 01:33:51,209 INFO  crawl.LinkDb - LinkDb: URL filter: true
2013-11-24 01:33:51,210 INFO  crawl.LinkDb - LinkDb: adding segment:
file:/home/general/workspace/nutch/crawl/segments/20131124013345
2013-11-24 01:33:51,211 WARN  mapred.JobClient - No job jar file set.  User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-11-24 01:33:52,260 INFO  crawl.LinkDb - LinkDb: finished at 2013-11-24
01:33:52, elapsed: 00:00:01
2013-11-24 01:33:52,260 INFO  crawl.Crawl - crawl finished: crawl

Reply via email to