Hi SL, what's the exact Nutch version you are using?
There is a tool to check the robots.txt file: 1. download it: % wget http://www.macys.com/robots.txt 2. place URLs to check in file: % cat urls.txt http://www.macys.com/ 3. Test: % bin/nutch org.apache.nutch.protocol.RobotRulesParser \ robots.txt urls.txt "MyAgentName" You can also check any URL separately, e.g.: % bin/nutch parsechecker 'http://www.macys.com/' fetching: http://www.macys.com/ Fetch failed with protocol status: temp_moved(13), lastModified=0: http://www.macys.com/ It's a redirect on itself (which is strange, and may be the reason for your problem). Sebastian 2013/12/23 S.L <[email protected]> > I Just noticed upon segment reading that the following entry in the segment > indicated that this could be a robots.txt issue > > > http://www.macys.com/ Version: 7 > Status: 3 (db_gone) > Fetch time: Sun Dec 22 19:49:57 EST 2013 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 5 seconds (0 days) > Score: 1.0 > Signature: null > > *Metadata: _pst_: robots_denied(18), lastModified=0* > > However in robots.txt for macys.com , I do nto see anything that is > restricting. > > User-agent: bingbot > Crawl-delay: 0 > Disallow: /compare > Disallow: /registry/wedding/compare > Disallow: /catalog/product/zoom.jsp > Disallow: /search > Disallow: /shop/search > Disallow: /shop/registry/wedding/search > Disallow: *natuzzi* > noindex: *natuzzi* > Disallow: *Natuzzi* > noindex: *Natuzzi* > > User-agent: * > Crawl-delay: 120 > Disallow: /compare > Disallow: /registry/wedding/compare > Disallow: /catalog/product/zoom.jsp > Disallow: /search > Disallow: /shop/search > Disallow: /shop/registry/wedding/search > Disallow: *natuzzi* > noindex: *natuzzi* > Disallow: *Natuzzi* > noindex: *Natuzzi* > > > > > > On Sun, Dec 22, 2013 at 7:29 PM, S.L <[email protected]> wrote: > > > Hi All, > > > > I am trying to crawl www.macys.com however it does not fetch anything , > I > > have check nitch-site.xml and regex-urlfilter.tx and there is nothing > that > > could filter anything.However it does crawl and the crawl finishes very > > soon with in 20 seconds without fetching any additional urls. > > > > I anot sure whats causing this behavior, can any one please try to parse > > this URL as a seed URL and let me know if they are succeeding in parsing > it. > > > > I have also try to ran the Nutch commands in a step by step fashion and > it > > also resuled in nothing being added to fetch, is I use parseChecker alone > > for www.macys.com it shows many outlinks though. > > > > Please see the step by step execution of the nutch commands below. > > > > > > Inject > > ============= > > > > 2013-12-22 18:00:32,171 INFO crawl.Injector - Injector: starting at > > 2013-12-22 18:00:32 > > 2013-12-22 18:00:32,171 INFO crawl.Injector - Injector: crawlDb: > > crawl/crawldb > > 2013-12-22 18:00:32,171 INFO crawl.Injector - Injector: urlDir: urls > > 2013-12-22 18:00:32,172 INFO crawl.Injector - Injector: Converting > > injected urls to crawl db entries. > > 2013-12-22 18:00:32,406 WARN util.NativeCodeLoader - Unable to load > > native-hadoop library for your platform... using builtin-java classes > where > > applicable > > 2013-12-22 18:00:32,426 WARN mapred.JobClient - No job jar file set. > > User classes may not be found. See JobConf(Class) or > JobConf#setJar(String). > > 2013-12-22 18:00:32,444 WARN snappy.LoadSnappy - Snappy native library > > not loaded > > 2013-12-22 18:00:33,110 INFO regex.RegexURLNormalizer - can't find rules > > for scope 'inject', using default > > 2013-12-22 18:00:33,743 INFO crawl.Injector - Injector: total number of > > urls rejected by filters: 1 > > 2013-12-22 18:00:33,743 INFO crawl.Injector - Injector: total number of > > urls injected after normalization and filtering: 1 > > 2013-12-22 18:00:33,743 INFO crawl.Injector - Injector: Merging injected > > urls into crawl db. > > 2013-12-22 18:00:33,775 WARN mapred.JobClient - No job jar file set. > > User classes may not be found. See JobConf(Class) or > JobConf#setJar(String). > > 2013-12-22 18:00:34,880 INFO crawl.Injector - Injector: finished at > > 2013-12-22 18:00:34, elapsed: 00:00:02 > > > > > > Generate > > ================= > > > > Generator: Selecting best-scoring urls due for fetch. > > Generator: filtering: true > > Generator: normalizing: true > > Generator: topN: 1000 > > Generator: jobtracker is 'local', generating exactly one partition. > > Generator: Partitioning selected urls for politeness. > > Generator: segment: crawl/segments/20131222180230 > > Generator: finished at 2013-12-22 18:02:31, elapsed: 00:00:03 > > > > Fetch > > ================== > > > > SLF4J: Class path contains multiple SLF4J bindings. > > SLF4J: Found binding in > > > [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: Found binding in > > > [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > > explanation. > > Fetcher: starting at 2013-12-22 18:04:21 > > Fetcher: segment: crawl/segments/20131222180230 > > Using queue mode : byHost > > Fetcher: threads: 10 > > Fetcher: time-out divisor: 2 > > QueueFeeder finished: total 1 records + hit by time limit :0 > > Using queue mode : byHost > > Using queue mode : byHost > > fetching http://www.macys.com/ (queue crawl delay=0ms) > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > Using queue mode : byHost > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > Fetcher: throughput threshold: -1 > > Fetcher: throughput threshold retries: 5 > > -finishing thread FetcherThread, activeThreads=2 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=0 > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > > -activeThreads=0 > > Fetcher: finished at 2013-12-22 18:04:24, elapsed: 00:00:02 > > > > Parse > > ==================== > > > > SLF4J: Class path contains multiple SLF4J bindings. > > SLF4J: Found binding in > > > [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: Found binding in > > > [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > > explanation. > > ParseSegment: starting at 2013-12-22 18:15:44 > > ParseSegment: segment: crawl/segments/20131222180230 > > Exception in thread "main" java.io.IOException: Segment already parsed! > > at > > > org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89) > > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:975) > > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:415) > > at > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) > > at > > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) > > at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209) > > at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:243) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:216) > > > > > > UpdateDB > > =================== > > > > SLF4J: Class path contains multiple SLF4J bindings. > > SLF4J: Found binding in > > > [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: Found binding in > > > [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > > explanation. > > CrawlDb update: starting at 2013-12-22 18:25:51 > > CrawlDb update: db: crawl/crawldb > > CrawlDb update: segments: [crawl/segments/20131222180230] > > CrawlDb update: additions allowed: true > > CrawlDb update: URL normalizing: false > > CrawlDb update: URL filtering: false > > CrawlDb update: 404 purging: false > > CrawlDb update: Merging segment data into db. > > CrawlDb update: finished at 2013-12-22 18:25:52, elapsed: 00:00:01 > > > > > > > > > > > > > > >

