I Just noticed upon segment reading that the following entry in the segment indicated that this could be a robots.txt issue
http://www.macys.com/ Version: 7 Status: 3 (db_gone) Fetch time: Sun Dec 22 19:49:57 EST 2013 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 5 seconds (0 days) Score: 1.0 Signature: null *Metadata: _pst_: robots_denied(18), lastModified=0* However in robots.txt for macys.com , I do nto see anything that is restricting. User-agent: bingbot Crawl-delay: 0 Disallow: /compare Disallow: /registry/wedding/compare Disallow: /catalog/product/zoom.jsp Disallow: /search Disallow: /shop/search Disallow: /shop/registry/wedding/search Disallow: *natuzzi* noindex: *natuzzi* Disallow: *Natuzzi* noindex: *Natuzzi* User-agent: * Crawl-delay: 120 Disallow: /compare Disallow: /registry/wedding/compare Disallow: /catalog/product/zoom.jsp Disallow: /search Disallow: /shop/search Disallow: /shop/registry/wedding/search Disallow: *natuzzi* noindex: *natuzzi* Disallow: *Natuzzi* noindex: *Natuzzi* On Sun, Dec 22, 2013 at 7:29 PM, S.L <[email protected]> wrote: > Hi All, > > I am trying to crawl www.macys.com however it does not fetch anything , I > have check nitch-site.xml and regex-urlfilter.tx and there is nothing that > could filter anything.However it does crawl and the crawl finishes very > soon with in 20 seconds without fetching any additional urls. > > I anot sure whats causing this behavior, can any one please try to parse > this URL as a seed URL and let me know if they are succeeding in parsing it. > > I have also try to ran the Nutch commands in a step by step fashion and it > also resuled in nothing being added to fetch, is I use parseChecker alone > for www.macys.com it shows many outlinks though. > > Please see the step by step execution of the nutch commands below. > > > Inject > ============= > > 2013-12-22 18:00:32,171 INFO crawl.Injector - Injector: starting at > 2013-12-22 18:00:32 > 2013-12-22 18:00:32,171 INFO crawl.Injector - Injector: crawlDb: > crawl/crawldb > 2013-12-22 18:00:32,171 INFO crawl.Injector - Injector: urlDir: urls > 2013-12-22 18:00:32,172 INFO crawl.Injector - Injector: Converting > injected urls to crawl db entries. > 2013-12-22 18:00:32,406 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2013-12-22 18:00:32,426 WARN mapred.JobClient - No job jar file set. > User classes may not be found. See JobConf(Class) or JobConf#setJar(String). > 2013-12-22 18:00:32,444 WARN snappy.LoadSnappy - Snappy native library > not loaded > 2013-12-22 18:00:33,110 INFO regex.RegexURLNormalizer - can't find rules > for scope 'inject', using default > 2013-12-22 18:00:33,743 INFO crawl.Injector - Injector: total number of > urls rejected by filters: 1 > 2013-12-22 18:00:33,743 INFO crawl.Injector - Injector: total number of > urls injected after normalization and filtering: 1 > 2013-12-22 18:00:33,743 INFO crawl.Injector - Injector: Merging injected > urls into crawl db. > 2013-12-22 18:00:33,775 WARN mapred.JobClient - No job jar file set. > User classes may not be found. See JobConf(Class) or JobConf#setJar(String). > 2013-12-22 18:00:34,880 INFO crawl.Injector - Injector: finished at > 2013-12-22 18:00:34, elapsed: 00:00:02 > > > Generate > ================= > > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 1000 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: crawl/segments/20131222180230 > Generator: finished at 2013-12-22 18:02:31, elapsed: 00:00:03 > > Fetch > ================== > > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > Fetcher: starting at 2013-12-22 18:04:21 > Fetcher: segment: crawl/segments/20131222180230 > Using queue mode : byHost > Fetcher: threads: 10 > Fetcher: time-out divisor: 2 > QueueFeeder finished: total 1 records + hit by time limit :0 > Using queue mode : byHost > Using queue mode : byHost > fetching http://www.macys.com/ (queue crawl delay=0ms) > Using queue mode : byHost > -finishing thread FetcherThread, activeThreads=1 > Using queue mode : byHost > Using queue mode : byHost > Using queue mode : byHost > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=1 > Using queue mode : byHost > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=1 > Using queue mode : byHost > -finishing thread FetcherThread, activeThreads=1 > Using queue mode : byHost > -finishing thread FetcherThread, activeThreads=1 > Using queue mode : byHost > Fetcher: throughput threshold: -1 > Fetcher: throughput threshold retries: 5 > -finishing thread FetcherThread, activeThreads=2 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2013-12-22 18:04:24, elapsed: 00:00:02 > > Parse > ==================== > > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > ParseSegment: starting at 2013-12-22 18:15:44 > ParseSegment: segment: crawl/segments/20131222180230 > Exception in thread "main" java.io.IOException: Segment already parsed! > at > org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:975) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) > at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209) > at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:243) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:216) > > > UpdateDB > =================== > > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > CrawlDb update: starting at 2013-12-22 18:25:51 > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20131222180230] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: false > CrawlDb update: URL filtering: false > CrawlDb update: 404 purging: false > CrawlDb update: Merging segment data into db. > CrawlDb update: finished at 2013-12-22 18:25:52, elapsed: 00:00:01 > > > > > > >

