Hi All, I am trying to crawl www.macys.com however it does not fetch anything , I have check nitch-site.xml and regex-urlfilter.tx and there is nothing that could filter anything.However it does crawl and the crawl finishes very soon with in 20 seconds without fetching any additional urls.
I anot sure whats causing this behavior, can any one please try to parse this URL as a seed URL and let me know if they are succeeding in parsing it. I have also try to ran the Nutch commands in a step by step fashion and it also resuled in nothing being added to fetch, is I use parseChecker alone for www.macys.com it shows many outlinks though. Please see the step by step execution of the nutch commands below. Inject ============= 2013-12-22 18:00:32,171 INFO crawl.Injector - Injector: starting at 2013-12-22 18:00:32 2013-12-22 18:00:32,171 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2013-12-22 18:00:32,171 INFO crawl.Injector - Injector: urlDir: urls 2013-12-22 18:00:32,172 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2013-12-22 18:00:32,406 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2013-12-22 18:00:32,426 WARN mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 2013-12-22 18:00:32,444 WARN snappy.LoadSnappy - Snappy native library not loaded 2013-12-22 18:00:33,110 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2013-12-22 18:00:33,743 INFO crawl.Injector - Injector: total number of urls rejected by filters: 1 2013-12-22 18:00:33,743 INFO crawl.Injector - Injector: total number of urls injected after normalization and filtering: 1 2013-12-22 18:00:33,743 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2013-12-22 18:00:33,775 WARN mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 2013-12-22 18:00:34,880 INFO crawl.Injector - Injector: finished at 2013-12-22 18:00:34, elapsed: 00:00:02 Generate ================= Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 1000 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20131222180230 Generator: finished at 2013-12-22 18:02:31, elapsed: 00:00:03 Fetch ================== SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Fetcher: starting at 2013-12-22 18:04:21 Fetcher: segment: crawl/segments/20131222180230 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost fetching http://www.macys.com/ (queue crawl delay=0ms) Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2013-12-22 18:04:24, elapsed: 00:00:02 Parse ==================== SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. ParseSegment: starting at 2013-12-22 18:15:44 ParseSegment: segment: crawl/segments/20131222180230 Exception in thread "main" java.io.IOException: Segment already parsed! at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:975) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209) at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:243) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:216) UpdateDB =================== SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. CrawlDb update: starting at 2013-12-22 18:25:51 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20131222180230] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2013-12-22 18:25:52, elapsed: 00:00:01

