Re: Unable to crawl a specific link

S.L Sun, 22 Dec 2013 16:58:15 -0800

I Just noticed upon segment reading that the following entry in the segment
indicated that this could be a robots.txt issue



http://www.macys.com/    Version: 7
Status: 3 (db_gone)
Fetch time: Sun Dec 22 19:49:57 EST 2013
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 5 seconds (0 days)
Score: 1.0
Signature: null

*Metadata: _pst_: robots_denied(18), lastModified=0*

However in robots.txt for macys.com , I do nto see anything that is
restricting.

User-agent: bingbot
Crawl-delay: 0
Disallow: /compare
Disallow: /registry/wedding/compare
Disallow: /catalog/product/zoom.jsp
Disallow: /search
Disallow: /shop/search
Disallow: /shop/registry/wedding/search
Disallow: *natuzzi*
noindex: *natuzzi*
Disallow: *Natuzzi*
noindex: *Natuzzi*

User-agent: *
Crawl-delay: 120
Disallow: /compare
Disallow: /registry/wedding/compare
Disallow: /catalog/product/zoom.jsp
Disallow: /search
Disallow: /shop/search
Disallow: /shop/registry/wedding/search
Disallow: *natuzzi*
noindex: *natuzzi*
Disallow: *Natuzzi*
noindex: *Natuzzi*





On Sun, Dec 22, 2013 at 7:29 PM, S.L <[email protected]> wrote:

> Hi All,
>
> I am trying to crawl www.macys.com however it does not fetch anything , I
> have check nitch-site.xml and regex-urlfilter.tx and there is nothing that
> could filter anything.However it does crawl and the crawl finishes very
> soon with in 20 seconds without fetching any additional urls.
>
> I anot sure whats causing this behavior, can any one please try to parse
> this URL as a seed URL and let me know if they are succeeding in parsing it.
>
> I have also try to ran the Nutch commands in a step by step fashion and it
> also resuled in nothing being added to fetch, is I use parseChecker alone
> for www.macys.com it shows many outlinks though.
>
> Please see the step by step execution of the nutch commands below.
>
>
> Inject
> =============
>
> 2013-12-22 18:00:32,171 INFO  crawl.Injector - Injector: starting at
> 2013-12-22 18:00:32
> 2013-12-22 18:00:32,171 INFO  crawl.Injector - Injector: crawlDb:
> crawl/crawldb
> 2013-12-22 18:00:32,171 INFO  crawl.Injector - Injector: urlDir: urls
> 2013-12-22 18:00:32,172 INFO  crawl.Injector - Injector: Converting
> injected urls to crawl db entries.
> 2013-12-22 18:00:32,406 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2013-12-22 18:00:32,426 WARN  mapred.JobClient - No job jar file set.
> User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
> 2013-12-22 18:00:32,444 WARN  snappy.LoadSnappy - Snappy native library
> not loaded
> 2013-12-22 18:00:33,110 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'inject', using default
> 2013-12-22 18:00:33,743 INFO  crawl.Injector - Injector: total number of
> urls rejected by filters: 1
> 2013-12-22 18:00:33,743 INFO  crawl.Injector - Injector: total number of
> urls injected after normalization and filtering: 1
> 2013-12-22 18:00:33,743 INFO  crawl.Injector - Injector: Merging injected
> urls into crawl db.
> 2013-12-22 18:00:33,775 WARN  mapred.JobClient - No job jar file set.
> User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
> 2013-12-22 18:00:34,880 INFO  crawl.Injector - Injector: finished at
> 2013-12-22 18:00:34, elapsed: 00:00:02
>
>
> Generate
> =================
>
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 1000
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20131222180230
> Generator: finished at 2013-12-22 18:02:31, elapsed: 00:00:03
>
> Fetch
> ==================
>
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> Fetcher: starting at 2013-12-22 18:04:21
> Fetcher: segment: crawl/segments/20131222180230
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Using queue mode : byHost
> Using queue mode : byHost
> fetching http://www.macys.com/ (queue crawl delay=0ms)
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold retries: 5
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2013-12-22 18:04:24, elapsed: 00:00:02
>
> Parse
> ====================
>
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> ParseSegment: starting at 2013-12-22 18:15:44
> ParseSegment: segment: crawl/segments/20131222180230
> Exception in thread "main" java.io.IOException: Segment already parsed!
>     at
> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89)
>     at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:975)
>     at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
>     at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
>     at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
>     at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:243)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:216)
>
>
> UpdateDB
> ===================
>
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> CrawlDb update: starting at 2013-12-22 18:25:51
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20131222180230]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2013-12-22 18:25:52, elapsed: 00:00:01
>
>
>
>
>
>
>

Re: Unable to crawl a specific link

Reply via email to