Re: Unable to crawl a specific link

Sebastian Nagel Mon, 23 Dec 2013 06:29:06 -0800

Hi SL,

what's the exact Nutch version you are using?


There is a tool to check the robots.txt file:
1. download it:
% wget http://www.macys.com/robots.txt
2. place URLs to check in file:
% cat urls.txt
http://www.macys.com/
3. Test:
% bin/nutch org.apache.nutch.protocol.RobotRulesParser \
   robots.txt urls.txt "MyAgentName"

You can also check any URL separately, e.g.:

% bin/nutch parsechecker 'http://www.macys.com/'
fetching: http://www.macys.com/
Fetch failed with protocol status: temp_moved(13), lastModified=0:
http://www.macys.com/

It's a redirect on itself (which is strange, and may be the reason for your
problem).

Sebastian


2013/12/23 S.L <[email protected]>

> I Just noticed upon segment reading that the following entry in the segment
> indicated that this could be a robots.txt issue
>
>
> http://www.macys.com/    Version: 7
> Status: 3 (db_gone)
> Fetch time: Sun Dec 22 19:49:57 EST 2013
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 5 seconds (0 days)
> Score: 1.0
> Signature: null
>
> *Metadata: _pst_: robots_denied(18), lastModified=0*
>
> However in robots.txt for macys.com , I do nto see anything that is
> restricting.
>
> User-agent: bingbot
> Crawl-delay: 0
> Disallow: /compare
> Disallow: /registry/wedding/compare
> Disallow: /catalog/product/zoom.jsp
> Disallow: /search
> Disallow: /shop/search
> Disallow: /shop/registry/wedding/search
> Disallow: *natuzzi*
> noindex: *natuzzi*
> Disallow: *Natuzzi*
> noindex: *Natuzzi*
>
> User-agent: *
> Crawl-delay: 120
> Disallow: /compare
> Disallow: /registry/wedding/compare
> Disallow: /catalog/product/zoom.jsp
> Disallow: /search
> Disallow: /shop/search
> Disallow: /shop/registry/wedding/search
> Disallow: *natuzzi*
> noindex: *natuzzi*
> Disallow: *Natuzzi*
> noindex: *Natuzzi*
>
>
>
>
>
> On Sun, Dec 22, 2013 at 7:29 PM, S.L <[email protected]> wrote:
>
> > Hi All,
> >
> > I am trying to crawl www.macys.com however it does not fetch anything ,
> I
> > have check nitch-site.xml and regex-urlfilter.tx and there is nothing
> that
> > could filter anything.However it does crawl and the crawl finishes very
> > soon with in 20 seconds without fetching any additional urls.
> >
> > I anot sure whats causing this behavior, can any one please try to parse
> > this URL as a seed URL and let me know if they are succeeding in parsing
> it.
> >
> > I have also try to ran the Nutch commands in a step by step fashion and
> it
> > also resuled in nothing being added to fetch, is I use parseChecker alone
> > for www.macys.com it shows many outlinks though.
> >
> > Please see the step by step execution of the nutch commands below.
> >
> >
> > Inject
> > =============
> >
> > 2013-12-22 18:00:32,171 INFO  crawl.Injector - Injector: starting at
> > 2013-12-22 18:00:32
> > 2013-12-22 18:00:32,171 INFO  crawl.Injector - Injector: crawlDb:
> > crawl/crawldb
> > 2013-12-22 18:00:32,171 INFO  crawl.Injector - Injector: urlDir: urls
> > 2013-12-22 18:00:32,172 INFO  crawl.Injector - Injector: Converting
> > injected urls to crawl db entries.
> > 2013-12-22 18:00:32,406 WARN  util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2013-12-22 18:00:32,426 WARN  mapred.JobClient - No job jar file set.
> > User classes may not be found. See JobConf(Class) or
> JobConf#setJar(String).
> > 2013-12-22 18:00:32,444 WARN  snappy.LoadSnappy - Snappy native library
> > not loaded
> > 2013-12-22 18:00:33,110 INFO  regex.RegexURLNormalizer - can't find rules
> > for scope 'inject', using default
> > 2013-12-22 18:00:33,743 INFO  crawl.Injector - Injector: total number of
> > urls rejected by filters: 1
> > 2013-12-22 18:00:33,743 INFO  crawl.Injector - Injector: total number of
> > urls injected after normalization and filtering: 1
> > 2013-12-22 18:00:33,743 INFO  crawl.Injector - Injector: Merging injected
> > urls into crawl db.
> > 2013-12-22 18:00:33,775 WARN  mapred.JobClient - No job jar file set.
> > User classes may not be found. See JobConf(Class) or
> JobConf#setJar(String).
> > 2013-12-22 18:00:34,880 INFO  crawl.Injector - Injector: finished at
> > 2013-12-22 18:00:34, elapsed: 00:00:02
> >
> >
> > Generate
> > =================
> >
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: topN: 1000
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: Partitioning selected urls for politeness.
> > Generator: segment: crawl/segments/20131222180230
> > Generator: finished at 2013-12-22 18:02:31, elapsed: 00:00:03
> >
> > Fetch
> > ==================
> >
> > SLF4J: Class path contains multiple SLF4J bindings.
> > SLF4J: Found binding in
> >
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > SLF4J: Found binding in
> >
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> > explanation.
> > Fetcher: starting at 2013-12-22 18:04:21
> > Fetcher: segment: crawl/segments/20131222180230
> > Using queue mode : byHost
> > Fetcher: threads: 10
> > Fetcher: time-out divisor: 2
> > QueueFeeder finished: total 1 records + hit by time limit :0
> > Using queue mode : byHost
> > Using queue mode : byHost
> > fetching http://www.macys.com/ (queue crawl delay=0ms)
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > Using queue mode : byHost
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > Fetcher: throughput threshold: -1
> > Fetcher: throughput threshold retries: 5
> > -finishing thread FetcherThread, activeThreads=2
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=0
> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: finished at 2013-12-22 18:04:24, elapsed: 00:00:02
> >
> > Parse
> > ====================
> >
> > SLF4J: Class path contains multiple SLF4J bindings.
> > SLF4J: Found binding in
> >
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > SLF4J: Found binding in
> >
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> > explanation.
> > ParseSegment: starting at 2013-12-22 18:15:44
> > ParseSegment: segment: crawl/segments/20131222180230
> > Exception in thread "main" java.io.IOException: Segment already parsed!
> >     at
> >
> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89)
> >     at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:975)
> >     at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> >     at java.security.AccessController.doPrivileged(Native Method)
> >     at javax.security.auth.Subject.doAs(Subject.java:415)
> >     at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> >     at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> >     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> >     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> >     at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
> >     at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:243)
> >     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >     at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:216)
> >
> >
> > UpdateDB
> > ===================
> >
> > SLF4J: Class path contains multiple SLF4J bindings.
> > SLF4J: Found binding in
> >
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > SLF4J: Found binding in
> >
> [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> > explanation.
> > CrawlDb update: starting at 2013-12-22 18:25:51
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/segments/20131222180230]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: false
> > CrawlDb update: URL filtering: false
> > CrawlDb update: 404 purging: false
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: finished at 2013-12-22 18:25:52, elapsed: 00:00:01
> >
> >
> >
> >
> >
> >
> >
>

Re: Unable to crawl a specific link

Reply via email to