Unable to crawl a specific link

S.L Sun, 22 Dec 2013 16:30:15 -0800

Hi All,

I am trying to crawl www.macys.com however it does not fetch anything , I
have check nitch-site.xml and regex-urlfilter.tx and there is nothing that
could filter anything.However it does crawl and the crawl finishes very
soon with in 20 seconds without fetching any additional urls.


I anot sure whats causing this behavior, can any one please try to parse
this URL as a seed URL and let me know if they are succeeding in parsing it.

I have also try to ran the Nutch commands in a step by step fashion and it
also resuled in nothing being added to fetch, is I use parseChecker alone
for www.macys.com it shows many outlinks though.

Please see the step by step execution of the nutch commands below.


Inject
=============

2013-12-22 18:00:32,171 INFO  crawl.Injector - Injector: starting at
2013-12-22 18:00:32
2013-12-22 18:00:32,171 INFO  crawl.Injector - Injector: crawlDb:
crawl/crawldb
2013-12-22 18:00:32,171 INFO  crawl.Injector - Injector: urlDir: urls
2013-12-22 18:00:32,172 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2013-12-22 18:00:32,406 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2013-12-22 18:00:32,426 WARN  mapred.JobClient - No job jar file set.  User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-12-22 18:00:32,444 WARN  snappy.LoadSnappy - Snappy native library not
loaded
2013-12-22 18:00:33,110 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2013-12-22 18:00:33,743 INFO  crawl.Injector - Injector: total number of
urls rejected by filters: 1
2013-12-22 18:00:33,743 INFO  crawl.Injector - Injector: total number of
urls injected after normalization and filtering: 1
2013-12-22 18:00:33,743 INFO  crawl.Injector - Injector: Merging injected
urls into crawl db.
2013-12-22 18:00:33,775 WARN  mapred.JobClient - No job jar file set.  User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-12-22 18:00:34,880 INFO  crawl.Injector - Injector: finished at
2013-12-22 18:00:34, elapsed: 00:00:02


Generate
=================

Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20131222180230
Generator: finished at 2013-12-22 18:02:31, elapsed: 00:00:03

Fetch
==================

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
Fetcher: starting at 2013-12-22 18:04:21
Fetcher: segment: crawl/segments/20131222180230
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.macys.com/ (queue crawl delay=0ms)
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-12-22 18:04:24, elapsed: 00:00:02

Parse
====================

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
ParseSegment: starting at 2013-12-22 18:15:44
ParseSegment: segment: crawl/segments/20131222180230
Exception in thread "main" java.io.IOException: Segment already parsed!
    at
org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:975)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:243)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:216)


UpdateDB
===================

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
CrawlDb update: starting at 2013-12-22 18:25:51
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20131222180230]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-12-22 18:25:52, elapsed: 00:00:01

Unable to crawl a specific link

Reply via email to