Re: crawling site without www

Alexei Korolev Tue, 07 Aug 2012 07:03:14 -0700

Hello,

Yes, test.com and www.test.com exist.
test.com do not redirect on www.test.com, it opens page with ongoing link
with www. like www.test.com/page1 www.test.com/page2


First launch of crawler script

root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
Injector: starting at 2012-08-07 16:00:30
Injector: crawlDb: crawl/crawldb
Injector: urlDir: seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02
Generator: starting at 2012-08-07 16:00:33
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/crawldb/segments/20120807160035
Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2012-08-07 16:00:37
Fetcher: segment: crawl/crawldb/segments/20120807160035
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
fetching http://test.com
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04
ParseSegment: starting at 2012-08-07 16:00:41
ParseSegment: segment: crawl/crawldb/segments/20120807160035
Parsing: http://test.com
ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02
CrawlDb update: starting at 2012-08-07 16:00:44
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01
LinkDb: starting at 2012-08-07 16:00:46
LinkDb: linkdb: crawl/crawldb/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/data/nutch/crawl/crawldb/segments/20120807160035
LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01

Second launch of srcipt

root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
Injector: starting at 2012-08-07 16:01:30
Injector: crawlDb: crawl/crawldb
Injector: urlDir: seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02
Generator: starting at 2012-08-07 16:01:33
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2012-08-07 16:01:35
Fetcher: segment: crawl/crawldb/segments/20120807160035
Fetcher: java.io.IOException: Segment already fetched!
    at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)

ParseSegment: starting at 2012-08-07 16:01:35
ParseSegment: segment: crawl/crawldb/segments/20120807160035
Exception in thread "main" java.io.IOException: Segment already parsed!
    at
org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
CrawlDb update: starting at 2012-08-07 16:01:36
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01
LinkDb: starting at 2012-08-07 16:01:37
LinkDb: linkdb: crawl/crawldb/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/data/nutch/crawl/crawldb/segments/20120807160035
LinkDb: merging with existing linkdb: crawl/crawldb/linkdb
LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02


But when seed.txt have www.test.com instead test.com second launch of
crawler script found next segment for fetching.

On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga <
mathijs.hommi...@kalooga.com> wrote:

> What do you mean exactly with "it falls on fetch phase"?
> Do  you get an error?
> Does "test.com" exist?
> Does it perhaps redirect to "www.test.com"?
> ...
>
> Mathijs
>
>
> On Aug 4, 2012, at 17:11 , Alexei Korolev <alexei.koro...@gmail.com>
> wrote:
>
> > yes
> >
> > On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
> > lewis.mcgibb...@gmail.com> wrote:
> >
> >> http://   ?
> >>
> >> hth
> >>
> >> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <
> alexei.koro...@gmail.com>
> >> wrote:
> >>> Hello,
> >>>
> >>> I have small script
> >>>
> >>> $NUTCH_PATH inject crawl/crawldb seed.txt
> >>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
> >>>
> >>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
> >>> $NUTCH_PATH fetch $s1
> >>> $NUTCH_PATH parse $s1
> >>> $NUTCH_PATH updatedb crawl/crawldb $s1
> >>>
> >>> In seed.txt I have just one site, for example "test.com". When I start
> >>> script it falls on fetch phase.
> >>> If I change test.com on www.test.com it works fine. Seems the reason,
> >> that
> >>> outgoing link on test.com all have www. prefix.
> >>> What I need to change in nutch config for work with test.com?
> >>>
> >>> Thank you in advance. I hope my explanation is clear :)
> >>>
> >>> --
> >>> Alexei A. Korolev
> >>
> >>
> >>
> >> --
> >> Lewis
> >>
> >
> >
> >
> > --
> > Alexei A. Korolev
>
>


-- 
Alexei A. Korolev

Re: crawling site without www

Reply via email to