Re: crawling site without www

Alexei Korolev Tue, 07 Aug 2012 07:38:03 -0700

Hi,

I made simple example


Put in seed.txt
http://mobile365.ru

It will produce error.

Put in seed.txt
http://www.mobile365.ru

and second launch of crawler script will work fine and fetch
http://www.mobile365.ru/test.html page.

On Tue, Aug 7, 2012 at 6:23 PM, Mathijs Homminga <
mathijs.hommi...@kalooga.com> wrote:

> Hi,
>
> I read from your logs:
> - test.com is injected.
> - test.com is fetched and parsed successfully.
> - but when you run a generate again (second launch), no segment is created
> (because no url is selected) and your script tries to fetch and parse the
> first segment again. Hence the errors.
>
> So test.com is fetched successfully. Question remains: why is no url
> selected in the second generate?
> Many answers possible. Can you tell us what urls you have in your crawldb
> after the first cycle? Perhaps no outlinks have been found / added.
>
> Mathijs
>
>
>
>
> On Aug 7, 2012, at 16:02 , Alexei Korolev <alexei.koro...@gmail.com>
> wrote:
>
> > Hello,
> >
> > Yes, test.com and www.test.com exist.
> > test.com do not redirect on www.test.com, it opens page with ongoing
> link
> > with www. like www.test.com/page1 www.test.com/page2
> >
> > First launch of crawler script
> >
> > root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
> > Injector: starting at 2012-08-07 16:00:30
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: seed.txt
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02
> > Generator: starting at 2012-08-07 16:00:33
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: Partitioning selected urls for politeness.
> > Generator: segment: crawl/crawldb/segments/20120807160035
> > Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03
> > Fetcher: Your 'http.agent.name' value should be listed first in
> > 'http.robots.agents' property.
> > Fetcher: starting at 2012-08-07 16:00:37
> > Fetcher: segment: crawl/crawldb/segments/20120807160035
> > Using queue mode : byHost
> > Fetcher: threads: 10
> > Fetcher: time-out divisor: 2
> > QueueFeeder finished: total 1 records + hit by time limit :0
> > Using queue mode : byHost
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > fetching http://test.com
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Fetcher: throughput threshold: -1
> > Fetcher: throughput threshold retries: 5
> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > -finishing thread FetcherThread, activeThreads=0
> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04
> > ParseSegment: starting at 2012-08-07 16:00:41
> > ParseSegment: segment: crawl/crawldb/segments/20120807160035
> > Parsing: http://test.com
> > ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02
> > CrawlDb update: starting at 2012-08-07 16:00:44
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: false
> > CrawlDb update: URL filtering: false
> > CrawlDb update: 404 purging: false
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01
> > LinkDb: starting at 2012-08-07 16:00:46
> > LinkDb: linkdb: crawl/crawldb/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: adding segment:
> > file:/data/nutch/crawl/crawldb/segments/20120807160035
> > LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01
> >
> > Second launch of srcipt
> >
> > root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
> > Injector: starting at 2012-08-07 16:01:30
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: seed.txt
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02
> > Generator: starting at 2012-08-07 16:01:33
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: 0 records selected for fetching, exiting ...
> > Fetcher: Your 'http.agent.name' value should be listed first in
> > 'http.robots.agents' property.
> > Fetcher: starting at 2012-08-07 16:01:35
> > Fetcher: segment: crawl/crawldb/segments/20120807160035
> > Fetcher: java.io.IOException: Segment already fetched!
> >    at
> >
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58)
> >    at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
> >    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> >    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> >    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
> >    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
> >    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)
> >
> > ParseSegment: starting at 2012-08-07 16:01:35
> > ParseSegment: segment: crawl/crawldb/segments/20120807160035
> > Exception in thread "main" java.io.IOException: Segment already parsed!
> >    at
> >
> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87)
> >    at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
> >    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> >    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> >    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
> >    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
> >    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >    at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
> > CrawlDb update: starting at 2012-08-07 16:01:36
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: false
> > CrawlDb update: URL filtering: false
> > CrawlDb update: 404 purging: false
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01
> > LinkDb: starting at 2012-08-07 16:01:37
> > LinkDb: linkdb: crawl/crawldb/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: adding segment:
> > file:/data/nutch/crawl/crawldb/segments/20120807160035
> > LinkDb: merging with existing linkdb: crawl/crawldb/linkdb
> > LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02
> >
> >
> > But when seed.txt have www.test.com instead test.com second launch of
> > crawler script found next segment for fetching.
> >
> > On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga <
> > mathijs.hommi...@kalooga.com> wrote:
> >
> >> What do you mean exactly with "it falls on fetch phase"?
> >> Do  you get an error?
> >> Does "test.com" exist?
> >> Does it perhaps redirect to "www.test.com"?
> >> ...
> >>
> >> Mathijs
> >>
> >>
> >> On Aug 4, 2012, at 17:11 , Alexei Korolev <alexei.koro...@gmail.com>
> >> wrote:
> >>
> >>> yes
> >>>
> >>> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
> >>> lewis.mcgibb...@gmail.com> wrote:
> >>>
> >>>> http://   ?
> >>>>
> >>>> hth
> >>>>
> >>>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <
> >> alexei.koro...@gmail.com>
> >>>> wrote:
> >>>>> Hello,
> >>>>>
> >>>>> I have small script
> >>>>>
> >>>>> $NUTCH_PATH inject crawl/crawldb seed.txt
> >>>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
> >>>>>
> >>>>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
> >>>>> $NUTCH_PATH fetch $s1
> >>>>> $NUTCH_PATH parse $s1
> >>>>> $NUTCH_PATH updatedb crawl/crawldb $s1
> >>>>>
> >>>>> In seed.txt I have just one site, for example "test.com". When I
> start
> >>>>> script it falls on fetch phase.
> >>>>> If I change test.com on www.test.com it works fine. Seems the
> reason,
> >>>> that
> >>>>> outgoing link on test.com all have www. prefix.
> >>>>> What I need to change in nutch config for work with test.com?
> >>>>>
> >>>>> Thank you in advance. I hope my explanation is clear :)
> >>>>>
> >>>>> --
> >>>>> Alexei A. Korolev
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Lewis
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Alexei A. Korolev
> >>
> >>
> >
> >
> > --
> > Alexei A. Korolev
>
>


-- 
Alexei A. Korolev

Re: crawling site without www

Reply via email to