RE: crawling site without www

Markus Jelsma Wed, 08 Aug 2012 06:43:59 -0700


 
 
-----Original message-----
> From:Alexei Korolev <alexei.koro...@gmail.com>
> Sent: Wed 08-Aug-2012 15:43
> To: user@nutch.apache.org
> Subject: Re: crawling site without www
> 
> Hi, Sebastian
> 
> Seems you are right. I have db.ignore.external.links is true.
> But how to configure nutch for processing mobile365.ru and www.mobile365 as
> single site?


You can use the HostURLNormalizer for this task or just crawl the www OR the 
non-www, not both.

> 
> Thanks.
> 
> On Tue, Aug 7, 2012 at 10:58 PM, Sebastian Nagel <wastl.na...@googlemail.com
> > wrote:
> 
> > Hi Alexei,
> >
> > I tried a crawl with your scrip fragment and Nutch 1.5.1
> > and the URLs http://mobile365.ru as seed. It worked,
> > see annotated log below.
> >
> > Which version of Nutch do you use?
> >
> > Check the property db.ignore.external.links (default is false).
> > If true the link from mobile365.ru to www.mobile365.ru
> > is skipped.
> >
> > Look into your crawldb (bin/nutch readdb)
> >
> > Check your URL filters with
> >  bin/nutch org.apache.nutch.net.URLFilterChecker
> >
> > Finally, send the nutch-site.xml and every configuration
> > file you changed.
> >
> > Good luck,
> > Sebastian
> >
> > % nutch inject crawl/crawldb seed.txt
> > Injector: starting at 2012-08-07 20:31:00
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: seed.txt
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: finished at 2012-08-07 20:31:15, elapsed: 00:00:15
> >
> > % nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
> > Generator: starting at 2012-08-07 20:31:23
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: Partitioning selected urls for politeness.
> > Generator: segment: crawl/crawldb/segments/20120807203131
> > Generator: finished at 2012-08-07 20:31:39, elapsed: 00:00:15
> >
> > # Note: personally, I would prefer not to place segments (also linkdb)
> > #       in the crawldb/ folder.
> >
> > % s1=`ls -d crawl/crawldb/segments/* | tail -1`
> >
> > % nutch fetch $s1
> > Fetcher: starting at 2012-08-07 20:32:00
> > Fetcher: segment: crawl/crawldb/segments/20120807203131
> > Using queue mode : byHost
> > Fetcher: threads: 10
> > Fetcher: time-out divisor: 2
> > QueueFeeder finished: total 1 records + hit by time limit :0
> > Using queue mode : byHost
> > fetching http://mobile365.ru/
> > Using queue mode : byHost
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > Using queue mode : byHost
> > Fetcher: throughput threshold: -1
> > -finishing thread FetcherThread, activeThreads=1
> > Fetcher: throughput threshold retries: 5
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=0
> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: finished at 2012-08-07 20:32:08, elapsed: 00:00:07
> >
> > % nutch parse $s1
> > ParseSegment: starting at 2012-08-07 20:32:12
> > ParseSegment: segment: crawl/crawldb/segments/20120807203131
> > Parsed (10ms):http://mobile365.ru/
> > ParseSegment: finished at 2012-08-07 20:32:20, elapsed: 00:00:07
> >
> > % nutch updatedb crawl/crawldb/ $s1
> > CrawlDb update: starting at 2012-08-07 20:32:24
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/crawldb/segments/20120807203131]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: false
> > CrawlDb update: URL filtering: false
> > CrawlDb update: 404 purging: false
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: finished at 2012-08-07 20:32:38, elapsed: 00:00:13
> >
> > # see whether the outlink is now in crawldb:
> > % nutch readdb crawl/crawldb/ -stats
> > CrawlDb statistics start: crawl/crawldb/
> > Statistics for CrawlDb: crawl/crawldb/
> > TOTAL urls:     2
> > retry 0:        2
> > min score:      1.0
> > avg score:      1.0
> > max score:      1.0
> > status 1 (db_unfetched):        1
> > status 2 (db_fetched):  1
> > CrawlDb statistics: done
> > # => yes: http://mobile365.ru/ is fetched, outlink found
> >
> > %nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
> > Generator: starting at 2012-08-07 20:32:58
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: Partitioning selected urls for politeness.
> > Generator: segment: crawl/crawldb/segments/20120807203307
> > Generator: finished at 2012-08-07 20:33:14, elapsed: 00:00:15
> >
> > % s1=`ls -d crawl/crawldb/segments/* | tail -1`
> >
> > % nutch fetch $s1
> > Fetcher: starting at 2012-08-07 20:33:34
> > Fetcher: segment: crawl/crawldb/segments/20120807203307
> > Using queue mode : byHost
> > Fetcher: threads: 10
> > Fetcher: time-out divisor: 2
> > QueueFeeder finished: total 1 records + hit by time limit :0
> > Using queue mode : byHost
> > fetching http://www.mobile365.ru/test.html
> > # got it
> >
> >
> > On 08/07/2012 04:37 PM, Alexei Korolev wrote:
> > > Hi,
> > >
> > > I made simple example
> > >
> > > Put in seed.txt
> > > http://mobile365.ru
> > >
> > > It will produce error.
> > > 20120807160035
> > > Put in seed.txt
> > > http://www.mobile365.ru
> > >
> > > and second launch of crawler script will work fine and fetch
> > > http://www.mobile365.ru/test.html page.
> > >
> > > On Tue, Aug 7, 2012 at 6:23 PM, Mathijs Homminga <
> > > mathijs.hommi...@kalooga.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> I read from your logs:
> > >> - test.com is injected.
> > >> - test.com is fetched and parsed successfully.
> > >> - but when you run a generate again (second launch), no segment is
> > created
> > >> (because no url is selected) and your script tries to fetch and parse
> > the
> > >> first segment again. Hence the errors.
> > >>20120807160035
> > >> So test.com is fetched successfully. Question remains: why is no url
> > >> selected in the second generate?
> > >> Many answers possible. Can you tell us what urls you have in your
> > crawldb
> > >> after the first cycle? Perhaps no outlinks have been found / added.
> > >>
> > >> Mathijs
> > >>
> > >>
> > >>
> > >>
> > >> On Aug 7, 2012, at 16:02 , Alexei Korolev <alexei.koro...@gmail.com>
> > >> wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> Yes, test.com and www.test.com exist.
> > >>> test.com do not redirect on www.test.com, it opens page with ongoing
> > >> link20120807160035
> > >>> with www. like www.test.com/page1 www.test.com/page2
> > >>>
> > >>> First launch of crawler script
> > >>>
> > >>> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
> > >>> Injector: starting at 2012-08-07 16:00:30
> > >>> Injector: crawlDb: crawl/crawldb
> > >>> Injector: urlDir: seed.txt
> > >>> Injector: Converting injected urls to crawl db entries.
> > >>> Injector: Merging injected urls into crawl db.
> > >>> Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02
> > >>> Generator: starting at 2012-08-07 16:00:33
> > >>> Generator: Selecting best-scoring urls due for fetc20120807160035h.
> > >>> Generator: filtering: true
> > >>> Generator: normalizing: true
> > >>> Generator: jobtracker is 'local', generating exactly one partition.
> > >>> Generator: Partitioning selected urls for politeness.
> > >>> Generator: segment: crawl/crawldb/segments/20120807160035
> > >>> Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03
> > >>> Fetcher: Your 'http.agent.name' value should be listed first in
> > >>> 'http.robots.agents' property.
> > >>> Fetcher: starting at 2012-08-07 16:00:37
> > >>> Fetcher: segment: crawl/crawldb/segments/20120807160035
> > >>> Using queue mode : byHost
> > >>> Fetcher: threads: 10
> > >>> Fetcher: time-out divisor: 2
> > >>> QueueFeeder finished: total 1 records + hit by time limit :0
> > >>> Using queue mode : byHost20120807160035
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> fetching http://test.com
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Using queue mode : byHost
> > >>> -finishing thread FetcherThread, activeThreads=1
> > >>> Fetcher: throughput threshold: -1
> > >>> Fetcher: throughput threshold retries: 5
> > >>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > >>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > >>> -finishing thread FetcherThread, activeThreads=0
> > >>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > >>> -activeThreads=0
> > >>> Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04
> > >>> ParseSegment: starting at 2012-08-07 16:00:41
> > >>> ParseSegment: segment: crawl/crawldb/segments/20120807160035
> > >>> Parsing: http://test.com
> > >>> ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02
> > >>> CrawlDb update: starting at 2012-08-07 16:00:44
> > >>> CrawlDb update: db: crawl/crawldb
> > >>> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
> > >>> CrawlDb update: additions allowed: true
> > >>> CrawlDb update: URL normalizing: false
> > >>> CrawlDb update: URL filtering: false
> > >>> CrawlDb update: 404 purging: false
> > >>> CrawlDb update: Merging segment data into db.
> > >>> CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01
> > >>> LinkDb: starting at 2012-08-07 16:00:46
> > >>> LinkDb: linkdb: crawl/crawldb/linkdb
> > >>> LinkDb: URL normalize: true
> > >>> LinkDb: URL filter: true
> > >>> LinkDb: adding segment:
> > >>> file:/data/nutch/crawl/crawldb/segments/20120807160035
> > >>> LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01
> > >>>
> > >>> Second launch of srcipt
> > >>>
> > >>> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
> > >>> Injector: starting at 2012-08-07 16:01:30
> > >>> Injector: crawlDb: crawl/crawldb
> > >>> Injector: urlDir: seed.txt
> > >>> Injector: Converting injected urls to crawl db entries.
> > >>> Injector: Merging injected urls into crawl db.
> > >>> Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02
> > >>> Generator: starting at 2012-08-07 16:01:33
> > >>> Generator: Selecting best-scoring urls due for fetch.
> > >>> Generator: filtering: true
> > >>> Generator: normalizing: true
> > >>> Generator: jobtracker is 'local', generating exactly one partition.
> > >>> Generator: 0 records selected for fetching, exiting ...
> > >>> Fetcher: Your 'http.agent.name' value should be listed first in
> > >>> 'http.robots.agents' property.
> > >>> Fetcher: starting at 2012-08-07 16:01:35
> > >>> Fetcher: segment: crawl/crawldb/segments/20120807160035
> > >>> Fetcher: java.io.IOException: Segment already fetched!
> > >>>    at
> > >>>
> > >>
> > org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58)
> > >>>    at
> > >>>
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
> > >>>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> > >>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> > >>>    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
> > >>>    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
> > >>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >>>    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)
> > >>>
> > >>> ParseSegment: starting at 2012-08-07 16:01:35
> > >>> ParseSegment: segment: crawl/crawldb/segments/20120807160035
> > >>> Exception in thread "main" java.io.IOException: Segment already parsed!
> > >>>    at
> > >>>
> > >>
> > org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87)
> > >>>    at
> > >>>
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
> > >>>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> > >>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> > >>>    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
> > >>>    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
> > >>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >>>    at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
> > >>> CrawlDb update: starting at 2012-08-07 16:01:36
> > >>> CrawlDb update: db: crawl/crawldb
> > >>> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
> > >>> CrawlDb update: additions allowed: true
> > >>> CrawlDb update: URL normalizing: false
> > >>> CrawlDb update: URL filtering: false
> > >>> CrawlDb update: 404 purging: false
> > >>> CrawlDb update: Merging segment data into db.
> > >>> CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01
> > >>> LinkDb: starting at 2012-08-07 16:01:37
> > >>> LinkDb: linkdb: crawl/crawldb/linkdb
> > >>> LinkDb: URL normalize: true
> > >>> LinkDb: URL filter: true
> > >>> LinkDb: adding segment:
> > >>> file:/data/nutch/crawl/crawldb/segments/20120807160035
> > >>> LinkDb: merging with existing linkdb: crawl/crawldb/linkdb
> > >>> LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02
> > >>>
> > >>>
> > >>> But when seed.txt have www.test.com instead test.com second launch of
> > >>> crawler script found next segment for fetching.
> > >>>
> > >>> On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga <
> > >>> mathijs.hommi...@kalooga.com> wrote:
> > >>>
> > >>>> What do you mean exactly with "it falls on fetch phase"?
> > >>>> Do  you get an error?
> > >>>> Does "test.com" exist?
> > >>>> Does it perhaps redirect to "www.test.com"?
> > >>>> ...
> > >>>>
> > >>>> Mathijs
> > >>>>
> > >>>>
> > >>>> On Aug 4, 2012, at 17:11 , Alexei Korolev <alexei.koro...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> yes
> > >>>>>
> > >>>>> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
> > >>>>> lewis.mcgibb...@gmail.com> wrote:
> > >>>>>
> > >>>>>> http://   ?
> > >>>>>>
> > >>>>>> hth
> > >>>>>>
> > >>>>>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <
> > >>>> alexei.koro...@gmail.com>
> > >>>>>> wrote:
> > >>>>>>> Hello,
> > >>>>>>>
> > >>>>>>> I have small script
> > >>>>>>>
> > >>>>>>> $NUTCH_PATH inject crawl/crawldb seed.txt
> > >>>>>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays
> > 0
> > >>>>>>>
> > >>>>>>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
> > >>>>>>> $NUTCH_PATH fetch $s1
> > >>>>>>> $NUTCH_PATH parse $s1
> > >>>>>>> $NUTCH_PATH updatedb crawl/crawldb $s1
> > >>>>>>>
> > >>>>>>> In seed.txt I have just one site, for example "test.com". When I
> > >> start
> > >>>>>>> script it falls on fetch phase.
> > >>>>>>> If I change test.com on www.test.com it works fine. Seems the
> > >> reason,
> > >>>>>> that
> > >>>>>>> outgoing link on test.com all have www. prefix.
> > >>>>>>> What I need to change in nutch config for work with test.com?
> > >>>>>>>
> > >>>>>>> Thank you in advance. I hope my explanation is clear :)
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Alexei A. Korolev
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Lewis
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Alexei A. Korolev
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Alexei A. Korolev
> > >>
> > >>
> > >
> > >
> >
> >
> 
> 
> -- 
> Alexei A. Korolev
>

RE: crawling site without www

Reply via email to