Re: crawling site without www

Mathijs Homminga Tue, 07 Aug 2012 07:23:47 -0700

Hi,

I read from your logs: 
- test.com is injected.
- test.com is fetched and parsed successfully. 
- but when you run a generate again (second launch), no segment is created 
(because no url is selected) and your script tries to fetch and parse the first 
segment again. Hence the errors.


So test.com is fetched successfully. Question remains: why is no url selected 
in the second generate? 
Many answers possible. Can you tell us what urls you have in your crawldb after 
the first cycle? Perhaps no outlinks have been found / added. 

Mathijs




On Aug 7, 2012, at 16:02 , Alexei Korolev <alexei.koro...@gmail.com> wrote:

> Hello,
> 
> Yes, test.com and www.test.com exist.
> test.com do not redirect on www.test.com, it opens page with ongoing link
> with www. like www.test.com/page1 www.test.com/page2
> 
> First launch of crawler script
> 
> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
> Injector: starting at 2012-08-07 16:00:30
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: seed.txt
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02
> Generator: starting at 2012-08-07 16:00:33
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/crawldb/segments/20120807160035
> Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2012-08-07 16:00:37
> Fetcher: segment: crawl/crawldb/segments/20120807160035
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Using queue mode : byHost
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> fetching http://test.com
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold retries: 5
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04
> ParseSegment: starting at 2012-08-07 16:00:41
> ParseSegment: segment: crawl/crawldb/segments/20120807160035
> Parsing: http://test.com
> ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02
> CrawlDb update: starting at 2012-08-07 16:00:44
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01
> LinkDb: starting at 2012-08-07 16:00:46
> LinkDb: linkdb: crawl/crawldb/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
> file:/data/nutch/crawl/crawldb/segments/20120807160035
> LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01
> 
> Second launch of srcipt
> 
> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
> Injector: starting at 2012-08-07 16:01:30
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: seed.txt
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02
> Generator: starting at 2012-08-07 16:01:33
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2012-08-07 16:01:35
> Fetcher: segment: crawl/crawldb/segments/20120807160035
> Fetcher: java.io.IOException: Segment already fetched!
>    at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58)
>    at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
>    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)
> 
> ParseSegment: starting at 2012-08-07 16:01:35
> ParseSegment: segment: crawl/crawldb/segments/20120807160035
> Exception in thread "main" java.io.IOException: Segment already parsed!
>    at
> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87)
>    at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
>    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
> CrawlDb update: starting at 2012-08-07 16:01:36
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01
> LinkDb: starting at 2012-08-07 16:01:37
> LinkDb: linkdb: crawl/crawldb/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
> file:/data/nutch/crawl/crawldb/segments/20120807160035
> LinkDb: merging with existing linkdb: crawl/crawldb/linkdb
> LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02
> 
> 
> But when seed.txt have www.test.com instead test.com second launch of
> crawler script found next segment for fetching.
> 
> On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga <
> mathijs.hommi...@kalooga.com> wrote:
> 
>> What do you mean exactly with "it falls on fetch phase"?
>> Do  you get an error?
>> Does "test.com" exist?
>> Does it perhaps redirect to "www.test.com"?
>> ...
>> 
>> Mathijs
>> 
>> 
>> On Aug 4, 2012, at 17:11 , Alexei Korolev <alexei.koro...@gmail.com>
>> wrote:
>> 
>>> yes
>>> 
>>> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney <
>>> lewis.mcgibb...@gmail.com> wrote:
>>> 
>>>> http://   ?
>>>> 
>>>> hth
>>>> 
>>>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev <
>> alexei.koro...@gmail.com>
>>>> wrote:
>>>>> Hello,
>>>>> 
>>>>> I have small script
>>>>> 
>>>>> $NUTCH_PATH inject crawl/crawldb seed.txt
>>>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
>>>>> 
>>>>> s1=`ls -d crawl/crawldb/segments/* | tail -1`
>>>>> $NUTCH_PATH fetch $s1
>>>>> $NUTCH_PATH parse $s1
>>>>> $NUTCH_PATH updatedb crawl/crawldb $s1
>>>>> 
>>>>> In seed.txt I have just one site, for example "test.com". When I start
>>>>> script it falls on fetch phase.
>>>>> If I change test.com on www.test.com it works fine. Seems the reason,
>>>> that
>>>>> outgoing link on test.com all have www. prefix.
>>>>> What I need to change in nutch config for work with test.com?
>>>>> 
>>>>> Thank you in advance. I hope my explanation is clear :)
>>>>> 
>>>>> --
>>>>> Alexei A. Korolev
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Lewis
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Alexei A. Korolev
>> 
>> 
> 
> 
> -- 
> Alexei A. Korolev

Re: crawling site without www

Reply via email to