Re: crawling site without www

2012-08-08 Thread Alexei Korolev
Hi, Sebastian

Seems you are right. I have db.ignore.external.links is true.
But how to configure nutch for processing mobile365.ru and www.mobile365 as
single site?

Thanks.

On Tue, Aug 7, 2012 at 10:58 PM, Sebastian Nagel wastl.na...@googlemail.com
 wrote:

 Hi Alexei,

 I tried a crawl with your scrip fragment and Nutch 1.5.1
 and the URLs http://mobile365.ru as seed. It worked,
 see annotated log below.

 Which version of Nutch do you use?

 Check the property db.ignore.external.links (default is false).
 If true the link from mobile365.ru to www.mobile365.ru
 is skipped.

 Look into your crawldb (bin/nutch readdb)

 Check your URL filters with
  bin/nutch org.apache.nutch.net.URLFilterChecker

 Finally, send the nutch-site.xml and every configuration
 file you changed.

 Good luck,
 Sebastian

 % nutch inject crawl/crawldb seed.txt
 Injector: starting at 2012-08-07 20:31:00
 Injector: crawlDb: crawl/crawldb
 Injector: urlDir: seed.txt
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2012-08-07 20:31:15, elapsed: 00:00:15

 % nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
 Generator: starting at 2012-08-07 20:31:23
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: crawl/crawldb/segments/20120807203131
 Generator: finished at 2012-08-07 20:31:39, elapsed: 00:00:15

 # Note: personally, I would prefer not to place segments (also linkdb)
 #   in the crawldb/ folder.

 % s1=`ls -d crawl/crawldb/segments/* | tail -1`

 % nutch fetch $s1
 Fetcher: starting at 2012-08-07 20:32:00
 Fetcher: segment: crawl/crawldb/segments/20120807203131
 Using queue mode : byHost
 Fetcher: threads: 10
 Fetcher: time-out divisor: 2
 QueueFeeder finished: total 1 records + hit by time limit :0
 Using queue mode : byHost
 fetching http://mobile365.ru/
 Using queue mode : byHost
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 Using queue mode : byHost
 Fetcher: throughput threshold: -1
 -finishing thread FetcherThread, activeThreads=1
 Fetcher: throughput threshold retries: 5
 -finishing thread FetcherThread, activeThreads=1
 -finishing thread FetcherThread, activeThreads=0
 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=0
 Fetcher: finished at 2012-08-07 20:32:08, elapsed: 00:00:07

 % nutch parse $s1
 ParseSegment: starting at 2012-08-07 20:32:12
 ParseSegment: segment: crawl/crawldb/segments/20120807203131
 Parsed (10ms):http://mobile365.ru/
 ParseSegment: finished at 2012-08-07 20:32:20, elapsed: 00:00:07

 % nutch updatedb crawl/crawldb/ $s1
 CrawlDb update: starting at 2012-08-07 20:32:24
 CrawlDb update: db: crawl/crawldb
 CrawlDb update: segments: [crawl/crawldb/segments/20120807203131]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: false
 CrawlDb update: URL filtering: false
 CrawlDb update: 404 purging: false
 CrawlDb update: Merging segment data into db.
 CrawlDb update: finished at 2012-08-07 20:32:38, elapsed: 00:00:13

 # see whether the outlink is now in crawldb:
 % nutch readdb crawl/crawldb/ -stats
 CrawlDb statistics start: crawl/crawldb/
 Statistics for CrawlDb: crawl/crawldb/
 TOTAL urls: 2
 retry 0:2
 min score:  1.0
 avg score:  1.0
 max score:  1.0
 status 1 (db_unfetched):1
 status 2 (db_fetched):  1
 CrawlDb statistics: done
 # = yes: http://mobile365.ru/ is fetched, outlink found

 %nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
 Generator: starting at 2012-08-07 20:32:58
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: crawl/crawldb/segments/20120807203307
 Generator: finished at 2012-08-07 20:33:14, elapsed: 00:00:15

 % s1=`ls -d crawl/crawldb/segments/* | tail -1`

 % nutch fetch $s1
 Fetcher: starting at 2012-08-07 20:33:34
 Fetcher: segment: crawl/crawldb/segments/20120807203307
 Using queue mode : byHost
 Fetcher: threads: 10
 Fetcher: time-out divisor: 2
 QueueFeeder finished: total 1 records + hit by time limit :0
 Using queue mode : byHost
 fetching http://www.mobile365.ru/test.html
 # got it


 On 

RE: crawling site without www

2012-08-08 Thread Markus Jelsma


 
 
-Original message-
 From:Alexei Korolev alexei.koro...@gmail.com
 Sent: Wed 08-Aug-2012 15:43
 To: user@nutch.apache.org
 Subject: Re: crawling site without www
 
 Hi, Sebastian
 
 Seems you are right. I have db.ignore.external.links is true.
 But how to configure nutch for processing mobile365.ru and www.mobile365 as
 single site?

You can use the HostURLNormalizer for this task or just crawl the www OR the 
non-www, not both.

 
 Thanks.
 
 On Tue, Aug 7, 2012 at 10:58 PM, Sebastian Nagel wastl.na...@googlemail.com
  wrote:
 
  Hi Alexei,
 
  I tried a crawl with your scrip fragment and Nutch 1.5.1
  and the URLs http://mobile365.ru as seed. It worked,
  see annotated log below.
 
  Which version of Nutch do you use?
 
  Check the property db.ignore.external.links (default is false).
  If true the link from mobile365.ru to www.mobile365.ru
  is skipped.
 
  Look into your crawldb (bin/nutch readdb)
 
  Check your URL filters with
   bin/nutch org.apache.nutch.net.URLFilterChecker
 
  Finally, send the nutch-site.xml and every configuration
  file you changed.
 
  Good luck,
  Sebastian
 
  % nutch inject crawl/crawldb seed.txt
  Injector: starting at 2012-08-07 20:31:00
  Injector: crawlDb: crawl/crawldb
  Injector: urlDir: seed.txt
  Injector: Converting injected urls to crawl db entries.
  Injector: Merging injected urls into crawl db.
  Injector: finished at 2012-08-07 20:31:15, elapsed: 00:00:15
 
  % nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
  Generator: starting at 2012-08-07 20:31:23
  Generator: Selecting best-scoring urls due for fetch.
  Generator: filtering: true
  Generator: normalizing: true
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: Partitioning selected urls for politeness.
  Generator: segment: crawl/crawldb/segments/20120807203131
  Generator: finished at 2012-08-07 20:31:39, elapsed: 00:00:15
 
  # Note: personally, I would prefer not to place segments (also linkdb)
  #   in the crawldb/ folder.
 
  % s1=`ls -d crawl/crawldb/segments/* | tail -1`
 
  % nutch fetch $s1
  Fetcher: starting at 2012-08-07 20:32:00
  Fetcher: segment: crawl/crawldb/segments/20120807203131
  Using queue mode : byHost
  Fetcher: threads: 10
  Fetcher: time-out divisor: 2
  QueueFeeder finished: total 1 records + hit by time limit :0
  Using queue mode : byHost
  fetching http://mobile365.ru/
  Using queue mode : byHost
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  Using queue mode : byHost
  Fetcher: throughput threshold: -1
  -finishing thread FetcherThread, activeThreads=1
  Fetcher: throughput threshold retries: 5
  -finishing thread FetcherThread, activeThreads=1
  -finishing thread FetcherThread, activeThreads=0
  -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
  -activeThreads=0
  Fetcher: finished at 2012-08-07 20:32:08, elapsed: 00:00:07
 
  % nutch parse $s1
  ParseSegment: starting at 2012-08-07 20:32:12
  ParseSegment: segment: crawl/crawldb/segments/20120807203131
  Parsed (10ms):http://mobile365.ru/
  ParseSegment: finished at 2012-08-07 20:32:20, elapsed: 00:00:07
 
  % nutch updatedb crawl/crawldb/ $s1
  CrawlDb update: starting at 2012-08-07 20:32:24
  CrawlDb update: db: crawl/crawldb
  CrawlDb update: segments: [crawl/crawldb/segments/20120807203131]
  CrawlDb update: additions allowed: true
  CrawlDb update: URL normalizing: false
  CrawlDb update: URL filtering: false
  CrawlDb update: 404 purging: false
  CrawlDb update: Merging segment data into db.
  CrawlDb update: finished at 2012-08-07 20:32:38, elapsed: 00:00:13
 
  # see whether the outlink is now in crawldb:
  % nutch readdb crawl/crawldb/ -stats
  CrawlDb statistics start: crawl/crawldb/
  Statistics for CrawlDb: crawl/crawldb/
  TOTAL urls: 2
  retry 0:2
  min score:  1.0
  avg score:  1.0
  max score:  1.0
  status 1 (db_unfetched):1
  status 2 (db_fetched):  1
  CrawlDb statistics: done
  # = yes: http://mobile365.ru/ is fetched, outlink found
 
  %nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
  Generator: starting at 2012-08-07 20:32:58
  Generator: Selecting best-scoring urls due for fetch.
  Generator: filtering: true
  Generator: normalizing: true
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: Partitioning selected urls for politeness.
  Generator: segment: crawl/crawldb/segments/20120807203307
  Generator: finished at 2012-08-07 20:33:14, elapsed: 00:00

Re: crawling site without www

2012-08-08 Thread Alexei Korolev
 You can use the HostURLNormalizer for this task or just crawl the www OR
 the non-www, not both.


I'm trying to crawl only version without www. As I see, I can remove www.
using proper configured regex-normalize.xml.
But will it work if mobile365.ru redirect on www.mobile365.ru (it's very
common situation in web)

Thanks.

Alexei


RE: crawling site without www

2012-08-08 Thread Markus Jelsma

If it starts to redirect and you are on the wrong side of the redirect, you're 
in trouble. But with the HostNormalizer you can then renormalize all URL's to 
the host that is being redirected to.
 
 
-Original message-
 From:Alexei Korolev alexei.koro...@gmail.com
 Sent: Wed 08-Aug-2012 15:55
 To: user@nutch.apache.org
 Subject: Re: crawling site without www
 
  You can use the HostURLNormalizer for this task or just crawl the www OR
  the non-www, not both.
 
 
 I'm trying to crawl only version without www. As I see, I can remove www.
 using proper configured regex-normalize.xml.
 But will it work if mobile365.ru redirect on www.mobile365.ru (it's very
 common situation in web)
 
 Thanks.
 
 Alexei
 


Re: crawling site without www

2012-08-08 Thread Alexei Korolev
So I see just one solution for crawling limited count of sites with
behaviour like on mobile365. Its limit scope of sites using
regex-urlfilter.txt with list like this

+^www.mobile365.ru
+^mobile365.ru

Thanks.

On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma markus.jel...@openindex.iowrote:


 If it starts to redirect and you are on the wrong side of the redirect,
 you're in trouble. But with the HostNormalizer you can then renormalize all
 URL's to the host that is being redirected to.


 -Original message-
  From:Alexei Korolev alexei.koro...@gmail.com
  Sent: Wed 08-Aug-2012 15:55
  To: user@nutch.apache.org
  Subject: Re: crawling site without www
 
   You can use the HostURLNormalizer for this task or just crawl the www
 OR
   the non-www, not both.
  
 
  I'm trying to crawl only version without www. As I see, I can remove www.
  using proper configured regex-normalize.xml.
  But will it work if mobile365.ru redirect on www.mobile365.ru (it's very
  common situation in web)
 
  Thanks.
 
  Alexei
 




-- 
Alexei A. Korolev


Re: crawling site without www

2012-08-08 Thread Sebastian Nagel
Hi Alexei,

 So I see just one solution for crawling limited count of sites with
 behaviour like on mobile365. Its limit scope of sites using
 regex-urlfilter.txt with list like this
 
 +^www.mobile365.ru
 +^mobile365.ru

Better:
+^https?://(?:www\.)?mobile365\.ru/
or to catch all of mobile365.ru
+^https?://(?:[a-z0-9-]+\.)*mobile365\.ru/

and don't forget to remove the final rule

# accept anything else
+.

and replace it by

# skip everything else
-.

If you have more than a few hosts / domains you want to allow
the urlfilter-domain would be a more comfortable choice.
Here a simple line has the desired effect:
mobile365.ru


Sebastian

 
 Thanks.
 
 On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma 
 markus.jel...@openindex.iowrote:
 

 If it starts to redirect and you are on the wrong side of the redirect,
 you're in trouble. But with the HostNormalizer you can then renormalize all
 URL's to the host that is being redirected to.


 -Original message-
 From:Alexei Korolev alexei.koro...@gmail.com
 Sent: Wed 08-Aug-2012 15:55
 To: user@nutch.apache.org
 Subject: Re: crawling site without www

 You can use the HostURLNormalizer for this task or just crawl the www
 OR
 the non-www, not both.


 I'm trying to crawl only version without www. As I see, I can remove www.
 using proper configured regex-normalize.xml.
 But will it work if mobile365.ru redirect on www.mobile365.ru (it's very
 common situation in web)

 Thanks.

 Alexei


 
 
 



Re: crawling site without www

2012-08-08 Thread Alexei Korolev
Ok. Thank you a lot. I'll try later :)

On Wed, Aug 8, 2012 at 9:18 PM, Sebastian Nagel
wastl.na...@googlemail.comwrote:

 Hi Alexei,

  So I see just one solution for crawling limited count of sites with
  behaviour like on mobile365. Its limit scope of sites using
  regex-urlfilter.txt with list like this
 
  +^www.mobile365.ru
  +^mobile365.ru

 Better:
 +^https?://(?:www\.)?mobile365\.ru/
 or to catch all of mobile365.ru
 +^https?://(?:[a-z0-9-]+\.)*mobile365\.ru/

 and don't forget to remove the final rule

 # accept anything else
 +.

 and replace it by

 # skip everything else
 -.

 If you have more than a few hosts / domains you want to allow
 the urlfilter-domain would be a more comfortable choice.
 Here a simple line has the desired effect:
 mobile365.ru


 Sebastian

 
  Thanks.
 
  On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma 
 markus.jel...@openindex.iowrote:
 
 
  If it starts to redirect and you are on the wrong side of the redirect,
  you're in trouble. But with the HostNormalizer you can then renormalize
 all
  URL's to the host that is being redirected to.
 
 
  -Original message-
  From:Alexei Korolev alexei.koro...@gmail.com
  Sent: Wed 08-Aug-2012 15:55
  To: user@nutch.apache.org
  Subject: Re: crawling site without www
 
  You can use the HostURLNormalizer for this task or just crawl the www
  OR
  the non-www, not both.
 
 
  I'm trying to crawl only version without www. As I see, I can remove
 www.
  using proper configured regex-normalize.xml.
  But will it work if mobile365.ru redirect on www.mobile365.ru (it's
 very
  common situation in web)
 
  Thanks.
 
  Alexei
 
 
 
 
 




-- 
Alexei A. Korolev


Re: crawling site without www

2012-08-07 Thread Alexei Korolev
Hello,

Yes, test.com and www.test.com exist.
test.com do not redirect on www.test.com, it opens page with ongoing link
with www. like www.test.com/page1 www.test.com/page2

First launch of crawler script

root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
Injector: starting at 2012-08-07 16:00:30
Injector: crawlDb: crawl/crawldb
Injector: urlDir: seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02
Generator: starting at 2012-08-07 16:00:33
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/crawldb/segments/20120807160035
Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2012-08-07 16:00:37
Fetcher: segment: crawl/crawldb/segments/20120807160035
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
fetching http://test.com
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04
ParseSegment: starting at 2012-08-07 16:00:41
ParseSegment: segment: crawl/crawldb/segments/20120807160035
Parsing: http://test.com
ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02
CrawlDb update: starting at 2012-08-07 16:00:44
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01
LinkDb: starting at 2012-08-07 16:00:46
LinkDb: linkdb: crawl/crawldb/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/data/nutch/crawl/crawldb/segments/20120807160035
LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01

Second launch of srcipt

root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
Injector: starting at 2012-08-07 16:01:30
Injector: crawlDb: crawl/crawldb
Injector: urlDir: seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02
Generator: starting at 2012-08-07 16:01:33
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2012-08-07 16:01:35
Fetcher: segment: crawl/crawldb/segments/20120807160035
Fetcher: java.io.IOException: Segment already fetched!
at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)

ParseSegment: starting at 2012-08-07 16:01:35
ParseSegment: segment: crawl/crawldb/segments/20120807160035
Exception in thread main java.io.IOException: Segment already parsed!
at
org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87)
at

Re: crawling site without www

2012-08-07 Thread Mathijs Homminga
Hi,

I read from your logs: 
- test.com is injected.
- test.com is fetched and parsed successfully. 
- but when you run a generate again (second launch), no segment is created 
(because no url is selected) and your script tries to fetch and parse the first 
segment again. Hence the errors.

So test.com is fetched successfully. Question remains: why is no url selected 
in the second generate? 
Many answers possible. Can you tell us what urls you have in your crawldb after 
the first cycle? Perhaps no outlinks have been found / added. 

Mathijs




On Aug 7, 2012, at 16:02 , Alexei Korolev alexei.koro...@gmail.com wrote:

 Hello,
 
 Yes, test.com and www.test.com exist.
 test.com do not redirect on www.test.com, it opens page with ongoing link
 with www. like www.test.com/page1 www.test.com/page2
 
 First launch of crawler script
 
 root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
 Injector: starting at 2012-08-07 16:00:30
 Injector: crawlDb: crawl/crawldb
 Injector: urlDir: seed.txt
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02
 Generator: starting at 2012-08-07 16:00:33
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: crawl/crawldb/segments/20120807160035
 Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03
 Fetcher: Your 'http.agent.name' value should be listed first in
 'http.robots.agents' property.
 Fetcher: starting at 2012-08-07 16:00:37
 Fetcher: segment: crawl/crawldb/segments/20120807160035
 Using queue mode : byHost
 Fetcher: threads: 10
 Fetcher: time-out divisor: 2
 QueueFeeder finished: total 1 records + hit by time limit :0
 Using queue mode : byHost
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 fetching http://test.com
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Fetcher: throughput threshold: -1
 Fetcher: throughput threshold retries: 5
 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
 -finishing thread FetcherThread, activeThreads=0
 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=0
 Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04
 ParseSegment: starting at 2012-08-07 16:00:41
 ParseSegment: segment: crawl/crawldb/segments/20120807160035
 Parsing: http://test.com
 ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02
 CrawlDb update: starting at 2012-08-07 16:00:44
 CrawlDb update: db: crawl/crawldb
 CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: false
 CrawlDb update: URL filtering: false
 CrawlDb update: 404 purging: false
 CrawlDb update: Merging segment data into db.
 CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01
 LinkDb: starting at 2012-08-07 16:00:46
 LinkDb: linkdb: crawl/crawldb/linkdb
 LinkDb: URL normalize: true
 LinkDb: URL filter: true
 LinkDb: adding segment:
 file:/data/nutch/crawl/crawldb/segments/20120807160035
 LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01
 
 Second launch of srcipt
 
 root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
 Injector: starting at 2012-08-07 16:01:30
 Injector: crawlDb: crawl/crawldb
 Injector: urlDir: seed.txt
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02
 Generator: starting at 2012-08-07 16:01:33
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: 0 records selected for fetching, exiting ...
 Fetcher: Your 'http.agent.name' value should be listed first in
 'http.robots.agents' property.
 Fetcher: starting at 2012-08-07 16:01:35
 Fetcher: segment: crawl/crawldb/segments/20120807160035
 Fetcher: java.io.IOException: Segment already fetched!
at
 org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58)
at
 

Re: crawling site without www

2012-08-07 Thread Alexei Korolev
Hi,

I made simple example

Put in seed.txt
http://mobile365.ru

It will produce error.

Put in seed.txt
http://www.mobile365.ru

and second launch of crawler script will work fine and fetch
http://www.mobile365.ru/test.html page.

On Tue, Aug 7, 2012 at 6:23 PM, Mathijs Homminga 
mathijs.hommi...@kalooga.com wrote:

 Hi,

 I read from your logs:
 - test.com is injected.
 - test.com is fetched and parsed successfully.
 - but when you run a generate again (second launch), no segment is created
 (because no url is selected) and your script tries to fetch and parse the
 first segment again. Hence the errors.

 So test.com is fetched successfully. Question remains: why is no url
 selected in the second generate?
 Many answers possible. Can you tell us what urls you have in your crawldb
 after the first cycle? Perhaps no outlinks have been found / added.

 Mathijs




 On Aug 7, 2012, at 16:02 , Alexei Korolev alexei.koro...@gmail.com
 wrote:

  Hello,
 
  Yes, test.com and www.test.com exist.
  test.com do not redirect on www.test.com, it opens page with ongoing
 link
  with www. like www.test.com/page1 www.test.com/page2
 
  First launch of crawler script
 
  root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
  Injector: starting at 2012-08-07 16:00:30
  Injector: crawlDb: crawl/crawldb
  Injector: urlDir: seed.txt
  Injector: Converting injected urls to crawl db entries.
  Injector: Merging injected urls into crawl db.
  Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02
  Generator: starting at 2012-08-07 16:00:33
  Generator: Selecting best-scoring urls due for fetch.
  Generator: filtering: true
  Generator: normalizing: true
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: Partitioning selected urls for politeness.
  Generator: segment: crawl/crawldb/segments/20120807160035
  Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03
  Fetcher: Your 'http.agent.name' value should be listed first in
  'http.robots.agents' property.
  Fetcher: starting at 2012-08-07 16:00:37
  Fetcher: segment: crawl/crawldb/segments/20120807160035
  Using queue mode : byHost
  Fetcher: threads: 10
  Fetcher: time-out divisor: 2
  QueueFeeder finished: total 1 records + hit by time limit :0
  Using queue mode : byHost
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  fetching http://test.com
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Fetcher: throughput threshold: -1
  Fetcher: throughput threshold retries: 5
  -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
  -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
  -finishing thread FetcherThread, activeThreads=0
  -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
  -activeThreads=0
  Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04
  ParseSegment: starting at 2012-08-07 16:00:41
  ParseSegment: segment: crawl/crawldb/segments/20120807160035
  Parsing: http://test.com
  ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02
  CrawlDb update: starting at 2012-08-07 16:00:44
  CrawlDb update: db: crawl/crawldb
  CrawlDb update: segments: [crawl/crawldb/segments/20120807160035]
  CrawlDb update: additions allowed: true
  CrawlDb update: URL normalizing: false
  CrawlDb update: URL filtering: false
  CrawlDb update: 404 purging: false
  CrawlDb update: Merging segment data into db.
  CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01
  LinkDb: starting at 2012-08-07 16:00:46
  LinkDb: linkdb: crawl/crawldb/linkdb
  LinkDb: URL normalize: true
  LinkDb: URL filter: true
  LinkDb: adding segment:
  file:/data/nutch/crawl/crawldb/segments/20120807160035
  LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01
 
  Second launch of srcipt
 
  root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh
  Injector: starting at 2012-08-07 16:01:30
  Injector: crawlDb: crawl/crawldb
  Injector: urlDir: seed.txt
  Injector: Converting injected urls to crawl db entries.
  Injector: Merging injected urls into crawl db.
  Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02
  Generator: starting at 2012-08-07 16:01:33
  Generator: Selecting best-scoring urls due for fetch.
  Generator: filtering: true
  Generator: normalizing: true
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: 0 records selected for fetching, exiting ...

Re: crawling site without www

2012-08-07 Thread Sebastian Nagel
Hi Alexei,

I tried a crawl with your scrip fragment and Nutch 1.5.1
and the URLs http://mobile365.ru as seed. It worked,
see annotated log below.

Which version of Nutch do you use?

Check the property db.ignore.external.links (default is false).
If true the link from mobile365.ru to www.mobile365.ru
is skipped.

Look into your crawldb (bin/nutch readdb)

Check your URL filters with
 bin/nutch org.apache.nutch.net.URLFilterChecker

Finally, send the nutch-site.xml and every configuration
file you changed.

Good luck,
Sebastian

% nutch inject crawl/crawldb seed.txt
Injector: starting at 2012-08-07 20:31:00
Injector: crawlDb: crawl/crawldb
Injector: urlDir: seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-08-07 20:31:15, elapsed: 00:00:15

% nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
Generator: starting at 2012-08-07 20:31:23
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/crawldb/segments/20120807203131
Generator: finished at 2012-08-07 20:31:39, elapsed: 00:00:15

# Note: personally, I would prefer not to place segments (also linkdb)
#   in the crawldb/ folder.

% s1=`ls -d crawl/crawldb/segments/* | tail -1`

% nutch fetch $s1
Fetcher: starting at 2012-08-07 20:32:00
Fetcher: segment: crawl/crawldb/segments/20120807203131
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
fetching http://mobile365.ru/
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-08-07 20:32:08, elapsed: 00:00:07

% nutch parse $s1
ParseSegment: starting at 2012-08-07 20:32:12
ParseSegment: segment: crawl/crawldb/segments/20120807203131
Parsed (10ms):http://mobile365.ru/
ParseSegment: finished at 2012-08-07 20:32:20, elapsed: 00:00:07

% nutch updatedb crawl/crawldb/ $s1
CrawlDb update: starting at 2012-08-07 20:32:24
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/crawldb/segments/20120807203131]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-08-07 20:32:38, elapsed: 00:00:13

# see whether the outlink is now in crawldb:
% nutch readdb crawl/crawldb/ -stats
CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls: 2
retry 0:2
min score:  1.0
avg score:  1.0
max score:  1.0
status 1 (db_unfetched):1
status 2 (db_fetched):  1
CrawlDb statistics: done
# = yes: http://mobile365.ru/ is fetched, outlink found

%nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
Generator: starting at 2012-08-07 20:32:58
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/crawldb/segments/20120807203307
Generator: finished at 2012-08-07 20:33:14, elapsed: 00:00:15

% s1=`ls -d crawl/crawldb/segments/* | tail -1`

% nutch fetch $s1
Fetcher: starting at 2012-08-07 20:33:34
Fetcher: segment: crawl/crawldb/segments/20120807203307
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
fetching http://www.mobile365.ru/test.html
# got it


On 08/07/2012 04:37 PM, Alexei Korolev wrote:
 Hi,
 
 I made simple example
 
 Put in seed.txt
 http://mobile365.ru
 
 It will produce error.
 20120807160035
 Put in seed.txt
 http://www.mobile365.ru
 
 and second launch of crawler script will work fine and fetch
 http://www.mobile365.ru/test.html page.
 
 On Tue, Aug 7, 2012 at 6:23 PM, Mathijs Homminga 
 

Re: crawling site without www

2012-08-04 Thread Lewis John Mcgibbney
http://   ?

hth

On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev alexei.koro...@gmail.com wrote:
 Hello,

 I have small script

 $NUTCH_PATH inject crawl/crawldb seed.txt
 $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0

 s1=`ls -d crawl/crawldb/segments/* | tail -1`
 $NUTCH_PATH fetch $s1
 $NUTCH_PATH parse $s1
 $NUTCH_PATH updatedb crawl/crawldb $s1

 In seed.txt I have just one site, for example test.com. When I start
 script it falls on fetch phase.
 If I change test.com on www.test.com it works fine. Seems the reason, that
 outgoing link on test.com all have www. prefix.
 What I need to change in nutch config for work with test.com?

 Thank you in advance. I hope my explanation is clear :)

 --
 Alexei A. Korolev



-- 
Lewis


Re: crawling site without www

2012-08-04 Thread Mathijs Homminga
What do you mean exactly with it falls on fetch phase?
Do  you get an error? 
Does test.com exist? 
Does it perhaps redirect to www.test.com?
...

Mathijs


On Aug 4, 2012, at 17:11 , Alexei Korolev alexei.koro...@gmail.com wrote:

 yes
 
 On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:
 
 http://   ?
 
 hth
 
 On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev alexei.koro...@gmail.com
 wrote:
 Hello,
 
 I have small script
 
 $NUTCH_PATH inject crawl/crawldb seed.txt
 $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0
 
 s1=`ls -d crawl/crawldb/segments/* | tail -1`
 $NUTCH_PATH fetch $s1
 $NUTCH_PATH parse $s1
 $NUTCH_PATH updatedb crawl/crawldb $s1
 
 In seed.txt I have just one site, for example test.com. When I start
 script it falls on fetch phase.
 If I change test.com on www.test.com it works fine. Seems the reason,
 that
 outgoing link on test.com all have www. prefix.
 What I need to change in nutch config for work with test.com?
 
 Thank you in advance. I hope my explanation is clear :)
 
 --
 Alexei A. Korolev
 
 
 
 --
 Lewis
 
 
 
 
 -- 
 Alexei A. Korolev



Re: crawling site without www

2012-08-04 Thread Sebastian Nagel
Hi Alexei,

Because users are lazy some browser automatically
try to add the www (and other stuff) to escape from
a server not found error, see
http://www-archive.mozilla.org/docs/end-user/domain-guessing.html

Nutch does no domain guessing. The urls have to be correct
and the host name must be complete.

Finally, even if test.com sends a HTTP redirect pointing
to www.test.com : check your URL filters whether both
hosts are accepted.

Sebastian

On 08/04/2012 05:33 PM, Mathijs Homminga wrote: What do you mean exactly with 
it falls on fetch
phase?
 Do  you get an error? 
 Does test.com exist? 
 Does it perhaps redirect to www.test.com?
 ...
 
 Mathijs
 
 On Aug 4, 2012, at 17:11 , Alexei Korolev alexei.koro...@gmail.com wrote:
 
 yes

 On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 http://   ?

 hth

 On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev alexei.koro...@gmail.com
 wrote:
 Hello,

 I have small script

 $NUTCH_PATH inject crawl/crawldb seed.txt
 $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0

 s1=`ls -d crawl/crawldb/segments/* | tail -1`
 $NUTCH_PATH fetch $s1
 $NUTCH_PATH parse $s1
 $NUTCH_PATH updatedb crawl/crawldb $s1

 In seed.txt I have just one site, for example test.com. When I start
 script it falls on fetch phase.
 If I change test.com on www.test.com it works fine. Seems the reason,
 that
 outgoing link on test.com all have www. prefix.
 What I need to change in nutch config for work with test.com?

 Thank you in advance. I hope my explanation is clear :)

 --
 Alexei A. Korolev



 --
 Lewis




 -- 
 Alexei A. Korolev