RE: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread Pravin Karne
Hi guys any pointer on following.
Your help will highly appreciated .

Thanks
-Pravin

-Original Message-
From: Pravin Karne
Sent: Friday, March 05, 2010 12:57 PM
To: nutch-user@lucene.apache.org
Subject: Two Nutch parallel crawl with two conf folder.

Hi,

I want to do two Nutch parallel crawl with two conf folder.

I am using crawl command to do this. I have two separate conf folders, all 
files from conf are same except crawl-urlfilter.txt . In  this file we have 
different filters(domain filters).

 e.g . 1 st conf have -
 +.^http://([a-z0-9]*\.)*abc.com/

   2nd conf have -
+.^http://([a-z0-9]*\.)*xyz.com/


I am starting two crawl with above configuration and on separate console.(one 
followed by other)

I am using following crawl commands  -

  bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth 1

  bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth 1

[Note: We have modified nutch.sh for '--nutch_conf_dir']

urls file have following entries-

http://www.abc.com
http://www.xyz.com
http://www.pqr.com


Expected Result:

 CrawlDB test1 should contains abc.com's  data and CrawlDB test2 should 
contains xyz.com's data.

Actual Results:

  url filter of first run  is overridden by url filter of second run.

  So Both CrawlDB have xyz.com's data.


Please provide pointer regarding this.

Thanks in advance.

-Pravin


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


RE: Content of redirected urls empty

2010-03-08 Thread BELLINI ADAM


is there any idea guys ??


 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: Content of redirected urls empty
 Date: Fri, 5 Mar 2010 22:01:05 +
 
 
 
 hi,
 the content of my redirected urls is empty...but still have the other 
 metadata...
 i have an http urls that is redirected to https.
 in my index i find the http URL but with an empty content...
 could you explain it plz?
 
 _
 Check your Hotmail from your phone. 
 http://go.microsoft.com/?linkid=9712957
  
_
Stay in touch.
http://go.microsoft.com/?linkid=9712959

Re: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread MilleBii
How parallel is parallel in your case ?
Don't forget Hadoop in distributed mode will serialize your jobs anyhow.

For the rest why don't you create two Nutch directories and run things
totally independently


2010/3/8, Pravin Karne pravin_ka...@persistent.co.in:
 Hi guys any pointer on following.
 Your help will highly appreciated .

 Thanks
 -Pravin

 -Original Message-
 From: Pravin Karne
 Sent: Friday, March 05, 2010 12:57 PM
 To: nutch-user@lucene.apache.org
 Subject: Two Nutch parallel crawl with two conf folder.

 Hi,

 I want to do two Nutch parallel crawl with two conf folder.

 I am using crawl command to do this. I have two separate conf folders, all
 files from conf are same except crawl-urlfilter.txt . In  this file we have
 different filters(domain filters).

  e.g . 1 st conf have -
  +.^http://([a-z0-9]*\.)*abc.com/

2nd conf have -
   +.^http://([a-z0-9]*\.)*xyz.com/


 I am starting two crawl with above configuration and on separate
 console.(one followed by other)

 I am using following crawl commands  -

   bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth 1

   bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth 1

 [Note: We have modified nutch.sh for '--nutch_conf_dir']

 urls file have following entries-

 http://www.abc.com
 http://www.xyz.com
 http://www.pqr.com


 Expected Result:

  CrawlDB test1 should contains abc.com's  data and CrawlDB test2 should
 contains xyz.com's data.

 Actual Results:

   url filter of first run  is overridden by url filter of second run.

   So Both CrawlDB have xyz.com's data.


 Please provide pointer regarding this.

 Thanks in advance.

 -Pravin


 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is the
 property of Persistent Systems Ltd. It is intended only for the use of the
 individual or entity to which it is addressed. If you are not the intended
 recipient, you are not authorized to read, retain, copy, print, distribute
 or use this message. If you have received this communication in error,
 please notify the sender and delete all copies of this message. Persistent
 Systems Ltd. does not accept any liability for virus infected mails.



-- 
-MilleBii-


Re: Content of redirected urls empty

2010-03-08 Thread Andrzej Bialecki

On 2010-03-08 14:55, BELLINI ADAM wrote:



is there any idea guys ??



From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: Content of redirected urls empty
Date: Fri, 5 Mar 2010 22:01:05 +



hi,
the content of my redirected urls is empty...but still have the other 
metadata...
i have an http urls that is redirected to https.
in my index i find the http URL but with an empty content...
could you explain it plz?


There are two ways to redirect - one is with protocol, and the other is 
with content (either meta refresh, or javascript).


When you dump the segment, is there really no content for the redirected 
url?



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Content of redirected urls empty

2010-03-08 Thread BELLINI ADAM



Hi, i'v just dumped my segments and found that i have both 2 URLS, the original 
one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL 
(HTTPS) with NON EMPTY content !

but in my search i found only the HTTPS URL with an empty content !! logically 
the content of the HTTPS  URL is not empty !
it's just mixing the HTTPS url with the content of the HTTP one.


our redirect is done by java code  response.sendRedirect(…), so it seams to be 
http redirect right ??

thx for helping me :)


 Date: Mon, 8 Mar 2010 15:51:34 +0100
 From: a...@getopt.org
 To: nutch-user@lucene.apache.org
 Subject: Re: Content of redirected urls empty
 
 On 2010-03-08 14:55, BELLINI ADAM wrote:
 
 
  is there any idea guys ??
 
 
  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org
  Subject: Content of redirected urls empty
  Date: Fri, 5 Mar 2010 22:01:05 +
 
 
 
  hi,
  the content of my redirected urls is empty...but still have the other 
  metadata...
  i have an http urls that is redirected to https.
  in my index i find the http URL but with an empty content...
  could you explain it plz?
 
 There are two ways to redirect - one is with protocol, and the other is 
 with content (either meta refresh, or javascript).
 
 When you dump the segment, is there really no content for the redirected 
 url?
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
  
_
Live connected with Messenger on your phone
http://go.microsoft.com/?linkid=9712958

RE: Content of redirected urls empty

2010-03-08 Thread BELLINI ADAM

i'm sorry...i just checked twice...and in my index i have the original URL, 
which is  the HTTP one with the empty content...but it dosent index the HTTPS 
oneand i using solr index
thx



 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: RE: Content of redirected urls empty
 Date: Mon, 8 Mar 2010 17:01:34 +
 
 
 
 
 Hi, i'v just dumped my segments and found that i have both 2 URLS, the 
 original one (HTTP) with an empty content and the REDIRCTED TO or the 
 DESTINATION URL (HTTPS) with NON EMPTY content !
 
 but in my search i found only the HTTPS URL with an empty content !! 
 logically the content of the HTTPS  URL is not empty !
 it's just mixing the HTTPS url with the content of the HTTP one.
 
 
 our redirect is done by java code  response.sendRedirect(…), so it seams to 
 be http redirect right ??
 
 thx for helping me :)
 
 
  Date: Mon, 8 Mar 2010 15:51:34 +0100
  From: a...@getopt.org
  To: nutch-user@lucene.apache.org
  Subject: Re: Content of redirected urls empty
  
  On 2010-03-08 14:55, BELLINI ADAM wrote:
  
  
   is there any idea guys ??
  
  
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org
   Subject: Content of redirected urls empty
   Date: Fri, 5 Mar 2010 22:01:05 +
  
  
  
   hi,
   the content of my redirected urls is empty...but still have the other 
   metadata...
   i have an http urls that is redirected to https.
   in my index i find the http URL but with an empty content...
   could you explain it plz?
  
  There are two ways to redirect - one is with protocol, and the other is 
  with content (either meta refresh, or javascript).
  
  When you dump the segment, is there really no content for the redirected 
  url?
  
  
  -- 
  Best regards,
  Andrzej Bialecki 
___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
  
 
 _
 Live connected with Messenger on your phone
 http://go.microsoft.com/?linkid=9712958
  
_
IM on the go with Messenger on your phone
http://go.microsoft.com/?linkid=9712960

AW: By Indexing I get: OutOfMemoryError: GC overhead limit exceeded ...

2010-03-08 Thread Patricio Galeas
Hello Ted,

I ran the command 'ps -aux' and I confirmed that only 1GB was defined.
I adjust  NUTCH_HEAPSIZE to 8GB (physical RAM) and ran it again
successfully.

Do you know which parameters need to be adjusted if not enough physical RAM is 
available on the server? For example for 2GB RAM.
I ran a web crawl (depth=6) without the parameter topN and the segments grew 
exponetially.
Later, I had a lot of problems by merging the segments and by indexing
(not enough memory, too many opened files, etc.).

Thank you for your help
Pato



- Ursprüngliche Mail 
Von: Ted Yu yuzhih...@gmail.com
An: nutch-user@lucene.apache.org
Gesendet: Samstag, den 6. März 2010, 15:42:38 Uhr
Betreff: Re: By Indexing I get: OutOfMemoryError: GC overhead limit exceeded  
...

Can you use 'ps aux' to find out the -Xmx commandine parameter passed to
java for the following action ?

On Fri, Mar 5, 2010 at 1:14 PM, Patricio Galeas pgal...@yahoo.de wrote:

 Hello all,
 I am  running Nutch in a Virtual Machine (Debian) with 8 GB RAM and 1,5TB
 for the hadoop temporal folder.
 Running the index process with a 1.3GB segments folder,  I got
  OutOfMemoryError: GC overhead limit exceeded  (see below)

 I created the segments using slice=5
 and I also set HADOOP_HEAPSIZE with the maximal physical memory (8000).

 Do I need more memory to run the index process?
 Are there some limitation to run Nutch in a Virtual Machine?

 Thank you!
 Pato

 ...
 ...
 2010-03-05 19:52:13,864 INFO  plugin.PluginRepository - Nutch
 Scoring (org.apache.nutch.scoring.ScoringFilter)
 2010-03-05 19:52:13,864 INFO  plugin.PluginRepository - Ontology
 Model Loader (org.apache.nutch.ontology.Ontology)
 2010-03-05 19:52:13,867 INFO  lang.LanguageIdentifier - Language identifier
 configuration [1-4/2048]
 2010-03-05 19:52:22,961 INFO  lang.LanguageIdentifier - Language identifier
 plugin supports: it(1000) is(1000) hu(1000) th(1000) sv(1000) sq(1000)
 fr(1000) ru(1000) fi(1000) es(1000) en(1000) el(1000) ee(1000) pt(1000)
 de(1000) da(1000) pl(1000) no(1000) nl(1000)
 2010-03-05 19:52:22,961 INFO  indexer.IndexingFilters - Adding
 org.apache.nutch.analysis.lang.LanguageIndexingFilter
 2010-03-05 19:52:22,963 INFO  indexer.IndexingFilters - Adding
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2010-03-05 19:52:22,964 INFO  indexer.IndexingFilters - Adding
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2010-03-05 19:52:36,278 WARN  mapred.LocalJobRunner - job_local_0001
 java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:775)
at org.apache.hadoop.io.Text.encode(Text.java:388)
at org.apache.hadoop.io.Text.encode(Text.java:369)
at org.apache.hadoop.io.Text.writeString(Text.java:409)
at org.apache.nutch.parse.Outlink.write(Outlink.java:52)
at org.apache.nutch.parse.ParseData.write(ParseData.java:152)
at
 org.apache.hadoop.io.GenericWritable.write(GenericWritable.java:135)
at
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:613)
at
 org.apache.nutch.indexer.IndexerMapReduce.map(IndexerMapReduce.java:67)
at
 org.apache.nutch.indexer.IndexerMapReduce.map(IndexerMapReduce.java:50)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
 2010-03-05 19:52:37,277 FATAL indexer.Indexer - Indexer:
 java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)

 __
 Do You Yahoo!?
 Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz
 gegen Massenmails.
 http://mail.yahoo.com



__
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen 
Massenmails. 
http://mail.yahoo.com


RE: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread Pravin Karne

Can we share Hadoop cluster between two nutch instance.
So there will be two nutch instance and they will point to same Hadoop cluster.

This way I am able to share my hardware bandwidth. I know that Hadoop in 
distributed mode serializes jobs.
But I will not affect my flow. I just want to share my hardware resource.

I tried with two nutch setup , but somehow second instance overriding the first 
one's configuration.


Any pointers ?

Thanks
-Pravin
 

-Original Message-
From: MilleBii [mailto:mille...@gmail.com] 
Sent: Monday, March 08, 2010 8:02 PM
To: nutch-user@lucene.apache.org
Subject: Re: Two Nutch parallel crawl with two conf folder.

How parallel is parallel in your case ?
Don't forget Hadoop in distributed mode will serialize your jobs anyhow.

For the rest why don't you create two Nutch directories and run things
totally independently


2010/3/8, Pravin Karne pravin_ka...@persistent.co.in:
 Hi guys any pointer on following.
 Your help will highly appreciated .

 Thanks
 -Pravin

 -Original Message-
 From: Pravin Karne
 Sent: Friday, March 05, 2010 12:57 PM
 To: nutch-user@lucene.apache.org
 Subject: Two Nutch parallel crawl with two conf folder.

 Hi,

 I want to do two Nutch parallel crawl with two conf folder.

 I am using crawl command to do this. I have two separate conf folders, all
 files from conf are same except crawl-urlfilter.txt . In  this file we have
 different filters(domain filters).

  e.g . 1 st conf have -
  +.^http://([a-z0-9]*\.)*abc.com/

2nd conf have -
   +.^http://([a-z0-9]*\.)*xyz.com/


 I am starting two crawl with above configuration and on separate
 console.(one followed by other)

 I am using following crawl commands  -

   bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth 1

   bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth 1

 [Note: We have modified nutch.sh for '--nutch_conf_dir']

 urls file have following entries-

 http://www.abc.com
 http://www.xyz.com
 http://www.pqr.com


 Expected Result:

  CrawlDB test1 should contains abc.com's  data and CrawlDB test2 should
 contains xyz.com's data.

 Actual Results:

   url filter of first run  is overridden by url filter of second run.

   So Both CrawlDB have xyz.com's data.


 Please provide pointer regarding this.

 Thanks in advance.

 -Pravin


 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is the
 property of Persistent Systems Ltd. It is intended only for the use of the
 individual or entity to which it is addressed. If you are not the intended
 recipient, you are not authorized to read, retain, copy, print, distribute
 or use this message. If you have received this communication in error,
 please notify the sender and delete all copies of this message. Persistent
 Systems Ltd. does not accept any liability for virus infected mails.



-- 
-MilleBii-

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread MilleBii
Yes it should work, I personnaly run some tests crawl on the same
hardware, even on the same nutch directory thus I share the conf
directory.
But If you don't want that I would use two nutch directory and of
course two different crawl directory because with hadoop they will
end-up on the same hdfs: (assuming you run in distribued or pseudo)

2010/3/9, Pravin Karne pravin_ka...@persistent.co.in:

 Can we share Hadoop cluster between two nutch instance.
 So there will be two nutch instance and they will point to same Hadoop
 cluster.

 This way I am able to share my hardware bandwidth. I know that Hadoop in
 distributed mode serializes jobs.
 But I will not affect my flow. I just want to share my hardware resource.

 I tried with two nutch setup , but somehow second instance overriding the
 first one's configuration.


 Any pointers ?

 Thanks
 -Pravin


 -Original Message-
 From: MilleBii [mailto:mille...@gmail.com]
 Sent: Monday, March 08, 2010 8:02 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: Two Nutch parallel crawl with two conf folder.

 How parallel is parallel in your case ?
 Don't forget Hadoop in distributed mode will serialize your jobs anyhow.

 For the rest why don't you create two Nutch directories and run things
 totally independently


 2010/3/8, Pravin Karne pravin_ka...@persistent.co.in:
 Hi guys any pointer on following.
 Your help will highly appreciated .

 Thanks
 -Pravin

 -Original Message-
 From: Pravin Karne
 Sent: Friday, March 05, 2010 12:57 PM
 To: nutch-user@lucene.apache.org
 Subject: Two Nutch parallel crawl with two conf folder.

 Hi,

 I want to do two Nutch parallel crawl with two conf folder.

 I am using crawl command to do this. I have two separate conf folders,
 all
 files from conf are same except crawl-urlfilter.txt . In  this file we
 have
 different filters(domain filters).

  e.g . 1 st conf have -
  +.^http://([a-z0-9]*\.)*abc.com/

2nd conf have -
  +.^http://([a-z0-9]*\.)*xyz.com/


 I am starting two crawl with above configuration and on separate
 console.(one followed by other)

 I am using following crawl commands  -

   bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth
 1

   bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth
 1

 [Note: We have modified nutch.sh for '--nutch_conf_dir']

 urls file have following entries-

 http://www.abc.com
 http://www.xyz.com
 http://www.pqr.com


 Expected Result:

  CrawlDB test1 should contains abc.com's  data and CrawlDB test2
 should
 contains xyz.com's data.

 Actual Results:

   url filter of first run  is overridden by url filter of second run.

   So Both CrawlDB have xyz.com's data.


 Please provide pointer regarding this.

 Thanks in advance.

 -Pravin


 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is
 the
 property of Persistent Systems Ltd. It is intended only for the use of
 the
 individual or entity to which it is addressed. If you are not the
 intended
 recipient, you are not authorized to read, retain, copy, print,
 distribute
 or use this message. If you have received this communication in error,
 please notify the sender and delete all copies of this message.
 Persistent
 Systems Ltd. does not accept any liability for virus infected mails.



 --
 -MilleBii-

 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is the
 property of Persistent Systems Ltd. It is intended only for the use of the
 individual or entity to which it is addressed. If you are not the intended
 recipient, you are not authorized to read, retain, copy, print, distribute
 or use this message. If you have received this communication in error,
 please notify the sender and delete all copies of this message. Persistent
 Systems Ltd. does not accept any liability for virus infected mails.



-- 
-MilleBii-