Re: Two Nutch parallel crawl with two conf folder.

2010-03-09 Thread eks dev
sorry for the noise.. I've mixed up Emails 



- Original Message 
> From: eks dev 
> To: nutch-user@lucene.apache.org
> Sent: Tue, 9 March, 2010 18:07:47
> Subject: Re: Two Nutch parallel crawl with two conf folder.
> 
> coool answer
> 
> 
> 
> - Original Message 
> > From: MilleBii 
> > To: nutch-user@lucene.apache.org
> > Sent: Tue, 9 March, 2010 8:35:42
> > Subject: Re: Two Nutch parallel crawl with two conf folder.
> > 
> > Yes it should work, I personnaly run some tests crawl on the same
> > hardware, even on the same nutch directory thus I share the conf
> > directory.
> > But If you don't want that I would use two nutch directory and of
> > course two different crawl directory because with hadoop they will
> > end-up on the same hdfs: (assuming you run in distribued or pseudo)
> > 
> > 2010/3/9, Pravin Karne :
> > >
> > > Can we share Hadoop cluster between two nutch instance.
> > > So there will be two nutch instance and they will point to same Hadoop
> > > cluster.
> > >
> > > This way I am able to share my hardware bandwidth. I know that Hadoop in
> > > distributed mode serializes jobs.
> > > But I will not affect my flow. I just want to share my hardware resource.
> > >
> > > I tried with two nutch setup , but somehow second instance overriding the
> > > first one's configuration.
> > >
> > >
> > > Any pointers ?????
> > >
> > > Thanks
> > > -Pravin
> > >
> > >
> > > -Original Message-
> > > From: MilleBii [mailto:mille...@gmail.com]
> > > Sent: Monday, March 08, 2010 8:02 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: Re: Two Nutch parallel crawl with two conf folder.
> > >
> > > How parallel is parallel in your case ?
> > > Don't forget Hadoop in distributed mode will serialize your jobs anyhow.
> > >
> > > For the rest why don't you create two Nutch directories and run things
> > > totally independently
> > >
> > >
> > > 2010/3/8, Pravin Karne :
> > >> Hi guys any pointer on following.
> > >> Your help will highly appreciated .
> > >>
> > >> Thanks
> > >> -Pravin
> > >>
> > >> -Original Message-
> > >> From: Pravin Karne
> > >> Sent: Friday, March 05, 2010 12:57 PM
> > >> To: nutch-user@lucene.apache.org
> > >> Subject: Two Nutch parallel crawl with two conf folder.
> > >>
> > >> Hi,
> > >>
> > >> I want to do two Nutch parallel crawl with two conf folder.
> > >>
> > >> I am using crawl command to do this. I have two separate conf folders,
> > >> all
> > >> files from conf are same except crawl-urlfilter.txt . In  this file we
> > >> have
> > >> different filters(domain filters).
> > >>
> > >>  e.g . 1 st conf have -
> > >>  +.^http://([a-z0-9]*\.)*abc.com/
> > >>
> > >>2nd conf have -
> > >> +.^http://([a-z0-9]*\.)*xyz.com/
> > >>
> > >>
> > >> I am starting two crawl with above configuration and on separate
> > >> console.(one followed by other)
> > >>
> > >> I am using following crawl commands  -
> > >>
> > >>   bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth
> > >> 1
> > >>
> > >>   bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth
> > >> 1
> > >>
> > >> [Note: We have modified nutch.sh for '--nutch_conf_dir']
> > >>
> > >> urls file have following entries-
> > >>
> > >>http://www.abc.com
> > >>http://www.xyz.com
> > >>http://www.pqr.com
> > >>
> > >>
> > >> Expected Result:
> > >>
> > >>  CrawlDB test1 should contains abc.com's  data and CrawlDB test2
> > >> should
> > >> contains xyz.com's data.
> > >>
> > >> Actual Results:
> > >>
> > >>   url filter of first run  is overridden by url filter of second run.
> > >>
> > >>   So Both CrawlDB have xyz.com's data.
> > >>
> > >>
> > >> Please provide pointer regarding this.
> > >>
> > >> Thanks in ad

Re: Two Nutch parallel crawl with two conf folder.

2010-03-09 Thread eks dev
coool answer



- Original Message 
> From: MilleBii 
> To: nutch-user@lucene.apache.org
> Sent: Tue, 9 March, 2010 8:35:42
> Subject: Re: Two Nutch parallel crawl with two conf folder.
> 
> Yes it should work, I personnaly run some tests crawl on the same
> hardware, even on the same nutch directory thus I share the conf
> directory.
> But If you don't want that I would use two nutch directory and of
> course two different crawl directory because with hadoop they will
> end-up on the same hdfs: (assuming you run in distribued or pseudo)
> 
> 2010/3/9, Pravin Karne :
> >
> > Can we share Hadoop cluster between two nutch instance.
> > So there will be two nutch instance and they will point to same Hadoop
> > cluster.
> >
> > This way I am able to share my hardware bandwidth. I know that Hadoop in
> > distributed mode serializes jobs.
> > But I will not affect my flow. I just want to share my hardware resource.
> >
> > I tried with two nutch setup , but somehow second instance overriding the
> > first one's configuration.
> >
> >
> > Any pointers ?
> >
> > Thanks
> > -Pravin
> >
> >
> > -----Original Message-----
> > From: MilleBii [mailto:mille...@gmail.com]
> > Sent: Monday, March 08, 2010 8:02 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: Two Nutch parallel crawl with two conf folder.
> >
> > How parallel is parallel in your case ?
> > Don't forget Hadoop in distributed mode will serialize your jobs anyhow.
> >
> > For the rest why don't you create two Nutch directories and run things
> > totally independently
> >
> >
> > 2010/3/8, Pravin Karne :
> >> Hi guys any pointer on following.
> >> Your help will highly appreciated .
> >>
> >> Thanks
> >> -Pravin
> >>
> >> -Original Message-
> >> From: Pravin Karne
> >> Sent: Friday, March 05, 2010 12:57 PM
> >> To: nutch-user@lucene.apache.org
> >> Subject: Two Nutch parallel crawl with two conf folder.
> >>
> >> Hi,
> >>
> >> I want to do two Nutch parallel crawl with two conf folder.
> >>
> >> I am using crawl command to do this. I have two separate conf folders,
> >> all
> >> files from conf are same except crawl-urlfilter.txt . In  this file we
> >> have
> >> different filters(domain filters).
> >>
> >>  e.g . 1 st conf have -
> >>  +.^http://([a-z0-9]*\.)*abc.com/
> >>
> >>2nd conf have -
> >> +.^http://([a-z0-9]*\.)*xyz.com/
> >>
> >>
> >> I am starting two crawl with above configuration and on separate
> >> console.(one followed by other)
> >>
> >> I am using following crawl commands  -
> >>
> >>   bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth
> >> 1
> >>
> >>   bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth
> >> 1
> >>
> >> [Note: We have modified nutch.sh for '--nutch_conf_dir']
> >>
> >> urls file have following entries-
> >>
> >>http://www.abc.com
> >>http://www.xyz.com
> >>http://www.pqr.com
> >>
> >>
> >> Expected Result:
> >>
> >>  CrawlDB test1 should contains abc.com's  data and CrawlDB test2
> >> should
> >> contains xyz.com's data.
> >>
> >> Actual Results:
> >>
> >>   url filter of first run  is overridden by url filter of second run.
> >>
> >>   So Both CrawlDB have xyz.com's data.
> >>
> >>
> >> Please provide pointer regarding this.
> >>
> >> Thanks in advance.
> >>
> >> -Pravin
> >>
> >>
> >> DISCLAIMER
> >> ==
> >> This e-mail may contain privileged and confidential information which is
> >> the
> >> property of Persistent Systems Ltd. It is intended only for the use of
> >> the
> >> individual or entity to which it is addressed. If you are not the
> >> intended
> >> recipient, you are not authorized to read, retain, copy, print,
> >> distribute
> >> or use this message. If you have received this communication in error,
> >> please notify the sender and delete all copies of this message.
> >> Persistent
> >> Systems Ltd. does not accept any liability for virus infected mails.
> >>
> >
> >
> > --
> > -MilleBii-
> >
> > DISCLAIMER
> > ==
> > This e-mail may contain privileged and confidential information which is the
> > property of Persistent Systems Ltd. It is intended only for the use of the
> > individual or entity to which it is addressed. If you are not the intended
> > recipient, you are not authorized to read, retain, copy, print, distribute
> > or use this message. If you have received this communication in error,
> > please notify the sender and delete all copies of this message. Persistent
> > Systems Ltd. does not accept any liability for virus infected mails.
> >
> 
> 
> -- 
> -MilleBii-






Re: Two Nutch parallel crawl with two conf folder.

2010-03-09 Thread Gora Mohanty
On Tue, 9 Mar 2010 14:36:33 +0100
MilleBii  wrote:

> Never tried... Also you may want to check $NUTCH_HOME variable
> which should be different for each instance, otherwise it will
> only use one of the two conf dir.
[...]

Had meant to reply to the original poster, but had forgotten.
We have indeed run multiple instances of Nutch in separate
directories, without any problems.

I presume that you are using the crawl.sh script, or a derivative
of it. If so, as pointed out above, a likely cause of what you
are seeing is that the NUTCH_HOME variable in the script is set
to the same directory, so that the configuration from that directory
is the one picked up.

Regards,
Gora


Re: Two Nutch parallel crawl with two conf folder.

2010-03-09 Thread MilleBii
Never tried... Also you may want to check $NUTCH_HOME variable which
should be different for each instance, otherwise it will only use one
of the two conf dir.

2010/3/9, Pravin Karne :
> Hi Millebii,
> Thanks for your valuable inputs.
>
> As per our requirements we need to run multiple nutch instances with each
> instance pointing to their own conf dir and crawlDB.
>
> crawl -urlfilter.txt is different in both conf folder. But in our case both
> nutch instances picking same conf dir instead of picking their own conf
> dir.
>
> So both crawlDB have same data. [Actually we have separate filter in both
> conf, so data in both crawlDB should be different]
>
> Have you tried such scenario?  Since
>
>
> Thanks
> -Pravin
>
>
>
> -Original Message-
> From: MilleBii [mailto:mille...@gmail.com]
> Sent: Tuesday, March 09, 2010 1:06 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Two Nutch parallel crawl with two conf folder.
>
> Yes it should work, I personnaly run some tests crawl on the same
> hardware, even on the same nutch directory thus I share the conf
> directory.
> But If you don't want that I would use two nutch directory and of
> course two different crawl directory because with hadoop they will
> end-up on the same hdfs: (assuming you run in distribued or pseudo)
>
> 2010/3/9, Pravin Karne :
>>
>> Can we share Hadoop cluster between two nutch instance.
>> So there will be two nutch instance and they will point to same Hadoop
>> cluster.
>>
>> This way I am able to share my hardware bandwidth. I know that Hadoop in
>> distributed mode serializes jobs.
>> But I will not affect my flow. I just want to share my hardware resource.
>>
>> I tried with two nutch setup , but somehow second instance overriding the
>> first one's configuration.
>>
>>
>> Any pointers ?????
>>
>> Thanks
>> -Pravin
>>
>>
>> -Original Message-
>> From: MilleBii [mailto:mille...@gmail.com]
>> Sent: Monday, March 08, 2010 8:02 PM
>> To: nutch-user@lucene.apache.org
>> Subject: Re: Two Nutch parallel crawl with two conf folder.
>>
>> How parallel is parallel in your case ?
>> Don't forget Hadoop in distributed mode will serialize your jobs anyhow.
>>
>> For the rest why don't you create two Nutch directories and run things
>> totally independently
>>
>>
>> 2010/3/8, Pravin Karne :
>>> Hi guys any pointer on following.
>>> Your help will highly appreciated .
>>>
>>> Thanks
>>> -Pravin
>>>
>>> -Original Message-
>>> From: Pravin Karne
>>> Sent: Friday, March 05, 2010 12:57 PM
>>> To: nutch-user@lucene.apache.org
>>> Subject: Two Nutch parallel crawl with two conf folder.
>>>
>>> Hi,
>>>
>>> I want to do two Nutch parallel crawl with two conf folder.
>>>
>>> I am using crawl command to do this. I have two separate conf folders,
>>> all
>>> files from conf are same except crawl-urlfilter.txt . In  this file we
>>> have
>>> different filters(domain filters).
>>>
>>>  e.g . 1 st conf have -
>>>  +.^http://([a-z0-9]*\.)*abc.com/
>>>
>>>2nd conf have -
>>> +.^http://([a-z0-9]*\.)*xyz.com/
>>>
>>>
>>> I am starting two crawl with above configuration and on separate
>>> console.(one followed by other)
>>>
>>> I am using following crawl commands  -
>>>
>>>   bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1
>>> -depth
>>> 1
>>>
>>>   bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2
>>> -depth
>>> 1
>>>
>>> [Note: We have modified nutch.sh for '--nutch_conf_dir']
>>>
>>> urls file have following entries-
>>>
>>> http://www.abc.com
>>> http://www.xyz.com
>>> http://www.pqr.com
>>>
>>>
>>> Expected Result:
>>>
>>>  CrawlDB test1 should contains abc.com's  data and CrawlDB test2
>>> should
>>> contains xyz.com's data.
>>>
>>> Actual Results:
>>>
>>>   url filter of first run  is overridden by url filter of second run.
>>>
>>>   So Both CrawlDB have xyz.com's data.
>>>
>>>
>>> Please provide pointer regarding this.
>>>
>>> Thanks in advance.
>>>
>>> -Pravin
>>>
>>>
>>&

RE: Two Nutch parallel crawl with two conf folder.

2010-03-09 Thread Pravin Karne
Hi Millebii,
Thanks for your valuable inputs.

As per our requirements we need to run multiple nutch instances with each 
instance pointing to their own conf dir and crawlDB.

crawl -urlfilter.txt is different in both conf folder. But in our case both 
nutch instances picking same conf dir instead of picking their own conf dir.

So both crawlDB have same data. [Actually we have separate filter in both conf, 
so data in both crawlDB should be different]

Have you tried such scenario?  Since 


Thanks
-Pravin



-Original Message-
From: MilleBii [mailto:mille...@gmail.com] 
Sent: Tuesday, March 09, 2010 1:06 PM
To: nutch-user@lucene.apache.org
Subject: Re: Two Nutch parallel crawl with two conf folder.

Yes it should work, I personnaly run some tests crawl on the same
hardware, even on the same nutch directory thus I share the conf
directory.
But If you don't want that I would use two nutch directory and of
course two different crawl directory because with hadoop they will
end-up on the same hdfs: (assuming you run in distribued or pseudo)

2010/3/9, Pravin Karne :
>
> Can we share Hadoop cluster between two nutch instance.
> So there will be two nutch instance and they will point to same Hadoop
> cluster.
>
> This way I am able to share my hardware bandwidth. I know that Hadoop in
> distributed mode serializes jobs.
> But I will not affect my flow. I just want to share my hardware resource.
>
> I tried with two nutch setup , but somehow second instance overriding the
> first one's configuration.
>
>
> Any pointers ?
>
> Thanks
> -Pravin
>
>
> -Original Message-
> From: MilleBii [mailto:mille...@gmail.com]
> Sent: Monday, March 08, 2010 8:02 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Two Nutch parallel crawl with two conf folder.
>
> How parallel is parallel in your case ?
> Don't forget Hadoop in distributed mode will serialize your jobs anyhow.
>
> For the rest why don't you create two Nutch directories and run things
> totally independently
>
>
> 2010/3/8, Pravin Karne :
>> Hi guys any pointer on following.
>> Your help will highly appreciated .
>>
>> Thanks
>> -Pravin
>>
>> -Original Message-
>> From: Pravin Karne
>> Sent: Friday, March 05, 2010 12:57 PM
>> To: nutch-user@lucene.apache.org
>> Subject: Two Nutch parallel crawl with two conf folder.
>>
>> Hi,
>>
>> I want to do two Nutch parallel crawl with two conf folder.
>>
>> I am using crawl command to do this. I have two separate conf folders,
>> all
>> files from conf are same except crawl-urlfilter.txt . In  this file we
>> have
>> different filters(domain filters).
>>
>>  e.g . 1 st conf have -
>>  +.^http://([a-z0-9]*\.)*abc.com/
>>
>>2nd conf have -
>>  +.^http://([a-z0-9]*\.)*xyz.com/
>>
>>
>> I am starting two crawl with above configuration and on separate
>> console.(one followed by other)
>>
>> I am using following crawl commands  -
>>
>>   bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth
>> 1
>>
>>   bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth
>> 1
>>
>> [Note: We have modified nutch.sh for '--nutch_conf_dir']
>>
>> urls file have following entries-
>>
>> http://www.abc.com
>> http://www.xyz.com
>> http://www.pqr.com
>>
>>
>> Expected Result:
>>
>>  CrawlDB test1 should contains abc.com's  data and CrawlDB test2
>> should
>> contains xyz.com's data.
>>
>> Actual Results:
>>
>>   url filter of first run  is overridden by url filter of second run.
>>
>>   So Both CrawlDB have xyz.com's data.
>>
>>
>> Please provide pointer regarding this.
>>
>> Thanks in advance.
>>
>> -Pravin
>>
>>
>> DISCLAIMER
>> ==
>> This e-mail may contain privileged and confidential information which is
>> the
>> property of Persistent Systems Ltd. It is intended only for the use of
>> the
>> individual or entity to which it is addressed. If you are not the
>> intended
>> recipient, you are not authorized to read, retain, copy, print,
>> distribute
>> or use this message. If you have received this communication in error,
>> please notify the sender and delete all copies of this message.
>> Persistent
>> Systems Ltd. does not accept any liability for virus infected mails.
>>
>
>
> --
> -MilleBii-
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and c

Re: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread MilleBii
Yes it should work, I personnaly run some tests crawl on the same
hardware, even on the same nutch directory thus I share the conf
directory.
But If you don't want that I would use two nutch directory and of
course two different crawl directory because with hadoop they will
end-up on the same hdfs: (assuming you run in distribued or pseudo)

2010/3/9, Pravin Karne :
>
> Can we share Hadoop cluster between two nutch instance.
> So there will be two nutch instance and they will point to same Hadoop
> cluster.
>
> This way I am able to share my hardware bandwidth. I know that Hadoop in
> distributed mode serializes jobs.
> But I will not affect my flow. I just want to share my hardware resource.
>
> I tried with two nutch setup , but somehow second instance overriding the
> first one's configuration.
>
>
> Any pointers ?
>
> Thanks
> -Pravin
>
>
> -Original Message-
> From: MilleBii [mailto:mille...@gmail.com]
> Sent: Monday, March 08, 2010 8:02 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Two Nutch parallel crawl with two conf folder.
>
> How parallel is parallel in your case ?
> Don't forget Hadoop in distributed mode will serialize your jobs anyhow.
>
> For the rest why don't you create two Nutch directories and run things
> totally independently
>
>
> 2010/3/8, Pravin Karne :
>> Hi guys any pointer on following.
>> Your help will highly appreciated .
>>
>> Thanks
>> -Pravin
>>
>> -Original Message-
>> From: Pravin Karne
>> Sent: Friday, March 05, 2010 12:57 PM
>> To: nutch-user@lucene.apache.org
>> Subject: Two Nutch parallel crawl with two conf folder.
>>
>> Hi,
>>
>> I want to do two Nutch parallel crawl with two conf folder.
>>
>> I am using crawl command to do this. I have two separate conf folders,
>> all
>> files from conf are same except crawl-urlfilter.txt . In  this file we
>> have
>> different filters(domain filters).
>>
>>  e.g . 1 st conf have -
>>  +.^http://([a-z0-9]*\.)*abc.com/
>>
>>2nd conf have -
>>  +.^http://([a-z0-9]*\.)*xyz.com/
>>
>>
>> I am starting two crawl with above configuration and on separate
>> console.(one followed by other)
>>
>> I am using following crawl commands  -
>>
>>   bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth
>> 1
>>
>>   bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth
>> 1
>>
>> [Note: We have modified nutch.sh for '--nutch_conf_dir']
>>
>> urls file have following entries-
>>
>> http://www.abc.com
>> http://www.xyz.com
>> http://www.pqr.com
>>
>>
>> Expected Result:
>>
>>  CrawlDB test1 should contains abc.com's  data and CrawlDB test2
>> should
>> contains xyz.com's data.
>>
>> Actual Results:
>>
>>   url filter of first run  is overridden by url filter of second run.
>>
>>   So Both CrawlDB have xyz.com's data.
>>
>>
>> Please provide pointer regarding this.
>>
>> Thanks in advance.
>>
>> -Pravin
>>
>>
>> DISCLAIMER
>> ==
>> This e-mail may contain privileged and confidential information which is
>> the
>> property of Persistent Systems Ltd. It is intended only for the use of
>> the
>> individual or entity to which it is addressed. If you are not the
>> intended
>> recipient, you are not authorized to read, retain, copy, print,
>> distribute
>> or use this message. If you have received this communication in error,
>> please notify the sender and delete all copies of this message.
>> Persistent
>> Systems Ltd. does not accept any liability for virus infected mails.
>>
>
>
> --
> -MilleBii-
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is the
> property of Persistent Systems Ltd. It is intended only for the use of the
> individual or entity to which it is addressed. If you are not the intended
> recipient, you are not authorized to read, retain, copy, print, distribute
> or use this message. If you have received this communication in error,
> please notify the sender and delete all copies of this message. Persistent
> Systems Ltd. does not accept any liability for virus infected mails.
>


-- 
-MilleBii-


RE: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread Pravin Karne

Can we share Hadoop cluster between two nutch instance.
So there will be two nutch instance and they will point to same Hadoop cluster.

This way I am able to share my hardware bandwidth. I know that Hadoop in 
distributed mode serializes jobs.
But I will not affect my flow. I just want to share my hardware resource.

I tried with two nutch setup , but somehow second instance overriding the first 
one's configuration.


Any pointers ?

Thanks
-Pravin
 

-Original Message-
From: MilleBii [mailto:mille...@gmail.com] 
Sent: Monday, March 08, 2010 8:02 PM
To: nutch-user@lucene.apache.org
Subject: Re: Two Nutch parallel crawl with two conf folder.

How parallel is parallel in your case ?
Don't forget Hadoop in distributed mode will serialize your jobs anyhow.

For the rest why don't you create two Nutch directories and run things
totally independently


2010/3/8, Pravin Karne :
> Hi guys any pointer on following.
> Your help will highly appreciated .
>
> Thanks
> -Pravin
>
> -Original Message-
> From: Pravin Karne
> Sent: Friday, March 05, 2010 12:57 PM
> To: nutch-user@lucene.apache.org
> Subject: Two Nutch parallel crawl with two conf folder.
>
> Hi,
>
> I want to do two Nutch parallel crawl with two conf folder.
>
> I am using crawl command to do this. I have two separate conf folders, all
> files from conf are same except crawl-urlfilter.txt . In  this file we have
> different filters(domain filters).
>
>  e.g . 1 st conf have -
>  +.^http://([a-z0-9]*\.)*abc.com/
>
>2nd conf have -
>   +.^http://([a-z0-9]*\.)*xyz.com/
>
>
> I am starting two crawl with above configuration and on separate
> console.(one followed by other)
>
> I am using following crawl commands  -
>
>   bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth 1
>
>   bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth 1
>
> [Note: We have modified nutch.sh for '--nutch_conf_dir']
>
> urls file have following entries-
>
> http://www.abc.com
> http://www.xyz.com
> http://www.pqr.com
>
>
> Expected Result:
>
>  CrawlDB test1 should contains abc.com's  data and CrawlDB test2 should
> contains xyz.com's data.
>
> Actual Results:
>
>   url filter of first run  is overridden by url filter of second run.
>
>   So Both CrawlDB have xyz.com's data.
>
>
> Please provide pointer regarding this.
>
> Thanks in advance.
>
> -Pravin
>
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is the
> property of Persistent Systems Ltd. It is intended only for the use of the
> individual or entity to which it is addressed. If you are not the intended
> recipient, you are not authorized to read, retain, copy, print, distribute
> or use this message. If you have received this communication in error,
> please notify the sender and delete all copies of this message. Persistent
> Systems Ltd. does not accept any liability for virus infected mails.
>


-- 
-MilleBii-

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread MilleBii
How parallel is parallel in your case ?
Don't forget Hadoop in distributed mode will serialize your jobs anyhow.

For the rest why don't you create two Nutch directories and run things
totally independently


2010/3/8, Pravin Karne :
> Hi guys any pointer on following.
> Your help will highly appreciated .
>
> Thanks
> -Pravin
>
> -Original Message-
> From: Pravin Karne
> Sent: Friday, March 05, 2010 12:57 PM
> To: nutch-user@lucene.apache.org
> Subject: Two Nutch parallel crawl with two conf folder.
>
> Hi,
>
> I want to do two Nutch parallel crawl with two conf folder.
>
> I am using crawl command to do this. I have two separate conf folders, all
> files from conf are same except crawl-urlfilter.txt . In  this file we have
> different filters(domain filters).
>
>  e.g . 1 st conf have -
>  +.^http://([a-z0-9]*\.)*abc.com/
>
>2nd conf have -
>   +.^http://([a-z0-9]*\.)*xyz.com/
>
>
> I am starting two crawl with above configuration and on separate
> console.(one followed by other)
>
> I am using following crawl commands  -
>
>   bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth 1
>
>   bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth 1
>
> [Note: We have modified nutch.sh for '--nutch_conf_dir']
>
> urls file have following entries-
>
> http://www.abc.com
> http://www.xyz.com
> http://www.pqr.com
>
>
> Expected Result:
>
>  CrawlDB test1 should contains abc.com's  data and CrawlDB test2 should
> contains xyz.com's data.
>
> Actual Results:
>
>   url filter of first run  is overridden by url filter of second run.
>
>   So Both CrawlDB have xyz.com's data.
>
>
> Please provide pointer regarding this.
>
> Thanks in advance.
>
> -Pravin
>
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is the
> property of Persistent Systems Ltd. It is intended only for the use of the
> individual or entity to which it is addressed. If you are not the intended
> recipient, you are not authorized to read, retain, copy, print, distribute
> or use this message. If you have received this communication in error,
> please notify the sender and delete all copies of this message. Persistent
> Systems Ltd. does not accept any liability for virus infected mails.
>


-- 
-MilleBii-


RE: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread Pravin Karne
Hi guys any pointer on following.
Your help will highly appreciated .

Thanks
-Pravin

-Original Message-
From: Pravin Karne
Sent: Friday, March 05, 2010 12:57 PM
To: nutch-user@lucene.apache.org
Subject: Two Nutch parallel crawl with two conf folder.

Hi,

I want to do two Nutch parallel crawl with two conf folder.

I am using crawl command to do this. I have two separate conf folders, all 
files from conf are same except crawl-urlfilter.txt . In  this file we have 
different filters(domain filters).

 e.g . 1 st conf have -
 +.^http://([a-z0-9]*\.)*abc.com/

   2nd conf have -
+.^http://([a-z0-9]*\.)*xyz.com/


I am starting two crawl with above configuration and on separate console.(one 
followed by other)

I am using following crawl commands  -

  bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth 1

  bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth 1

[Note: We have modified nutch.sh for '--nutch_conf_dir']

urls file have following entries-

http://www.abc.com
http://www.xyz.com
http://www.pqr.com


Expected Result:

 CrawlDB test1 should contains abc.com's  data and CrawlDB test2 should 
contains xyz.com's data.

Actual Results:

  url filter of first run  is overridden by url filter of second run.

  So Both CrawlDB have xyz.com's data.


Please provide pointer regarding this.

Thanks in advance.

-Pravin


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.