Re: Nutch segment merge is very slow

2010-04-06 Thread MilleBii
I never merge segments on my single box, too slow worse always getting
into HD full.
Means you have to ditch older segments after while though... So you
need to keep segments at least younger than maxcrawl duration

2010/4/6, arkadi.kosmy...@csiro.au arkadi.kosmy...@csiro.au:
 Hi,

 -Original Message-
 From: Susam Pal [mailto:susam@gmail.com]
 Sent: Tuesday, 6 April 2010 12:18 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Nutch segment merge is very slow

 On Mon, Apr 5, 2010 at 5:27 PM, ashokkumar.raveendi...@wipro.com
 wrote:

  Hi
 
  I'm using Nutch crawler in my project and crawled more than 2GB of
 data
  using Nutch runbot script. Up to 2GB segment merger has took and
 ended
  with in  24 hrs but now it takes more than 48 hrs and still running.
 I
  have set depth to 16 and topN to 2500. I want to run crawler every
 day
  as per my requirement.
 
 
 
  How to speed up segment merge and index process.
 
 
 
  Regards
 
  Ashokkumar.R
 
 
 Hi,

 From my experience of running Nutch on a single box to crawl a
 corporate
 intranet with a depth as high as 16 and a topN value greater than 1000,
 I
 feel it isn't feasible to have one crawl per day.

 That is, if you consider your site a monolithic object and try to recrawl
 the whole site each time. Normally, web sites are not homogeneous. To keep
 your index up to date, you only have to recrawl regularly the parts that
 change fast, which is 1% to 10% of the site, and refresh other parts perhaps
 once in every few months.


 One of these options might help you.

 1. Run Nutch on a Hadoop cluster to distribute the job and speed up
 processing.

 Then you have to run your web server on a cluster as well because it will
 have to serve all contents of your site every day + serve other clients.


 2. Reduce the crawl depth to about 7, 8, or whatever works for you.
 This
 means you wouldn't be crawling links discovered in the crawl perform at
 depth 8. This may be a good or a bad thing for you depending on whether
 you
 want to crawl URLs found so deep in the crawl. These URLs may be
 obscure and
 less important because they are so many hops away from your seed
 URLs.

 Loosing quality. Can it be a good thing?

 `
 3. However, if the URLs found very deep are also important and you want
 to
 crawl them, you might have to sacrifice low ranking URLs by setting a
 smaller topN value, say, 1000, or whatever works for you.

 Loosing quality in a different way. How do you calculate ranks? Link based
 methods do not work nearly as well on intranets as on the global Web.


 Regards,
 Susam Pal

 Regards,

 Arkadi Kosmynin
 CSIRO Astronomy and Space Science



-- 
-MilleBii-


Re: Nutch segment merge is very slow

2010-04-05 Thread Susam Pal
On Mon, Apr 5, 2010 at 5:27 PM, ashokkumar.raveendi...@wipro.com wrote:

 Hi

 I'm using Nutch crawler in my project and crawled more than 2GB of data
 using Nutch runbot script. Up to 2GB segment merger has took and ended
 with in  24 hrs but now it takes more than 48 hrs and still running. I
 have set depth to 16 and topN to 2500. I want to run crawler every day
 as per my requirement.



 How to speed up segment merge and index process.



 Regards

 Ashokkumar.R


Hi,

From my experience of running Nutch on a single box to crawl a corporate
intranet with a depth as high as 16 and a topN value greater than 1000, I
feel it isn't feasible to have one crawl per day.

One of these options might help you.

1. Run Nutch on a Hadoop cluster to distribute the job and speed up
processing.

2. Reduce the crawl depth to about 7, 8, or whatever works for you. This
means you wouldn't be crawling links discovered in the crawl perform at
depth 8. This may be a good or a bad thing for you depending on whether you
want to crawl URLs found so deep in the crawl. These URLs may be obscure and
less important because they are so many hops away from your seed URLs.
`
3. However, if the URLs found very deep are also important and you want to
crawl them, you might have to sacrifice low ranking URLs by setting a
smaller topN value, say, 1000, or whatever works for you.

Regards,
Susam Pal


RE: Nutch segment merge is very slow

2010-04-05 Thread ashokkumar.raveendiran
Hi,
Thank you for your suggestion. I have around 500+ internet urls
configured for crawling and crawl process is running in Amazon cloud.  I
have already reduced my depth to 8, topN to 1000 and also increased
fetcher threads to 150 and limited 50 urls per  host using
generate.max.per.host property. With this configuration Generate, Fetch,
Parse, Update completes in max 10 hrs. When comes to segment merge it
takes lot of time. As a temporary solution I am not doing the segment
merge and directly indexing the fetched segments. With this solution I
am able to finish the crawl process with in 24hrs. Now I am looking for
long term solution to optimize segment merge process.

Regards
Ashokkumar.R

-Original Message-
From: Susam Pal [mailto:susam@gmail.com]
Sent: Monday, April 05, 2010 7:48 PM
To: nutch-user@lucene.apache.org
Subject: Re: Nutch segment merge is very slow

On Mon, Apr 5, 2010 at 5:27 PM, ashokkumar.raveendi...@wipro.com
wrote:

 Hi

 I'm using Nutch crawler in my project and crawled more than 2GB of
data
 using Nutch runbot script. Up to 2GB segment merger has took and ended
 with in  24 hrs but now it takes more than 48 hrs and still running. I
 have set depth to 16 and topN to 2500. I want to run crawler every day
 as per my requirement.



 How to speed up segment merge and index process.



 Regards

 Ashokkumar.R


Hi,

From my experience of running Nutch on a single box to crawl a corporate
intranet with a depth as high as 16 and a topN value greater than 1000,
I
feel it isn't feasible to have one crawl per day.

One of these options might help you.

1. Run Nutch on a Hadoop cluster to distribute the job and speed up
processing.

2. Reduce the crawl depth to about 7, 8, or whatever works for you. This
means you wouldn't be crawling links discovered in the crawl perform at
depth 8. This may be a good or a bad thing for you depending on whether
you
want to crawl URLs found so deep in the crawl. These URLs may be obscure
and
less important because they are so many hops away from your seed URLs.
`
3. However, if the URLs found very deep are also important and you want
to
crawl them, you might have to sacrifice low ranking URLs by setting a
smaller topN value, say, 1000, or whatever works for you.

Regards,
Susam Pal

Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com


Re: Nutch segment merge is very slow

2010-04-05 Thread Andrzej Bialecki
On 2010-04-05 16:54, ashokkumar.raveendi...@wipro.com wrote:
 Hi,
   Thank you for your suggestion. I have around 500+ internet urls
 configured for crawling and crawl process is running in Amazon cloud.  I
 have already reduced my depth to 8, topN to 1000 and also increased
 fetcher threads to 150 and limited 50 urls per  host using
 generate.max.per.host property. With this configuration Generate, Fetch,
 Parse, Update completes in max 10 hrs. When comes to segment merge it
 takes lot of time. As a temporary solution I am not doing the segment
 merge and directly indexing the fetched segments. With this solution I
 am able to finish the crawl process with in 24hrs. Now I am looking for
 long term solution to optimize segment merge process.

Segment merging is not strictly necessary, unless you have a hundred
segments or so. If this step takes too much time, but still the number
of segments is well below a hundred, just don't merge them.


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Nutch segment merge is very slow

2010-04-05 Thread Arkadi.Kosmynin
Hi,

 -Original Message-
 From: Susam Pal [mailto:susam@gmail.com]
 Sent: Tuesday, 6 April 2010 12:18 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Nutch segment merge is very slow
 
 On Mon, Apr 5, 2010 at 5:27 PM, ashokkumar.raveendi...@wipro.com
 wrote:
 
  Hi
 
  I'm using Nutch crawler in my project and crawled more than 2GB of
 data
  using Nutch runbot script. Up to 2GB segment merger has took and
 ended
  with in  24 hrs but now it takes more than 48 hrs and still running.
 I
  have set depth to 16 and topN to 2500. I want to run crawler every
 day
  as per my requirement.
 
 
 
  How to speed up segment merge and index process.
 
 
 
  Regards
 
  Ashokkumar.R
 
 
 Hi,
 
 From my experience of running Nutch on a single box to crawl a
 corporate
 intranet with a depth as high as 16 and a topN value greater than 1000,
 I
 feel it isn't feasible to have one crawl per day.

That is, if you consider your site a monolithic object and try to recrawl the 
whole site each time. Normally, web sites are not homogeneous. To keep your 
index up to date, you only have to recrawl regularly the parts that change 
fast, which is 1% to 10% of the site, and refresh other parts perhaps once in 
every few months.

 
 One of these options might help you.
 
 1. Run Nutch on a Hadoop cluster to distribute the job and speed up
 processing.

Then you have to run your web server on a cluster as well because it will have 
to serve all contents of your site every day + serve other clients.

 
 2. Reduce the crawl depth to about 7, 8, or whatever works for you.
 This
 means you wouldn't be crawling links discovered in the crawl perform at
 depth 8. This may be a good or a bad thing for you depending on whether
 you
 want to crawl URLs found so deep in the crawl. These URLs may be
 obscure and
 less important because they are so many hops away from your seed
 URLs.

Loosing quality. Can it be a good thing?

 `
 3. However, if the URLs found very deep are also important and you want
 to
 crawl them, you might have to sacrifice low ranking URLs by setting a
 smaller topN value, say, 1000, or whatever works for you.

Loosing quality in a different way. How do you calculate ranks? Link based 
methods do not work nearly as well on intranets as on the global Web. 

 
 Regards,
 Susam Pal

Regards,

Arkadi Kosmynin
CSIRO Astronomy and Space Science