Re: Running out of disk space during segment merger

2010-04-09 Thread Yves Petinot

Arkadi,

thanks a lot for these suggestions. Indeed skipping the merge step seems 
to be perfectly acceptable as i have a fairly reasonable number of 
clusters. As an experiment I also attempted to perform my crawl with a 
single segment (which does not even raise the issue of merging) but my 
cluster doesn't seem to be able to handle the amount of disk space 
required during the fetch's reduce tasks. Somewhat frustrating as I only 
have about 1.2M URLs in this crawl.


at any rate thanks again for your comments !

-yp

arkadi.kosmy...@csiro.au wrote:

There are two solutions:

1. Write a light weight version of segment version, which should not be hard if 
you are familiar with Hadoop.
2. Don't merge segments. If you have a reasonable number of segments, even in 
100s, Nutch still can handle this.

Regards,

Arkadi

  

-Original Message-
From: Yves Petinot [mailto:y...@snooth.com]
Sent: Saturday, 27 March 2010 2:21 AM
To: nutch-user@lucene.apache.org
Subject: Re: Running out of disk space during segment merger

Thanks a lot for your reply, Arkadi. It's good to know that this is
indeed a known problem. With the current version, is there anything
more
one can do other than throwing a huge amount of temp disk space at each
node ? Based on my experience and given a replication factor of N,
would
a rule of thumb to reserve roughly N TB per set of 1M URLs make sense ?

-y

arkadi.kosmy...@csiro.au wrote:


Hi Yves,

Yes, what you got is a normal result. This issue is discussed every
  

few months on this list. To my mind, the segment merger is too general.
It assumes that the segments are at arbitrary stages of completion and
works on this assumption. But, this is not a common case at all.
Mostly, people just want to merge finished segments. The algorithm
could be much cheaper in this case.


Regards,

Arkadi

-Original Message-
From: Yves Petinot [mailto:y...@snooth.com]
Sent: Friday, 26 March 2010 6:01 AM
To: nutch-user@lucene.apache.org
Subject: Running out of disk space during segment merger

Hi,

I was wondering if some people on the list have been facing issues
  

with


the segment merge phase, with all the nodes on their hadoop cluster
eventually running out of disk space ? The type errors i'm getting
  

looks


like this:

java.io.IOException: Task: attempt_201003241258_0005_r_01_0 - The
  

reduce copier failed


at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException:
  

Could not find any valid local directory for
file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache
/job_201003241258_0005/attempt_201003241258_0005_r_01_0/output/map_
115.out


at
  

org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPath
ForWrite(LocalDirAllocator.java:335)


at
  

org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAll
ocator.java:124)


at
  

org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Redu
ceTask.java:2384)


FSError: java.io.IOException: No space left on device

To give a little background, I'm currrently running this on a 3 node
cluster, each node having 500GB drives, which are mostly empty at the
beginning of the process (~400 GB available on each node). The
replication factor is set to 2 and i did also enable Hadoop block
compression. Now, the nutch crawl takes up around 20 GB of disk (with
  

7


segments to merge, one of them being 9 GB, the others ranging from 1
  

to


3 GB in size), so intuitively there should be plenty of space
  

available


for the merge operation, but we still end up running out of space
  

during


the reduce phase (7 reduce tasks). I'm currently trying to increase
  

the


number of reduce tasks to limit the resource/disk consumption of any
given task, but i'm wondering if someone has experienced this type of
issue before and whether there is a better way of approaching it ?
  

For


instance would using the multiple output segments option useful in
decreasing the amount of temp disk space needed at any given time ?

many thanks in advance,

-yp


  


  



--
Yves Petinot
Senior Software Engineer
www.snooth.com
y...@snooth.com




RE: Running out of disk space during segment merger

2010-04-09 Thread Arkadi.Kosmynin
Hi Yves,

I am glad it helped. Wish you success.

Regards,

Arkadi

 -Original Message-
 From: Yves Petinot [mailto:y...@snooth.com]
 Sent: Saturday, 10 April 2010 12:56 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Running out of disk space during segment merger
 
 Arkadi,
 
 thanks a lot for these suggestions. Indeed skipping the merge step
 seems
 to be perfectly acceptable as i have a fairly reasonable number of
 clusters. As an experiment I also attempted to perform my crawl with a
 single segment (which does not even raise the issue of merging) but my
 cluster doesn't seem to be able to handle the amount of disk space
 required during the fetch's reduce tasks. Somewhat frustrating as I
 only
 have about 1.2M URLs in this crawl.
 
 at any rate thanks again for your comments !
 
 -yp
 
 arkadi.kosmy...@csiro.au wrote:
  There are two solutions:
 
  1. Write a light weight version of segment version, which should not
 be hard if you are familiar with Hadoop.
  2. Don't merge segments. If you have a reasonable number of segments,
 even in 100s, Nutch still can handle this.
 
  Regards,
 
  Arkadi
 
 
  -Original Message-
  From: Yves Petinot [mailto:y...@snooth.com]
  Sent: Saturday, 27 March 2010 2:21 AM
  To: nutch-user@lucene.apache.org
  Subject: Re: Running out of disk space during segment merger
 
  Thanks a lot for your reply, Arkadi. It's good to know that this is
  indeed a known problem. With the current version, is there anything
  more
  one can do other than throwing a huge amount of temp disk space at
 each
  node ? Based on my experience and given a replication factor of N,
  would
  a rule of thumb to reserve roughly N TB per set of 1M URLs make
 sense ?
 
  -y
 
  arkadi.kosmy...@csiro.au wrote:
 
  Hi Yves,
 
  Yes, what you got is a normal result. This issue is discussed
 every
 
  few months on this list. To my mind, the segment merger is too
 general.
  It assumes that the segments are at arbitrary stages of completion
 and
  works on this assumption. But, this is not a common case at all.
  Mostly, people just want to merge finished segments. The algorithm
  could be much cheaper in this case.
 
  Regards,
 
  Arkadi
 
  -Original Message-
  From: Yves Petinot [mailto:y...@snooth.com]
  Sent: Friday, 26 March 2010 6:01 AM
  To: nutch-user@lucene.apache.org
  Subject: Running out of disk space during segment merger
 
  Hi,
 
  I was wondering if some people on the list have been facing issues
 
  with
 
  the segment merge phase, with all the nodes on their hadoop cluster
  eventually running out of disk space ? The type errors i'm getting
 
  looks
 
  like this:
 
  java.io.IOException: Task: attempt_201003241258_0005_r_01_0 -
 The
 
  reduce copier failed
 
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
  Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException:
 
  Could not find any valid local directory for
 
 file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache
 
 /job_201003241258_0005/attempt_201003241258_0005_r_01_0/output/map_
  115.out
 
at
 
 
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPath
  ForWrite(LocalDirAllocator.java:335)
 
at
 
 
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAll
  ocator.java:124)
 
at
 
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Redu
  ceTask.java:2384)
 
  FSError: java.io.IOException: No space left on device
 
  To give a little background, I'm currrently running this on a 3
 node
  cluster, each node having 500GB drives, which are mostly empty at
 the
  beginning of the process (~400 GB available on each node). The
  replication factor is set to 2 and i did also enable Hadoop block
  compression. Now, the nutch crawl takes up around 20 GB of disk
 (with
 
  7
 
  segments to merge, one of them being 9 GB, the others ranging from
 1
 
  to
 
  3 GB in size), so intuitively there should be plenty of space
 
  available
 
  for the merge operation, but we still end up running out of space
 
  during
 
  the reduce phase (7 reduce tasks). I'm currently trying to increase
 
  the
 
  number of reduce tasks to limit the resource/disk consumption of
 any
  given task, but i'm wondering if someone has experienced this type
 of
  issue before and whether there is a better way of approaching it ?
 
  For
 
  instance would using the multiple output segments option useful in
  decreasing the amount of temp disk space needed at any given time ?
 
  many thanks in advance,
 
  -yp
 
 
 
 
 
 
 
 --
 Yves Petinot
 Senior Software Engineer
 www.snooth.com
 y...@snooth.com
 



Re: Running out of disk space during segment merger

2010-03-26 Thread Yves Petinot
Thanks a lot for your reply, Arkadi. It's good to know that this is 
indeed a known problem. With the current version, is there anything more 
one can do other than throwing a huge amount of temp disk space at each 
node ? Based on my experience and given a replication factor of N, would 
a rule of thumb to reserve roughly N TB per set of 1M URLs make sense ?


-y

arkadi.kosmy...@csiro.au wrote:

Hi Yves,

Yes, what you got is a normal result. This issue is discussed every few 
months on this list. To my mind, the segment merger is too general. It assumes that the 
segments are at arbitrary stages of completion and works on this assumption. But, this is 
not a common case at all. Mostly, people just want to merge finished segments. The 
algorithm could be much cheaper in this case.

Regards,

Arkadi

-Original Message-
From: Yves Petinot [mailto:y...@snooth.com] 
Sent: Friday, 26 March 2010 6:01 AM

To: nutch-user@lucene.apache.org
Subject: Running out of disk space during segment merger

Hi,

I was wondering if some people on the list have been facing issues with 
the segment merge phase, with all the nodes on their hadoop cluster 
eventually running out of disk space ? The type errors i'm getting looks 
like this:


java.io.IOException: Task: attempt_201003241258_0005_r_01_0 - The reduce 
copier failed
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not 
find any valid local directory for 
file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache/job_201003241258_0005/attempt_201003241258_0005_r_01_0/output/map_115.out
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2384)

FSError: java.io.IOException: No space left on device

To give a little background, I'm currrently running this on a 3 node 
cluster, each node having 500GB drives, which are mostly empty at the 
beginning of the process (~400 GB available on each node). The 
replication factor is set to 2 and i did also enable Hadoop block 
compression. Now, the nutch crawl takes up around 20 GB of disk (with 7 
segments to merge, one of them being 9 GB, the others ranging from 1 to 
3 GB in size), so intuitively there should be plenty of space available 
for the merge operation, but we still end up running out of space during 
the reduce phase (7 reduce tasks). I'm currently trying to increase the 
number of reduce tasks to limit the resource/disk consumption of any 
given task, but i'm wondering if someone has experienced this type of 
issue before and whether there is a better way of approaching it ? For 
instance would using the multiple output segments option useful in 
decreasing the amount of temp disk space needed at any given time ?


many thanks in advance,

-yp

  




RE: Running out of disk space during segment merger

2010-03-26 Thread Arkadi.Kosmynin
There are two solutions:

1. Write a light weight version of segment version, which should not be hard if 
you are familiar with Hadoop.
2. Don't merge segments. If you have a reasonable number of segments, even in 
100s, Nutch still can handle this.

Regards,

Arkadi

 -Original Message-
 From: Yves Petinot [mailto:y...@snooth.com]
 Sent: Saturday, 27 March 2010 2:21 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Running out of disk space during segment merger
 
 Thanks a lot for your reply, Arkadi. It's good to know that this is
 indeed a known problem. With the current version, is there anything
 more
 one can do other than throwing a huge amount of temp disk space at each
 node ? Based on my experience and given a replication factor of N,
 would
 a rule of thumb to reserve roughly N TB per set of 1M URLs make sense ?
 
 -y
 
 arkadi.kosmy...@csiro.au wrote:
  Hi Yves,
 
  Yes, what you got is a normal result. This issue is discussed every
 few months on this list. To my mind, the segment merger is too general.
 It assumes that the segments are at arbitrary stages of completion and
 works on this assumption. But, this is not a common case at all.
 Mostly, people just want to merge finished segments. The algorithm
 could be much cheaper in this case.
 
  Regards,
 
  Arkadi
 
  -Original Message-
  From: Yves Petinot [mailto:y...@snooth.com]
  Sent: Friday, 26 March 2010 6:01 AM
  To: nutch-user@lucene.apache.org
  Subject: Running out of disk space during segment merger
 
  Hi,
 
  I was wondering if some people on the list have been facing issues
 with
  the segment merge phase, with all the nodes on their hadoop cluster
  eventually running out of disk space ? The type errors i'm getting
 looks
  like this:
 
  java.io.IOException: Task: attempt_201003241258_0005_r_01_0 - The
 reduce copier failed
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375)
  at org.apache.hadoop.mapred.Child.main(Child.java:158)
  Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException:
 Could not find any valid local directory for
 file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache
 /job_201003241258_0005/attempt_201003241258_0005_r_01_0/output/map_
 115.out
  at
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPath
 ForWrite(LocalDirAllocator.java:335)
  at
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAll
 ocator.java:124)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Redu
 ceTask.java:2384)
 
  FSError: java.io.IOException: No space left on device
 
  To give a little background, I'm currrently running this on a 3 node
  cluster, each node having 500GB drives, which are mostly empty at the
  beginning of the process (~400 GB available on each node). The
  replication factor is set to 2 and i did also enable Hadoop block
  compression. Now, the nutch crawl takes up around 20 GB of disk (with
 7
  segments to merge, one of them being 9 GB, the others ranging from 1
 to
  3 GB in size), so intuitively there should be plenty of space
 available
  for the merge operation, but we still end up running out of space
 during
  the reduce phase (7 reduce tasks). I'm currently trying to increase
 the
  number of reduce tasks to limit the resource/disk consumption of any
  given task, but i'm wondering if someone has experienced this type of
  issue before and whether there is a better way of approaching it ?
 For
  instance would using the multiple output segments option useful in
  decreasing the amount of temp disk space needed at any given time ?
 
  many thanks in advance,
 
  -yp
 
 



RE: Running out of disk space during segment merger

2010-03-25 Thread Arkadi.Kosmynin
Hi Yves,

Yes, what you got is a normal result. This issue is discussed every few 
months on this list. To my mind, the segment merger is too general. It assumes 
that the segments are at arbitrary stages of completion and works on this 
assumption. But, this is not a common case at all. Mostly, people just want to 
merge finished segments. The algorithm could be much cheaper in this case.

Regards,

Arkadi

-Original Message-
From: Yves Petinot [mailto:y...@snooth.com] 
Sent: Friday, 26 March 2010 6:01 AM
To: nutch-user@lucene.apache.org
Subject: Running out of disk space during segment merger

Hi,

I was wondering if some people on the list have been facing issues with 
the segment merge phase, with all the nodes on their hadoop cluster 
eventually running out of disk space ? The type errors i'm getting looks 
like this:

java.io.IOException: Task: attempt_201003241258_0005_r_01_0 - The reduce 
copier failed
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not 
find any valid local directory for 
file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache/job_201003241258_0005/attempt_201003241258_0005_r_01_0/output/map_115.out
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2384)

FSError: java.io.IOException: No space left on device

To give a little background, I'm currrently running this on a 3 node 
cluster, each node having 500GB drives, which are mostly empty at the 
beginning of the process (~400 GB available on each node). The 
replication factor is set to 2 and i did also enable Hadoop block 
compression. Now, the nutch crawl takes up around 20 GB of disk (with 7 
segments to merge, one of them being 9 GB, the others ranging from 1 to 
3 GB in size), so intuitively there should be plenty of space available 
for the merge operation, but we still end up running out of space during 
the reduce phase (7 reduce tasks). I'm currently trying to increase the 
number of reduce tasks to limit the resource/disk consumption of any 
given task, but i'm wondering if someone has experienced this type of 
issue before and whether there is a better way of approaching it ? For 
instance would using the multiple output segments option useful in 
decreasing the amount of temp disk space needed at any given time ?

many thanks in advance,

-yp