Re: Running out of disk space during segment merger
Arkadi, thanks a lot for these suggestions. Indeed skipping the merge step seems to be perfectly acceptable as i have a fairly reasonable number of clusters. As an experiment I also attempted to perform my crawl with a single segment (which does not even raise the issue of merging) but my cluster doesn't seem to be able to handle the amount of disk space required during the fetch's reduce tasks. Somewhat frustrating as I only have about 1.2M URLs in this crawl. at any rate thanks again for your comments ! -yp arkadi.kosmy...@csiro.au wrote: There are two solutions: 1. Write a light weight version of segment version, which should not be hard if you are familiar with Hadoop. 2. Don't merge segments. If you have a reasonable number of segments, even in 100s, Nutch still can handle this. Regards, Arkadi -Original Message- From: Yves Petinot [mailto:y...@snooth.com] Sent: Saturday, 27 March 2010 2:21 AM To: nutch-user@lucene.apache.org Subject: Re: Running out of disk space during segment merger Thanks a lot for your reply, Arkadi. It's good to know that this is indeed a known problem. With the current version, is there anything more one can do other than throwing a huge amount of temp disk space at each node ? Based on my experience and given a replication factor of N, would a rule of thumb to reserve roughly N TB per set of 1M URLs make sense ? -y arkadi.kosmy...@csiro.au wrote: Hi Yves, Yes, what you got is a normal result. This issue is discussed every few months on this list. To my mind, the segment merger is too general. It assumes that the segments are at arbitrary stages of completion and works on this assumption. But, this is not a common case at all. Mostly, people just want to merge finished segments. The algorithm could be much cheaper in this case. Regards, Arkadi -Original Message- From: Yves Petinot [mailto:y...@snooth.com] Sent: Friday, 26 March 2010 6:01 AM To: nutch-user@lucene.apache.org Subject: Running out of disk space during segment merger Hi, I was wondering if some people on the list have been facing issues with the segment merge phase, with all the nodes on their hadoop cluster eventually running out of disk space ? The type errors i'm getting looks like this: java.io.IOException: Task: attempt_201003241258_0005_r_01_0 - The reduce copier failed at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375) at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache /job_201003241258_0005/attempt_201003241258_0005_r_01_0/output/map_ 115.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPath ForWrite(LocalDirAllocator.java:335) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAll ocator.java:124) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Redu ceTask.java:2384) FSError: java.io.IOException: No space left on device To give a little background, I'm currrently running this on a 3 node cluster, each node having 500GB drives, which are mostly empty at the beginning of the process (~400 GB available on each node). The replication factor is set to 2 and i did also enable Hadoop block compression. Now, the nutch crawl takes up around 20 GB of disk (with 7 segments to merge, one of them being 9 GB, the others ranging from 1 to 3 GB in size), so intuitively there should be plenty of space available for the merge operation, but we still end up running out of space during the reduce phase (7 reduce tasks). I'm currently trying to increase the number of reduce tasks to limit the resource/disk consumption of any given task, but i'm wondering if someone has experienced this type of issue before and whether there is a better way of approaching it ? For instance would using the multiple output segments option useful in decreasing the amount of temp disk space needed at any given time ? many thanks in advance, -yp -- Yves Petinot Senior Software Engineer www.snooth.com y...@snooth.com
RE: Running out of disk space during segment merger
Hi Yves, I am glad it helped. Wish you success. Regards, Arkadi -Original Message- From: Yves Petinot [mailto:y...@snooth.com] Sent: Saturday, 10 April 2010 12:56 AM To: nutch-user@lucene.apache.org Subject: Re: Running out of disk space during segment merger Arkadi, thanks a lot for these suggestions. Indeed skipping the merge step seems to be perfectly acceptable as i have a fairly reasonable number of clusters. As an experiment I also attempted to perform my crawl with a single segment (which does not even raise the issue of merging) but my cluster doesn't seem to be able to handle the amount of disk space required during the fetch's reduce tasks. Somewhat frustrating as I only have about 1.2M URLs in this crawl. at any rate thanks again for your comments ! -yp arkadi.kosmy...@csiro.au wrote: There are two solutions: 1. Write a light weight version of segment version, which should not be hard if you are familiar with Hadoop. 2. Don't merge segments. If you have a reasonable number of segments, even in 100s, Nutch still can handle this. Regards, Arkadi -Original Message- From: Yves Petinot [mailto:y...@snooth.com] Sent: Saturday, 27 March 2010 2:21 AM To: nutch-user@lucene.apache.org Subject: Re: Running out of disk space during segment merger Thanks a lot for your reply, Arkadi. It's good to know that this is indeed a known problem. With the current version, is there anything more one can do other than throwing a huge amount of temp disk space at each node ? Based on my experience and given a replication factor of N, would a rule of thumb to reserve roughly N TB per set of 1M URLs make sense ? -y arkadi.kosmy...@csiro.au wrote: Hi Yves, Yes, what you got is a normal result. This issue is discussed every few months on this list. To my mind, the segment merger is too general. It assumes that the segments are at arbitrary stages of completion and works on this assumption. But, this is not a common case at all. Mostly, people just want to merge finished segments. The algorithm could be much cheaper in this case. Regards, Arkadi -Original Message- From: Yves Petinot [mailto:y...@snooth.com] Sent: Friday, 26 March 2010 6:01 AM To: nutch-user@lucene.apache.org Subject: Running out of disk space during segment merger Hi, I was wondering if some people on the list have been facing issues with the segment merge phase, with all the nodes on their hadoop cluster eventually running out of disk space ? The type errors i'm getting looks like this: java.io.IOException: Task: attempt_201003241258_0005_r_01_0 - The reduce copier failed at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375) at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache /job_201003241258_0005/attempt_201003241258_0005_r_01_0/output/map_ 115.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPath ForWrite(LocalDirAllocator.java:335) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAll ocator.java:124) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Redu ceTask.java:2384) FSError: java.io.IOException: No space left on device To give a little background, I'm currrently running this on a 3 node cluster, each node having 500GB drives, which are mostly empty at the beginning of the process (~400 GB available on each node). The replication factor is set to 2 and i did also enable Hadoop block compression. Now, the nutch crawl takes up around 20 GB of disk (with 7 segments to merge, one of them being 9 GB, the others ranging from 1 to 3 GB in size), so intuitively there should be plenty of space available for the merge operation, but we still end up running out of space during the reduce phase (7 reduce tasks). I'm currently trying to increase the number of reduce tasks to limit the resource/disk consumption of any given task, but i'm wondering if someone has experienced this type of issue before and whether there is a better way of approaching it ? For instance would using the multiple output segments option useful in decreasing the amount of temp disk space needed at any given time ? many thanks in advance, -yp -- Yves Petinot Senior Software Engineer www.snooth.com y...@snooth.com
Re: Running out of disk space during segment merger
Thanks a lot for your reply, Arkadi. It's good to know that this is indeed a known problem. With the current version, is there anything more one can do other than throwing a huge amount of temp disk space at each node ? Based on my experience and given a replication factor of N, would a rule of thumb to reserve roughly N TB per set of 1M URLs make sense ? -y arkadi.kosmy...@csiro.au wrote: Hi Yves, Yes, what you got is a normal result. This issue is discussed every few months on this list. To my mind, the segment merger is too general. It assumes that the segments are at arbitrary stages of completion and works on this assumption. But, this is not a common case at all. Mostly, people just want to merge finished segments. The algorithm could be much cheaper in this case. Regards, Arkadi -Original Message- From: Yves Petinot [mailto:y...@snooth.com] Sent: Friday, 26 March 2010 6:01 AM To: nutch-user@lucene.apache.org Subject: Running out of disk space during segment merger Hi, I was wondering if some people on the list have been facing issues with the segment merge phase, with all the nodes on their hadoop cluster eventually running out of disk space ? The type errors i'm getting looks like this: java.io.IOException: Task: attempt_201003241258_0005_r_01_0 - The reduce copier failed at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375) at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache/job_201003241258_0005/attempt_201003241258_0005_r_01_0/output/map_115.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2384) FSError: java.io.IOException: No space left on device To give a little background, I'm currrently running this on a 3 node cluster, each node having 500GB drives, which are mostly empty at the beginning of the process (~400 GB available on each node). The replication factor is set to 2 and i did also enable Hadoop block compression. Now, the nutch crawl takes up around 20 GB of disk (with 7 segments to merge, one of them being 9 GB, the others ranging from 1 to 3 GB in size), so intuitively there should be plenty of space available for the merge operation, but we still end up running out of space during the reduce phase (7 reduce tasks). I'm currently trying to increase the number of reduce tasks to limit the resource/disk consumption of any given task, but i'm wondering if someone has experienced this type of issue before and whether there is a better way of approaching it ? For instance would using the multiple output segments option useful in decreasing the amount of temp disk space needed at any given time ? many thanks in advance, -yp
RE: Running out of disk space during segment merger
There are two solutions: 1. Write a light weight version of segment version, which should not be hard if you are familiar with Hadoop. 2. Don't merge segments. If you have a reasonable number of segments, even in 100s, Nutch still can handle this. Regards, Arkadi -Original Message- From: Yves Petinot [mailto:y...@snooth.com] Sent: Saturday, 27 March 2010 2:21 AM To: nutch-user@lucene.apache.org Subject: Re: Running out of disk space during segment merger Thanks a lot for your reply, Arkadi. It's good to know that this is indeed a known problem. With the current version, is there anything more one can do other than throwing a huge amount of temp disk space at each node ? Based on my experience and given a replication factor of N, would a rule of thumb to reserve roughly N TB per set of 1M URLs make sense ? -y arkadi.kosmy...@csiro.au wrote: Hi Yves, Yes, what you got is a normal result. This issue is discussed every few months on this list. To my mind, the segment merger is too general. It assumes that the segments are at arbitrary stages of completion and works on this assumption. But, this is not a common case at all. Mostly, people just want to merge finished segments. The algorithm could be much cheaper in this case. Regards, Arkadi -Original Message- From: Yves Petinot [mailto:y...@snooth.com] Sent: Friday, 26 March 2010 6:01 AM To: nutch-user@lucene.apache.org Subject: Running out of disk space during segment merger Hi, I was wondering if some people on the list have been facing issues with the segment merge phase, with all the nodes on their hadoop cluster eventually running out of disk space ? The type errors i'm getting looks like this: java.io.IOException: Task: attempt_201003241258_0005_r_01_0 - The reduce copier failed at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375) at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache /job_201003241258_0005/attempt_201003241258_0005_r_01_0/output/map_ 115.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPath ForWrite(LocalDirAllocator.java:335) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAll ocator.java:124) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Redu ceTask.java:2384) FSError: java.io.IOException: No space left on device To give a little background, I'm currrently running this on a 3 node cluster, each node having 500GB drives, which are mostly empty at the beginning of the process (~400 GB available on each node). The replication factor is set to 2 and i did also enable Hadoop block compression. Now, the nutch crawl takes up around 20 GB of disk (with 7 segments to merge, one of them being 9 GB, the others ranging from 1 to 3 GB in size), so intuitively there should be plenty of space available for the merge operation, but we still end up running out of space during the reduce phase (7 reduce tasks). I'm currently trying to increase the number of reduce tasks to limit the resource/disk consumption of any given task, but i'm wondering if someone has experienced this type of issue before and whether there is a better way of approaching it ? For instance would using the multiple output segments option useful in decreasing the amount of temp disk space needed at any given time ? many thanks in advance, -yp
RE: Running out of disk space during segment merger
Hi Yves, Yes, what you got is a normal result. This issue is discussed every few months on this list. To my mind, the segment merger is too general. It assumes that the segments are at arbitrary stages of completion and works on this assumption. But, this is not a common case at all. Mostly, people just want to merge finished segments. The algorithm could be much cheaper in this case. Regards, Arkadi -Original Message- From: Yves Petinot [mailto:y...@snooth.com] Sent: Friday, 26 March 2010 6:01 AM To: nutch-user@lucene.apache.org Subject: Running out of disk space during segment merger Hi, I was wondering if some people on the list have been facing issues with the segment merge phase, with all the nodes on their hadoop cluster eventually running out of disk space ? The type errors i'm getting looks like this: java.io.IOException: Task: attempt_201003241258_0005_r_01_0 - The reduce copier failed at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375) at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache/job_201003241258_0005/attempt_201003241258_0005_r_01_0/output/map_115.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2384) FSError: java.io.IOException: No space left on device To give a little background, I'm currrently running this on a 3 node cluster, each node having 500GB drives, which are mostly empty at the beginning of the process (~400 GB available on each node). The replication factor is set to 2 and i did also enable Hadoop block compression. Now, the nutch crawl takes up around 20 GB of disk (with 7 segments to merge, one of them being 9 GB, the others ranging from 1 to 3 GB in size), so intuitively there should be plenty of space available for the merge operation, but we still end up running out of space during the reduce phase (7 reduce tasks). I'm currently trying to increase the number of reduce tasks to limit the resource/disk consumption of any given task, but i'm wondering if someone has experienced this type of issue before and whether there is a better way of approaching it ? For instance would using the multiple output segments option useful in decreasing the amount of temp disk space needed at any given time ? many thanks in advance, -yp