I found more disk space is required during indexing. So, for slave node with limited space, building smaller index, e.g. 2M pages instead of 10M pages, can avoid the disk space error.
A related question: after crawling/indexing for some time, each slave node accumulate lots of files (under hdfs/data/current and hdfs/mapreduce). What's the correct way to recover the occupied disk space? I assume some of these files are needed for communicating with master node. thanks, -aj On Thu, Aug 19, 2010 at 10:30 AM, AJ Chen <[email protected]> wrote: > I'm indexing 5M pages on a small/cheap cluster. there are some fatal errors > I try to understand. > > 1. no disk space error occurs on slave node even though there are still 30% > free space (>20GB) in hdfs partition. is it possible that disk requirement > may surge during nutch indexing? > > 2010-08-19 02:34:23,546 INFO mapred.ReduceTask - > attempt_201008141418_0034_r_000004_2 Scheduled 1 outputs (1 slow hosts and0 > dup hosts) > 2010-08-19 02:34:24,191 ERROR mapred.ReduceTask - Task: > attempt_201008141418_0034_r_000004_2 - FSError: > org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device > at > org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:192) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > at > java.io.BufferedOutputStream.write(BufferedOutputStream.java:104) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49) > at java.io.DataOutputStream.write(DataOutputStream.java:90) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleToDisk(ReduceTask.java:1620) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1416) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195) > Caused by: java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:260) > at > org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:190) > ... 8 more > > 2. task failure errors. any idea what might cause it? > > Task attempt_201008141418_0034_r_000005_0 failed to report status for 600 > seconds. Killing! > Task attempt_201008141418_0034_r_000001_0 failed to report status for 600 > seconds. Killing! > Task attempt_201008141418_0034_r_000002_0 failed to report status for 601 > seconds. Killing! > > 2010-08-19 00:49:28,309 INFO mapred.TaskTracker - Process Thread Dump: > lost task > 28 active threads > Thread 12334 (IPC Client (47) connection to > vmo-crawl08-dev/10.1.1.60:9001from jboss): > State: TIMED_WAITING > Blocked count: 2498 > Waited count: 2498 > Stack: > java.lang.Object.wait(Native Method) > org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403) > org.apache.hadoop.ipc.Client$Connection.run(Client.java:445) > Thread 11256 (process reaper): > State: RUNNABLE > Blocked count: 0 > Waited count: 0 > Stack: > java.lang.UNIXProcess.waitForProcessExit(Native Method) > java.lang.UNIXProcess.access$900(UNIXProcess.java:20) > > > thanks, > aj > > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

