Hi Wenyia, The chunk size property will cause seqdirectory to output smaller sequence files. Using multiple small files as input will allow a greater number of map tasks to be run in parallel because each file will be assigned to its own map task.
In the case of the Reuters example, forcing the chunk size to 5mb will cause 3 separate files to be generated instead of a single sequence file. The FileSystem blocksize of 64m is treated as an upper bound for input splits, so unless input less than 64m is chunked into smaller files only a single mapper will be run. Drew On Mon, Jun 27, 2011 at 4:36 PM, wine lover <[email protected]> wrote: > Hello Everyone, > > When using seqdirectory to convert directory of documents to SequenceFile > format, it asks to set the parameter of chunk size: > <-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> > > In the example of build-ruters.sh, the chunk size is setup as 5. But I do > not know why? Is parameter input-dependent or system-dependent? Is there any > rule for setting this parameter? > > When using seq2sparse to creat vectors from SequenceFile, I notice that the > build-ruters.sh use it as follows: > $MAHOUT seq2sparse \ > -i mahout-work/reuters-out-seqdir/ \ > -o mahout-work/reuters-out-seqdir-sparse-lda \ > -wt tf -seq -nr 3 \ > > What does "-nr 3" stand for? > > Thanks, > > Wenyia >
