Hi Som,

Crunch uses the CombineFileInputFormat to wrap large numbers of input files 
into a single input split if the underlying files are small. As of Crunch 
0.10.0 (not yet released), this behaviour is disabled by default for file 
formats that are not built-in to Crunch, but I believe in 0.9.x the 
CrunchCombineFileInputFormat will be used by default for all subclasses of 
FileInputFormat.

You should be able to disable this behaviour by calling 
formatBundle..set(RuntimeParameters.DISABLE_COMBINE_FILE, "true") in your 
custom TableSource implementation.

I'm a little confused as to why only one mapper is being created if your input 
is indeed 366 GB -- from what I understand, CombineFileInputFormat is just 
supposed to combine small files into a smaller number of splits. Could you give 
a bit more background on what your custom source is doing? In any case, turning 
on DISABLE_COMBINE_FILE should get around this for now. 

- Gabriel



> On Wed, Apr 2, 2014 at 7:21 AM, Som Satpathy <[email protected]> wrote:
> Hi Josh/all,
> 
> I have a query regarding how crunch decides the number of mappers required
> to process a data sourced formed out of multiple inputs.
> 
> I have data stored as multiple sequence files, and I have implemented a
> source class that implements TableSource<K, V>. I have a
> MultiSequenceFileInputFormat which is set as my input format class in
> configureSource(). I also made sure my getSize() returns the total size of
> all the input sequence files.
> 
> But interestingly, while applying a doFn() over data read from the above
> source, I never see more than 1 mapper created.
> 
> Here is what I see in my logs -
> 
> 14/04/01 19:46:46 INFO crunch.OneToOneTrainingRecordPreSampler: source size
> in bytes: 366566818559
> 
> 14/04/01 19:46:51 INFO input.FileInputFormat: Total input paths to process:
> 170
> 
> 
> But there is always only 1 mapper running.
> 
> As per my understanding, I should be seeing (total source size / block size)
> number of mappers spawned. I might be missing something here, and I look
> forward to hearing your thoughts to help me fix this.
> 
> 
> Thanks,
> 
> Som
> 
> 
> 

Reply via email to