Hi Som, Crunch uses the CombineFileInputFormat to wrap large numbers of input files into a single input split if the underlying files are small. As of Crunch 0.10.0 (not yet released), this behaviour is disabled by default for file formats that are not built-in to Crunch, but I believe in 0.9.x the CrunchCombineFileInputFormat will be used by default for all subclasses of FileInputFormat.
You should be able to disable this behaviour by calling formatBundle..set(RuntimeParameters.DISABLE_COMBINE_FILE, "true") in your custom TableSource implementation. I'm a little confused as to why only one mapper is being created if your input is indeed 366 GB -- from what I understand, CombineFileInputFormat is just supposed to combine small files into a smaller number of splits. Could you give a bit more background on what your custom source is doing? In any case, turning on DISABLE_COMBINE_FILE should get around this for now. - Gabriel > On Wed, Apr 2, 2014 at 7:21 AM, Som Satpathy <[email protected]> wrote: > Hi Josh/all, > > I have a query regarding how crunch decides the number of mappers required > to process a data sourced formed out of multiple inputs. > > I have data stored as multiple sequence files, and I have implemented a > source class that implements TableSource<K, V>. I have a > MultiSequenceFileInputFormat which is set as my input format class in > configureSource(). I also made sure my getSize() returns the total size of > all the input sequence files. > > But interestingly, while applying a doFn() over data read from the above > source, I never see more than 1 mapper created. > > Here is what I see in my logs - > > 14/04/01 19:46:46 INFO crunch.OneToOneTrainingRecordPreSampler: source size > in bytes: 366566818559 > > 14/04/01 19:46:51 INFO input.FileInputFormat: Total input paths to process: > 170 > > > But there is always only 1 mapper running. > > As per my understanding, I should be seeing (total source size / block size) > number of mappers spawned. I might be missing something here, and I look > forward to hearing your thoughts to help me fix this. > > > Thanks, > > Som > > >
