Thanks for your inputs Gabriel. I was able to resolve the problem by using SeqFileTableSource(List<Path> paths, PTableType<K, V> ptype) instead of the custom TableSource implementation using a multiseqfileinputformat. Before instantiating the SeqFileTableSource, I just build the list of input paths needed for my job.
Thanks, Som On Wed, Apr 2, 2014 at 1:32 AM, Gabriel Reid <[email protected]> wrote: > Hi Som, > > Crunch uses the CombineFileInputFormat to wrap large numbers of input > files into a single input split if the underlying files are small. As of > Crunch 0.10.0 (not yet released), this behaviour is disabled by default for > file formats that are not built-in to Crunch, but I believe in 0.9.x the > CrunchCombineFileInputFormat will be used by default for all subclasses of > FileInputFormat. > > You should be able to disable this behaviour by calling > formatBundle..set(RuntimeParameters.DISABLE_COMBINE_FILE, "true") in your > custom TableSource implementation. > > I'm a little confused as to why only one mapper is being created if your > input is indeed 366 GB -- from what I understand, CombineFileInputFormat is > just supposed to combine small files into a smaller number of splits. Could > you give a bit more background on what your custom source is doing? In any > case, turning on DISABLE_COMBINE_FILE should get around this for now. > > - Gabriel > > > > > On Wed, Apr 2, 2014 at 7:21 AM, Som Satpathy <[email protected]> > wrote: > > Hi Josh/all, > > > > I have a query regarding how crunch decides the number of mappers > required > > to process a data sourced formed out of multiple inputs. > > > > I have data stored as multiple sequence files, and I have implemented a > > source class that implements TableSource<K, V>. I have a > > MultiSequenceFileInputFormat which is set as my input format class in > > configureSource(). I also made sure my getSize() returns the total size > of > > all the input sequence files. > > > > But interestingly, while applying a doFn() over data read from the above > > source, I never see more than 1 mapper created. > > > > Here is what I see in my logs - > > > > 14/04/01 19:46:46 INFO crunch.OneToOneTrainingRecordPreSampler: source > size > > in bytes: 366566818559 > > > > 14/04/01 19:46:51 INFO input.FileInputFormat: Total input paths to > process: > > 170 > > > > > > But there is always only 1 mapper running. > > > > As per my understanding, I should be seeing (total source size / block > size) > > number of mappers spawned. I might be missing something here, and I look > > forward to hearing your thoughts to help me fix this. > > > > > > Thanks, > > > > Som > > > > > > >
