Hi Josh/all, I have a query regarding how crunch decides the number of mappers required to process a data sourced formed out of multiple inputs.
I have data stored as multiple sequence files, and I have implemented a source class that implements TableSource<K, V>. I have a MultiSequenceFileInputFormat which is set as my input format class in configureSource(). I also made sure my getSize() returns the total size of all the input sequence files. But interestingly, while applying a doFn() over data read from the above source, I never see more than 1 mapper created. Here is what I see in my logs - 14/04/01 19:46:46 INFO crunch.OneToOneTrainingRecordPreSampler: source size in bytes: 366566818559 14/04/01 19:46:51 INFO input.FileInputFormat: Total input paths to process: 170 But there is always only 1 mapper running. As per my understanding, I should be seeing (total source size / block size) number of mappers spawned. I might be missing something here, and I look forward to hearing your thoughts to help me fix this. Thanks, Som
