On 25 Jan 2013, at 10:37, Bertrand Dechoux wrote: > It seems to me the question has not been answered : > "is it possible yes or no to force a smaller split size > than a block on the mappers" > > Not that I know (but you could implement something to do it) but why would > you do it? > By default if the split is set under the size of a block, it will be a > block. > One of the reason is data-locality. The second is that a block is written > into a single hard-drive (leaving replicas aside) so if n mappers were > reading n parts from the same block well they would share the hard-drive > bandwidth... So it is not a clear win. > > You can change the block size of the file you want to read but using > smaller block size is really an anti-pattern. Most people increase the > block size. > (Note : block size of files are fixed when writing the files and it can be > different between two different files.) > > Are you trying to handle data which are too small? > If hive supports multi-threading for mapper it might be an solution. But I > don't the configuration for that.
The files are RCFiles with a block size of 128MB IIRC, but the file compression achieves a ratio of nearly 1 to 100. When going through the mapper, there is simply not enough memory available to it. Since the compression scheme is BLOCK, I expected it would be possible to instruct hive to process only a limited number of fragments instead of everything that's in the file in 1 go. David