Re: Real-life experience of forcing smaller input splits?

David Morel Fri, 25 Jan 2013 04:29:15 -0800

On 25 Jan 2013, at 10:37, Bertrand Dechoux wrote:

It seems to me the question has not been answered :
"is it possible yes or no to force a smaller split size
than a block on the mappers"
Not that I know (but you could implement something to do it) but whywould
you do it?
By default if the split is set under the size of a block, it will be a
block.
One of the reason is data-locality. The second is that a block iswritten
into a single hard-drive (leaving replicas aside) so if n mappers were
reading n parts from the same block well they would share thehard-drive
bandwidth... So it is not a clear win.

You can change the block size of the file you want to read but using
smaller block size is really an anti-pattern. Most people increase the
block size.
(Note : block size of files are fixed when writing the files and itcan be
different between two different files.)

That will be my approach for now, or disabling compression altogetherfor

these files. The only problem I have is that compression is so efficient

that any operation in the mapper (so on the uncompressed data) justmakes

the mapper throw an OOM exception, no matter how much memory I give it.

What partly works though, is setting a low mapred.max.split.size. In a
directory containing 34 files, I get 33 mappers (???). When setting

hive.merge.mapfiles to false (and leaving mapred.max.split.size at itsfsblocksize default), it doesn't seem to have any effect and I get 20mappers

only.


Are you trying to handle data which are too small?

If hive supports multi-threading for mapper it might be an solution.But I

don't the configuration for that.

Regards

Bertrand

PS : the question is quite general and not really hive related


I realized that after re-reading the whole thread :-)

Thanks for all the answers, everyone!

David

On Fri, Jan 25, 2013 at 8:46 AM, Edward Capriolo<edlinuxg...@gmail.com>wrote:
Not all files are split-table Sequence Files are. Raw gzip files arenot.
On Fri, Jan 25, 2013 at 1:47 AM, Nitin Pawar<nitinpawar...@gmail.com>wrote:
set mapred.min.split.size=1024000;
set mapred.max.split.size=4096000;
set hive.merge.mapfiles=false;
I had set above value and setting max split size to a lower valuedid
increase my # number of maps.  My blocksize was 128MB
Only thing was my files on hdfs were not heavily compressed and Iwas
using RCFileFormat
I would suggest if you have heavily compressed files then you maywant todo check what will be size after uncompression and allocate morememory to
maps
On Fri, Jan 25, 2013 at 11:46 AM, David Morel <dmore...@gmail.com>wrote:
Hello,
I have seen many posts on various sites and MLs, but didn't find afirmanswer anywhere: is it possible yes or no to force a smaller splitsize
than a block on the mappers, from the client side? I'm not after
pointers to the docs (unless you're very very sure :-) but after
real-life experience along the lines of 'yes, it works this way,I've
done it like this...'
All the parameters that I could find (especially specifying a maxinputsplit size) seem to have no effect, and the files that I have aresoheavily compressed that they completely saturate the mappers'memory
when processed.
A solution I could imagine for this specific issue is reducing theblocksize, but for now I simply went with disabling in-file compressionforthose. And changing the block size on a per-file basis is somethingI'd
like to avoid if at all possible.
All the hive settings that we tried only got me as far as raisingthenumber of mappers from 5 to 6 (yay!) where I would have needed atleast
ten times more.

Thanks!

D.Morel
--
Nitin Pawar
--
Bertrand Dechoux

Re: Real-life experience of forcing smaller input splits?

Reply via email to