How is the number of mappers to be launched calculated exactly?
Is the file format and compression taken into the picture (256MB compressed
data would give much more MB when mapper decompresses it)?

I've created a couple of ORC files (no compression, 1file=1table) with
different stripe size settings:
256, 128, 64 and 16MB. Their sizes are respectively (327,814,200;
413,030,657; 413,030,290; 433,481,175)
When I run a query (SELECT * FROM … ORDER BY) over those tables the number
of map tasks launched is respectively:
1, 2, 2, 2.
I would expect it to be aligned with my chunk size (256MB) so always 2 as
it's always a multiplier of the stripe sizes I've chosen.
After I change the engine to TEZ it gets even more interesting, the number
of mappers is respectively;
2, 2, 4, 13
Why is it different?

Also when I examine the source table files using orcdump utility I can see
the number of stripes is not consistent with declared stripe size,
respectively:
8, 118, 118, 118
Is it like the number of mappers is based on the declared stripe size (DDL
= Hive metastore) rather than the file itself?

~Maciek

Reply via email to