Hi,

I’m trying to optimize our map/reduce job which generates RFiles using 
AccumuloFileOutputFormat. We have a specific time window and within that time 
window we need to generate a predefined amount of simulation data and in terms 
of number of core we also have an upper bound we can use. Disks are also fixed 
at 4 per node and they are all SSDs. So I can’t employ more machines or more 
disks or cores to achieve higher write/s numbers.

So far we’ve managed to utilize 100% of all available cores and the SSD disks 
are also highly utilized. I’m trying to reduce processing time and we are 
willing to waste more disk space to achieve higher data generation speed. The 
data itself is 10s of columns of floating numbers, all serialized to fixed 
9-byte values which doesn’t lend well to compression. With no compression and 
replication set to 1 we can generate the same amount of data in almost half the 
time. With snappy it’s almost 10% more data generation time compared to no 
compression and almost twice more size on disk for the all the generated RFiles.

dataBlockSize doesn’t seem to change anything for non-compressed data. 
indexBlockSize also didn't change anything (tried 64K vs the default 128K).

Any other tricks I could employ to achieve higher write/s numbers?

Ara.



________________________________

This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Thank you in advance for your 
cooperation.

________________________________

Reply via email to