Tuning simple count m/r job

Justin Cohen Mon, 13 Sep 2010 14:44:30 -0700

I have a table with 82 regions and about 44 million rows. It takesalmost 6 minutes to count with map reduce. Is that a reasonable rate fora ten machine cluster of data nodes? That's just over 12,000 rows persecond per machine…. Can I do better? Right now the only custom thing Iam doing is setting scan.setCaching to 10,000. There's one gz column perrow, but I just want to count rows, not decompress the columns...

Is each map task assigned to each region? Some map tasks only have a fewthousand rows. Others have over 2 million. Does this mean the regionsaren't balanced, or does it also take into account size of columns withnumber of rows.


Thanks,
Justin

Tuning simple count m/r job

Reply via email to