I have a table with 82 regions and about 44 million rows. It takes almost 6 minutes to count with map reduce. Is that a reasonable rate for a ten machine cluster of data nodes? That's just over 12,000 rows per second per machineā¦. Can I do better? Right now the only custom thing I am doing is setting scan.setCaching to 10,000. There's one gz column per row, but I just want to count rows, not decompress the columns...
Is each map task assigned to each region? Some map tasks only have a few thousand rows. Others have over 2 million. Does this mean the regions aren't balanced, or does it also take into account size of columns with number of rows.
Thanks, Justin
