I bet what you're seeing is more efficient batching in the latter case. BatchWriter goes through a binning phase whenever it fills up half of its buffer, binning everything in the buffer into tablets. If you give it sorted data it will probably be binning into a subset of the tablets instead of all of them, which would be likely in the random case. Fewer batches translates into fewer RPC calls, and less general overhead.
This generally indicates that if your data starts roughly partitioned it will load faster, and that becomes more important as you scale up. Adam Hi, We just did a simple test: - insert 10k batches of columns - sort the same 10k batch based on row keys and insert So basically the batch writer in the first test has items in non-sorted order and in the second one in sorted order. We noticed 50% better performance in the sorted version! Why is that the case? Is this something we need to consider doing for live ingest scenarios? Thanks, Ara. ________________________________ This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Thank you in advance for your cooperation. ________________________________
