Data Size - 20 GB. It took about an hour with default hbase setting and after varying several parameters, we were able to get this done in ~20 minutes. This is slow and we are trying to improve.
We wrote a java client which would essentially `put` to hbase tables in batches. Our fine-tuning parameters include, 1. Disabling compaction 2. Varying batch sizes of put ( tried with 1000, 5000, 10000, 20000, 40000 ) 3. Setting AutoFlush to on/off. 4. Varying write buffer(in client) with 2mb, 128mb,256mb 5. Changing regionserver.handler.count to 100 6. Varying regionserver size from 128 to 256/512/1024. 7. Increasing number of regions. 8. Creating regions with keys pre-specified (so that clients hit the regions directly) 9. Varying number of clients (from 30 clients to 100 clients) The above was tested on a 38 node cluster with 2 regions each. We did not try disabling WAL fearing loss of data. Are there any other parameters that we missed during the process? Viv
