Hello all, I have a job that does heavy writing into HBase. The most recent run was 94 million records, each being put to two tables: one table stores a KeyValue per record, while the other table batches them up into bundles of up to a few thousand per bundle. This latest run took about 25 minutes to run.
We are currently in a phase of development where we need to do these migrations often, and we noticed that enabling the WAL slows the job down about 6-8x. In the interest of speed, we have disabled the WAL and added the following safeguards: 1) At the beginning of the job we check for any dead servers. At the end of the job we check again, and compare. If there is a new dead server, we retry the job (the jobs are idempotent/reentrant). 2) At the end of the job, if no servers were lost, we force a memstore flush on the tables that were saved to, using HBaseAdmin.flush(String tableName). We then poll the HServerLoad.RegionLoad for all regions of the tables we flushed, checking the memStoreSizeMB and waiting until it reaches 0 (obviously with a time limit, which causes the job to fail). We feel as though these two mechanisms give us enough protection against losing data from region server loss, since the hadoop job is the only process saving to the tables. I use this same technique on another smaller job as well, and that one has worked fine. However, on this larger job I am seeing a NotServingRegionException when trying to call the initial flush(). We have re-ran the job a few times now, and each time this has happened, with a different region. Searching for that region on the Admin UI confirms it doesn't exist. Trying to flush the table manually with the hbase shell shows the same problem. However, I also tried running the flush from the shell maybe 10-20 minutes after the job finished and it worked that time. Is it possible that a split or compaction is happening at the time of the flush and the region either temporarily becomes unavailable? Any thoughts are appreciated. Thanks, Bryan
