Hi all, In this chapter of our 0.89 to 0.90 migration saga, we are seeing what we suspect might be latency related artifacts.
The setting: - Our EC2 dev environment running our CI builds - CDH3 U0 (both hadoop and hbase) setup in pseudo-clustered mode We have several unit tests that have started mysteriously failing in random ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3. Those tests used to run against 0.89 and never failed before. They also run OK on our local macbooks. On EC2, we are seeing lots of issues where the setup data is not being persisted in time for the tests to assert against them. They are also not always being torn down properly. We first suspected our new code around secondary indexes; we do have extensive unit tests around it that provide us with a solid level of confidence that it works properly in our CRUD scenarios. We also performance tested against the old hbase-trx contrib code and our new secondary indexes seem to be running slightly faster as well (of course, that could be due to the bump from 0.89 to 0.90). We first started seeing issues running our hudson build on the same machine as the hbase pseudo-cluster. We figured that was putting too much load on the box, so we created a separate large instance on EC2 to host just the 0.90 stack. This migration nearly quadrupled the number of unit tests failing at times. The only difference between for first and second CI setup is the network in between. Before we start tearing down our code line by line, I'd like to see if there are latency related configuration tweaks we could try to make the setup more resilient to network lag. Are there any hbase/zookepper settings that might help? For instance, we see things such as HBASE_SLAVE_SLEEP in hbase-env.sh . Can that help? Any suggestions are more than welcome. Also, the overview above may not be enough to go on, so please let me know if I could provide more details. Thank you in advance for any help. -GS
