Hello,
We are considering which of HBase or Cassandra to choose for our future
projects. I'm recommending HBase to my boss and coworkers, because HBase is
good both for analysis (MapReduce) and for OLTP (get/put provides relatively
fast response). Cassandra is superior in get/put response time, but it does
not seem to be good at MapReduce because it can't perform range queries
based on row keys (OPP can, but it seems difficult to use).
However, my boss points out the following as the weaknesses of HBase and
insists that we choose Cassandra. I prefer HBase because HBase has stronger
potential, thanks to its active community and rich ecosystem backed by the
membership of Hadoop family. Are there any good explanations (or future
improvement plans/ideas) to persuade him and change his mind?
(1) Ease of use
Cassandra does not require any other software. All nodes of Cassandra have
the same role. Pretty easy.
On the other hand, HBase requires HDFS and ZooKeeper. Users have to
manipulate and manage HDFS and ZooKeeper. The nodes in the cluster have
various roles, and the users need to design the placement of different types
of nodes.
(2) Failover time
One of our potential customers requires that the system completes failover
within one second. "One second" means the interval between when the system
detects node failure and when the clients regain access to data.
Cassandra continues to process requests if one of three replica nodes
remains. Therefore, the requirement is met.
However, HBase seems to take minutes, because it needs to reassign regions
to live region servers, open reassigned regions and load their block index
into memory, and perform log application. As the hardware gets more
powerful, each node will be able to handle more regions. As a result,
failover time will get longer in proportion to the number of regions, won't
it?
## My question:
Is it possible to improve failover time? If yes, how long will it get
shortened?
##
(3) SPOF
Cassandra has no SPOF. HBase and future HDFS eliminates SPOF by using backup
masters, however, master failure *can* the entire system operation in some
way. How long does it take to detect master failure and make one of the
backup masters promote to the new master and return to normal operation?
(4) Storage and analysis of sensor data
If the row key is (sensor_id) or (sensor_id, timestamp), Cassandra can hash
the row key and distribute inserts from many sensors to the entire cluster
(no hotspot). Though MapReduce framework may throw commands to all nodes,
the nodes that do not have related data will not do anything nor waste CPU
or I/O resources.
## My question:
Is there any case study where HBase is used as a storage for sensor data?
##
Regards,
Maumau