Re: HBase and Cassandra on StackOverflow

Sam Seigal Tue, 30 Aug 2011 12:22:50 -0700

Will the write call to HBase block until the record written is fully
replicated ? If not (since it is happening at the block level), then isn't
there a window where a region server goes down, the data might not be
available anywhere else, until it comes back up ?


On Tue, Aug 30, 2011 at 9:17 AM, Andrew Purtell <[email protected]> wrote:

> > Is the replication strategy for HBase completely reliant on HDFS' block
> > replication pipelining ?
>
> Yes.
>
> > Is this replication process asynchronous ?
>
>
> No.
> Best regards,
>
>
>        - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
>
> >________________________________
> >From: Sam Seigal <[email protected]>
> >To: [email protected]; Andrew Purtell <[email protected]>
> >Cc: "[email protected]" <[email protected]>
> >Sent: Tuesday, August 30, 2011 7:35 PM
> >Subject: Re: HBase and Cassandra on StackOverflow
> >
> >A question inline:
> >
> >On Tue, Aug 30, 2011 at 2:47 AM, Andrew Purtell <[email protected]>
> wrote:
> >
> >> Hi Chris,
> >>
> >> Appreciate your answer on the post.
> >>
> >> Personally speaking however the endless Cassandra vs. HBase discussion
> is
> >> tiresome and rarely do blog posts or emails in this regard shed any
> light.
> >> Often, Cassandra proponents mis-state their case out of ignorance of
> HBase
> >> or due to commercial or personal agendas. It is difficult to find clear
> eyed
> >> analysis among the partisans. I'm not sure it will make any difference
> >> posting a rebuttal to some random thing jbellis says. Better to focus on
> >> improving HBase than play whack a mole.
> >>
> >>
> >> Regarding some of the specific points in that post:
> >>
> >> HBase is proven in production deployments larger than the largest
> publicly
> >> reported Cassandra cluster, ~1K versus 400 or 700 or somesuch. But
> basically
> >> this is the same order of magnitude, with HBase having a slight edge. I
> >> don't see a meaningful difference here. Stating otherwise is false.
> >>
> >> HBase supports replication between clusters (i.e. data centers). I
> believe,
> >> but admit I'm not super familiar with the Cassandra option here, that
> the
> >> main difference is HBase provides simple mechanism and the user must
> build a
> >> replication architecture useful for them; while Cassandra attempts to
> hide
> >> some of that complexity. I do not know if they succeed there, but large
> >> scale cross data center replication is rarely one size fits all so I
> doubt
> >> it.
> >>
> >> Cassandra does not have strong consistency in the sense that HBase
> >> provides. It can provide strong consistency, but at the cost of failing
> any
> >> read if there is insufficient quorum. HBase/HDFS does not have that
> >> limitation. On the other hand, HBase has its own and different scenarios
> >> where data may not be immediately available. The differences between the
> >> systems are nuanced and which to use depends on the use case
> requirements.
> >>
> >>
> >I have a question regarding this point. Is the replication strategy for
> >HBase completely reliant on HDFS' block replication pipelining ? Is this
> >replication process asynchronous ? If it is, then is there not a window,
> >where when a machine is to die and the replication pipeline for a
> particular
> >block has not started yet, that block will be unavailable until the
> machine
> >comes back up ? Sorry, if I am missing something important here.
> >
> >
> >> Cassandra's RandomPartitioner / hash based partitioning means efficient
> >> MapReduce or table scanning is not possible, whereas HBase's distributed
> >> ordered tree is naturally efficient for such use cases, I believe
> explaining
> >> why Hadoop users often prefer it. This may or may not be a problem for
> any
> >> given use case. Using an ordered partitioner with Cassandra used to
> require
> >> frequent manual rebalancing to avoid blowing up nodes. I don't know if
> more
> >> recent versions still have this mis-feature.
> >>
> >> Cassandra is no less complex than HBase. All of this complexity is
> "hidden"
> >> in the sense that with Hadoop/HBase the layering is obvious -- HDFS,
> HBase,
> >> etc. -- but the Cassandra internals are no less layered. An impartial
> >> analysis of implementation and algorithms will reveal that Cassandra's
> >> theory of operation in its full detail is substantially more complex.
> >> Compare the BigTable and Dynamo papers and this is clear. There are
> actually
> >> more opportunities for something to go wrong with Cassandra.
> >>
> >> While we are looking at codebases, it should be noted that HBase has
> >> substantially more unit tests.
> >>
> >> With Cassandra, all RPC is via Thrift with various wrappers, so actually
> >> all Cassandra clients are second class in the sense that jbellis means
> when
> >> he states "Non-Java clients are not second-class citizens".
> >>
> >> The master-slave versus peer-to-peer argument is larger than Cassandra
> vs.
> >> HBase, and not nearly as one sided as claimed. The famous (infamous?)
> global
> >> failure of Amazon's S3 in 2008, a fully peer-to-peer system, due to a
> single
> >> flipped bit in a gossip message demonstrates how in peer to peer systems
> >> every node can be a single point of failure. There is no obvious winner,
> >> instead, a series of trade offs. Claiming otherwise is intellectually
> >> dishonest. Master-slave architectures seem easier to operate and reason
> >> about in my experience. Of course, I'm partial there.
> >>
> >> I have just scratched the surface.
> >>
> >>
> >> Best regards,
> >>
> >>
> >>        - Andy
> >>
> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> >> (via Tom White)
> >>
> >>
> >> >________________________________
> >> >From: Chris Tarnas <[email protected]>
> >> >To: [email protected]
> >> >Sent: Tuesday, August 30, 2011 2:02 PM
> >> >Subject: HBase and Cassandra on StackOverflow
> >> >
> >> >Someone with better knowledge than might be interested in helping
> answer
> >> this question over at StackOverflow:
> >> >
> >> >
> >>
> http://stackoverflow.com/questions/7237271/large-scale-data-processing-hbase-cassandra
> >> >
> >> >-chris
> >> >
> >> >
> >>
> >
> >
> >
>

Re: HBase and Cassandra on StackOverflow

Reply via email to