Re: HBase and Cassandra on StackOverflow

Joseph Boyd Tue, 30 Aug 2011 13:04:52 -0700

On Tue, Aug 30, 2011 at 12:22 PM, Sam Seigal <[email protected]> wrote:
>
> Will the write call to HBase block until the record written is fully
> replicated ?


no. data isn't written to disk immediately

> If not (since it is happening at the block level), then isn't
> there a window where a region server goes down, the data might not be
> available anywhere else, until it comes back up ?

the data would be in the write ahead log.


...joe


> On Tue, Aug 30, 2011 at 9:17 AM, Andrew Purtell <[email protected]> wrote:
>
> > > Is the replication strategy for HBase completely reliant on HDFS' block
> > > replication pipelining ?
> >
> > Yes.
> >
> > > Is this replication process asynchronous ?
> >
> >
> > No.
> > Best regards,
> >
> >
> >        - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
> >
> > >________________________________
> > >From: Sam Seigal <[email protected]>
> > >To: [email protected]; Andrew Purtell <[email protected]>
> > >Cc: "[email protected]" <[email protected]>
> > >Sent: Tuesday, August 30, 2011 7:35 PM
> > >Subject: Re: HBase and Cassandra on StackOverflow
> > >
> > >A question inline:
> > >
> > >On Tue, Aug 30, 2011 at 2:47 AM, Andrew Purtell <[email protected]>
> > wrote:
> > >
> > >> Hi Chris,
> > >>
> > >> Appreciate your answer on the post.
> > >>
> > >> Personally speaking however the endless Cassandra vs. HBase discussion
> > is
> > >> tiresome and rarely do blog posts or emails in this regard shed any
> > light.
> > >> Often, Cassandra proponents mis-state their case out of ignorance of
> > HBase
> > >> or due to commercial or personal agendas. It is difficult to find clear
> > eyed
> > >> analysis among the partisans. I'm not sure it will make any difference
> > >> posting a rebuttal to some random thing jbellis says. Better to focus on
> > >> improving HBase than play whack a mole.
> > >>
> > >>
> > >> Regarding some of the specific points in that post:
> > >>
> > >> HBase is proven in production deployments larger than the largest
> > publicly
> > >> reported Cassandra cluster, ~1K versus 400 or 700 or somesuch. But
> > basically
> > >> this is the same order of magnitude, with HBase having a slight edge. I
> > >> don't see a meaningful difference here. Stating otherwise is false.
> > >>
> > >> HBase supports replication between clusters (i.e. data centers). I
> > believe,
> > >> but admit I'm not super familiar with the Cassandra option here, that
> > the
> > >> main difference is HBase provides simple mechanism and the user must
> > build a
> > >> replication architecture useful for them; while Cassandra attempts to
> > hide
> > >> some of that complexity. I do not know if they succeed there, but large
> > >> scale cross data center replication is rarely one size fits all so I
> > doubt
> > >> it.
> > >>
> > >> Cassandra does not have strong consistency in the sense that HBase
> > >> provides. It can provide strong consistency, but at the cost of failing
> > any
> > >> read if there is insufficient quorum. HBase/HDFS does not have that
> > >> limitation. On the other hand, HBase has its own and different scenarios
> > >> where data may not be immediately available. The differences between the
> > >> systems are nuanced and which to use depends on the use case
> > requirements.
> > >>
> > >>
> > >I have a question regarding this point. Is the replication strategy for
> > >HBase completely reliant on HDFS' block replication pipelining ? Is this
> > >replication process asynchronous ? If it is, then is there not a window,
> > >where when a machine is to die and the replication pipeline for a
> > particular
> > >block has not started yet, that block will be unavailable until the
> > machine
> > >comes back up ? Sorry, if I am missing something important here.
> > >
> > >
> > >> Cassandra's RandomPartitioner / hash based partitioning means efficient
> > >> MapReduce or table scanning is not possible, whereas HBase's distributed
> > >> ordered tree is naturally efficient for such use cases, I believe
> > explaining
> > >> why Hadoop users often prefer it. This may or may not be a problem for
> > any
> > >> given use case. Using an ordered partitioner with Cassandra used to
> > require
> > >> frequent manual rebalancing to avoid blowing up nodes. I don't know if
> > more
> > >> recent versions still have this mis-feature.
> > >>
> > >> Cassandra is no less complex than HBase. All of this complexity is
> > "hidden"
> > >> in the sense that with Hadoop/HBase the layering is obvious -- HDFS,
> > HBase,
> > >> etc. -- but the Cassandra internals are no less layered. An impartial
> > >> analysis of implementation and algorithms will reveal that Cassandra's
> > >> theory of operation in its full detail is substantially more complex.
> > >> Compare the BigTable and Dynamo papers and this is clear. There are
> > actually
> > >> more opportunities for something to go wrong with Cassandra.
> > >>
> > >> While we are looking at codebases, it should be noted that HBase has
> > >> substantially more unit tests.
> > >>
> > >> With Cassandra, all RPC is via Thrift with various wrappers, so actually
> > >> all Cassandra clients are second class in the sense that jbellis means
> > when
> > >> he states "Non-Java clients are not second-class citizens".
> > >>
> > >> The master-slave versus peer-to-peer argument is larger than Cassandra
> > vs.
> > >> HBase, and not nearly as one sided as claimed. The famous (infamous?)
> > global
> > >> failure of Amazon's S3 in 2008, a fully peer-to-peer system, due to a
> > single
> > >> flipped bit in a gossip message demonstrates how in peer to peer systems
> > >> every node can be a single point of failure. There is no obvious winner,
> > >> instead, a series of trade offs. Claiming otherwise is intellectually
> > >> dishonest. Master-slave architectures seem easier to operate and reason
> > >> about in my experience. Of course, I'm partial there.
> > >>
> > >> I have just scratched the surface.
> > >>
> > >>
> > >> Best regards,
> > >>
> > >>
> > >>        - Andy
> > >>
> > >> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > >> (via Tom White)
> > >>
> > >>
> > >> >________________________________
> > >> >From: Chris Tarnas <[email protected]>
> > >> >To: [email protected]
> > >> >Sent: Tuesday, August 30, 2011 2:02 PM
> > >> >Subject: HBase and Cassandra on StackOverflow
> > >> >
> > >> >Someone with better knowledge than might be interested in helping
> > answer
> > >> this question over at StackOverflow:
> > >> >
> > >> >
> > >>
> > http://stackoverflow.com/questions/7237271/large-scale-data-processing-hbase-cassandra
> > >> >
> > >> >-chris
> > >> >
> > >> >
> > >>
> > >
> > >
> > >
> >

Re: HBase and Cassandra on StackOverflow

Reply via email to