This needs to be documented on the official blog.
On Mon, Jun 23, 2014 at 3:31 PM, Josh Elser <[email protected]
<mailto:[email protected]>> wrote:
Sent too quickly..
- The BatchScanner is communicating to tservers in *parallel* which
is where this really shows it strength.
- A "default" locality group. You don't have to define the locality
groups for a table at creation time in Accumulo (or have to modify
the table if you want to insert a new column family). Because of
this, you have a lot more flexibility in how you structure your
tables while also being able to take advantage of the efficient
filtering you get having locality groups you have configured. Adding
a new locality group does still require a compaction to re-write the
data in separate files.
On 6/23/14, 3:24 PM, Josh Elser wrote:
A few observations I can make from watching both communities
(although
only really participating in Accumulo's).
- HBase undeniably has a much larger public community of both
users and
developers; however, we are seeing broader adoption across different
vertical markets with Accumulo. IMO, I think we have a rather
responsive
community built up here. Lots of smart people are working that are
available and happy to help with problems.
- BatchScanner: The BatchScanner is a query construct which will
automatically fetch data from a collection of Ranges on a table and
return the results in the form of a Java Iterator. This makes
for a very
natural way to read lots of data from Accumulo, automatically
performing
some reduction in the data server-side (using Accumulo
Iterators), and
getting a wonderfully simple Iterator<Entry<Key,Value>> in your
client
code. It really helps to encourage a state-less and functional-like
style to your code.
I really like it, and, when combined with the ability to push a
bunch of
work server-side, it has often kept me from having to write
MapReduce
jobs (which is always a win to me).
- Accumulo Iterators are a common thing you might hear as a
difference.
AFAICT, they're a bit more powerful than what you can do with HBase
filters because you are presented with a stream of Key-Value pairs
inside of the TServer. Again, it's a bit functional programming
inspired. You have the ability to combine, consume, seek within the
stream and do what you please (more context would be helpful in
giving
specific examples)
That being said, Iterators do come with a learning curve, but
that's to
be expected with the amount of flexibility they provide. It's
just like
anything else :)
- <disclaimer>I can't comment about running HBase in production
environments, but I tend to hear a lot of "war stories" about
it. I also
don't know how much of this is from running old version of HBase
which
don't have known issues patched. </disclaimer>
In my experience, Accumulo just works. It doesn't require much
day-to-day interaction, processes stay running and if some node goes
haywire, I have absolutely no qualms against `kill -9`'ing it and
knowing that everything will come back fine.
My $0.02.
- Josh
On 6/23/14, 2:49 PM, Josh Elser wrote:
Another way you could word this is that Accumulo has a very
"mature"
security implementation, whereas, like you pointed out,
HBase has only
recently added this in 0.98.
The note about how visibility being in the Key as opposed to
the Value
also has impact when writing Iterators. Because the
visibility is a
"first class citizen" instead of an afterthought, having it
uniquely
define some pair makes aggregations much easier to think
about, IMO.
This is especially prevalent when doing this server-side with an
Accumulo Iterator.
There are also other differences between the implementations
visibility
filtering, the most common being the support of a "NOT"
operator in
HBase whereas Accumulo explicitly chose not to implement
this. By
allowing "NOT" into the syntax, it becomes much more
possible that data
is inadvertently leaked. Marking data correctly is more
difficult than
it seems and introducing the ability to negate certain
branches makes it
even more difficult. Auditors are scary :)
- Josh
On 6/23/14, 2:34 PM, Aaron wrote:
I'm not sure of all the differences, but, wrt HBase Cell
Level security
(CLS)..while similar..not 100% the same. If I
understand how the HBase
CLS works it's extension to ACL system. And that ACL is
"applied" to a
cell. In Accumulo's case, it is part of the key. So
the ramification
is that in Accumulo, you can have:
RowID, CF, CQ, VIS1, TS --> Value1
RowID, CF, CQ, VIS2, TS --> Value2
If everything is the same, including the timestamp, the
visibility can
actually determine which value to return. So, a more
concrete example
would be:
XXX, METADATA, NAME, everyone, 100--> Bruce Wayne
XXX, METADATA, NAME, alfred-only, 100--> Batman
Where Alfred could/would see both "values"...but,
everyone else would
only see "Bruce"
Hope that helps.
Cheers,
Aaron
PS: this is my understanding of how HBase CLS
works...based on what I
have read/interpreted.
On Mon, Jun 23, 2014 at 1:55 PM, Jianshi Huang
<[email protected] <mailto:[email protected]>
<mailto:jianshi.huang@gmail.__com
<mailto:[email protected]>>> wrote:
Er... basically I need to explain to my manager why
choosing
Accumulo, instead of HBase.
So what are the pros and cons of Accumulo vs.
HBase? (btw HBase 0.98
also got cell-level security, modeled after Accumulo)
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/
--
*
*Donald Miner
Chief Technology Officer
ClearEdge IT Solutions, LLC
Cell: 443 799 7807
www.clearedgeit.com <http://www.clearedgeit.com>