I think A1 is ultimately the right thing, as well.

The problem is not that you don't know how to accurately label your data (which is the biggest problem in Accumulo as updating the visibility is very costly), it's that it's hard to be able to add your enrichment data after the fact.

The reason that's hard, though, is because your enrichment client needs act like a client -- have authorizations to read the original data. It seems reasonable to me to try to tackle the problem of ensuring the process that needs to enrich some data has the appropriate authorizations to read that data.

Christopher wrote:
I think part of your question pertains to the differences between ABAC
(attribute-based access controls) and RBAC (role-based access controls).

In both A1 and A2, you're thinking in terms of RBAC. The only real
differences is whether you want to have one additional role, or
repurpose the existing ones. However, Accumulo's data visibilities are
more like ABAC. Of course, you can use whatever method works for you,
but the intent is more ABAC than RBAC.

The main pitfall with RBAC is that roles and users change, and data is
complex and large and you don't want to re-write it when things change.
However, attributes are properties of the data itself, upon which you
can make access decisions. These attributes should be things that don't
change... they are inherent to the data (ideal).

To think in terms of ABAC, the main question to ask is "What properties
of this data element will determine who can access it?". For example,
does it contain personal information or medical history? Does it contain
usernames and email addresses? What is it about this data that makes it
worth protecting? Does it need to be protected? I think that's mainly
what John Vines' talk was about (the differences between RBAC and ABAC).

If RBAC is more appropriate for your data, I'd probably go with A1,
because it's easier to implement and maintain. The biggest drawback is
that you require additional storage space to store the additional role
in each visibility. Because of some internal optimizations, if you go
this route, I'd recommend making this role a prefix, rather than a
suffix "SUPERUSER|(restOfVisibility)" vs. "(restOfVisibility)|SUPERUSER".


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Mon, Feb 16, 2015 at 5:39 PM, Srikanth Viswanathan
<[email protected] <mailto:[email protected]>> wrote:

    Hello,

    I'm using Accumulo to store raw and value-added data and expose this
    data to a small number of end users. During ingestion, the system will
    connect to accumulo as a single accumulo user called, say, "ingestor".
    This user will first store data, and then later in the ingestion
    pipeline read the same data back to add value and write the
    value-added data back. End-users will connect as themselves (i.e.,
    individual accumulo accounts) to read the data.

    The questions I am facing are:
    Q1. How to manage the read authorizations for the ingestor?
    Q2. How to ensure data in accumulo is never orphaned due to current
    users lacking authorizations to read certain columns?

    It seems to me that I have two options, both of which will solve both
    my problems above:
    A1. Grant the ingestor a single authorization and store the data with
    labels that allow the ingestor access via this label. e.g.,
    "ingestor|(foo_end_user_group|bar_end_user_group)". By doing this, I
    don't have to maintain special authorization logic for the ingestor,
    and I can also fall back on it to read data that might otherwise be
    orphaned.
    A2.  Store only the end user groups in the visibility labels
    ("foo_end_user_group|bar_end_user_group"), and
    force the ingestion user to obtain all group authorizations needed in
    order to read the data. This will require special logic to update the
    ingestor's authorizations when a new authorization is added to the
    system.

    A1 seems simpler to me, but I heard John Vines discourage this in his
    talk at the 2014 Accumulo Summit.  Doesn't the user in either case see
    the same set of data (i.e., "everything"). What then are the potential
    pitfalls of A1 compared to A2?

    Thank you!

    Srikanth Viswanathan


Reply via email to