> Is a sandbox the same thing as a workspace? Can the terms be used 
> interchangeably? Just want to make sure I'm not misinterpreting your answers.

Yes. Sorry I wasn’t consistent with the terminology. 

> Is it fair to describe each sandbox as a separate index table for the global 
> data set? And then when users do deletes, it is only reflected in the index 
> fields, right?

Not quite. The sandbox is a pointer only. Changes are made directly to the 
global data, scoped with an additional workspace visibility. For edits (which I 
haven’t been talking about), this ends up being a new column because of the 
visibility change. For a delete, we’d like to simply add a !workspace_name 
visibility and delete the original column.

> But you can't just delete values from the index because you need to keep 
> track of the changes in case the user decides to delete globally (after 
> appropriate authorization checks, etc...)

Correct. The user may also choose to undo/abandon the delete.

> Because the visibility is part of the key, changing it involves re-writing 
> the data. Which might be just an index record in your case. However, this is 
> generally an expensive operation.

It would be an operation to the global data with a workspace-scoped visibility. 
It wouldn’t be terribly expensive with a NOT operator in the case of deletes 
because there’s only one record to change.

I really appreciate you thinking about this problem Mike. My team has spent a 
long time discussing a solution and felt the NOT operator would work best for 
our situation. We’re happy to consider other possible approaches though too.


On Mar 19, 2014, at 1:46 PM, Mike Drob <[email protected]> wrote:

> Thanks, that's really helpful. Couple more questions.
> 
> Is a sandbox the same thing as a workspace? Can the terms be used 
> interchangeably? Just want to make sure I'm not misinterpreting your answers.
> 
> Is it fair to describe each sandbox as a separate index table for the global 
> data set? And then when users do deletes, it is only reflected in the index 
> fields, right?
> But you can't just delete values from the index because you need to keep 
> track of the changes in case the user decides to delete globally (after 
> appropriate authorization checks, etc...)
> 
> Because the visibility is part of the key, changing it involves re-writing 
> the data. Which might be just an index record in your case. However, this is 
> generally an expensive operation.
> 
> I think I need to think on this use case some more, it's definitely 
> interesting and not something I had considered before.
> 
> 
> On Wed, Mar 19, 2014 at 1:24 PM, Jeff Kunkle <[email protected]> wrote:
>> You have a large amount of data, that is generally readable by all users.
> 
> Not necessarily. All data has some visibility constraint that a users 
> authorization's may or may not satisfy. 
> 
>> Users create their own sandbox, from which they can later exclude portions 
>> of the global data set.
> 
> Yes, users create their own sandboxes which are populated with global data. 
> They may decide to delete some of that data and the change needs to be scoped 
> to their sandbox until the change is published globally.
> 
>> User can share their sandbox with others, so really we are talking about 
>> sandbox permissions and not so much user permissions.
> 
> Yes, users can share their sandbox with others, but a sandbox is just a 
> collection of pointers to data. Users sharing a workspace may not necessarily 
> see all of the same data depending on their authorizations.
> 
>> Sandboxes are created often. Or, at least much more often than the data 
>> changes.
> 
> Yes, sandboxes are created often. The data is likely to be ingested more 
> frequently than sandboxes will be created.
> 
>> Do users typically remove large amounts of data from their sandbox? 1%? 10%? 
>> 99%?
> 
> I don’t have good numbers to share here.
> 
>> Assuming data is removed via rules, are the rules applied automatically to 
>> new data under ingest?
> I would say no, although I’m not positive I understand the question. Users 
> are not removing data from their sandbox per se, but they may delete data 
> that should then be hidden from their workspace. The data is not really 
> deleted though and is still visible to other users in other sandboxes. Only 
> when the deletion is published does it get deleted for everyone.
> 
> On Mar 19, 2014, at 1:03 PM, Mike Drob <[email protected]> wrote:
> 
>> Wait, I'm really confused by what you are describing, Jeff. Sorry if these 
>> are obvious questions, but can you help me get a better grasp of your use 
>> case?
>> 
>> You have a large amount of data, that is generally readable by all users.
>> Users create their own sandbox, from which they can later exclude portions 
>> of the global data set.
>> User can share their sandbox with others, so really we are talking about 
>> sandbox permissions and not so much user permissions.
>> Sandboxes are created often. Or, at least much more often than the data 
>> changes.
>> 
>> Are those all accurate statements? If so, can you clarify the following 
>> points:
>> 
>> Do users typically remove large amounts of data from their sandbox? 1%? 10%? 
>> 99%?
>> Assuming data is removed via rules, are the rules applied automatically to 
>> new data under ingest?
>> 
>> Thanks,
>> Mike
>> 
>> 
>> On Wed, Mar 19, 2014 at 12:54 PM, Jeff Kunkle <[email protected]> wrote:
>> Hi John,
>> 
>> Yes it’s accurate that the system controls the label and who is associated 
>> with it; there are no Accumulo-internal user accounts. But I don’t think 
>> it’s feasible to remove a sandbox label from something that should be 
>> hidden. Such a scenario would imply that all data is “tagged” with the 
>> labels of every sandbox that is allowed to see the data, which would be 
>> most. It would also imply that the creation of a new sandbox would 
>> necessitate changing the visibility of everything in Accumulo to include the 
>> new sandbox label, effectively rewriting the entire database. Sanboxes are 
>> created and deleted all the time in our application, so it doesn’t seem like 
>> a feasible solution to me.
>> 
>> -Jeff
>> 
>> On Mar 19, 2014, at 12:16 PM, Josh Elser <[email protected]> wrote:
>> 
>> > It kind of sounds like you could manage this much easier by controlling 
>> > the authorizations a user gets (notably the workspace name) and the 
>> > grant/revoke above the Accumulo level.
>> >
>> > A sandbox has a unique label and the external system controls which users 
>> > are granted that label. This way, each sandbox can be modified 
>> > individually (using authorizations that contain the data visibility and 
>> > the sandbox label) or the original data set could be modified (by omitting 
>> > a sandbox label in the authorizations used).
>> >
>> > Is that accurate?
>> >
>> > On 3/19/14, 12:05 PM, Jeff Kunkle wrote:
>> >> I attempted to simplify the scenario to facilitate discussion, which on
>> >> second thought may have been a mistake. Here’s the whole scenario:
>> >>
>> >> Different users have access to different subsets of the data depending
>> >> on their authorizations and the visibility of the data. Users “work
>> >> with” the data in what we call a sandbox. Sanboxes can be shared with
>> >> other users (this is the group creation I was talking about earlier).
>> >> Deletes to the data would be “scoped” to the sandbox by changing the
>> >> visibility to add “& !workspace_name” so that people viewing the
>> >> workspace wouldn’t see the data but everyone else would.
>> >>
>> >> On Mar 19, 2014, at 11:48 AM, Sean Busbey <[email protected]
>> >> <mailto:[email protected]>> wrote:
>> >>
>> >>> On Wed, Mar 19, 2014 at 10:43 AM, Jeff Kunkle <[email protected]
>> >>> <mailto:[email protected]>> wrote:
>> >>>
>> >>>    New groups are created on the fly by our application when needed.
>> >>>    Under the scenario you describe we’d have to go through all the
>> >>>    data in Accumulo whenever a group is created so that users in the
>> >>>    group can see the existing data.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> Ah! So your use case is that all data defaults to world readable and
>> >>> then users have the option of opting out of seeing subsets. Right?
>> >>>
>> >>> In your scenario user groups also get to opt-out of seeing data on the
>> >>> fly, yes? Both require rewriting the data. Does the group creation
>> >>> happen more often?
>> >>
>> 
>> 
> 
> 

Reply via email to