> Is a sandbox the same thing as a workspace? Can the terms be used > interchangeably? Just want to make sure I'm not misinterpreting your answers.
Yes. Sorry I wasn’t consistent with the terminology. > Is it fair to describe each sandbox as a separate index table for the global > data set? And then when users do deletes, it is only reflected in the index > fields, right? Not quite. The sandbox is a pointer only. Changes are made directly to the global data, scoped with an additional workspace visibility. For edits (which I haven’t been talking about), this ends up being a new column because of the visibility change. For a delete, we’d like to simply add a !workspace_name visibility and delete the original column. > But you can't just delete values from the index because you need to keep > track of the changes in case the user decides to delete globally (after > appropriate authorization checks, etc...) Correct. The user may also choose to undo/abandon the delete. > Because the visibility is part of the key, changing it involves re-writing > the data. Which might be just an index record in your case. However, this is > generally an expensive operation. It would be an operation to the global data with a workspace-scoped visibility. It wouldn’t be terribly expensive with a NOT operator in the case of deletes because there’s only one record to change. I really appreciate you thinking about this problem Mike. My team has spent a long time discussing a solution and felt the NOT operator would work best for our situation. We’re happy to consider other possible approaches though too. On Mar 19, 2014, at 1:46 PM, Mike Drob <[email protected]> wrote: > Thanks, that's really helpful. Couple more questions. > > Is a sandbox the same thing as a workspace? Can the terms be used > interchangeably? Just want to make sure I'm not misinterpreting your answers. > > Is it fair to describe each sandbox as a separate index table for the global > data set? And then when users do deletes, it is only reflected in the index > fields, right? > But you can't just delete values from the index because you need to keep > track of the changes in case the user decides to delete globally (after > appropriate authorization checks, etc...) > > Because the visibility is part of the key, changing it involves re-writing > the data. Which might be just an index record in your case. However, this is > generally an expensive operation. > > I think I need to think on this use case some more, it's definitely > interesting and not something I had considered before. > > > On Wed, Mar 19, 2014 at 1:24 PM, Jeff Kunkle <[email protected]> wrote: >> You have a large amount of data, that is generally readable by all users. > > Not necessarily. All data has some visibility constraint that a users > authorization's may or may not satisfy. > >> Users create their own sandbox, from which they can later exclude portions >> of the global data set. > > Yes, users create their own sandboxes which are populated with global data. > They may decide to delete some of that data and the change needs to be scoped > to their sandbox until the change is published globally. > >> User can share their sandbox with others, so really we are talking about >> sandbox permissions and not so much user permissions. > > Yes, users can share their sandbox with others, but a sandbox is just a > collection of pointers to data. Users sharing a workspace may not necessarily > see all of the same data depending on their authorizations. > >> Sandboxes are created often. Or, at least much more often than the data >> changes. > > Yes, sandboxes are created often. The data is likely to be ingested more > frequently than sandboxes will be created. > >> Do users typically remove large amounts of data from their sandbox? 1%? 10%? >> 99%? > > I don’t have good numbers to share here. > >> Assuming data is removed via rules, are the rules applied automatically to >> new data under ingest? > I would say no, although I’m not positive I understand the question. Users > are not removing data from their sandbox per se, but they may delete data > that should then be hidden from their workspace. The data is not really > deleted though and is still visible to other users in other sandboxes. Only > when the deletion is published does it get deleted for everyone. > > On Mar 19, 2014, at 1:03 PM, Mike Drob <[email protected]> wrote: > >> Wait, I'm really confused by what you are describing, Jeff. Sorry if these >> are obvious questions, but can you help me get a better grasp of your use >> case? >> >> You have a large amount of data, that is generally readable by all users. >> Users create their own sandbox, from which they can later exclude portions >> of the global data set. >> User can share their sandbox with others, so really we are talking about >> sandbox permissions and not so much user permissions. >> Sandboxes are created often. Or, at least much more often than the data >> changes. >> >> Are those all accurate statements? If so, can you clarify the following >> points: >> >> Do users typically remove large amounts of data from their sandbox? 1%? 10%? >> 99%? >> Assuming data is removed via rules, are the rules applied automatically to >> new data under ingest? >> >> Thanks, >> Mike >> >> >> On Wed, Mar 19, 2014 at 12:54 PM, Jeff Kunkle <[email protected]> wrote: >> Hi John, >> >> Yes it’s accurate that the system controls the label and who is associated >> with it; there are no Accumulo-internal user accounts. But I don’t think >> it’s feasible to remove a sandbox label from something that should be >> hidden. Such a scenario would imply that all data is “tagged” with the >> labels of every sandbox that is allowed to see the data, which would be >> most. It would also imply that the creation of a new sandbox would >> necessitate changing the visibility of everything in Accumulo to include the >> new sandbox label, effectively rewriting the entire database. Sanboxes are >> created and deleted all the time in our application, so it doesn’t seem like >> a feasible solution to me. >> >> -Jeff >> >> On Mar 19, 2014, at 12:16 PM, Josh Elser <[email protected]> wrote: >> >> > It kind of sounds like you could manage this much easier by controlling >> > the authorizations a user gets (notably the workspace name) and the >> > grant/revoke above the Accumulo level. >> > >> > A sandbox has a unique label and the external system controls which users >> > are granted that label. This way, each sandbox can be modified >> > individually (using authorizations that contain the data visibility and >> > the sandbox label) or the original data set could be modified (by omitting >> > a sandbox label in the authorizations used). >> > >> > Is that accurate? >> > >> > On 3/19/14, 12:05 PM, Jeff Kunkle wrote: >> >> I attempted to simplify the scenario to facilitate discussion, which on >> >> second thought may have been a mistake. Here’s the whole scenario: >> >> >> >> Different users have access to different subsets of the data depending >> >> on their authorizations and the visibility of the data. Users “work >> >> with” the data in what we call a sandbox. Sanboxes can be shared with >> >> other users (this is the group creation I was talking about earlier). >> >> Deletes to the data would be “scoped” to the sandbox by changing the >> >> visibility to add “& !workspace_name” so that people viewing the >> >> workspace wouldn’t see the data but everyone else would. >> >> >> >> On Mar 19, 2014, at 11:48 AM, Sean Busbey <[email protected] >> >> <mailto:[email protected]>> wrote: >> >> >> >>> On Wed, Mar 19, 2014 at 10:43 AM, Jeff Kunkle <[email protected] >> >>> <mailto:[email protected]>> wrote: >> >>> >> >>> New groups are created on the fly by our application when needed. >> >>> Under the scenario you describe we’d have to go through all the >> >>> data in Accumulo whenever a group is created so that users in the >> >>> group can see the existing data. >> >>> >> >>> >> >>> >> >>> >> >>> Ah! So your use case is that all data defaults to world readable and >> >>> then users have the option of opting out of seeing subsets. Right? >> >>> >> >>> In your scenario user groups also get to opt-out of seeing data on the >> >>> fly, yes? Both require rewriting the data. Does the group creation >> >>> happen more often? >> >> >> >> > >
