Re: NOT operator in visibility string

2014-03-20 Thread joeferner
If I were to add a configurable switch that you could enable/disable the NOT
operator would that increase the likelihood of this patch being accepted? I
could make it default 'disabled'.



--
View this message in context: 
http://apache-accumulo.1065345.n5.nabble.com/NOT-operator-in-visibility-string-tp7949p8310.html
Sent from the Users mailing list archive at Nabble.com.


Re: NOT operator in visibility string

2014-03-20 Thread Josh Elser
For scenarios 1-3, that's not a scalable approach. Every time one of 
these operations occurs, you're re-creating *double* the size of the 
original dataset (your workspace) by deleting and then re-inserting 
every key-value.


I believe this has already been said, but this is the job of an 
authorization service, and cell-level visibilities do not appropriately 
solve this problem.


On 3/20/14, 10:36 AM, sfeng88 wrote:

I currently work on the same project as Jeff. To elaborate on the Jeff's
posts, here are some more detailed scenarios of why the NOT operator would
be helpful.

scenario 1 - hide a value from users using workspace1


temporarily add  !workspace1 to data you want to exclude

row1  cf1-col1:value1 []
 cf1-col2:value2 [a]
row2 cf1-col1:value3 [b]
 cf1-col2:value4 [b  !workspace1]

1) user A scans the table providing authorizations [a,b]
   -- sees all 4 values

2) user B scans the table providing authorization [a]
   -- sees value1, value2

3) user A when using workspace1 scans the table providing
authorizations [a,b,workspace1]
   -- is prevented from seeing value4


scenario 2 - hide workspace data from other users


add  workspace2 to private data

row1 cf1-col1:value1 []
 cf1-col2:value2 [a]
row2 cf1-col1:value3 [b]
 cf1-col2:value4 [b  !workspace1]
row3 cf1-col1:value5 [workspace2]
 cf1-col2:value6 [a  b  workspace2]

4) user A scans the table providing authorizations [a,b]
   -- sees the 4 original values

5) user B scans the table providing authorization [a]
   -- sees value1, value2

6) user A when using workspace2 scans the table providing
authorizations [a,b,workspace2]
   -- sees all 6 values


scenario 3 - publish workspace data


remove  workspace2 from data no longer required to be private

row1 cf1-col1:value1 []
 cf1-col2:value2 [a]
row2 cf1-col1:value3 [b]
 cf1-col2:value4 [b  !workspace1]
row3 cf1-col1:value5 [workspace2] -- change to []
 cf1-col2:value6 [a  b  workspace2] -- change to [a  b]

7) user A scans the table providing authorizations [a,b]
   -- sees the 4 original values and newly published value5, value6

8) user B scans the table providing authorization [a]
   -- sees value1, value2 and newly published value5


scenario 4 - failsecure data


no additional data is return when no authorizations are provided

row1 cf1-col1:value1 []
 cf1-col2:value2 [a]
row2 cf1-col1:value3 [b]
 cf1-col2:value4 [b  !workspace1]
row3 cf1-col1:value5 [workspace2]
 cf1-col2:value6 [a  b  workspace2]

9) user A scans the table with no authorizations
   -- sees only the unmarked value1 (3 fewer than allowed in scenario 2)

10) user B scans the table with no authorizations
   -- sees only the unmarked value1 (1 fewer than allowed in scenario 2)




--
View this message in context: 
http://apache-accumulo.1065345.n5.nabble.com/NOT-operator-in-visibility-string-tp7949p8314.html
Sent from the Users mailing list archive at Nabble.com.



Re: NOT operator in visibility string

2014-03-20 Thread sfeng88
For scenario 1-3, it is a very small dataset that we will be adding/hiding.
In addition, we would rather not duplicate Accumulo's Authorization code
into our project to filter what should or should not be hidden from the user
given our scenarios. 

Going through this entire thread, am I wrong to assume that Accumulo will
not be accepting this patch? If so, please let us know so we can come up
with alternative ways to solve our use cases. 



--
View this message in context: 
http://apache-accumulo.1065345.n5.nabble.com/NOT-operator-in-visibility-string-tp7949p8317.html
Sent from the Users mailing list archive at Nabble.com.


Re: NOT operator in visibility string

2014-03-20 Thread Keith Turner
I am curious about your intent when you write !workspace1, this is
equivalent to every possible authorization except workspace1.  I am
wondering if you actually wanted something more restrictive though, like
the set of all possible workspaces except workspace1?  Do you want to write
something like b  workspace[2-inf]?  Where workspace[2-inf] is made up
syntax that means all workspaces with followed by an integer GTE 2.


On Thu, Mar 20, 2014 at 10:36 AM, sfeng88 susan.f...@altamiracorp.comwrote:

 I currently work on the same project as Jeff. To elaborate on the Jeff's
 posts, here are some more detailed scenarios of why the NOT operator would
 be helpful.

 scenario 1 - hide a value from users using workspace1


 temporarily add  !workspace1 to data you want to exclude

 row1  cf1-col1:value1 []
 cf1-col2:value2 [a]
 row2 cf1-col1:value3 [b]
 cf1-col2:value4 [b  !workspace1]

 1) user A scans the table providing authorizations [a,b]
   -- sees all 4 values

 2) user B scans the table providing authorization [a]
   -- sees value1, value2

 3) user A when using workspace1 scans the table providing
 authorizations [a,b,workspace1]
   -- is prevented from seeing value4


 scenario 2 - hide workspace data from other users


 add  workspace2 to private data

 row1 cf1-col1:value1 []
 cf1-col2:value2 [a]
 row2 cf1-col1:value3 [b]
 cf1-col2:value4 [b  !workspace1]
 row3 cf1-col1:value5 [workspace2]
 cf1-col2:value6 [a  b  workspace2]

 4) user A scans the table providing authorizations [a,b]
   -- sees the 4 original values

 5) user B scans the table providing authorization [a]
   -- sees value1, value2

 6) user A when using workspace2 scans the table providing
 authorizations [a,b,workspace2]
   -- sees all 6 values


 scenario 3 - publish workspace data


 remove  workspace2 from data no longer required to be private

 row1 cf1-col1:value1 []
 cf1-col2:value2 [a]
 row2 cf1-col1:value3 [b]
 cf1-col2:value4 [b  !workspace1]
 row3 cf1-col1:value5 [workspace2] -- change to []
 cf1-col2:value6 [a  b  workspace2] -- change to [a  b]

 7) user A scans the table providing authorizations [a,b]
   -- sees the 4 original values and newly published value5, value6

 8) user B scans the table providing authorization [a]
   -- sees value1, value2 and newly published value5


 scenario 4 - failsecure data


 no additional data is return when no authorizations are provided

 row1 cf1-col1:value1 []
 cf1-col2:value2 [a]
 row2 cf1-col1:value3 [b]
 cf1-col2:value4 [b  !workspace1]
 row3 cf1-col1:value5 [workspace2]
 cf1-col2:value6 [a  b  workspace2]

 9) user A scans the table with no authorizations
   -- sees only the unmarked value1 (3 fewer than allowed in scenario 2)

 10) user B scans the table with no authorizations
   -- sees only the unmarked value1 (1 fewer than allowed in scenario 2)




 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/NOT-operator-in-visibility-string-tp7949p8314.html
 Sent from the Users mailing list archive at Nabble.com.



Re: NOT operator in visibility string

2014-03-20 Thread Sean Busbey
On Thu, Mar 20, 2014 at 10:32 AM, sfeng88 susan.f...@altamiracorp.comwrote:

 For scenario 1-3, it is a very small dataset that we will be adding/hiding.
 In addition, we would rather not duplicate Accumulo's Authorization code
 into our project to filter what should or should not be hidden from the
 user
 given our scenarios.

 Going through this entire thread, am I wrong to assume that Accumulo will
 not be accepting this patch? If so, please let us know so we can come up
 with alternative ways to solve our use cases.




Hey Susan!

We haven't called a formal vote, but from my estimation consensus within
the project is opposed to adding a NOT operator. After our pending releases
are handled, I'm going to try to pull the reasoning into a more organized
document since I expect this will come up again.

Since I'd like that document to include some examples that can be
implemented as is, even though NOT seems like an obvious choice, I'd be
happy to help worth through how your use case can be handled without a NOT
operator.

Did you happen to already read my earlier approach[1]? I believe it can be
used to accomplish all of the scenarios you presented.

-Sean

[1]: http://s.apache.org/35e


Re: Hadoop HA with Accumulo 1.5

2014-03-20 Thread Josh Elser
Yes, removing the biggest SPOF from the entire Accumulo architecture is 
a good thing :)


The usage of it with Accumulo is, given all of my testing, completely 
transparent. Once you configure HDFS correctly, there should be nothing 
additional you have to do with Accumulo except make sure 
instance.dfs.uri in accumulo.site.xml is up to date.


On 3/20/14, 6:17 PM, Ott, Charlie H. wrote:

So I was looking into the software configuration for HA in regard to
hdfs clients utilizing the class
“org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider”
(http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-High-Availability-Guide/cdh4hag_topic_2_3.html)

So I was wondering, does Accumulo 1.5 benefit from the HA feature of
having a failover namenode?



RE: Hadoop HA with Accumulo 1.5

2014-03-20 Thread dlmarion
We had an issue in our testing
(https://issues.apache.org/jira/browse/ACCUMULO-2480). The root cause was a
misconfiguration for automatic failover. The sshfence feature does not
handle network failures, so you have to configure it with the
shell(/bin/true) command also (separated by a newline, unlike other Hadoop
configuration property values). However, if you end up with a hiccup in the
failover for some reason, you could run into ACCUMULO-2480. I had to restart
the entire Accumulo database because different tservers started reporting
the error. A restart of Accumulo did work and recovered with no issues.

-Original Message-
From: Josh Elser [mailto:josh.el...@gmail.com] 
Sent: Thursday, March 20, 2014 7:06 PM
To: user@accumulo.apache.org
Subject: Re: Hadoop HA with Accumulo 1.5

Yes, removing the biggest SPOF from the entire Accumulo architecture is a
good thing :)

The usage of it with Accumulo is, given all of my testing, completely
transparent. Once you configure HDFS correctly, there should be nothing
additional you have to do with Accumulo except make sure instance.dfs.uri in
accumulo.site.xml is up to date.

On 3/20/14, 6:17 PM, Ott, Charlie H. wrote:
 So I was looking into the software configuration for HA in regard to 
 hdfs clients utilizing the class 

org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
 (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/l
 atest/CDH4-High-Availability-Guide/cdh4hag_topic_2_3.html)

 So I was wondering, does Accumulo 1.5 benefit from the HA feature of 
 having a failover namenode?




Re: Combiner behaviour

2014-03-20 Thread Josh Elser

Russ,

Close to it. I'll try to work up some actual code to what I'm suggesting.

On 3/20/14, 1:12 AM, Russ Weeks wrote:

Hi, Josh,

Thanks for walking me through this.  This is my first stab at it:

public class RowSummingCombiner extends WrappingIterator {

Key lastKey;
long sum;

public Key getTopKey() {

if (lastKey == null)

return super.getTopKey();

return lastKey;
}
public Value getTopValue() {

lastKey = null;

return new Value(Long.toString(sum).getBytes());

}
public boolean hasTop() {

return lastKey != null || super.hasTop();

}
public void next() throws IOException {

while (super.hasTop()) {

lastKey = super.getTopKey();

if (!lastKey.isDeleted()) {

sum += Long.parseLong(super.getTopValue().toString());

}
super.next();

}
}
public SortedKeyValueIteratorKey,Value deepCopy(IteratorEnvironment env) {

RowSummingCombiner instance = new RowSummingCombiner();

instance.setSource(getSource().deepCopy(env));

return instance;
}
}

I restrict the scanner to the single CF/CQ that I'm interested in
summing. The biggest disadvantage is that I can't utilize any of the
logic in the Combiner class hierarchy for value decoding etc. because
the logic to combine based on the common (row, cf, cq, vis) tuple is
baked in at the top level of that hierarchy and I don't see an easy way
to plug in new behaviour. But, each instance of the RowSummingCombiner
returns its own sum, and then my client just has to add up a handful of
values. Is this what you were getting at?

Regards,
-Russ


On Wed, Mar 19, 2014 at 3:51 PM, Josh Elser josh.el...@gmail.com
mailto:josh.el...@gmail.com wrote:

Ummm, you got the gist of it (I may have misspoke in what I
initially said).

What my first thought was to make an iterator that will filter down
to the columns that you want. It doesn't look like we have an
iterator that will efficiently do this for you included in the core
(although, I know I've done something similar in the past like
this). This iterator would scan the rows on your table returning
just the columns you want.

00021ccaac30 meta:size []1807
00021cdaac30 meta:size []656
00021cfaac30 meta:size []565

Then, we could put the summing combiner on top of that iterator to
sum those and get back a single key. The row in the key you return
should be the last row you included in the sum. This way, if a retry
happens under the hood by the batchscanner, you'll resume where you
left off and won't double-count things.

(you could even do things like sum a maximum of N rows before
returning back some intermediate count to better parallelize things)

00021cfaac30 meta:size []3028

So, each ScanSession (what the batchscanner is doing underneath
the hood) would return you a value which your client would do a
final summation.

The final stack would be {(data from accumulo)  SKVI to project
columns  summing combiner}  final summation, where {...} denotes
work done server-side. This is one of those things that really
shines with the Accumulo API.


On 3/19/14, 6:40 PM, Russ Weeks wrote:

Hi, Josh,

Thanks very much for your response. I think I get what you're
saying,
but it's kind of blowing my mind.

Are you saying that if I first set up an iterator that took my
key/value
pairs like,

00021ccaac30 meta:size []1807
00021ccaac30 meta:source []data2
00021cdaac30 meta:filename []doc02985453
00021cdaac30 meta:size []656
00021cdaac30 meta:source []data2
00021cfaac30 meta:filename []doc04484522
00021cfaac30 meta:size []565
00021cfaac30 meta:source []data2
00021dcaac30 meta:filename []doc03342958

And emitted something like,

0 meta:size [] 1807
0 meta:size [] 656
0 meta:size [] 565

And then applied a SummingCombiner at a lower priority than that
iterator, then... it should work, right?

I'll give it a try.

Regards,
-Russ


On Wed, Mar 19, 2014 at 3:33 PM, Josh Elser
josh.el...@gmail.com mailto:josh.el...@gmail.com
mailto:josh.el...@gmail.com mailto:josh.el...@gmail.com wrote:

 Russ,

 Remember about the distribution of data across multiple
nodes in
 your cluster by tablet.

 A tablet, at the very minimum, will contain one row. Any
way to say
 that same thing is that a row will never be split across
multiple
 tablets. The only guarantee you get from Accumulo here is
that you
 can use a combiner to do you combination across one row.

 However, when you combine (pun not intended) another SKVI
with the
 Combiner, you can do more merging of that intermediate