Re: NOT operator in visibility string
If I were to add a configurable switch that you could enable/disable the NOT operator would that increase the likelihood of this patch being accepted? I could make it default 'disabled'. -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/NOT-operator-in-visibility-string-tp7949p8310.html Sent from the Users mailing list archive at Nabble.com.
Re: NOT operator in visibility string
For scenarios 1-3, that's not a scalable approach. Every time one of these operations occurs, you're re-creating *double* the size of the original dataset (your workspace) by deleting and then re-inserting every key-value. I believe this has already been said, but this is the job of an authorization service, and cell-level visibilities do not appropriately solve this problem. On 3/20/14, 10:36 AM, sfeng88 wrote: I currently work on the same project as Jeff. To elaborate on the Jeff's posts, here are some more detailed scenarios of why the NOT operator would be helpful. scenario 1 - hide a value from users using workspace1 temporarily add !workspace1 to data you want to exclude row1 cf1-col1:value1 [] cf1-col2:value2 [a] row2 cf1-col1:value3 [b] cf1-col2:value4 [b !workspace1] 1) user A scans the table providing authorizations [a,b] -- sees all 4 values 2) user B scans the table providing authorization [a] -- sees value1, value2 3) user A when using workspace1 scans the table providing authorizations [a,b,workspace1] -- is prevented from seeing value4 scenario 2 - hide workspace data from other users add workspace2 to private data row1 cf1-col1:value1 [] cf1-col2:value2 [a] row2 cf1-col1:value3 [b] cf1-col2:value4 [b !workspace1] row3 cf1-col1:value5 [workspace2] cf1-col2:value6 [a b workspace2] 4) user A scans the table providing authorizations [a,b] -- sees the 4 original values 5) user B scans the table providing authorization [a] -- sees value1, value2 6) user A when using workspace2 scans the table providing authorizations [a,b,workspace2] -- sees all 6 values scenario 3 - publish workspace data remove workspace2 from data no longer required to be private row1 cf1-col1:value1 [] cf1-col2:value2 [a] row2 cf1-col1:value3 [b] cf1-col2:value4 [b !workspace1] row3 cf1-col1:value5 [workspace2] -- change to [] cf1-col2:value6 [a b workspace2] -- change to [a b] 7) user A scans the table providing authorizations [a,b] -- sees the 4 original values and newly published value5, value6 8) user B scans the table providing authorization [a] -- sees value1, value2 and newly published value5 scenario 4 - failsecure data no additional data is return when no authorizations are provided row1 cf1-col1:value1 [] cf1-col2:value2 [a] row2 cf1-col1:value3 [b] cf1-col2:value4 [b !workspace1] row3 cf1-col1:value5 [workspace2] cf1-col2:value6 [a b workspace2] 9) user A scans the table with no authorizations -- sees only the unmarked value1 (3 fewer than allowed in scenario 2) 10) user B scans the table with no authorizations -- sees only the unmarked value1 (1 fewer than allowed in scenario 2) -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/NOT-operator-in-visibility-string-tp7949p8314.html Sent from the Users mailing list archive at Nabble.com.
Re: NOT operator in visibility string
For scenario 1-3, it is a very small dataset that we will be adding/hiding. In addition, we would rather not duplicate Accumulo's Authorization code into our project to filter what should or should not be hidden from the user given our scenarios. Going through this entire thread, am I wrong to assume that Accumulo will not be accepting this patch? If so, please let us know so we can come up with alternative ways to solve our use cases. -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/NOT-operator-in-visibility-string-tp7949p8317.html Sent from the Users mailing list archive at Nabble.com.
Re: NOT operator in visibility string
I am curious about your intent when you write !workspace1, this is equivalent to every possible authorization except workspace1. I am wondering if you actually wanted something more restrictive though, like the set of all possible workspaces except workspace1? Do you want to write something like b workspace[2-inf]? Where workspace[2-inf] is made up syntax that means all workspaces with followed by an integer GTE 2. On Thu, Mar 20, 2014 at 10:36 AM, sfeng88 susan.f...@altamiracorp.comwrote: I currently work on the same project as Jeff. To elaborate on the Jeff's posts, here are some more detailed scenarios of why the NOT operator would be helpful. scenario 1 - hide a value from users using workspace1 temporarily add !workspace1 to data you want to exclude row1 cf1-col1:value1 [] cf1-col2:value2 [a] row2 cf1-col1:value3 [b] cf1-col2:value4 [b !workspace1] 1) user A scans the table providing authorizations [a,b] -- sees all 4 values 2) user B scans the table providing authorization [a] -- sees value1, value2 3) user A when using workspace1 scans the table providing authorizations [a,b,workspace1] -- is prevented from seeing value4 scenario 2 - hide workspace data from other users add workspace2 to private data row1 cf1-col1:value1 [] cf1-col2:value2 [a] row2 cf1-col1:value3 [b] cf1-col2:value4 [b !workspace1] row3 cf1-col1:value5 [workspace2] cf1-col2:value6 [a b workspace2] 4) user A scans the table providing authorizations [a,b] -- sees the 4 original values 5) user B scans the table providing authorization [a] -- sees value1, value2 6) user A when using workspace2 scans the table providing authorizations [a,b,workspace2] -- sees all 6 values scenario 3 - publish workspace data remove workspace2 from data no longer required to be private row1 cf1-col1:value1 [] cf1-col2:value2 [a] row2 cf1-col1:value3 [b] cf1-col2:value4 [b !workspace1] row3 cf1-col1:value5 [workspace2] -- change to [] cf1-col2:value6 [a b workspace2] -- change to [a b] 7) user A scans the table providing authorizations [a,b] -- sees the 4 original values and newly published value5, value6 8) user B scans the table providing authorization [a] -- sees value1, value2 and newly published value5 scenario 4 - failsecure data no additional data is return when no authorizations are provided row1 cf1-col1:value1 [] cf1-col2:value2 [a] row2 cf1-col1:value3 [b] cf1-col2:value4 [b !workspace1] row3 cf1-col1:value5 [workspace2] cf1-col2:value6 [a b workspace2] 9) user A scans the table with no authorizations -- sees only the unmarked value1 (3 fewer than allowed in scenario 2) 10) user B scans the table with no authorizations -- sees only the unmarked value1 (1 fewer than allowed in scenario 2) -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/NOT-operator-in-visibility-string-tp7949p8314.html Sent from the Users mailing list archive at Nabble.com.
Re: NOT operator in visibility string
On Thu, Mar 20, 2014 at 10:32 AM, sfeng88 susan.f...@altamiracorp.comwrote: For scenario 1-3, it is a very small dataset that we will be adding/hiding. In addition, we would rather not duplicate Accumulo's Authorization code into our project to filter what should or should not be hidden from the user given our scenarios. Going through this entire thread, am I wrong to assume that Accumulo will not be accepting this patch? If so, please let us know so we can come up with alternative ways to solve our use cases. Hey Susan! We haven't called a formal vote, but from my estimation consensus within the project is opposed to adding a NOT operator. After our pending releases are handled, I'm going to try to pull the reasoning into a more organized document since I expect this will come up again. Since I'd like that document to include some examples that can be implemented as is, even though NOT seems like an obvious choice, I'd be happy to help worth through how your use case can be handled without a NOT operator. Did you happen to already read my earlier approach[1]? I believe it can be used to accomplish all of the scenarios you presented. -Sean [1]: http://s.apache.org/35e
Re: Hadoop HA with Accumulo 1.5
Yes, removing the biggest SPOF from the entire Accumulo architecture is a good thing :) The usage of it with Accumulo is, given all of my testing, completely transparent. Once you configure HDFS correctly, there should be nothing additional you have to do with Accumulo except make sure instance.dfs.uri in accumulo.site.xml is up to date. On 3/20/14, 6:17 PM, Ott, Charlie H. wrote: So I was looking into the software configuration for HA in regard to hdfs clients utilizing the class “org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider” (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-High-Availability-Guide/cdh4hag_topic_2_3.html) So I was wondering, does Accumulo 1.5 benefit from the HA feature of having a failover namenode?
RE: Hadoop HA with Accumulo 1.5
We had an issue in our testing (https://issues.apache.org/jira/browse/ACCUMULO-2480). The root cause was a misconfiguration for automatic failover. The sshfence feature does not handle network failures, so you have to configure it with the shell(/bin/true) command also (separated by a newline, unlike other Hadoop configuration property values). However, if you end up with a hiccup in the failover for some reason, you could run into ACCUMULO-2480. I had to restart the entire Accumulo database because different tservers started reporting the error. A restart of Accumulo did work and recovered with no issues. -Original Message- From: Josh Elser [mailto:josh.el...@gmail.com] Sent: Thursday, March 20, 2014 7:06 PM To: user@accumulo.apache.org Subject: Re: Hadoop HA with Accumulo 1.5 Yes, removing the biggest SPOF from the entire Accumulo architecture is a good thing :) The usage of it with Accumulo is, given all of my testing, completely transparent. Once you configure HDFS correctly, there should be nothing additional you have to do with Accumulo except make sure instance.dfs.uri in accumulo.site.xml is up to date. On 3/20/14, 6:17 PM, Ott, Charlie H. wrote: So I was looking into the software configuration for HA in regard to hdfs clients utilizing the class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/l atest/CDH4-High-Availability-Guide/cdh4hag_topic_2_3.html) So I was wondering, does Accumulo 1.5 benefit from the HA feature of having a failover namenode?
Re: Combiner behaviour
Russ, Close to it. I'll try to work up some actual code to what I'm suggesting. On 3/20/14, 1:12 AM, Russ Weeks wrote: Hi, Josh, Thanks for walking me through this. This is my first stab at it: public class RowSummingCombiner extends WrappingIterator { Key lastKey; long sum; public Key getTopKey() { if (lastKey == null) return super.getTopKey(); return lastKey; } public Value getTopValue() { lastKey = null; return new Value(Long.toString(sum).getBytes()); } public boolean hasTop() { return lastKey != null || super.hasTop(); } public void next() throws IOException { while (super.hasTop()) { lastKey = super.getTopKey(); if (!lastKey.isDeleted()) { sum += Long.parseLong(super.getTopValue().toString()); } super.next(); } } public SortedKeyValueIteratorKey,Value deepCopy(IteratorEnvironment env) { RowSummingCombiner instance = new RowSummingCombiner(); instance.setSource(getSource().deepCopy(env)); return instance; } } I restrict the scanner to the single CF/CQ that I'm interested in summing. The biggest disadvantage is that I can't utilize any of the logic in the Combiner class hierarchy for value decoding etc. because the logic to combine based on the common (row, cf, cq, vis) tuple is baked in at the top level of that hierarchy and I don't see an easy way to plug in new behaviour. But, each instance of the RowSummingCombiner returns its own sum, and then my client just has to add up a handful of values. Is this what you were getting at? Regards, -Russ On Wed, Mar 19, 2014 at 3:51 PM, Josh Elser josh.el...@gmail.com mailto:josh.el...@gmail.com wrote: Ummm, you got the gist of it (I may have misspoke in what I initially said). What my first thought was to make an iterator that will filter down to the columns that you want. It doesn't look like we have an iterator that will efficiently do this for you included in the core (although, I know I've done something similar in the past like this). This iterator would scan the rows on your table returning just the columns you want. 00021ccaac30 meta:size []1807 00021cdaac30 meta:size []656 00021cfaac30 meta:size []565 Then, we could put the summing combiner on top of that iterator to sum those and get back a single key. The row in the key you return should be the last row you included in the sum. This way, if a retry happens under the hood by the batchscanner, you'll resume where you left off and won't double-count things. (you could even do things like sum a maximum of N rows before returning back some intermediate count to better parallelize things) 00021cfaac30 meta:size []3028 So, each ScanSession (what the batchscanner is doing underneath the hood) would return you a value which your client would do a final summation. The final stack would be {(data from accumulo) SKVI to project columns summing combiner} final summation, where {...} denotes work done server-side. This is one of those things that really shines with the Accumulo API. On 3/19/14, 6:40 PM, Russ Weeks wrote: Hi, Josh, Thanks very much for your response. I think I get what you're saying, but it's kind of blowing my mind. Are you saying that if I first set up an iterator that took my key/value pairs like, 00021ccaac30 meta:size []1807 00021ccaac30 meta:source []data2 00021cdaac30 meta:filename []doc02985453 00021cdaac30 meta:size []656 00021cdaac30 meta:source []data2 00021cfaac30 meta:filename []doc04484522 00021cfaac30 meta:size []565 00021cfaac30 meta:source []data2 00021dcaac30 meta:filename []doc03342958 And emitted something like, 0 meta:size [] 1807 0 meta:size [] 656 0 meta:size [] 565 And then applied a SummingCombiner at a lower priority than that iterator, then... it should work, right? I'll give it a try. Regards, -Russ On Wed, Mar 19, 2014 at 3:33 PM, Josh Elser josh.el...@gmail.com mailto:josh.el...@gmail.com mailto:josh.el...@gmail.com mailto:josh.el...@gmail.com wrote: Russ, Remember about the distribution of data across multiple nodes in your cluster by tablet. A tablet, at the very minimum, will contain one row. Any way to say that same thing is that a row will never be split across multiple tablets. The only guarantee you get from Accumulo here is that you can use a combiner to do you combination across one row. However, when you combine (pun not intended) another SKVI with the Combiner, you can do more merging of that intermediate