Re: AccumuloInputFormat and data locality for jobs that don't need keys sorted
Hi Mario, In my experiences, the performance/locality help from AccumuloInputFormat/AccumuloRowInputFormat tends to be less helpful if you add a lot of ranges. In some cases, I've found there's an efficiency curve you can experiment with where it's sometimes faster to just locally throw out data vs use many ranges. I've been using Accumulo with classic MapReduce, YARN MapReduce, and Spark for a while and this has held true on all of those platforms. Good luck! Marc On Mon, Aug 1, 2016 at 6:55 PM, Mario Pastorelli < mario.pastore...@teralytics.ch> wrote: > I would like to use an Accumulo table as input for a Spark job. Let me > clarify that my job doesn't need keys sorted and Accumulo is purely used to > filter the input data thanks to it's index on the keys. The data that I > need to process in Spark is still a small portion of the full dataset. > I know that Accumulo provides the AccumuloInputFormat but in my tests > almost no task has data locality when I use this input format which leads > to poor performance. I'm not sure why this happens but my guess is that > the AccumuloInputFormat creates one task per range. > I wonder if there is a way to tell to the AccumuloInputFormat to split > each range into the sub-ranges local to each tablet server so that each > task in Spark will will read only data from the same machines where it is > running. > > Thanks for the help, > Mario > > -- > Mario Pastorelli | TERALYTICS > > *software engineer* > > Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland > phone: +41794381682 > email: mario.pastore...@teralytics.ch > www.teralytics.net > > Company registration number: CH-020.3.037.709-7 | Trade register Canton > Zurich > Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann > de Vries > > This e-mail message contains confidential information which is for the > sole attention and use of the intended recipient. Please notify us at once > if you think that it may not be intended for you and delete it immediately. >
Re: Feedback about techniques for tuning batch scanning for my problem
Hi Mario, Not sure where this plays into your data integrity, but have you looked into these settings in hdfs-site.xml? dfs.client.read.shortcircuit dfs.client.read.shortcircuit.skip.checksum dfs.domain.socket.path These make for a somewhat dramatic increase in HDFS read performance if data is distributed well enough around.. I can't speak as much to the scanner params, but you may look into these as well. Marc On Thu, May 19, 2016 at 10:08 AM, Mario Pastorelli < mario.pastore...@teralytics.ch> wrote: > Hey people, > I'm trying to tune a bit the query performance to see how fast it can go > and I thought it would be great to have comments from the community. The > problem that I'm trying to solve in Accumulo is the following: we want to > store the entities that have been in a certain location in a certain day. > The location is a Long and the entity id is a Long. I want to be able to > scan ~1M of rows in few seconds, possibly less than one. Right now, I'm > doing the following things: > >1. I'm using a sharding byte at the start of the rowId to keep the >data in the same range distributed in the cluster >2. all the records are encoded, one single record is composed by > 1. rowId: 1 shard byte + 3 bytes for the day > 2. column family: 8 byte for the long corresponding to the hash of > the location > 3. column qualifier: 8 byte corresponding to the identifier of the > entity > 4. value: 2 bytes for some additional information >3. I use a batch scanner because I don't need sorting and it's faster > > As expected, it takes few seconds to scan 1M rows but now I'm wondering if > I can improve it. My ideas are the following: > >1. set table.compaction.major.ration to 1 because I don't care about >the ingestion performance and this should improve the query performance >2. pre-split tables to match the number of servers and then use a byte >of shard as first byte of the rowId. This should improve both writing and >reading the data because both should work in parallel for what I understood >3. enable bloom filter on the table > > Do you think those ideas make sense? Furthermore, I have two questions: > >1. considering that a single entry is only 22 bytes but I'm going to >scan ~1M records per query, do you think I should change the BatchScanner >buffers somehow? >2. anything else to improve the scan speed? Again, I don't care about >the ingestion time > > Thanks for the help! > > -- > Mario Pastorelli | TERALYTICS > > *software engineer* > > Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland > phone: +41794381682 > email: mario.pastore...@teralytics.ch > www.teralytics.net > > Company registration number: CH-020.3.037.709-7 | Trade register Canton > Zurich > Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann > de Vries > > This e-mail message contains confidential information which is for the > sole attention and use of the intended recipient. Please notify us at once > if you think that it may not be intended for you and delete it immediately. >
Re: CHANGES files
I'm not sure if this has been covered already, but the Jetbrains family of products (IntelliJ, TeamCity, etc.) uses this mechanism for their detailed release notes, linked right from their download pages: https://confluence.jetbrains.com/display/TW/TeamCity+9.0.4+%28build+32407%29+Release+Notes Can the Accumulo JIRA be linked to in such a public way? If so that can take care of most of the requirements, I'd wager. Marc On Wed, Jun 10, 2015 at 4:28 PM, Christopher ctubb...@apache.org wrote: On Wed, Jun 10, 2015 at 5:21 PM, Sean Busbey bus...@cloudera.com wrote: On Wed, Jun 10, 2015 at 4:14 PM, Christopher ctubb...@apache.org wrote: If we're going to keep doing this, I'd like to have a really good reason for why we should (which is more convincing than a preference for grep over JIRA). I'm not coming up with such a reason. The reason I heard Josh give is accessibility for folks who use our software but do not have access to our web pages, jira, nor our git repository. I think that's a legitimate benefit, but like Josh I don't know how much the file effectively gets used in those spaces currently. True. That's a good point. But... even if that were true for some users (we're obviously lacking data on that), there's still the concern about its accuracy, and whether an issue number and summary is adequate to convey anything meaningful to those users. And, even if it were minimally meaningful and 100% accurate (which they aren't), does this benefit outweigh the burden? In any case, these individuals (presuming they exist) can likely just as easily generate a static up-to-date report, as needed, from JIRA at the time they download the release.
spark with AccumuloRowInputFormat?
Has anyone done any testing with Spark and AccumuloRowInputFormat? I have no problem doing this for AccumuloInputFormat: JavaPairRDDKey, Value pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(), AccumuloInputFormat.class, Key.class, Value.class); But I run into a snag trying to do a similar thing: JavaPairRDDText, PeekingIteratorMap.EntryKey, Value pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(), AccumuloRowInputFormat.class, Text.class, PeekingIterator.class); The compilation error is (big, sorry): Error:(141, 97) java: method newAPIHadoopRDD in class org.apache.spark.api.java.JavaSparkContext cannot be applied to given types; required: org.apache.hadoop.conf.Configuration,java.lang.ClassF,java.lang.ClassK,java.lang.ClassV found: org.apache.hadoop.conf.Configuration,java.lang.Classorg.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat,java.lang.Classorg.apache.hadoop.io.Text,java.lang.Classorg.apache.accumulo.core.util.PeekingIterator reason: inferred type does not conform to declared bound(s) inferred: org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat bound(s): org.apache.hadoop.mapreduce.InputFormatorg.apache.hadoop.io.Text,org.apache.accumulo.core.util.PeekingIterator I've tried a few things, the signature of the function is: public K, V, F extends org.apache.hadoop.mapreduce.InputFormatK, V JavaPairRDDK, V newAPIHadoopRDD(Configuration conf, ClassF fClass, ClassK kClass, ClassV vClass) I guess it's having trouble with the format extending InputFormatBase with its own additional generic parameters (the Map.Entry inside PeekingIterator). This may be an issue to chase with Spark vs Accumulo, unless something can be tweaked on the Accumulo side or I could wrap the InputFormat with my own somehow. Accumulo 1.6.1, Spark 1.3.1, JDK 7u71. Stopping short of this, can anyone think of a good way to use AccumuloInputFormat to get what I'm getting from the Row version in a performant way? It doesn't necessarily have to be an iterator approach, but I'd need all my values with the key in one consuming function. I'm looking into ways to do it in spark functions but trying to avoid any major performance hits. Thanks, Marc p.s. The summit was absolutely great, thank you all for having it!
Re: spark with AccumuloRowInputFormat?
Hi Russ, How exactly would this work regarding column qualifiers, etc, as those are part of the key? I apologize but I'm not as familiar with the WholeRowIterator use model, does it consolidate based on the rowkey, and then return some Key+Value value which has all the original information serialized? My rows aren't gigantic but they can occasionally get into the 10s of MB. On Mon, May 4, 2015 at 11:22 AM, Russ Weeks rwe...@newbrightidea.com wrote: Hi, Marc, If your rows are small you can use the WholeRowIterator to get all the values with the key in one consuming function. If your rows are big but you know up-front that you'll only need a small part of each row, you could put a filter in front of the WholeRowIterator. I expect there's a performance hit (I haven't done any benchmarks myself) because of the extra serialization/deserialization but it's a very convenient way of working with Rows in Spark. Regards, -Russ On Mon, May 4, 2015 at 8:46 AM, Marc Reichman mreich...@pixelforensics.com wrote: Has anyone done any testing with Spark and AccumuloRowInputFormat? I have no problem doing this for AccumuloInputFormat: JavaPairRDDKey, Value pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(), AccumuloInputFormat.class, Key.class, Value.class); But I run into a snag trying to do a similar thing: JavaPairRDDText, PeekingIteratorMap.EntryKey, Value pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(), AccumuloRowInputFormat.class, Text.class, PeekingIterator.class); The compilation error is (big, sorry): Error:(141, 97) java: method newAPIHadoopRDD in class org.apache.spark.api.java.JavaSparkContext cannot be applied to given types; required: org.apache.hadoop.conf.Configuration,java.lang.ClassF,java.lang.ClassK,java.lang.ClassV found: org.apache.hadoop.conf.Configuration,java.lang.Classorg.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat,java.lang.Classorg.apache.hadoop.io.Text,java.lang.Classorg.apache.accumulo.core.util.PeekingIterator reason: inferred type does not conform to declared bound(s) inferred: org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat bound(s): org.apache.hadoop.mapreduce.InputFormatorg.apache.hadoop.io.Text,org.apache.accumulo.core.util.PeekingIterator I've tried a few things, the signature of the function is: public K, V, F extends org.apache.hadoop.mapreduce.InputFormatK, V JavaPairRDDK, V newAPIHadoopRDD(Configuration conf, ClassF fClass, ClassK kClass, ClassV vClass) I guess it's having trouble with the format extending InputFormatBase with its own additional generic parameters (the Map.Entry inside PeekingIterator). This may be an issue to chase with Spark vs Accumulo, unless something can be tweaked on the Accumulo side or I could wrap the InputFormat with my own somehow. Accumulo 1.6.1, Spark 1.3.1, JDK 7u71. Stopping short of this, can anyone think of a good way to use AccumuloInputFormat to get what I'm getting from the Row version in a performant way? It doesn't necessarily have to be an iterator approach, but I'd need all my values with the key in one consuming function. I'm looking into ways to do it in spark functions but trying to avoid any major performance hits. Thanks, Marc p.s. The summit was absolutely great, thank you all for having it!
Re: spark with AccumuloRowInputFormat?
This is working very well, thanks Russ! For anyone ever stuck in this predicament, using the WholeRowIterator, I was able to get the same IteratorMap.EntryKey,Value that I can get similarly to the AccumuloRowInputFormat as follows: ... IteratorSetting iteratorSetting = new IteratorSetting(1, WholeRowIterator.class); AccumuloInputFormat.addIterator(job, iteratorSetting); // setup RDD JavaPairRDDKey, Value pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(), AccumuloInputFormat.class, Key.class, Value.class); JavaRDDListMyResult result = pairRDD .map(new FunctionTuple2Key, Value, ListMyResult() { @Override public ListMyResult call(Tuple2Key, Value keyValueTuple2) throws Exception { SortedMapKey, Value wholeRow = WholeRowIterator.decodeRow(keyValueTuple2._1, keyValueTuple2._2); MyObject o = getMyObject(wholeRow.entrySet().iterator()); *...* } }); Previously, I was doing this approach, which required an additional stage of Spark calculations as well as a shuffle phase, and wasn't nearly as quick, and also needed a helper class (AccumuloRowMapEntry, very basic Map.Entry implementation): JavaRDDListMyResult result = pairRDD .mapToPair(new PairFunctionTuple2Key, Value, Text, Map.EntryKey, Value() { @Override public Tuple2Text, Map.EntryKey, Value call(Tuple2Key, Value keyValueTuple2) throws Exception { return new Tuple2Text, Map.EntryKey, Value(keyValueTuple2._1.getRow(), new AccumuloRowMapEntry(keyValueTuple2._1, keyValueTuple2._2)); } }) .groupByKey() .map(new FunctionTuple2Text, IterableMap.EntryKey, Value, ListMyResult() { @Override public ListMyResult call(Tuple2Text, IterableMap.EntryKey, Value textIterableTuple2) throws Exception { MyObject o = getMyObject(textIterableTuple2._2.iterator()); *...* } }); Thanks again for all the help. Marc On Mon, May 4, 2015 at 12:23 PM, Russ Weeks rwe...@newbrightidea.com wrote: Yeah, exactly. When you put the WholeRowIterator on the scan, instead of seeing all the Key,Value pairs that make up a row you'll see a single Key,Value pair. The only part of the Key that matters is the row id. The Value is an encoded map of the Key,Value pairs that constitute the row. Call the static method WholeRowIterator.decodeRow to get at this map. The decoded Keys have all the CF, CQ, timestamp and visibility data populated. I'm not sure if they have the row ID populated; either way, they all belong to the same row that was present in the original Key. -Russ On Mon, May 4, 2015 at 9:51 AM, Marc Reichman mreich...@pixelforensics.com wrote: Hi Russ, How exactly would this work regarding column qualifiers, etc, as those are part of the key? I apologize but I'm not as familiar with the WholeRowIterator use model, does it consolidate based on the rowkey, and then return some Key+Value value which has all the original information serialized? My rows aren't gigantic but they can occasionally get into the 10s of MB. On Mon, May 4, 2015 at 11:22 AM, Russ Weeks rwe...@newbrightidea.com wrote: Hi, Marc, If your rows are small you can use the WholeRowIterator to get all the values with the key in one consuming function. If your rows are big but you know up-front that you'll only need a small part of each row, you could put a filter in front of the WholeRowIterator. I expect there's a performance hit (I haven't done any benchmarks myself) because of the extra serialization/deserialization but it's a very convenient way of working with Rows in Spark. Regards, -Russ On Mon, May 4, 2015 at 8:46 AM, Marc Reichman mreich...@pixelforensics.com wrote: Has anyone done any testing with Spark and AccumuloRowInputFormat? I have no problem doing this for AccumuloInputFormat: JavaPairRDDKey, Value pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(), AccumuloInputFormat.class, Key.class, Value.class); But I run into a snag trying to do a similar thing: JavaPairRDDText, PeekingIteratorMap.EntryKey, Value pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(), AccumuloRowInputFormat.class, Text.class, PeekingIterator.class); The compilation error is (big, sorry): Error:(141, 97) java: method newAPIHadoopRDD in class org.apache.spark.api.java.JavaSparkContext cannot be applied to given types; required: org.apache.hadoop.conf.Configuration,java.lang.ClassF,java.lang.ClassK,java.lang.ClassV found: org.apache.hadoop.conf.Configuration,java.lang.Classorg.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat,java.lang.Classorg.apache.hadoop.io.Text,java.lang.Classorg.apache.accumulo.core.util.PeekingIterator reason: inferred type does not conform to declared bound(s) inferred
Re: spark with AccumuloRowInputFormat?
Thanks Josh. I will make that change to be safe, though in these experiments I use a maxversions of 1 anyway. I look forward to seeing the definitive Accumulo + Spark guide some day, glad to help where I can if there are specific things to fill in. On Mon, May 4, 2015 at 2:40 PM, Josh Elser josh.el...@gmail.com wrote: Thanks _so_ much for taking the time to write this up, Marc! It's a good example. One note, you probably want to use an priority greater than 20 for the IteratorSetting. The VersioningIterator is set on Accumulo tables by default at priority 20. In most cases, you'd want to see the state of the table _after_ the VersioningIterator filters things. Marc Reichman wrote: This is working very well, thanks Russ! For anyone ever stuck in this predicament, using the WholeRowIterator, I was able to get the same IteratorMap.EntryKey,Value that I can get similarly to the AccumuloRowInputFormat as follows: ... IteratorSetting iteratorSetting =newIteratorSetting(1, WholeRowIterator.class); AccumuloInputFormat.addIterator(job, iteratorSetting); // setup RDD JavaPairRDDKey, Value pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(), AccumuloInputFormat.class, Key.class, Value.class); JavaRDDListMyResult result = pairRDD .map(newFunctionTuple2Key, Value, ListMyResult() { @Override publicListMyResult call(Tuple2Key, Value keyValueTuple2)throwsException { SortedMapKey, Value wholeRow = WholeRowIterator.decodeRow(keyValueTuple2._1, keyValueTuple2._2); MyObject o = getMyObject(wholeRow.entrySet().iterator()); *...* } }); Previously, I was doing this approach, which required an additional stage of Spark calculations as well as a shuffle phase, and wasn't nearly as quick, and also needed a helper class (AccumuloRowMapEntry, very basic Map.Entry implementation): JavaRDDListMyResult result = pairRDD .mapToPair(newPairFunctionTuple2Key, Value, Text, Map.EntryKey, Value() { @Override publicTuple2Text, Map.EntryKey, Value call(Tuple2Key, Value keyValueTuple2)throwsException { return newTuple2Text, Map.EntryKey, Value(keyValueTuple2._1.getRow(),newAccumuloRowMapEntry(keyValueTuple2._1, keyValueTuple2._2)); } }) .groupByKey() .map(newFunctionTuple2Text, IterableMap.EntryKey, Value, ListMyResult() { @Override publicListMyResult call(Tuple2Text, IterableMap.EntryKey, Value textIterableTuple2)throwsException { MyObject o = getMyObject(textIterableTuple2._2.iterator()); *...* } }); Thanks again for all the help. Marc On Mon, May 4, 2015 at 12:23 PM, Russ Weeks rwe...@newbrightidea.com mailto:rwe...@newbrightidea.com wrote: Yeah, exactly. When you put the WholeRowIterator on the scan, instead of seeing all the Key,Value pairs that make up a row you'll see a single Key,Value pair. The only part of the Key that matters is the row id. The Value is an encoded map of the Key,Value pairs that constitute the row. Call the static method WholeRowIterator.decodeRow to get at this map. The decoded Keys have all the CF, CQ, timestamp and visibility data populated. I'm not sure if they have the row ID populated; either way, they all belong to the same row that was present in the original Key. -Russ On Mon, May 4, 2015 at 9:51 AM, Marc Reichman mreich...@pixelforensics.com mailto:mreich...@pixelforensics.com wrote: Hi Russ, How exactly would this work regarding column qualifiers, etc, as those are part of the key? I apologize but I'm not as familiar with the WholeRowIterator use model, does it consolidate based on the rowkey, and then return some Key+Value value which has all the original information serialized? My rows aren't gigantic but they can occasionally get into the 10s of MB. On Mon, May 4, 2015 at 11:22 AM, Russ Weeks rwe...@newbrightidea.com mailto:rwe...@newbrightidea.com wrote: Hi, Marc, If your rows are small you can use the WholeRowIterator to get all the values with the key in one consuming function. If your rows are big but you know up-front that you'll only need a small part of each row, you could put a filter in front of the WholeRowIterator. I expect there's a performance hit (I haven't done any benchmarks myself) because of the extra serialization/deserialization but it's a very convenient way of working with Rows in Spark. Regards, -Russ On Mon, May 4, 2015 at 8:46 AM, Marc Reichman
Re: OfflineScanner
Apologies for hijacking this, but is there any way to use an offline table clone with MapReduce and AccumuloInputFormat? That read speed increase sounds very appealing.. On Thu, Feb 19, 2015 at 9:27 AM, Josh Elser josh.el...@gmail.com wrote: Typically, if you're using the OfflineScanner, you'd clone the table you want to read and then take the clone offline. It's a simple (and fast) solution that doesn't interrupt the availability of the table. Doing the read offline will definitely be faster (maybe 20%, I'm not entirely sure on accurate number and how it scales with nodes). The pain would be the extra work in creating the clone, offline'ing the table, and eventually deleting the clone when you're done with it. A little more work, but manageable. Ara Ebrahimi wrote: Hi, I’m trying to optimize a connector we’ve written for Presto. In some cases we need to perform full table scans. This happens across all the nodes but each node is assigned to process only a sharded subset of data. Each shard is hosted by only 1 RFile. I’m looking at the AbstractInputFormat and OfflineIterator and it seems like the code is not that hard to use for this case. Is there any drawback? It seems like if the table is offline then OfflineIterator is used which apparently reads the RFiles directly and doesn’t involve any RPC and I think should be significantly faster. Is it so? Is there any drawback to using this while the table is not offline but no other app is messing with the table? Thanks, Ara. This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Thank you in advance for your cooperation.
Re: submission w/classpath without tool.sh?
So, mapreduce.application.classpath was the winner. It's possible that yarn.application.classpath would have worked as well. My main issue was that I was neglecting to include a copy of the XML files in classpath, so my settings weren't being taken, late night epiphany. Passing the value as -Dmapreduce.application.classpath=... on the command line allowed this to take effect and I was fine. For remote clients, I have copied into a local classpath lib what I need to launch, the jar list output from accumulo classpath, and a set of the XML files needed to set the appropriate client-side mapreduce options to launch properly, including the classpath mentioned above but also the various memory-related settings in YARN/MR2. Thanks for the help Billie! On Sat, Jan 24, 2015 at 7:51 AM, Billie Rinaldi bil...@apache.org wrote: You might have to set yarn.application.classpath in both the client and the server conf. At least that's what Slider does. On Jan 23, 2015 10:00 PM, Marc Reichman mreich...@pixelforensics.com wrote: That's correct, I don't really want to have the client have to package up every accumulo and zookeeper jar I need in dcache or a fat jar or whatever just to run stuff from a remote client when the jars are all there. I did try yarn.application.classpath, but I didn't spell out the whole thing. Next try I will take all those jars and put them in explicitly instead of the dir wildcards. I will update how it goes. On Fri, Jan 23, 2015 at 5:19 PM, Billie Rinaldi bil...@apache.org wrote: You have all the jars your app needs on both the servers and the client (as opposed to wanting Yarn to distribute them)? Then yarn.application.classpath should be what you need. It looks like /etc/hadoop/conf,/some/lib/dir/*,/some/other/lib/dir/* etc. Is that what you're trying? On Fri, Jan 23, 2015 at 1:56 PM, Marc Reichman mreich...@pixelforensics.com wrote: My apologies if this is covered somewhere, I've done a lot of searching and come up dry. I am migrating a set of applications from Hadoop 1.0.3/Accumulo 1.4.1 to Hadoop 2.6.0/Accumulo 1.6.1. The applications are launched by my custom java apps, using the Hadoop Tool/Configured interface setup, not a big deal. To run MR jobs with AccumuloInputFormat/OutputFormat, in 1.0 I could use tool.sh to launch the programs, which worked great for local on-cluster launching. I however needed to launch from remote hosts (maybe even Windows ones), and I would bundle a large lib dir with everything I needed on the client-side, and fill out HADOOP_CLASSPATH in hadoop-env.sh with everything I needed (basically copied the output of accumulo classpath). This would work for remote submissions, or even local ones, but specifically using my java mains to launch them without any accumulo or hadoop wrapper scripts. In YARN MR 2.6 this doesn't seem to work. No matter what I do, I can't seem to get a normal java app to have the 2.x MR Application Master pick up the accumulo items in the classpath, and my jobs fail with ClassNotFound exceptions. tool.sh works just fine, but again, I need to be able to submit without that environment. I have tried (on the cluster): HADOOP_CLASSPATH in hadoop-env.sh HADOOP_CLASSPATH from .bashrc yarn.application.classpath in yarn-site.xml I don't mind using tool.sh locally, it's quite nice, but I need a strategy to have the cluster setup so I can just launch java, set my appropriate hadoop configs for remote fs and yarn hosts, get my accumulo connections and in/out setup for mapreduce and launch jobs which have accumulo awareness. Any ideas? Thanks, Marc
Re: submission w/classpath without tool.sh?
That's correct, I don't really want to have the client have to package up every accumulo and zookeeper jar I need in dcache or a fat jar or whatever just to run stuff from a remote client when the jars are all there. I did try yarn.application.classpath, but I didn't spell out the whole thing. Next try I will take all those jars and put them in explicitly instead of the dir wildcards. I will update how it goes. On Fri, Jan 23, 2015 at 5:19 PM, Billie Rinaldi bil...@apache.org wrote: You have all the jars your app needs on both the servers and the client (as opposed to wanting Yarn to distribute them)? Then yarn.application.classpath should be what you need. It looks like /etc/hadoop/conf,/some/lib/dir/*,/some/other/lib/dir/* etc. Is that what you're trying? On Fri, Jan 23, 2015 at 1:56 PM, Marc Reichman mreich...@pixelforensics.com wrote: My apologies if this is covered somewhere, I've done a lot of searching and come up dry. I am migrating a set of applications from Hadoop 1.0.3/Accumulo 1.4.1 to Hadoop 2.6.0/Accumulo 1.6.1. The applications are launched by my custom java apps, using the Hadoop Tool/Configured interface setup, not a big deal. To run MR jobs with AccumuloInputFormat/OutputFormat, in 1.0 I could use tool.sh to launch the programs, which worked great for local on-cluster launching. I however needed to launch from remote hosts (maybe even Windows ones), and I would bundle a large lib dir with everything I needed on the client-side, and fill out HADOOP_CLASSPATH in hadoop-env.sh with everything I needed (basically copied the output of accumulo classpath). This would work for remote submissions, or even local ones, but specifically using my java mains to launch them without any accumulo or hadoop wrapper scripts. In YARN MR 2.6 this doesn't seem to work. No matter what I do, I can't seem to get a normal java app to have the 2.x MR Application Master pick up the accumulo items in the classpath, and my jobs fail with ClassNotFound exceptions. tool.sh works just fine, but again, I need to be able to submit without that environment. I have tried (on the cluster): HADOOP_CLASSPATH in hadoop-env.sh HADOOP_CLASSPATH from .bashrc yarn.application.classpath in yarn-site.xml I don't mind using tool.sh locally, it's quite nice, but I need a strategy to have the cluster setup so I can just launch java, set my appropriate hadoop configs for remote fs and yarn hosts, get my accumulo connections and in/out setup for mapreduce and launch jobs which have accumulo awareness. Any ideas? Thanks, Marc
Hadoop Summit (San Jose June 3-5)
Will anyone be there? I wouldn't mind meeting up for a drink, talk about Accumulo, projects, etc. Looking forward to coming to my first Hadoop-based conference! Marc
accessing accumulo row in mapper setup method?
Hello, I am running a search job of a single piece of query data against potential targets in an accumulo table, using AccumuloRowInputFormat. In most cases, the query data itself is also in the same accumulo table. To date, my client program has pulled the query data from accumulo using a basic scanner, stored the data into HDFS, and added the file(s) in question to distributed cache. My mapper then pulls the data from distributed cache into a private class member in its setup method and uses it in all of the map calls. I had a thought, that maybe I'm spending a bit too much overhead on the client-side doing this, and that my job submission performance is slow because of all of the HDFS i/o and distributed cache handling for arguably small files, in the 100-200k range max. Does it seem like a reasonable idea to skip the preparation on the client-side, and have the mapper setup pull the data directly from accumulo in its setup method instead? Questions related to this: 1. Does this put a lot of pressure on the tabletserver which contains the data, to have many mappers hitting at once during setup for the first wave? 2. Is there any way whatsoever for the mapper to use the existing client connection already being made? Or would I have to do the usual setup with my own zookeeper connection, and if so does that make for a much worse performance impact? Thanks, Marc
Re: Getting the IP Address
Just tested. Does not work. On Wed, Aug 28, 2013 at 11:53 AM, Eric Newton eric.new...@gmail.com wrote: Does hostname -i work on a mac? Not being a mac user, I can't check. -Eric On Wed, Aug 28, 2013 at 11:38 AM, Ravi Mutyala r...@hortonworks.comwrote: Hi, I see from the accumulo-tracer init.d script that IP is determined by this logic. ifconfig | grep inet[^6] | awk '{print $2}' | sed 's/addr://' | grep -v 0.0.0.0 | grep -v 127.0.0.1 | head -n 1 Any reason for using this logic instead of a hostname -i and using reverse dns lookup? I have a cluster where the order of nics on one of the nodes is in a different order and ifconfig returns a IP from a different subnet than for other nodes. But DNS and reverse DNS are properly configured. Thanks CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Filtering on column qualifier
Extending looked like a bit of a boondoggle, because all of the useful fields in the class are private, not protected. I also ran into another architectural question, how does one pass a value (a-la constructor) into one of these classes? If I'm going to use this to filter based on a threshold, I'd need to pass that threshold in somehow. On Wed, Aug 21, 2013 at 9:49 AM, John Vines vi...@apache.org wrote: There's no way to extend the ColumnQualietyFilter via configuration, but it sounds like you are on top of it. You just need to extend the class, possibly copy a bit of code, and change the equality check to a compareTo after converting the Strings to Doubles. On Wed, Aug 21, 2013 at 10:00 AM, Marc Reichman mreich...@pixelforensics.com wrote: I have some data stored in Accumulo with some scores stored as column qualifiers (there was an older thread about this). I would like to find a way to do thresholding when retrieving the data without retrieving it all and then manually filtering out items below my threshold. I know I can fetch column qualifiers which are exact. I've seen the ColumnQualifierFilter, which I assume is what's in play when I fetch qualifiers. Is there a reasonable pattern to extend this and try to use it as a scan iterator so I can do things like greater than a value which will be interpreted as a Double vs. the string equality going on now? Thanks, Marc
Re: Filtering on column qualifier
I haven't considered that. Would that allow me to specify it in the client-side code and not worry about spreading JARs around? It is a very basic need, in my scan iterator loop right now is: String matchScoreString = key.getColumnQualifier().toString(); Double score = Double.parseDouble(matchScoreString); if (threshold != null threshold score) { // TODO: figure out if this is possible to do via data-local scan iterator continue; } What is the pattern for including a groovy snippet for a scan iterator? On Thu, Aug 22, 2013 at 11:16 AM, David Medinets david.medin...@gmail.comwrote: Have you thought of writing a filter class that takes some bit of groovy for execution inside the accept method, depending on how efficient you need to be and how changeable your constraints are. On Thu, Aug 22, 2013 at 10:19 AM, Marc Reichman mreich...@pixelforensics.com wrote: Extending looked like a bit of a boondoggle, because all of the useful fields in the class are private, not protected. I also ran into another architectural question, how does one pass a value (a-la constructor) into one of these classes? If I'm going to use this to filter based on a threshold, I'd need to pass that threshold in somehow. On Wed, Aug 21, 2013 at 9:49 AM, John Vines vi...@apache.org wrote: There's no way to extend the ColumnQualietyFilter via configuration, but it sounds like you are on top of it. You just need to extend the class, possibly copy a bit of code, and change the equality check to a compareTo after converting the Strings to Doubles. On Wed, Aug 21, 2013 at 10:00 AM, Marc Reichman mreich...@pixelforensics.com wrote: I have some data stored in Accumulo with some scores stored as column qualifiers (there was an older thread about this). I would like to find a way to do thresholding when retrieving the data without retrieving it all and then manually filtering out items below my threshold. I know I can fetch column qualifiers which are exact. I've seen the ColumnQualifierFilter, which I assume is what's in play when I fetch qualifiers. Is there a reasonable pattern to extend this and try to use it as a scan iterator so I can do things like greater than a value which will be interpreted as a Double vs. the string equality going on now? Thanks, Marc
Re: Filtering on column qualifier
I apologize for my dense-ness, but could you walk me through this? Is there some form of existing scan iterator which interprets groovy? Or is this something I would build? On Thu, Aug 22, 2013 at 12:10 PM, David Medinets david.medin...@gmail.comwrote: The advantage is that you'd only write the iterator once and deploy it to the cluster. Then the groovy snippet changes its behavior. You'd save passing the data to your client code, but more work would be done by the accumulo cluster. On Thu, Aug 22, 2013 at 12:33 PM, Marc Reichman mreich...@pixelforensics.com wrote: I haven't considered that. Would that allow me to specify it in the client-side code and not worry about spreading JARs around? It is a very basic need, in my scan iterator loop right now is: String matchScoreString = key.getColumnQualifier().toString(); Double score = Double.parseDouble(matchScoreString); if (threshold != null threshold score) { // TODO: figure out if this is possible to do via data-local scan iterator continue; } What is the pattern for including a groovy snippet for a scan iterator? On Thu, Aug 22, 2013 at 11:16 AM, David Medinets david.medin...@gmail.com wrote: Have you thought of writing a filter class that takes some bit of groovy for execution inside the accept method, depending on how efficient you need to be and how changeable your constraints are. On Thu, Aug 22, 2013 at 10:19 AM, Marc Reichman mreich...@pixelforensics.com wrote: Extending looked like a bit of a boondoggle, because all of the useful fields in the class are private, not protected. I also ran into another architectural question, how does one pass a value (a-la constructor) into one of these classes? If I'm going to use this to filter based on a threshold, I'd need to pass that threshold in somehow. On Wed, Aug 21, 2013 at 9:49 AM, John Vines vi...@apache.org wrote: There's no way to extend the ColumnQualietyFilter via configuration, but it sounds like you are on top of it. You just need to extend the class, possibly copy a bit of code, and change the equality check to a compareTo after converting the Strings to Doubles. On Wed, Aug 21, 2013 at 10:00 AM, Marc Reichman mreich...@pixelforensics.com wrote: I have some data stored in Accumulo with some scores stored as column qualifiers (there was an older thread about this). I would like to find a way to do thresholding when retrieving the data without retrieving it all and then manually filtering out items below my threshold. I know I can fetch column qualifiers which are exact. I've seen the ColumnQualifierFilter, which I assume is what's in play when I fetch qualifiers. Is there a reasonable pattern to extend this and try to use it as a scan iterator so I can do things like greater than a value which will be interpreted as a Double vs. the string equality going on now? Thanks, Marc
Filtering on column qualifier
I have some data stored in Accumulo with some scores stored as column qualifiers (there was an older thread about this). I would like to find a way to do thresholding when retrieving the data without retrieving it all and then manually filtering out items below my threshold. I know I can fetch column qualifiers which are exact. I've seen the ColumnQualifierFilter, which I assume is what's in play when I fetch qualifiers. Is there a reasonable pattern to extend this and try to use it as a scan iterator so I can do things like greater than a value which will be interpreted as a Double vs. the string equality going on now? Thanks, Marc
Re: deletion technique question
The 1.5 solution looks nice. Aware of the potential data loss angle and the sort ordering is also an interesting angle, thank you. In my particular case where I may not necessarily be aware of all permutations of column visibility of a given key but want to replace them all with a particular new visibility with the same data, how would I go about that? Is there a way to use a batchscanner (step 1 of the batchdeleter approach) to pull down all the permutations, then putdeletes for them and put what I want? In my case I'm pulling one copy of the data down first to verify I have it at the user's current scan auth, then using the #1 approach to clear it out and then put it in again as the vis I need. On Mon, May 13, 2013 at 10:05 AM, Keith Turner ke...@deenlo.com wrote: On Fri, May 10, 2013 at 12:39 PM, Marc Reichman mreich...@pixelforensics.com wrote: I have a table with rows which have 3 column values in one column family, and a column visibility. There are situations where I will want to replace the row content with a new column visibility; I understand that the visibility attributes are immutable, so I will have to delete and re-put. Am I better off doing: 1. BatchDeleter with authorizations to allow access, set range to the key in question, call delete, and then put in mutations with the new visibility 2. Create mutations with a putDelete followed by a put with the new visibility for each value 3. Something else entirely? In 1.5, you can use ACCUMULO-956 For option #2, can I simply do a putDelete on the column family/qualifier? Or do I need to know the old authorizations to put in a visibility expression with the putDelete? For all of these, can a client get up-to-the-minute results immediately after? Or does some kind of compaction need to occur first? If you send a mutation with a delete and put, the client will be able to see it after the batchwriter flushes or closes. No compaction needed. I am little fuzzy on #1. Will you delete everything in one pass (using batchdeleter), and then do another pass writing data w/ updated colvis? If so this would seems to imply that you are pulling the data from another source (other than the table stuff was deleted from)? Make sure the method you chose is not susceptible to data loss in the event that the client dies. For example if a client was, reading a table and then writing a delete and updates mutation for each key/val read. If the client died and some deletes were written, but not the corresponding updates, then that data would not be seen to be transformed on the second run. When you change the colvis, you change the sort order. If you read a key and K and change it to K', where K' sorts after K. If you insert K', its possible that you may read it. Its being inserted in front of the scanners pointer. Because of buffering in the batch writer and scanner, this would not occur always, but it would occur occasionally. Something to be aware of.
deletion technique question
I have a table with rows which have 3 column values in one column family, and a column visibility. There are situations where I will want to replace the row content with a new column visibility; I understand that the visibility attributes are immutable, so I will have to delete and re-put. Am I better off doing: 1. BatchDeleter with authorizations to allow access, set range to the key in question, call delete, and then put in mutations with the new visibility 2. Create mutations with a putDelete followed by a put with the new visibility for each value 3. Something else entirely? For option #2, can I simply do a putDelete on the column family/qualifier? Or do I need to know the old authorizations to put in a visibility expression with the putDelete? For all of these, can a client get up-to-the-minute results immediately after? Or does some kind of compaction need to occur first?
Re: deletion technique question
The only limitation with the approach that I can see is that I may not know every permutation of visibility on a given key, and with the scan-driven approach I can use the user's entire authorization set as a way to get all of the rows for deletion. Thanks, Marc On Fri, May 10, 2013 at 2:19 PM, Christopher ctubb...@apache.org wrote: The BatchDeleter is essentially a BatchScanner with the SortedKeyIterator (which drops values from the returned entries... they aren't needed to delete), and a BatchWriter that inserts a delete entry in a mutation for every entry the scanner sees. You can, and should, select option 2, because you're better off sending two column updates in each mutation rather than send twice as many mutations, as you'd be doing for option 1. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Fri, May 10, 2013 at 12:39 PM, Marc Reichman mreich...@pixelforensics.com wrote: I have a table with rows which have 3 column values in one column family, and a column visibility. There are situations where I will want to replace the row content with a new column visibility; I understand that the visibility attributes are immutable, so I will have to delete and re-put. Am I better off doing: 1. BatchDeleter with authorizations to allow access, set range to the key in question, call delete, and then put in mutations with the new visibility 2. Create mutations with a putDelete followed by a put with the new visibility for each value 3. Something else entirely? For option #2, can I simply do a putDelete on the column family/qualifier? Or do I need to know the old authorizations to put in a visibility expression with the putDelete? For all of these, can a client get up-to-the-minute results immediately after? Or does some kind of compaction need to occur first?
Re: remote accumulo instance issue
These are from the client machine: (9997 on a tserver) [mreichman@packers: ~]$ nmap -p 9997 192.168.1.162 Starting Nmap 5.51 ( http://nmap.org ) at 2013-05-08 16:35 ric Nmap scan report for giants.home (192.168.1.162) Host is up (0.0063s latency). PORT STATE SERVICE 9997/tcp open unknown MAC Address: 7A:79:C0:A8:01:A2 (Unknown) Nmap done: 1 IP address (1 host up) scanned in 0.60 seconds (2181 zookeeper on the master) [mreichman@packers: ~]$ nmap -p 2181 192.168.1.160 Starting Nmap 5.51 ( http://nmap.org ) at 2013-05-08 16:35 ric Nmap scan report for padres.home (192.168.1.160) Host is up (0.0071s latency). PORT STATE SERVICE 2181/tcp open unknown MAC Address: 7A:79:C0:A8:01:A0 (Unknown) Nmap done: 1 IP address (1 host up) scanned in 0.56 seconds Any chance it could be anything related to DNS or reverse DNS? On Wed, May 8, 2013 at 10:25 AM, John Vines vi...@apache.org wrote: Is that remote instance behind a firewall or anything like that? On Wed, May 8, 2013 at 11:09 AM, Marc Reichman mreich...@pixelforensics.com wrote: I have seen this as ticket ACCUMULO-687 which has been marked resolved, but I still see this issue. I am connecting to a remote accumulo instance to query and to launch mapreduce jobs using AccumuloRowInputFormat, and I'm seeing an error like: 91 [main-SendThread(padres.home:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established to padres.home/192.168.1.160:2181, initiating session 166 [main-SendThread(padres.home:2181)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server padres.home/192.168.1.160:2181, sessionid = 0x13e7b48f9d17af7, negotiated timeout = 3 1889 [main] WARN org.apache.accumulo.core.client.impl.ServerClient - Failed to find an available server in the list of servers: [192.168.1.164:9997:9997 (12), 192.168.1.192:9997:9997 (12), 192.168.1.194:9997:9997 (12), 192.168.1.162:9997:9997 (12), 192.168.1.190:9997:9997 (12), 192.168.1.166:9997:9997 (12), 192.168.1.168:9997:9997 (12), 192.168.1.196:9997:9997 (12)] My zookeeper's tservers key looks like: [zk: localhost:2181(CONNECTED) 1] ls /accumulo/908a756e-1c81-4bea-a4de-675456499a10/tservers [192.168.1.164:9997, 192.168.1.192:9997, 192.168.1.194:9997, 192.168.1.162:9997, 192.168.1.190:9997, 192.168.1.166:9997, 192.168.1.168:9997, 192.168.1.196:9997] My masters and slaves file look like: [hadoop@padres conf]$ cat masters 192.168.1.160 [hadoop@padres conf]$ cat slaves 192.168.1.162 192.168.1.164 192.168.1.166 192.168.1.168 192.168.1.190 192.168.1.192 192.168.1.194 192.168.1.196 tracers, gc, and monitor are the same as masters. I have no issues executing on the master, but I would like to work from a remote host. The remote host is on a VPN, and its default resolver is NOT the resolver from the remote network. If I do reverse lookup over the VPN *using* the remote resolver it shows proper hostnames. My concern is that something is causing the host:port entry plus the port to come up with this concatenated view of host:port:port, which is obviously not going to work. What else can I try? I previously had hostnames in the masters/slaves/etc. files but now have the IPs. Should I re-init the instance to see if it changes anything in zookeeper?
Re: remote accumulo instance issue
All, My apologies. This seemed to be a JAR mismatch error. No more problems. Sorry for the drill. Marc On Wed, May 8, 2013 at 11:45 AM, Marc Reichman mreich...@pixelforensics.com wrote: 1.4.1., hadoop 1.0.3. Just for sanity, I ran 'accumulo classpath' on the cluster and am copying those exact files to my client side in case there was a mismatch somewhere. On Wed, May 8, 2013 at 11:43 AM, John Vines vi...@apache.org wrote: What version of Accumulo are you running? Sent from my phone, please pardon the typos and brevity. On May 8, 2013 12:38 PM, Marc Reichman mreich...@pixelforensics.com wrote: I can't find anything wrong with the networking. Here is the whole error with stack trace: 2057 [main] WARN org.apache.accumulo.core.client.impl.ServerClient - Failed to find an available server in the list of servers: [192.168.1.164:9997:9997 (12), 192.168.1.192:9997:9997 (12), 192.168.1.194:9997:9997 (12), 192.168.1.162:9997:9997 (12), 192.168.1.190:9997:9997 (12), 192.168.1.166:9997:9997 (12), 192.168.1.168:9997:9997 (12), 192.168.1.196:9997:9997 (12)] Exception in thread main java.lang.IncompatibleClassChangeError: Implementing class at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) at java.lang.ClassLoader.defineClass(ClassLoader.java:615) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at org.apache.accumulo.core.client.impl.ServerClient.getConnection(ServerClient.java:146) at org.apache.accumulo.core.client.impl.ServerClient.getConnection(ServerClient.java:123) at org.apache.accumulo.core.client.impl.ServerClient.executeRaw(ServerClient.java:105) at org.apache.accumulo.core.client.impl.ServerClient.execute(ServerClient.java:71) at org.apache.accumulo.core.client.impl.ConnectorImpl.init(ConnectorImpl.java:75) at org.apache.accumulo.core.client.ZooKeeperInstance.getConnector(ZooKeeperInstance.java:218) at org.apache.accumulo.core.client.ZooKeeperInstance.getConnector(ZooKeeperInstance.java:206) Running on JDK 1.6.0_27 On Wed, May 8, 2013 at 10:38 AM, Keith Turner ke...@deenlo.com wrote: On Wed, May 8, 2013 at 11:09 AM, Marc Reichman mreich...@pixelforensics.com wrote: I have seen this as ticket ACCUMULO-687 which has been marked resolved, but I still see this issue. I am connecting to a remote accumulo instance to query and to launch mapreduce jobs using AccumuloRowInputFormat, and I'm seeing an error like: 91 [main-SendThread(padres.home:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established to padres.home/192.168.1.160:2181, initiating session 166 [main-SendThread(padres.home:2181)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server padres.home/192.168.1.160:2181, sessionid = 0x13e7b48f9d17af7, negotiated timeout = 3 1889 [main] WARN org.apache.accumulo.core.client.impl.ServerClient - Failed to find an available server in the list of servers: [192.168.1.164:9997:9997 (12), 192.168.1.192:9997:9997 (12), 192.168.1.194:9997:9997 (12), 192.168.1.162:9997:9997 (12), 192.168.1.190:9997:9997 (12), 192.168.1.166:9997:9997 (12), 192.168.1.168:9997:9997 (12), 192.168.1.196:9997:9997 (12)] My zookeeper's tservers key looks like: [zk: localhost:2181(CONNECTED) 1] ls /accumulo/908a756e-1c81-4bea-a4de-675456499a10/tservers [192.168.1.164:9997, 192.168.1.192:9997, 192.168.1.194:9997, 192.168.1.162:9997, 192.168.1.190:9997, 192.168.1.166:9997, 192.168.1.168:9997, 192.168.1.196:9997] My masters and slaves file look like: [hadoop@padres conf]$ cat masters 192.168.1.160 [hadoop@padres conf]$ cat slaves 192.168.1.162 192.168.1.164 192.168.1.166 192.168.1.168 192.168.1.190 192.168.1.192 192.168.1.194 192.168.1.196 tracers, gc, and monitor are the same as masters. I have no issues executing on the master, but I would like to work from a remote host. The remote host is on a VPN, and its default resolver is NOT the resolver from the remote network. If I do reverse lookup over the VPN *using* the remote resolver it shows proper hostnames. My concern is that something is causing the host:port entry plus the port to come up with this concatenated view of host:port:port, which is obviously not going to work. The second port is nothing to worry about. Its created by concatenating what came from zookeeper
Change/modify column visibility of an existing row?
Is there a way via the Java API to modify the column visibility of an existing row without having to put a new column along-side? Or are those immutable? I realize I can delete and re-put the data with new visibility. Thanks, Marc -- http://saucyandbossy.wordpress.com
Re: Change/modify column visibility of an existing row?
Thank you. I felt that was the case and didn't see anything to sway, but I figured I'd ask as it came up in design of a tool using accumulo. On Mon, Apr 22, 2013 at 11:05 AM, John Vines vi...@apache.org wrote: All Keys in Accumulo are immutable, including the visibility fields. So the only way to change is to delete and insert. On Mon, Apr 22, 2013 at 12:00 PM, Marc Reichman marcreich...@gmail.comwrote: Is there a way via the Java API to modify the column visibility of an existing row without having to put a new column along-side? Or are those immutable? I realize I can delete and re-put the data with new visibility. Thanks, Marc -- http://saucyandbossy.wordpress.com -- http://saucyandbossy.wordpress.com
increase running scans in monitor?
Hello, I am running a accumulo-based MR job using the AccumuloRowInputFormat on 1.4.1. Config is more-or-less default, using the native-standalone 3GB template, but with the TServer memory put up to 2GB in accumulo-env.sh from its default. accumulo-site.xml has tserver.memory.maps.max at 1G, tserver.cache.data.size at 50M, and tserver.cache.index.size at 512M. My tables are created with maxversions for all three types (scan, minc, majc) at 1 and compress type as gz. I am finding, on an 8 node test cluster with 64 map task slots, that when a job is running, the 'Running Scans' count in the monitor is roughly 0-4 on average for each tablet server. When viewed at the table view, this puts the running scans anywhere from 4-24 on average. I would expect/hope the scans to be somewhere close to the map task count. To me, this means one of the following. 1. There is a configuration setting inhibiting the amount of scans from accumulating (excuse the pun) to about the same amount as my map tasks 2. My map task job is cpu-intensive enough to introduce delays between scans and everything is fine 3. Some combination of 1/2. On an alternate cluster, 40 nodes with 320 task slots, we haven't seen anywhere near full capacity scanning with map tasks which have the same performance, and the problem seems much worse. I am experimenting with some of the readahead configuration variables for the tablet servers in the meantime, but haven't found any smoking guns yet. Thank you, Marc -- http://saucyandbossy.wordpress.com
Re: increase running scans in monitor?
Hi Josh, Thanks for writing back. I am doing all explicit splits using addSplits in the Java API since the keyspace is easy to divide evenly. Depending on the table size for some of these experiments, I've had 128 splits, 256, 512, or 1024 splits. My jobs are executing properly, MR-wise, in the sense that I do have a proper amount of map tasks created (as the count of splits above, respectively). My concern is that the jobs may not be quite as busy as they can be, dataflow-wise and I think the Running Scans per table/tablet server seem to be good indicators of that. My data is a 32-byte key (an md5 value), and I have one column family with 3 columns which contain bigger data, anywhere from 50-100k to an occasional 10M-15M piece. On Tue, Apr 2, 2013 at 10:06 AM, Josh Elser josh.el...@gmail.com wrote: Hi Marc, How many tablets are in the table you're running MR over (see the monitor)? Might adding some more splits to your table (`addsplits` in the Accumulo shell) get you better parallelism? What does your data look like in your table? Lots of small rows? Few very large rows? On 4/2/13 10:56 AM, Marc Reichman wrote: Hello, I am running a accumulo-based MR job using the AccumuloRowInputFormat on 1.4.1. Config is more-or-less default, using the native-standalone 3GB template, but with the TServer memory put up to 2GB in accumulo-env.sh from its default. accumulo-site.xml has tserver.memory.maps.max at 1G, tserver.cache.data.size at 50M, and tserver.cache.index.size at 512M. My tables are created with maxversions for all three types (scan, minc, majc) at 1 and compress type as gz. I am finding, on an 8 node test cluster with 64 map task slots, that when a job is running, the 'Running Scans' count in the monitor is roughly 0-4 on average for each tablet server. When viewed at the table view, this puts the running scans anywhere from 4-24 on average. I would expect/hope the scans to be somewhere close to the map task count. To me, this means one of the following. 1. There is a configuration setting inhibiting the amount of scans from accumulating (excuse the pun) to about the same amount as my map tasks 2. My map task job is cpu-intensive enough to introduce delays between scans and everything is fine 3. Some combination of 1/2. On an alternate cluster, 40 nodes with 320 task slots, we haven't seen anywhere near full capacity scanning with map tasks which have the same performance, and the problem seems much worse. I am experimenting with some of the readahead configuration variables for the tablet servers in the meantime, but haven't found any smoking guns yet. Thank you, Marc -- http://saucyandbossy.**wordpress.com http://saucyandbossy.wordpress.com -- http://saucyandbossy.wordpress.com
Re: increase running scans in monitor?
I apologize, I neglected to include row counts. For the above split sizes mentioned, there are roughly ~55K rows, ~300K rows, ~800K rows, and ~2M rows. I'm not necessarily hard-set on the idea that lower running scans are affecting my overall job time negatively, and I realize that my jobs themselves may simply be starving the tablet servers (cpu-wise). In my experiences thus-far, running all 8 CPU cores per node leads to an overall quicker job completion than pulling one core out of the mix to let accumulo itself have more breathing room. On Tue, Apr 2, 2013 at 10:20 AM, Marc Reichman marcreich...@gmail.comwrote: Hi Josh, Thanks for writing back. I am doing all explicit splits using addSplits in the Java API since the keyspace is easy to divide evenly. Depending on the table size for some of these experiments, I've had 128 splits, 256, 512, or 1024 splits. My jobs are executing properly, MR-wise, in the sense that I do have a proper amount of map tasks created (as the count of splits above, respectively). My concern is that the jobs may not be quite as busy as they can be, dataflow-wise and I think the Running Scans per table/tablet server seem to be good indicators of that. My data is a 32-byte key (an md5 value), and I have one column family with 3 columns which contain bigger data, anywhere from 50-100k to an occasional 10M-15M piece. On Tue, Apr 2, 2013 at 10:06 AM, Josh Elser josh.el...@gmail.com wrote: Hi Marc, How many tablets are in the table you're running MR over (see the monitor)? Might adding some more splits to your table (`addsplits` in the Accumulo shell) get you better parallelism? What does your data look like in your table? Lots of small rows? Few very large rows? On 4/2/13 10:56 AM, Marc Reichman wrote: Hello, I am running a accumulo-based MR job using the AccumuloRowInputFormat on 1.4.1. Config is more-or-less default, using the native-standalone 3GB template, but with the TServer memory put up to 2GB in accumulo-env.sh from its default. accumulo-site.xml has tserver.memory.maps.max at 1G, tserver.cache.data.size at 50M, and tserver.cache.index.size at 512M. My tables are created with maxversions for all three types (scan, minc, majc) at 1 and compress type as gz. I am finding, on an 8 node test cluster with 64 map task slots, that when a job is running, the 'Running Scans' count in the monitor is roughly 0-4 on average for each tablet server. When viewed at the table view, this puts the running scans anywhere from 4-24 on average. I would expect/hope the scans to be somewhere close to the map task count. To me, this means one of the following. 1. There is a configuration setting inhibiting the amount of scans from accumulating (excuse the pun) to about the same amount as my map tasks 2. My map task job is cpu-intensive enough to introduce delays between scans and everything is fine 3. Some combination of 1/2. On an alternate cluster, 40 nodes with 320 task slots, we haven't seen anywhere near full capacity scanning with map tasks which have the same performance, and the problem seems much worse. I am experimenting with some of the readahead configuration variables for the tablet servers in the meantime, but haven't found any smoking guns yet. Thank you, Marc -- http://saucyandbossy.**wordpress.comhttp://saucyandbossy.wordpress.com -- http://saucyandbossy.wordpress.com -- http://saucyandbossy.wordpress.com