Re: AccumuloInputFormat and data locality for jobs that don't need keys sorted

2016-08-02 Thread Marc Reichman
Hi Mario,

In my experiences, the performance/locality help from
AccumuloInputFormat/AccumuloRowInputFormat tends to be less helpful if you
add a lot of ranges. In some cases, I've found there's an efficiency curve
you can experiment with where it's sometimes faster to just locally throw
out data vs use many ranges. I've been using Accumulo with classic
MapReduce, YARN MapReduce, and Spark for a while and this has held true on
all of those platforms.

Good luck!

Marc

On Mon, Aug 1, 2016 at 6:55 PM, Mario Pastorelli <
mario.pastore...@teralytics.ch> wrote:

> I would like to use an Accumulo table as input for a Spark job. Let me
> clarify that my job doesn't need keys sorted and Accumulo is purely used to
> filter the input data thanks to it's index on the keys. The data that I
> need to process in Spark is still a small portion of the full dataset.
> I know that Accumulo provides the AccumuloInputFormat but in my tests
> almost no task has data locality when I use this input format which leads
> to poor performance. I'm not sure why this happens but my guess is that
> the  AccumuloInputFormat creates one task per range.
> I wonder if there is a way to tell to the AccumuloInputFormat to split
> each range into the sub-ranges local to each tablet server so that each
> task in Spark will will read only data from the same machines where it is
> running.
>
> Thanks for the help,
> Mario
>
> --
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastore...@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>


Re: Feedback about techniques for tuning batch scanning for my problem

2016-05-19 Thread Marc Reichman
Hi Mario,

Not sure where this plays into your data integrity, but have you looked
into these settings in hdfs-site.xml?
dfs.client.read.shortcircuit
dfs.client.read.shortcircuit.skip.checksum
dfs.domain.socket.path

These make for a somewhat dramatic increase in HDFS read performance if
data is distributed well enough around..

I can't speak as much to the scanner params, but you may look into these as
well.

Marc

On Thu, May 19, 2016 at 10:08 AM, Mario Pastorelli <
mario.pastore...@teralytics.ch> wrote:

> Hey people,
> I'm trying to tune a bit the query performance to see how fast it can go
> and I thought it would be great to have comments from the community. The
> problem that I'm trying to solve in Accumulo is the following: we want to
> store the entities that have been in a certain location in a certain day.
> The location is a Long and the entity id is a Long. I want to be able to
> scan ~1M of rows in few seconds, possibly less than one. Right now, I'm
> doing the following things:
>
>1. I'm using a sharding byte at the start of the rowId to keep the
>data in the same range distributed in the cluster
>2. all the records are encoded, one single record is composed by
>   1. rowId: 1 shard byte + 3 bytes for the day
>   2. column family: 8 byte for the long corresponding to the hash of
>   the location
>   3. column qualifier: 8 byte corresponding to the identifier of the
>   entity
>   4. value: 2 bytes for some additional information
>3. I use a batch scanner because I don't need sorting and it's faster
>
> As expected, it takes few seconds to scan 1M rows but now I'm wondering if
> I can improve it. My ideas are the following:
>
>1. set table.compaction.major.ration to 1 because I don't care about
>the ingestion performance and this should improve the query performance
>2. pre-split tables to match the number of servers and then use a byte
>of shard as first byte of the rowId. This should improve both writing and
>reading the data because both should work in parallel for what I understood
>3. enable bloom filter on the table
>
> Do you think those ideas make sense? Furthermore, I have two questions:
>
>1. considering that a single entry is only 22 bytes but I'm going to
>scan ~1M records per query, do you think I should change the BatchScanner
>buffers somehow?
>2. anything else to improve the scan speed? Again, I don't care about
>the ingestion time
>
> Thanks for the help!
>
> --
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastore...@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>


Re: CHANGES files

2015-06-10 Thread Marc Reichman
I'm not sure if this has been covered already, but the Jetbrains family of
products (IntelliJ, TeamCity, etc.) uses this mechanism for their detailed
release notes, linked right from their download pages:
https://confluence.jetbrains.com/display/TW/TeamCity+9.0.4+%28build+32407%29+Release+Notes

Can the Accumulo JIRA be linked to in such a public way? If so that can
take care of most of the requirements, I'd wager.

Marc

On Wed, Jun 10, 2015 at 4:28 PM, Christopher ctubb...@apache.org wrote:

 On Wed, Jun 10, 2015 at 5:21 PM, Sean Busbey bus...@cloudera.com wrote:
 
 
  On Wed, Jun 10, 2015 at 4:14 PM, Christopher ctubb...@apache.org
 wrote:
 
 
 
  If we're going to keep doing this, I'd like to have a really good
  reason for why we should (which is more convincing than a preference
  for grep over JIRA). I'm not coming up with such a reason.
 
 
 
  The reason I heard Josh give is accessibility for folks who use our
 software
  but do not have access to our web pages, jira, nor our git repository.
 
  I think that's a legitimate benefit, but like Josh I don't know how much
 the
  file effectively gets used in those spaces currently.
 

 True. That's a good point. But... even if that were true for some
 users (we're obviously lacking data on that), there's still the
 concern about its accuracy, and whether an issue number and summary is
 adequate to convey anything meaningful to those users. And, even if it
 were minimally meaningful and 100% accurate (which they aren't), does
 this benefit outweigh the burden? In any case, these individuals
 (presuming they exist) can likely just as easily generate a static
 up-to-date report, as needed, from JIRA at the time they download the
 release.



spark with AccumuloRowInputFormat?

2015-05-04 Thread Marc Reichman
Has anyone done any testing with Spark and AccumuloRowInputFormat? I have
no problem doing this for AccumuloInputFormat:

JavaPairRDDKey, Value pairRDD =
sparkContext.newAPIHadoopRDD(job.getConfiguration(),
AccumuloInputFormat.class,
Key.class, Value.class);

But I run into a snag trying to do a similar thing:

JavaPairRDDText, PeekingIteratorMap.EntryKey, Value pairRDD =
sparkContext.newAPIHadoopRDD(job.getConfiguration(),
AccumuloRowInputFormat.class,
Text.class, PeekingIterator.class);

The compilation error is (big, sorry):

Error:(141, 97) java: method newAPIHadoopRDD in class
org.apache.spark.api.java.JavaSparkContext cannot be applied to given
types;
  required: 
org.apache.hadoop.conf.Configuration,java.lang.ClassF,java.lang.ClassK,java.lang.ClassV
  found: 
org.apache.hadoop.conf.Configuration,java.lang.Classorg.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat,java.lang.Classorg.apache.hadoop.io.Text,java.lang.Classorg.apache.accumulo.core.util.PeekingIterator
  reason: inferred type does not conform to declared bound(s)
inferred: org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat
bound(s): 
org.apache.hadoop.mapreduce.InputFormatorg.apache.hadoop.io.Text,org.apache.accumulo.core.util.PeekingIterator

I've tried a few things, the signature of the function is:

public K, V, F extends org.apache.hadoop.mapreduce.InputFormatK, V
JavaPairRDDK, V newAPIHadoopRDD(Configuration conf, ClassF fClass,
ClassK kClass, ClassV vClass)

I guess it's having trouble with the format extending InputFormatBase with
its own additional generic parameters (the Map.Entry inside
PeekingIterator).

This may be an issue to chase with Spark vs Accumulo, unless something can
be tweaked on the Accumulo side or I could wrap the InputFormat with my own
somehow.

Accumulo 1.6.1, Spark 1.3.1, JDK 7u71.

Stopping short of this, can anyone think of a good way to use
AccumuloInputFormat to get what I'm getting from the Row version in a
performant way? It doesn't necessarily have to be an iterator approach, but
I'd need all my values with the key in one consuming function. I'm looking
into ways to do it in spark functions but trying to avoid any major
performance hits.

Thanks,

Marc

p.s. The summit was absolutely great, thank you all for having it!


Re: spark with AccumuloRowInputFormat?

2015-05-04 Thread Marc Reichman
Hi Russ,

How exactly would this work regarding column qualifiers, etc, as those are
part of the key? I apologize but I'm not as familiar with the
WholeRowIterator use model, does it consolidate based on the rowkey, and
then return some Key+Value value which has all the original information
serialized?

My rows aren't gigantic but they can occasionally get into the 10s of MB.

On Mon, May 4, 2015 at 11:22 AM, Russ Weeks rwe...@newbrightidea.com
wrote:

 Hi, Marc,

 If your rows are small you can use the WholeRowIterator to get all the
 values with the key in one consuming function. If your rows are big but you
 know up-front that you'll only need a small part of each row, you could put
 a filter in front of the WholeRowIterator.

 I expect there's a performance hit (I haven't done any benchmarks myself)
 because of the extra serialization/deserialization but it's a very
 convenient way of working with Rows in Spark.

 Regards,
 -Russ

 On Mon, May 4, 2015 at 8:46 AM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 Has anyone done any testing with Spark and AccumuloRowInputFormat? I have
 no problem doing this for AccumuloInputFormat:

 JavaPairRDDKey, Value pairRDD = 
 sparkContext.newAPIHadoopRDD(job.getConfiguration(),
 AccumuloInputFormat.class,
 Key.class, Value.class);

 But I run into a snag trying to do a similar thing:

 JavaPairRDDText, PeekingIteratorMap.EntryKey, Value pairRDD = 
 sparkContext.newAPIHadoopRDD(job.getConfiguration(),
 AccumuloRowInputFormat.class,
 Text.class, PeekingIterator.class);

 The compilation error is (big, sorry):

 Error:(141, 97) java: method newAPIHadoopRDD in class 
 org.apache.spark.api.java.JavaSparkContext cannot be applied to given types;
   required: 
 org.apache.hadoop.conf.Configuration,java.lang.ClassF,java.lang.ClassK,java.lang.ClassV
   found: 
 org.apache.hadoop.conf.Configuration,java.lang.Classorg.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat,java.lang.Classorg.apache.hadoop.io.Text,java.lang.Classorg.apache.accumulo.core.util.PeekingIterator
   reason: inferred type does not conform to declared bound(s)
 inferred: 
 org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat
 bound(s): 
 org.apache.hadoop.mapreduce.InputFormatorg.apache.hadoop.io.Text,org.apache.accumulo.core.util.PeekingIterator

 I've tried a few things, the signature of the function is:

 public K, V, F extends org.apache.hadoop.mapreduce.InputFormatK, V 
 JavaPairRDDK, V newAPIHadoopRDD(Configuration conf, ClassF fClass, 
 ClassK kClass, ClassV vClass)

 I guess it's having trouble with the format extending InputFormatBase
 with its own additional generic parameters (the Map.Entry inside
 PeekingIterator).

 This may be an issue to chase with Spark vs Accumulo, unless something
 can be tweaked on the Accumulo side or I could wrap the InputFormat with my
 own somehow.

 Accumulo 1.6.1, Spark 1.3.1, JDK 7u71.

 Stopping short of this, can anyone think of a good way to use
 AccumuloInputFormat to get what I'm getting from the Row version in a
 performant way? It doesn't necessarily have to be an iterator approach, but
 I'd need all my values with the key in one consuming function. I'm looking
 into ways to do it in spark functions but trying to avoid any major
 performance hits.

 Thanks,

 Marc

 p.s. The summit was absolutely great, thank you all for having it!





Re: spark with AccumuloRowInputFormat?

2015-05-04 Thread Marc Reichman
This is working very well, thanks Russ!

For anyone ever stuck in this predicament, using the WholeRowIterator, I
was able to get the same IteratorMap.EntryKey,Value that I can get
similarly to the AccumuloRowInputFormat as follows:

...

IteratorSetting iteratorSetting = new IteratorSetting(1,
WholeRowIterator.class);
AccumuloInputFormat.addIterator(job, iteratorSetting);

// setup RDD
JavaPairRDDKey, Value pairRDD =
sparkContext.newAPIHadoopRDD(job.getConfiguration(),
AccumuloInputFormat.class,
Key.class, Value.class);

JavaRDDListMyResult result = pairRDD
.map(new FunctionTuple2Key, Value, ListMyResult() {
@Override
public ListMyResult call(Tuple2Key, Value
keyValueTuple2) throws Exception {
SortedMapKey, Value wholeRow =
WholeRowIterator.decodeRow(keyValueTuple2._1, keyValueTuple2._2);
MyObject o = getMyObject(wholeRow.entrySet().iterator());
*...*
}
});

Previously, I was doing this approach, which required an additional
stage of Spark calculations as well as a shuffle phase, and wasn't
nearly as quick, and also needed a helper class (AccumuloRowMapEntry,
very basic Map.Entry implementation):

JavaRDDListMyResult result = pairRDD
.mapToPair(new PairFunctionTuple2Key, Value, Text,
Map.EntryKey, Value() {
@Override
public Tuple2Text, Map.EntryKey, Value
call(Tuple2Key, Value keyValueTuple2) throws Exception {
return new Tuple2Text, Map.EntryKey,
Value(keyValueTuple2._1.getRow(), new
AccumuloRowMapEntry(keyValueTuple2._1, keyValueTuple2._2));
}
})
.groupByKey()
.map(new FunctionTuple2Text, IterableMap.EntryKey,
Value, ListMyResult() {
@Override
public ListMyResult call(Tuple2Text,
IterableMap.EntryKey, Value textIterableTuple2) throws Exception
{
MyObject o = getMyObject(textIterableTuple2._2.iterator());
*...*
}
});


Thanks again for all the help.

Marc


On Mon, May 4, 2015 at 12:23 PM, Russ Weeks rwe...@newbrightidea.com
wrote:

 Yeah, exactly. When you put the WholeRowIterator on the scan, instead of
 seeing all the Key,Value pairs that make up a row you'll see a single
 Key,Value pair. The only part of the Key that matters is the row id. The
 Value is an encoded map of the Key,Value pairs that constitute the row.
 Call the static method WholeRowIterator.decodeRow to get at this map.

 The decoded Keys have all the CF, CQ, timestamp and visibility data
 populated. I'm not sure if they have the row ID populated; either way, they
 all belong to the same row that was present in the original Key.

 -Russ


 On Mon, May 4, 2015 at 9:51 AM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 Hi Russ,

 How exactly would this work regarding column qualifiers, etc, as those
 are part of the key? I apologize but I'm not as familiar with the
 WholeRowIterator use model, does it consolidate based on the rowkey, and
 then return some Key+Value value which has all the original information
 serialized?

 My rows aren't gigantic but they can occasionally get into the 10s of MB.

 On Mon, May 4, 2015 at 11:22 AM, Russ Weeks rwe...@newbrightidea.com
 wrote:

 Hi, Marc,

 If your rows are small you can use the WholeRowIterator to get all the
 values with the key in one consuming function. If your rows are big but you
 know up-front that you'll only need a small part of each row, you could put
 a filter in front of the WholeRowIterator.

 I expect there's a performance hit (I haven't done any benchmarks
 myself) because of the extra serialization/deserialization but it's a very
 convenient way of working with Rows in Spark.

 Regards,
 -Russ

 On Mon, May 4, 2015 at 8:46 AM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 Has anyone done any testing with Spark and AccumuloRowInputFormat? I
 have no problem doing this for AccumuloInputFormat:

 JavaPairRDDKey, Value pairRDD = 
 sparkContext.newAPIHadoopRDD(job.getConfiguration(),
 AccumuloInputFormat.class,
 Key.class, Value.class);

 But I run into a snag trying to do a similar thing:

 JavaPairRDDText, PeekingIteratorMap.EntryKey, Value pairRDD = 
 sparkContext.newAPIHadoopRDD(job.getConfiguration(),
 AccumuloRowInputFormat.class,
 Text.class, PeekingIterator.class);

 The compilation error is (big, sorry):

 Error:(141, 97) java: method newAPIHadoopRDD in class 
 org.apache.spark.api.java.JavaSparkContext cannot be applied to given 
 types;
   required: 
 org.apache.hadoop.conf.Configuration,java.lang.ClassF,java.lang.ClassK,java.lang.ClassV
   found: 
 org.apache.hadoop.conf.Configuration,java.lang.Classorg.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat,java.lang.Classorg.apache.hadoop.io.Text,java.lang.Classorg.apache.accumulo.core.util.PeekingIterator
   reason: inferred type does not conform to declared bound(s)
 inferred

Re: spark with AccumuloRowInputFormat?

2015-05-04 Thread Marc Reichman
Thanks Josh. I will make that change to be safe, though in these
experiments I use a maxversions of 1 anyway.

I look forward to seeing the definitive Accumulo + Spark guide some day,
glad to help where I can if there are specific things to fill in.

On Mon, May 4, 2015 at 2:40 PM, Josh Elser josh.el...@gmail.com wrote:

 Thanks _so_ much for taking the time to write this up, Marc! It's a good
 example.

 One note, you probably want to use an priority greater than 20 for the
 IteratorSetting. The VersioningIterator is set on Accumulo tables by
 default at priority 20. In most cases, you'd want to see the state of the
 table _after_ the VersioningIterator filters things.

 Marc Reichman wrote:

 This is working very well, thanks Russ!

 For anyone ever stuck in this predicament, using the WholeRowIterator, I
 was able to get the same IteratorMap.EntryKey,Value that I can get
 similarly to the AccumuloRowInputFormat as follows:

 ...

 IteratorSetting iteratorSetting =newIteratorSetting(1,
 WholeRowIterator.class);
 AccumuloInputFormat.addIterator(job, iteratorSetting);

 // setup RDD
 JavaPairRDDKey, Value  pairRDD =
 sparkContext.newAPIHadoopRDD(job.getConfiguration(),
  AccumuloInputFormat.class,
  Key.class, Value.class);

 JavaRDDListMyResult  result = pairRDD
  .map(newFunctionTuple2Key, Value, ListMyResult() {
  @Override
  publicListMyResult  call(Tuple2Key, Value
 keyValueTuple2)throwsException {
  SortedMapKey, Value  wholeRow =
 WholeRowIterator.decodeRow(keyValueTuple2._1, keyValueTuple2._2);
  MyObject o = getMyObject(wholeRow.entrySet().iterator());
  *...*
  }
  });

 Previously, I was doing this approach, which required an additional stage
 of Spark calculations as well as a shuffle phase, and wasn't nearly as
 quick, and also needed a helper class (AccumuloRowMapEntry, very basic
 Map.Entry implementation):

 JavaRDDListMyResult  result = pairRDD
  .mapToPair(newPairFunctionTuple2Key, Value, Text,
 Map.EntryKey, Value() {
  @Override
  publicTuple2Text, Map.EntryKey, Value  call(Tuple2Key,
 Value  keyValueTuple2)throwsException {
  return newTuple2Text, Map.EntryKey,
 Value(keyValueTuple2._1.getRow(),newAccumuloRowMapEntry(keyValueTuple2._1,
 keyValueTuple2._2));
  }
  })
  .groupByKey()
  .map(newFunctionTuple2Text, IterableMap.EntryKey, Value,
 ListMyResult() {
  @Override
  publicListMyResult  call(Tuple2Text,
 IterableMap.EntryKey, Value  textIterableTuple2)throwsException {
  MyObject o =
 getMyObject(textIterableTuple2._2.iterator());
  *...*
  }
  });


 Thanks again for all the help.

 Marc


 On Mon, May 4, 2015 at 12:23 PM, Russ Weeks rwe...@newbrightidea.com
 mailto:rwe...@newbrightidea.com wrote:

 Yeah, exactly. When you put the WholeRowIterator on the scan,
 instead of seeing all the Key,Value pairs that make up a row you'll
 see a single Key,Value pair. The only part of the Key that matters
 is the row id. The Value is an encoded map of the Key,Value pairs
 that constitute the row. Call the static method
 WholeRowIterator.decodeRow to get at this map.

 The decoded Keys have all the CF, CQ, timestamp and visibility data
 populated. I'm not sure if they have the row ID populated; either
 way, they all belong to the same row that was present in the
 original Key.

 -Russ


 On Mon, May 4, 2015 at 9:51 AM, Marc Reichman
 mreich...@pixelforensics.com mailto:mreich...@pixelforensics.com
 wrote:

 Hi Russ,

 How exactly would this work regarding column qualifiers, etc, as
 those are part of the key? I apologize but I'm not as familiar
 with the WholeRowIterator use model, does it consolidate based
 on the rowkey, and then return some Key+Value value which has
 all the original information serialized?

 My rows aren't gigantic but they can occasionally get into the
 10s of MB.

 On Mon, May 4, 2015 at 11:22 AM, Russ Weeks
 rwe...@newbrightidea.com mailto:rwe...@newbrightidea.com
 wrote:

 Hi, Marc,

 If your rows are small you can use the WholeRowIterator to
 get all the values with the key in one consuming function.
 If your rows are big but you know up-front that you'll only
 need a small part of each row, you could put a filter in
 front of the WholeRowIterator.

 I expect there's a performance hit (I haven't done any
 benchmarks myself) because of the extra
 serialization/deserialization but it's a very convenient way
 of working with Rows in Spark.

 Regards,
 -Russ

 On Mon, May 4, 2015 at 8:46 AM, Marc Reichman

Re: OfflineScanner

2015-02-19 Thread Marc Reichman
Apologies for hijacking this, but is there any way to use an offline table
clone with MapReduce and AccumuloInputFormat? That read speed increase
sounds very appealing..

On Thu, Feb 19, 2015 at 9:27 AM, Josh Elser josh.el...@gmail.com wrote:

 Typically, if you're using the OfflineScanner, you'd clone the table you
 want to read and then take the clone offline. It's a simple (and fast)
 solution that doesn't interrupt the availability of the table.

 Doing the read offline will definitely be faster (maybe 20%, I'm not
 entirely sure on accurate number and how it scales with nodes). The pain
 would be the extra work in creating the clone, offline'ing the table, and
 eventually deleting the clone when you're done with it. A little more work,
 but manageable.


 Ara Ebrahimi wrote:

 Hi,

 I’m trying to optimize a connector we’ve written for Presto. In some
 cases we need to perform full table scans. This happens across all the
 nodes but each node is assigned to process only a sharded subset of data.
 Each shard is hosted by only 1 RFile. I’m looking at the
 AbstractInputFormat and OfflineIterator and it seems like the code is not
 that hard to use for this case. Is there any drawback? It seems like if the
 table is offline then OfflineIterator is used which apparently reads the
 RFiles directly and doesn’t involve any RPC and I think should be
 significantly faster. Is it so? Is there any drawback to using this while
 the table is not offline but no other app is messing with the table?

 Thanks,
 Ara.



 

 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Thank you in
 advance for your cooperation.

 




Re: submission w/classpath without tool.sh?

2015-01-26 Thread Marc Reichman
So, mapreduce.application.classpath was the winner. It's possible that
yarn.application.classpath would have worked as well. My main issue was
that I was neglecting to include a copy of the XML files in classpath, so
my settings weren't being taken, late night epiphany. Passing the value as
-Dmapreduce.application.classpath=... on the command line allowed this to
take effect and I was fine.

For remote clients, I have copied into a local classpath lib what I need to
launch, the jar list output from accumulo classpath, and a set of the XML
files needed to set the appropriate client-side mapreduce options to launch
properly, including the classpath mentioned above but also the various
memory-related settings in YARN/MR2.

Thanks for the help Billie!

On Sat, Jan 24, 2015 at 7:51 AM, Billie Rinaldi bil...@apache.org wrote:

 You might have to set yarn.application.classpath in both the client and
 the server conf. At least that's what Slider does.
 On Jan 23, 2015 10:00 PM, Marc Reichman mreich...@pixelforensics.com
 wrote:

 That's correct, I don't really want to have the client have to package up
 every accumulo and zookeeper jar I need in dcache or a fat jar or whatever
 just to run stuff from a remote client when the jars are all there.

 I did try yarn.application.classpath, but I didn't spell out the whole
 thing. Next try I will take all those jars and put them in explicitly
 instead of the dir wildcards. I will update how it goes.

 On Fri, Jan 23, 2015 at 5:19 PM, Billie Rinaldi bil...@apache.org
 wrote:

 You have all the jars your app needs on both the servers and the client
 (as opposed to wanting Yarn to distribute them)?  Then
 yarn.application.classpath should be what you need.  It looks like
 /etc/hadoop/conf,/some/lib/dir/*,/some/other/lib/dir/* etc.  Is that what
 you're trying?

 On Fri, Jan 23, 2015 at 1:56 PM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 My apologies if this is covered somewhere, I've done a lot of searching
 and come up dry.

 I am migrating a set of applications from Hadoop 1.0.3/Accumulo 1.4.1
 to Hadoop 2.6.0/Accumulo 1.6.1. The applications are launched by my custom
 java apps, using the Hadoop Tool/Configured interface setup, not a big 
 deal.

 To run MR jobs with AccumuloInputFormat/OutputFormat, in 1.0 I could
 use tool.sh to launch the programs, which worked great for local on-cluster
 launching. I however needed to launch from remote hosts (maybe even Windows
 ones), and I would bundle a large lib dir with everything I needed on the
 client-side, and fill out HADOOP_CLASSPATH in hadoop-env.sh with everything
 I needed (basically copied the output of accumulo classpath). This would
 work for remote submissions, or even local ones, but specifically using my
 java mains to launch them without any accumulo or hadoop wrapper scripts.

 In YARN MR 2.6 this doesn't seem to work. No matter what I do, I can't
 seem to get a normal java app to have the 2.x MR Application Master pick up
 the accumulo items in the classpath, and my jobs fail with ClassNotFound
 exceptions. tool.sh works just fine, but again, I need to be able to submit
 without that environment.

 I have tried (on the cluster):
 HADOOP_CLASSPATH in hadoop-env.sh
 HADOOP_CLASSPATH from .bashrc
 yarn.application.classpath in yarn-site.xml

 I don't mind using tool.sh locally, it's quite nice, but I need a
 strategy to have the cluster setup so I can just launch java, set my
 appropriate hadoop configs for remote fs and yarn hosts, get my accumulo
 connections and in/out setup for mapreduce and launch jobs which have
 accumulo awareness.

 Any ideas?

 Thanks,
 Marc






Re: submission w/classpath without tool.sh?

2015-01-23 Thread Marc Reichman
That's correct, I don't really want to have the client have to package up
every accumulo and zookeeper jar I need in dcache or a fat jar or whatever
just to run stuff from a remote client when the jars are all there.

I did try yarn.application.classpath, but I didn't spell out the whole
thing. Next try I will take all those jars and put them in explicitly
instead of the dir wildcards. I will update how it goes.

On Fri, Jan 23, 2015 at 5:19 PM, Billie Rinaldi bil...@apache.org wrote:

 You have all the jars your app needs on both the servers and the client
 (as opposed to wanting Yarn to distribute them)?  Then
 yarn.application.classpath should be what you need.  It looks like
 /etc/hadoop/conf,/some/lib/dir/*,/some/other/lib/dir/* etc.  Is that what
 you're trying?

 On Fri, Jan 23, 2015 at 1:56 PM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 My apologies if this is covered somewhere, I've done a lot of searching
 and come up dry.

 I am migrating a set of applications from Hadoop 1.0.3/Accumulo 1.4.1 to
 Hadoop 2.6.0/Accumulo 1.6.1. The applications are launched by my custom
 java apps, using the Hadoop Tool/Configured interface setup, not a big deal.

 To run MR jobs with AccumuloInputFormat/OutputFormat, in 1.0 I could use
 tool.sh to launch the programs, which worked great for local on-cluster
 launching. I however needed to launch from remote hosts (maybe even Windows
 ones), and I would bundle a large lib dir with everything I needed on the
 client-side, and fill out HADOOP_CLASSPATH in hadoop-env.sh with everything
 I needed (basically copied the output of accumulo classpath). This would
 work for remote submissions, or even local ones, but specifically using my
 java mains to launch them without any accumulo or hadoop wrapper scripts.

 In YARN MR 2.6 this doesn't seem to work. No matter what I do, I can't
 seem to get a normal java app to have the 2.x MR Application Master pick up
 the accumulo items in the classpath, and my jobs fail with ClassNotFound
 exceptions. tool.sh works just fine, but again, I need to be able to submit
 without that environment.

 I have tried (on the cluster):
 HADOOP_CLASSPATH in hadoop-env.sh
 HADOOP_CLASSPATH from .bashrc
 yarn.application.classpath in yarn-site.xml

 I don't mind using tool.sh locally, it's quite nice, but I need a
 strategy to have the cluster setup so I can just launch java, set my
 appropriate hadoop configs for remote fs and yarn hosts, get my accumulo
 connections and in/out setup for mapreduce and launch jobs which have
 accumulo awareness.

 Any ideas?

 Thanks,
 Marc





Hadoop Summit (San Jose June 3-5)

2014-04-28 Thread Marc Reichman
Will anyone be there? I wouldn't mind meeting up for a drink, talk about
Accumulo, projects, etc.

Looking forward to coming to my first Hadoop-based conference!

Marc


accessing accumulo row in mapper setup method?

2013-09-02 Thread Marc Reichman
Hello,

I am running a search job of a single piece of query data against potential
targets in an accumulo table, using AccumuloRowInputFormat. In most cases,
the query data itself is also in the same accumulo table.

To date, my client program has pulled the query data from accumulo using a
basic scanner, stored the data into HDFS, and added the file(s) in question
to distributed cache. My mapper then pulls the data from distributed cache
into a private class member in its setup method and uses it in all of the
map calls.

I had a thought, that maybe I'm spending a bit too much overhead on the
client-side doing this, and that my job submission performance is slow
because of all of the HDFS i/o and distributed cache handling for arguably
small files, in the 100-200k range max.

Does it seem like a reasonable idea to skip the preparation on the
client-side, and have the mapper setup pull the data directly from accumulo
in its setup method instead?

Questions related to this:
1. Does this put a lot of pressure on the tabletserver which contains the
data, to have many mappers hitting at once during setup for the first wave?
2. Is there any way whatsoever for the mapper to use the existing client
connection already being made? Or would I have to do the usual setup with
my own zookeeper connection, and if so does that make for a much worse
performance impact?

Thanks,
Marc


Re: Getting the IP Address

2013-08-28 Thread Marc Reichman
Just tested. Does not work.


On Wed, Aug 28, 2013 at 11:53 AM, Eric Newton eric.new...@gmail.com wrote:

 Does hostname -i work on a mac?  Not being a mac user, I can't check.

 -Eric



 On Wed, Aug 28, 2013 at 11:38 AM, Ravi Mutyala r...@hortonworks.comwrote:

 Hi,

 I see from the accumulo-tracer init.d script that IP is determined by
 this logic.

 ifconfig | grep inet[^6] | awk '{print $2}' | sed 's/addr://' | grep -v
 0.0.0.0 | grep -v 127.0.0.1 | head -n 1


 Any reason for using this logic instead of a hostname -i and using
 reverse dns lookup? I have a cluster where the order of nics on one of the
 nodes is in a different order and ifconfig returns a IP from a different
 subnet than for other nodes. But DNS and reverse DNS are properly
 configured.

 Thanks

 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.





Re: Filtering on column qualifier

2013-08-22 Thread Marc Reichman
Extending looked like a bit of a boondoggle, because all of the useful
fields in the class are private, not protected. I also ran into another
architectural question, how does one pass a value (a-la constructor) into
one of these classes? If I'm going to use this to filter based on a
threshold, I'd need to pass that threshold in somehow.




On Wed, Aug 21, 2013 at 9:49 AM, John Vines vi...@apache.org wrote:

 There's no way to extend the ColumnQualietyFilter via configuration, but
 it sounds like you are on top of it. You just need to extend the class,
 possibly copy a bit of code, and change the equality check to a compareTo
 after converting the Strings to Doubles.


 On Wed, Aug 21, 2013 at 10:00 AM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 I have some data stored in Accumulo with some scores stored as column
 qualifiers (there was an older thread about this). I would like to find a
 way to do thresholding when retrieving the data without retrieving it all
 and then manually filtering out items below my threshold.

 I know I can fetch column qualifiers which are exact.

 I've seen the ColumnQualifierFilter, which I assume is what's in play
 when I fetch qualifiers. Is there a reasonable pattern to extend this and
 try to use it as a scan iterator so I can do things like greater than a
 value which will be interpreted as a Double vs. the string equality going
 on now?

 Thanks,
 Marc





Re: Filtering on column qualifier

2013-08-22 Thread Marc Reichman
I haven't considered that. Would that allow me to specify it in the
client-side code and not worry about spreading JARs around? It is a very
basic need, in my scan iterator loop right now is:

String matchScoreString = key.getColumnQualifier().toString();
Double score = Double.parseDouble(matchScoreString);

if (threshold != null  threshold  score) {
// TODO: figure out if this is possible to do via
data-local scan iterator
continue;
}

What is the pattern for including a groovy snippet for a scan iterator?


On Thu, Aug 22, 2013 at 11:16 AM, David Medinets
david.medin...@gmail.comwrote:

 Have you thought of writing a filter class that takes some bit of groovy
 for execution inside the accept method, depending on how efficient you need
 to be and how changeable your constraints are.


 On Thu, Aug 22, 2013 at 10:19 AM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 Extending looked like a bit of a boondoggle, because all of the useful
 fields in the class are private, not protected. I also ran into another
 architectural question, how does one pass a value (a-la constructor) into
 one of these classes? If I'm going to use this to filter based on a
 threshold, I'd need to pass that threshold in somehow.




 On Wed, Aug 21, 2013 at 9:49 AM, John Vines vi...@apache.org wrote:

 There's no way to extend the ColumnQualietyFilter via configuration, but
 it sounds like you are on top of it. You just need to extend the class,
 possibly copy a bit of code, and change the equality check to a compareTo
 after converting the Strings to Doubles.


 On Wed, Aug 21, 2013 at 10:00 AM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 I have some data stored in Accumulo with some scores stored as column
 qualifiers (there was an older thread about this). I would like to find a
 way to do thresholding when retrieving the data without retrieving it all
 and then manually filtering out items below my threshold.

 I know I can fetch column qualifiers which are exact.

 I've seen the ColumnQualifierFilter, which I assume is what's in play
 when I fetch qualifiers. Is there a reasonable pattern to extend this and
 try to use it as a scan iterator so I can do things like greater than a
 value which will be interpreted as a Double vs. the string equality going
 on now?

 Thanks,
 Marc







Re: Filtering on column qualifier

2013-08-22 Thread Marc Reichman
I apologize for my dense-ness, but could you walk me through this? Is there
some form of existing scan iterator which interprets groovy? Or is this
something I would build?


On Thu, Aug 22, 2013 at 12:10 PM, David Medinets
david.medin...@gmail.comwrote:

 The advantage is that you'd only write the iterator once and deploy it to
 the cluster. Then the groovy snippet changes its behavior. You'd save
 passing the data to your client code, but more work would be done by the
 accumulo cluster.


 On Thu, Aug 22, 2013 at 12:33 PM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 I haven't considered that. Would that allow me to specify it in the
 client-side code and not worry about spreading JARs around? It is a very
 basic need, in my scan iterator loop right now is:

 String matchScoreString = key.getColumnQualifier().toString();
 Double score = Double.parseDouble(matchScoreString);

 if (threshold != null  threshold  score) {
 // TODO: figure out if this is possible to do via
 data-local scan iterator
 continue;
 }

 What is the pattern for including a groovy snippet for a scan iterator?


 On Thu, Aug 22, 2013 at 11:16 AM, David Medinets 
 david.medin...@gmail.com wrote:

 Have you thought of writing a filter class that takes some bit of groovy
 for execution inside the accept method, depending on how efficient you need
 to be and how changeable your constraints are.


 On Thu, Aug 22, 2013 at 10:19 AM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 Extending looked like a bit of a boondoggle, because all of the useful
 fields in the class are private, not protected. I also ran into another
 architectural question, how does one pass a value (a-la constructor) into
 one of these classes? If I'm going to use this to filter based on a
 threshold, I'd need to pass that threshold in somehow.




 On Wed, Aug 21, 2013 at 9:49 AM, John Vines vi...@apache.org wrote:

 There's no way to extend the ColumnQualietyFilter via configuration,
 but it sounds like you are on top of it. You just need to extend the 
 class,
 possibly copy a bit of code, and change the equality check to a compareTo
 after converting the Strings to Doubles.


 On Wed, Aug 21, 2013 at 10:00 AM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 I have some data stored in Accumulo with some scores stored as column
 qualifiers (there was an older thread about this). I would like to find a
 way to do thresholding when retrieving the data without retrieving it all
 and then manually filtering out items below my threshold.

 I know I can fetch column qualifiers which are exact.

 I've seen the ColumnQualifierFilter, which I assume is what's in play
 when I fetch qualifiers. Is there a reasonable pattern to extend this and
 try to use it as a scan iterator so I can do things like greater than a
 value which will be interpreted as a Double vs. the string equality going
 on now?

 Thanks,
 Marc









Filtering on column qualifier

2013-08-21 Thread Marc Reichman
I have some data stored in Accumulo with some scores stored as column
qualifiers (there was an older thread about this). I would like to find a
way to do thresholding when retrieving the data without retrieving it all
and then manually filtering out items below my threshold.

I know I can fetch column qualifiers which are exact.

I've seen the ColumnQualifierFilter, which I assume is what's in play when
I fetch qualifiers. Is there a reasonable pattern to extend this and try to
use it as a scan iterator so I can do things like greater than a value
which will be interpreted as a Double vs. the string equality going on now?

Thanks,
Marc


Re: deletion technique question

2013-05-13 Thread Marc Reichman
The 1.5 solution looks nice.

Aware of the potential data loss angle and the sort ordering is also an
interesting angle, thank you.

In my particular case where I may not necessarily be aware of all
permutations of column visibility of a given key but want to replace them
all with a particular new visibility with the same data, how would I go
about that? Is there a way to use a batchscanner (step 1 of the
batchdeleter approach) to pull down all the permutations, then putdeletes
for them and put what I want?

In my case I'm pulling one copy of the data down first to verify I have it
at the user's current scan auth, then using the #1 approach to clear it out
and then put it in again as the vis I need.


On Mon, May 13, 2013 at 10:05 AM, Keith Turner ke...@deenlo.com wrote:




 On Fri, May 10, 2013 at 12:39 PM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 I have a table with rows which have 3 column values in one column family,
 and a column visibility.

 There are situations where I will want to replace the row content with a
 new column visibility; I understand that the visibility attributes are
 immutable, so I will have to delete and re-put.

 Am I better off doing:
 1. BatchDeleter with authorizations to allow access, set range to the key
 in question, call delete, and then put in mutations with the new visibility
 2. Create mutations with a putDelete followed by a put with the new
 visibility for each value
 3. Something else entirely?


 In 1.5, you can use ACCUMULO-956



 For option #2, can I simply do a putDelete on the column
 family/qualifier? Or do I need to know the old authorizations to put in a
 visibility expression with the putDelete?

 For all of these, can a client get up-to-the-minute results immediately
 after? Or does some kind of compaction need to occur first?


 If you send a mutation with a delete and put, the client will be able to
 see it after the batchwriter flushes or closes.  No compaction needed.

 I am little fuzzy on #1.  Will you delete everything in one pass (using
 batchdeleter), and then do another pass writing data w/ updated colvis?  If
 so this would seems to imply that you are pulling the data from another
 source (other than the table stuff was deleted from)?

 Make sure the method you chose is not susceptible to data loss in the
 event that the client dies.  For example if a client was, reading a table
 and then writing a delete and updates mutation for each key/val read.  If
 the client died and some deletes were written, but not the corresponding
 updates, then that data would not be seen to be transformed on the second
 run.

 When you change the colvis, you change the sort order.  If you read a key
 and K and change it to K', where K' sorts after K. If you insert K', its
 possible that you may read it.  Its being inserted in front of the scanners
 pointer.  Because of buffering in the batch writer and scanner, this would
 not occur always, but it would occur occasionally.  Something to be aware
 of.






deletion technique question

2013-05-10 Thread Marc Reichman
I have a table with rows which have 3 column values in one column family,
and a column visibility.

There are situations where I will want to replace the row content with a
new column visibility; I understand that the visibility attributes are
immutable, so I will have to delete and re-put.

Am I better off doing:
1. BatchDeleter with authorizations to allow access, set range to the key
in question, call delete, and then put in mutations with the new visibility
2. Create mutations with a putDelete followed by a put with the new
visibility for each value
3. Something else entirely?

For option #2, can I simply do a putDelete on the column family/qualifier?
Or do I need to know the old authorizations to put in a visibility
expression with the putDelete?

For all of these, can a client get up-to-the-minute results immediately
after? Or does some kind of compaction need to occur first?


Re: deletion technique question

2013-05-10 Thread Marc Reichman
The only limitation with the approach that I can see is that I may not know
every permutation of visibility on a given key, and with the scan-driven
approach I can use the user's entire authorization set as a way to get all
of the rows for deletion.

Thanks,
Marc


On Fri, May 10, 2013 at 2:19 PM, Christopher ctubb...@apache.org wrote:

 The BatchDeleter is essentially a BatchScanner with the
 SortedKeyIterator (which drops values from the returned entries...
 they aren't needed to delete), and a BatchWriter that inserts a delete
 entry in a mutation for every entry the scanner sees.

 You can, and should, select option 2, because you're better off
 sending two column updates in each mutation rather than send twice as
 many mutations, as you'd be doing for option 1.

 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii


 On Fri, May 10, 2013 at 12:39 PM, Marc Reichman
 mreich...@pixelforensics.com wrote:
  I have a table with rows which have 3 column values in one column family,
  and a column visibility.
 
  There are situations where I will want to replace the row content with a
 new
  column visibility; I understand that the visibility attributes are
  immutable, so I will have to delete and re-put.
 
  Am I better off doing:
  1. BatchDeleter with authorizations to allow access, set range to the
 key in
  question, call delete, and then put in mutations with the new visibility
  2. Create mutations with a putDelete followed by a put with the new
  visibility for each value
  3. Something else entirely?
 
  For option #2, can I simply do a putDelete on the column
 family/qualifier?
  Or do I need to know the old authorizations to put in a visibility
  expression with the putDelete?
 
  For all of these, can a client get up-to-the-minute results immediately
  after? Or does some kind of compaction need to occur first?



Re: remote accumulo instance issue

2013-05-08 Thread Marc Reichman
These are from the client machine:
(9997 on a tserver)
[mreichman@packers: ~]$ nmap -p 9997 192.168.1.162

Starting Nmap 5.51 ( http://nmap.org ) at 2013-05-08 16:35 ric
Nmap scan report for giants.home (192.168.1.162)
Host is up (0.0063s latency).
PORT STATE SERVICE
9997/tcp open  unknown
MAC Address: 7A:79:C0:A8:01:A2 (Unknown)

Nmap done: 1 IP address (1 host up) scanned in 0.60 seconds

(2181 zookeeper on the master)
[mreichman@packers: ~]$ nmap -p 2181 192.168.1.160

Starting Nmap 5.51 ( http://nmap.org ) at 2013-05-08 16:35 ric
Nmap scan report for padres.home (192.168.1.160)
Host is up (0.0071s latency).
PORT STATE SERVICE
2181/tcp open  unknown
MAC Address: 7A:79:C0:A8:01:A0 (Unknown)

Nmap done: 1 IP address (1 host up) scanned in 0.56 seconds

Any chance it could be anything related to DNS or reverse DNS?


On Wed, May 8, 2013 at 10:25 AM, John Vines vi...@apache.org wrote:

 Is that remote instance behind a firewall or anything like that?


 On Wed, May 8, 2013 at 11:09 AM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 I have seen this as ticket ACCUMULO-687 which has been marked resolved,
 but I still see this issue.

 I am connecting to a remote accumulo instance to query and to launch
 mapreduce jobs using AccumuloRowInputFormat, and I'm seeing an error like:

 91 [main-SendThread(padres.home:2181)] INFO
 org.apache.zookeeper.ClientCnxn  - Socket connection established to
 padres.home/192.168.1.160:2181, initiating session
 166 [main-SendThread(padres.home:2181)] INFO
 org.apache.zookeeper.ClientCnxn  - Session establishment complete on server
 padres.home/192.168.1.160:2181, sessionid = 0x13e7b48f9d17af7,
 negotiated timeout = 3
 1889 [main] WARN org.apache.accumulo.core.client.impl.ServerClient  -
 Failed to find an available server in the list of servers:
 [192.168.1.164:9997:9997 (12), 192.168.1.192:9997:9997 (12),
 192.168.1.194:9997:9997 (12), 192.168.1.162:9997:9997 (12),
 192.168.1.190:9997:9997 (12), 192.168.1.166:9997:9997 (12),
 192.168.1.168:9997:9997 (12), 192.168.1.196:9997:9997 (12)]

 My zookeeper's tservers key looks like:
 [zk: localhost:2181(CONNECTED) 1] ls
 /accumulo/908a756e-1c81-4bea-a4de-675456499a10/tservers
 [192.168.1.164:9997, 192.168.1.192:9997, 192.168.1.194:9997,
 192.168.1.162:9997, 192.168.1.190:9997, 192.168.1.166:9997,
 192.168.1.168:9997, 192.168.1.196:9997]

 My masters and slaves file look like:
 [hadoop@padres conf]$ cat masters
 192.168.1.160
 [hadoop@padres conf]$ cat slaves
 192.168.1.162
 192.168.1.164
 192.168.1.166
 192.168.1.168
 192.168.1.190
 192.168.1.192
 192.168.1.194
 192.168.1.196

 tracers, gc, and monitor are the same as masters.

 I have no issues executing on the master, but I would like to work from a
 remote host. The remote host is on a VPN, and its default resolver is NOT
 the resolver from the remote network. If I do reverse lookup over the VPN
 *using* the remote resolver it shows proper hostnames.

 My concern is that something is causing the host:port entry plus the
 port to come up with this concatenated view of host:port:port, which is
 obviously not going to work.

 What else can I try? I previously had hostnames in the
 masters/slaves/etc. files but now have the IPs. Should I re-init the
 instance to see if it changes anything in zookeeper?





Re: remote accumulo instance issue

2013-05-08 Thread Marc Reichman
All,

My apologies. This seemed to be a JAR mismatch error. No more problems.
Sorry for the drill.

Marc


On Wed, May 8, 2013 at 11:45 AM, Marc Reichman mreich...@pixelforensics.com
 wrote:

 1.4.1., hadoop 1.0.3.

 Just for sanity, I ran 'accumulo classpath' on the cluster and am copying
 those exact files to my client side in case there was a mismatch somewhere.


 On Wed, May 8, 2013 at 11:43 AM, John Vines vi...@apache.org wrote:

 What version of Accumulo are you running?

 Sent from my phone, please pardon the typos and brevity.
 On May 8, 2013 12:38 PM, Marc Reichman mreich...@pixelforensics.com
 wrote:

 I can't find anything wrong with the networking. Here is the whole error
 with stack trace:
 2057 [main] WARN org.apache.accumulo.core.client.impl.ServerClient  -
 Failed to find an available server in the list of servers:
 [192.168.1.164:9997:9997 (12), 192.168.1.192:9997:9997 (12),
 192.168.1.194:9997:9997 (12), 192.168.1.162:9997:9997 (12),
 192.168.1.190:9997:9997 (12), 192.168.1.166:9997:9997 (12),
 192.168.1.168:9997:9997 (12), 192.168.1.196:9997:9997 (12)]
 Exception in thread main java.lang.IncompatibleClassChangeError:
 Implementing class
 at java.lang.ClassLoader.defineClass1(Native Method)
  at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
  at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
  at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
  at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 at
 org.apache.accumulo.core.client.impl.ServerClient.getConnection(ServerClient.java:146)
  at
 org.apache.accumulo.core.client.impl.ServerClient.getConnection(ServerClient.java:123)
 at
 org.apache.accumulo.core.client.impl.ServerClient.executeRaw(ServerClient.java:105)
  at
 org.apache.accumulo.core.client.impl.ServerClient.execute(ServerClient.java:71)
 at
 org.apache.accumulo.core.client.impl.ConnectorImpl.init(ConnectorImpl.java:75)
  at
 org.apache.accumulo.core.client.ZooKeeperInstance.getConnector(ZooKeeperInstance.java:218)
 at
 org.apache.accumulo.core.client.ZooKeeperInstance.getConnector(ZooKeeperInstance.java:206)

 Running on JDK 1.6.0_27


 On Wed, May 8, 2013 at 10:38 AM, Keith Turner ke...@deenlo.com wrote:




 On Wed, May 8, 2013 at 11:09 AM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 I have seen this as ticket ACCUMULO-687 which has been marked
 resolved, but I still see this issue.

 I am connecting to a remote accumulo instance to query and to launch
 mapreduce jobs using AccumuloRowInputFormat, and I'm seeing an error like:

 91 [main-SendThread(padres.home:2181)] INFO
 org.apache.zookeeper.ClientCnxn  - Socket connection established to
 padres.home/192.168.1.160:2181, initiating session
 166 [main-SendThread(padres.home:2181)] INFO
 org.apache.zookeeper.ClientCnxn  - Session establishment complete on 
 server
 padres.home/192.168.1.160:2181, sessionid = 0x13e7b48f9d17af7,
 negotiated timeout = 3
 1889 [main] WARN org.apache.accumulo.core.client.impl.ServerClient  -
 Failed to find an available server in the list of servers:
 [192.168.1.164:9997:9997 (12), 192.168.1.192:9997:9997 (12),
 192.168.1.194:9997:9997 (12), 192.168.1.162:9997:9997 (12),
 192.168.1.190:9997:9997 (12), 192.168.1.166:9997:9997 (12),
 192.168.1.168:9997:9997 (12), 192.168.1.196:9997:9997 (12)]

 My zookeeper's tservers key looks like:
 [zk: localhost:2181(CONNECTED) 1] ls
 /accumulo/908a756e-1c81-4bea-a4de-675456499a10/tservers
 [192.168.1.164:9997, 192.168.1.192:9997, 192.168.1.194:9997,
 192.168.1.162:9997, 192.168.1.190:9997, 192.168.1.166:9997,
 192.168.1.168:9997, 192.168.1.196:9997]

 My masters and slaves file look like:
 [hadoop@padres conf]$ cat masters
 192.168.1.160
 [hadoop@padres conf]$ cat slaves
 192.168.1.162
 192.168.1.164
 192.168.1.166
 192.168.1.168
 192.168.1.190
 192.168.1.192
 192.168.1.194
 192.168.1.196

 tracers, gc, and monitor are the same as masters.

 I have no issues executing on the master, but I would like to work
 from a remote host. The remote host is on a VPN, and its default resolver
 is NOT the resolver from the remote network. If I do reverse lookup over
 the VPN *using* the remote resolver it shows proper hostnames.

 My concern is that something is causing the host:port entry plus the
 port to come up with this concatenated view of host:port:port, which is
 obviously not going to work.


 The second port is nothing to worry about. Its created by concatenating
 what came from zookeeper

Change/modify column visibility of an existing row?

2013-04-22 Thread Marc Reichman
Is there a way via the Java API to modify the column visibility of an
existing row without having to put a new column along-side? Or are those
immutable?

I realize I can delete and re-put the data with new visibility.

Thanks,
Marc

-- 
http://saucyandbossy.wordpress.com


Re: Change/modify column visibility of an existing row?

2013-04-22 Thread Marc Reichman
Thank you. I felt that was the case and didn't see anything to sway, but I
figured I'd ask as it came up in design of a tool using accumulo.


On Mon, Apr 22, 2013 at 11:05 AM, John Vines vi...@apache.org wrote:

 All Keys in Accumulo are immutable, including the visibility fields. So
 the only way to change is to delete and insert.




 On Mon, Apr 22, 2013 at 12:00 PM, Marc Reichman marcreich...@gmail.comwrote:

 Is there a way via the Java API to modify the column visibility of an
 existing row without having to put a new column along-side? Or are those
 immutable?

 I realize I can delete and re-put the data with new visibility.

 Thanks,
 Marc

 --
 http://saucyandbossy.wordpress.com





-- 
http://saucyandbossy.wordpress.com


increase running scans in monitor?

2013-04-02 Thread Marc Reichman
Hello,

I am running a accumulo-based MR job using the AccumuloRowInputFormat on
1.4.1. Config is more-or-less default, using the native-standalone 3GB
template, but with the TServer memory put up to 2GB in accumulo-env.sh from
its default. accumulo-site.xml has tserver.memory.maps.max at 1G,
tserver.cache.data.size at 50M, and tserver.cache.index.size at 512M.

My tables are created with maxversions for all three types (scan, minc,
majc) at 1 and compress type as gz.

I am finding, on an 8 node test cluster with 64 map task slots, that when a
job is running, the 'Running Scans' count in the monitor is roughly 0-4 on
average for each tablet server. When viewed at the table view, this puts
the running scans anywhere from 4-24 on average. I would expect/hope the
scans to be somewhere close to the map task count. To me, this means one of
the following.
1. There is a configuration setting inhibiting the amount of scans from
accumulating (excuse the pun) to about the same amount as my map tasks
2. My map task job is cpu-intensive enough to introduce delays between
scans and everything is fine
3. Some combination of 1/2.

On an alternate cluster, 40 nodes with 320 task slots, we haven't seen
anywhere near full capacity scanning with map tasks which have the same
performance, and the problem seems much worse.

I am experimenting with some of the readahead configuration variables for
the tablet servers in the meantime, but haven't found any smoking guns yet.

Thank you,
Marc


-- 
http://saucyandbossy.wordpress.com


Re: increase running scans in monitor?

2013-04-02 Thread Marc Reichman
Hi Josh,

Thanks for writing back. I am doing all explicit splits using addSplits in
the Java API since the keyspace is easy to divide evenly. Depending on the
table size for some of these experiments, I've had 128 splits, 256, 512, or
1024 splits. My jobs are executing properly, MR-wise, in the sense that I
do have a proper amount of map tasks created (as the count of splits above,
respectively). My concern is that the jobs may not be quite as busy as they
can be, dataflow-wise and I think the Running Scans per table/tablet
server seem to be good indicators of that.

My data is a 32-byte key (an md5 value), and I have one column family with
3 columns which contain bigger data, anywhere from 50-100k to an
occasional 10M-15M piece.


On Tue, Apr 2, 2013 at 10:06 AM, Josh Elser josh.el...@gmail.com wrote:

 Hi Marc,

 How many tablets are in the table you're running MR over (see the
 monitor)? Might adding some more splits to your table (`addsplits` in the
 Accumulo shell) get you better parallelism?

 What does your data look like in your table? Lots of small rows? Few very
 large rows?


 On 4/2/13 10:56 AM, Marc Reichman wrote:

 Hello,

 I am running a accumulo-based MR job using the AccumuloRowInputFormat on
 1.4.1. Config is more-or-less default, using the native-standalone 3GB
 template, but with the TServer memory put up to 2GB in accumulo-env.sh from
 its default. accumulo-site.xml has tserver.memory.maps.max at 1G,
 tserver.cache.data.size at 50M, and tserver.cache.index.size at 512M.

 My tables are created with maxversions for all three types (scan, minc,
 majc) at 1 and compress type as gz.

 I am finding, on an 8 node test cluster with 64 map task slots, that when
 a job is running, the 'Running Scans' count in the monitor is roughly 0-4
 on average for each tablet server. When viewed at the table view, this puts
 the running scans anywhere from 4-24 on average. I would expect/hope the
 scans to be somewhere close to the map task count. To me, this means one of
 the following.
 1. There is a configuration setting inhibiting the amount of scans from
 accumulating (excuse the pun) to about the same amount as my map tasks
 2. My map task job is cpu-intensive enough to introduce delays between
 scans and everything is fine
 3. Some combination of 1/2.

 On an alternate cluster, 40 nodes with 320 task slots, we haven't seen
 anywhere near full capacity scanning with map tasks which have the same
 performance, and the problem seems much worse.

 I am experimenting with some of the readahead configuration variables for
 the tablet servers in the meantime, but haven't found any smoking guns yet.

 Thank you,
 Marc


 --
 http://saucyandbossy.**wordpress.com http://saucyandbossy.wordpress.com





-- 
http://saucyandbossy.wordpress.com


Re: increase running scans in monitor?

2013-04-02 Thread Marc Reichman
I apologize, I neglected to include row counts. For the above split sizes
mentioned, there are roughly ~55K rows, ~300K rows, ~800K rows, and ~2M
rows.

I'm not necessarily hard-set on the idea that lower running scans are
affecting my overall job time negatively, and I realize that my jobs
themselves may simply be starving the tablet servers (cpu-wise). In my
experiences thus-far, running all 8 CPU cores per node leads to an overall
quicker job completion than pulling one core out of the mix to let accumulo
itself have more breathing room.


On Tue, Apr 2, 2013 at 10:20 AM, Marc Reichman marcreich...@gmail.comwrote:

 Hi Josh,

 Thanks for writing back. I am doing all explicit splits using addSplits in
 the Java API since the keyspace is easy to divide evenly. Depending on the
 table size for some of these experiments, I've had 128 splits, 256, 512, or
 1024 splits. My jobs are executing properly, MR-wise, in the sense that I
 do have a proper amount of map tasks created (as the count of splits above,
 respectively). My concern is that the jobs may not be quite as busy as they
 can be, dataflow-wise and I think the Running Scans per table/tablet
 server seem to be good indicators of that.

 My data is a 32-byte key (an md5 value), and I have one column family with
 3 columns which contain bigger data, anywhere from 50-100k to an
 occasional 10M-15M piece.


 On Tue, Apr 2, 2013 at 10:06 AM, Josh Elser josh.el...@gmail.com wrote:

 Hi Marc,

 How many tablets are in the table you're running MR over (see the
 monitor)? Might adding some more splits to your table (`addsplits` in the
 Accumulo shell) get you better parallelism?

 What does your data look like in your table? Lots of small rows? Few very
 large rows?


 On 4/2/13 10:56 AM, Marc Reichman wrote:

 Hello,

 I am running a accumulo-based MR job using the AccumuloRowInputFormat on
 1.4.1. Config is more-or-less default, using the native-standalone 3GB
 template, but with the TServer memory put up to 2GB in accumulo-env.sh from
 its default. accumulo-site.xml has tserver.memory.maps.max at 1G,
 tserver.cache.data.size at 50M, and tserver.cache.index.size at 512M.

 My tables are created with maxversions for all three types (scan, minc,
 majc) at 1 and compress type as gz.

 I am finding, on an 8 node test cluster with 64 map task slots, that
 when a job is running, the 'Running Scans' count in the monitor is roughly
 0-4 on average for each tablet server. When viewed at the table view, this
 puts the running scans anywhere from 4-24 on average. I would expect/hope
 the scans to be somewhere close to the map task count. To me, this means
 one of the following.
 1. There is a configuration setting inhibiting the amount of scans from
 accumulating (excuse the pun) to about the same amount as my map tasks
 2. My map task job is cpu-intensive enough to introduce delays between
 scans and everything is fine
 3. Some combination of 1/2.

 On an alternate cluster, 40 nodes with 320 task slots, we haven't seen
 anywhere near full capacity scanning with map tasks which have the same
 performance, and the problem seems much worse.

 I am experimenting with some of the readahead configuration variables
 for the tablet servers in the meantime, but haven't found any smoking guns
 yet.

 Thank you,
 Marc


 --
 http://saucyandbossy.**wordpress.comhttp://saucyandbossy.wordpress.com





 --
 http://saucyandbossy.wordpress.com




-- 
http://saucyandbossy.wordpress.com