Re: Accumulo Iterator painful development because TS don't pick up changes to Jars

2015-10-30 Thread Rob Povey
Thanks, we’ll give this a try next week, and see if the issue is in fact the 
VFS Jar version. We certainly commonly change several iterator Jars 
simultaneously.

Rob Povey

From: "dlmar...@comcast.net<mailto:dlmar...@comcast.net>" 
<dlmar...@comcast.net<mailto:dlmar...@comcast.net>>
Reply-To: "user@accumulo.apache.org<mailto:user@accumulo.apache.org>" 
<user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Date: Friday, October 30, 2015 at 9:33 AM
To: "user@accumulo.apache.org<mailto:user@accumulo.apache.org>" 
<user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Subject: Re: Accumulo Iterator painful development because TS don't pick up 
changes to Jars

Also, turn the logging on the tservers up to DEBUG for 
org.apache.accumulo.start.classloader.*. You should see a line in the log that 
starts with "monitoring "



From: dlmar...@comcast.net<mailto:dlmar...@comcast.net>
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Sent: Friday, October 30, 2015 12:22:53 PM
Subject: Re: Accumulo Iterator painful development because TS don't pick up 
changes to Jars

Try replacing the vfs jar in lib with a 2.1-SNAPSHOT. Several issues have been 
fixed, but one of them is that if more than one monitored resource changed then 
it would miss some of them.



From: "Rob Povey" <r...@maana.io<mailto:r...@maana.io>>
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Sent: Friday, October 30, 2015 11:57:27 AM
Subject: Re: Accumulo Iterator painful development because TS don't pick up 
changes to Jars

Thanks for the help with this,

To be clear I believe we are using the context class loader for each of the 
sets of tables, and we don’t see the jar reloaded reliably when they are 
changed. This behavior is consistent running just a 1 node stop on my local 
machine for development or on the cluster. Copying the jars into the lib/ext 
directory however always seems to pick up the change.

These are from my dev box, but the clusters look the same just with many more 
contexts

default| general.vfs.classpaths . |
site   | general.vfs.context.classpath.rob_maanaNgram ... |
system |@override ... | 
hdfs://localhost/user/maana/rob/iterators/maanaNgram/maana-iterators-plugins-core_2.11-assembly.jar
site   | general.vfs.context.classpath.rob_maanaSearch .. |
system |@override ... | 
hdfs://localhost/user/maana/rob/iterators/maanaSearch/maana-iterators-core_2.11-1.0-SNAPSHOT-assembly.jar

And then the context is set on the table like this


default| table.classpath.context  |
table  |@override ... | 
rob_maanaNgram



And below is most of the accumulo site.xml minus the secrets and Zookeeper 
sections, but none of the iterators are in the  config classpaths.

  
tserver.memory.maps.max
2G
  


  
tserver.memory.maps.native.enabled
true
  


  
tserver.cache.data.size
4G
  


  
tserver.cache.index.size
3G
  


  
trace.token.property.password
maana
  


  
trace.user
root
  


  
tserver.sort.buffer.size
200M
  


  
tserver.walog.max.size
1G
  


  
general.classpaths

  $ACCUMULO_HOME/lib/accumulo-server.jar,
  $ACCUMULO_HOME/lib/accumulo-core.jar,
  $ACCUMULO_HOME/lib/accumulo-start.jar,
  $ACCUMULO_HOME/lib/accumulo-fate.jar,
  $ACCUMULO_HOME/lib/accumulo-proxy.jar,
  $ACCUMULO_HOME/lib/[^.].*.jar,
  $ZOOKEEPER_HOME/zookeeper[^.].*.jar,
  $HADOOP_CONF_DIR,
  $HADOOP_PREFIX/share/hadoop/common/[^.].*.jar,
  $HADOOP_PREFIX/share/hadoop/common/lib/(?!slf4j)[^.].*.jar,
  $HADOOP_PREFIX/share/hadoop/hdfs/[^.].*.jar,
  $HADOOP_PREFIX/share/hadoop/mapreduce/[^.].*.jar,
  $HADOOP_PREFIX/share/hadoop/yarn/[^.].*.jar,
  /usr/hdp/current/hadoop-client/[^.].*.jar,
  /usr/hdp/current/hadoop-client/lib/(?!slf4j)[^.].*.jar,
  /usr/hdp/current/hadoop-hdfs-client/[^.].*.jar,
  /usr/hdp/current/hadoop-mapreduce-client/[^.].*.jar,
  /usr/hdp/current/hadoop-yarn-client/[^.].*.jar,
  /usr/hdp/current/hadoop-yarn-client/lib/jersey.*.jar,
  $HADOOP_PREFIX/[^.].*.jar,
  $HADOOP_PREFIX/lib/(?!slf4j)[^.].*.jar,
  /usr/hdp/current/hive-client/lib/hive-accumulo-handler.jar,

Classpaths that accumulo checks for updates and class 
files.
  


  
  instance.volumes
  hdfs://maana2/apps/accumulo
  









On 10/29/15, 5:27 PM, "dlmar...@comcast.net<mailto:dlmar...@comcast.net>" 
<dlmar...@comcast.net<mailto:dlmar...@comcast.net>> wrote:

>So, without seeing your configuration, I wou

Re: Accumulo Iterator painful development because TS don't pick up changes to Jars

2015-10-30 Thread Rob Povey
Thanks for the help with this, 

To be clear I believe we are using the context class loader for each of the 
sets of tables, and we don’t see the jar reloaded reliably when they are 
changed. This behavior is consistent running just a 1 node stop on my local 
machine for development or on the cluster. Copying the jars into the lib/ext 
directory however always seems to pick up the change. 

These are from my dev box, but the clusters look the same just with many more 
contexts

default| general.vfs.classpaths . |
site   | general.vfs.context.classpath.rob_maanaNgram ... |
system |@override ... | 
hdfs://localhost/user/maana/rob/iterators/maanaNgram/maana-iterators-plugins-core_2.11-assembly.jar
site   | general.vfs.context.classpath.rob_maanaSearch .. |
system |@override ... | 
hdfs://localhost/user/maana/rob/iterators/maanaSearch/maana-iterators-core_2.11-1.0-SNAPSHOT-assembly.jar

And then the context is set on the table like this


default| table.classpath.context  |
table  |@override ... | 
rob_maanaNgram



And below is most of the accumulo site.xml minus the secrets and Zookeeper 
sections, but none of the iterators are in the  config classpaths.

  
tserver.memory.maps.max
2G
  


  
tserver.memory.maps.native.enabled
true
  


  
tserver.cache.data.size
4G
  


  
tserver.cache.index.size
3G
  


  
trace.token.property.password
maana
  


  
trace.user
root
  


  
tserver.sort.buffer.size
200M
  


  
tserver.walog.max.size
1G
  


  
general.classpaths

  $ACCUMULO_HOME/lib/accumulo-server.jar,
  $ACCUMULO_HOME/lib/accumulo-core.jar,
  $ACCUMULO_HOME/lib/accumulo-start.jar,
  $ACCUMULO_HOME/lib/accumulo-fate.jar,
  $ACCUMULO_HOME/lib/accumulo-proxy.jar,
  $ACCUMULO_HOME/lib/[^.].*.jar,
  $ZOOKEEPER_HOME/zookeeper[^.].*.jar,
  $HADOOP_CONF_DIR,
  $HADOOP_PREFIX/share/hadoop/common/[^.].*.jar,
  $HADOOP_PREFIX/share/hadoop/common/lib/(?!slf4j)[^.].*.jar,
  $HADOOP_PREFIX/share/hadoop/hdfs/[^.].*.jar,
  $HADOOP_PREFIX/share/hadoop/mapreduce/[^.].*.jar,
  $HADOOP_PREFIX/share/hadoop/yarn/[^.].*.jar,
  /usr/hdp/current/hadoop-client/[^.].*.jar,
  /usr/hdp/current/hadoop-client/lib/(?!slf4j)[^.].*.jar,
  /usr/hdp/current/hadoop-hdfs-client/[^.].*.jar,
  /usr/hdp/current/hadoop-mapreduce-client/[^.].*.jar,
  /usr/hdp/current/hadoop-yarn-client/[^.].*.jar,
  /usr/hdp/current/hadoop-yarn-client/lib/jersey.*.jar,
  $HADOOP_PREFIX/[^.].*.jar,
  $HADOOP_PREFIX/lib/(?!slf4j)[^.].*.jar,
  /usr/hdp/current/hive-client/lib/hive-accumulo-handler.jar,

Classpaths that accumulo checks for updates and class 
files.
  


  
  instance.volumes
  hdfs://maana2/apps/accumulo
  









On 10/29/15, 5:27 PM, "dlmar...@comcast.net" <dlmar...@comcast.net> wrote:

>So, without seeing your configuration, I would suggest trying something before 
>upgrading to 1.7. In 1.5 we changed the classloader so that it could load from 
>different locations. At the same time, we added the concept of classloader 
>contexts which are basically names for locations for jars. Table(s) can be 
>configured to use a classloader context allowing you to deploy server side 
>code for different applications in different locations. This new classloader 
>does "reload" jars on the classpath when they change; the same behavior with 
>the older classloader reading from lib/ext. You can read more about this 
>feature at [1].
>
>We currently depend on Commons VFS 2.0 for this feature. Some bugs have been 
>fixed and you will have a better experience if you replace the VFS jar in the 
>lib directory with a snapshot of the 2.1 release[2].
>
>[1] https://blogs.apache.org/accumulo/entry/the_accumulo_classloader
>[2] 
>https://continuum-ci.apache.org/continuum/workingCopy.action?projectId=129=Apache+Commons+VFS=dist/target
>
>
>> -Original Message-
>> From: dlmar...@comcast.net [mailto:dlmar...@comcast.net]
>> Sent: Thursday, October 29, 2015 8:04 PM
>> To: user@accumulo.apache.org
>> Subject: RE: Accumulo Iterator painful development because TS don't pick up
>> changes to Jars
>> 
>> Can you provide the relevant classpath sections of your accumulo-site.xml
>> file?
>> 
>> > -Original Message-
>> > From: Rob Povey [mailto:r...@maana.io]
>> > Sent: Thursday, October 29, 2015 8:01 PM
>> > To: user@accumulo.apache.org
>> > Subject: Accumulo Iterator painful development because TS don't pick
>> > up changes to Jars
>> >
>>

Accumulo Iterator painful development because TS don't pick up changes to Jars

2015-10-29 Thread Rob Povey
Caveat I’m still running 1.6.2 internally here, and things may have changed and 
I could simply “be doing it wrong”, or have missed the solution in the docs. 
It’s also probably not a typical use case.

This is not really an issue for most day to day development, but our internal 
testing process makes this changing iterators a nightmare.
Before I start I am aware of general.dynamic.classpaths, and because it appears 
that wildcards are only respected at the file level, which is insufficient for 
our use case as you’ll see later.

I’ll try and explain our internal test environment to help understand the issue.
We run daily (or more frequent) drops of our codebase against two internal 
clusters across a variety of data sources (most of them aren’t particularly 
large). 
To give some idea I count 462 tables on one of of the clusters and each 
instance of the application is using 11 or so tables of which 4 or so have a 
variety of iterators we’ve written.
To resolve the conflicts since our application predates namespaces we prefix 
the tables and the table contexts and upload the iterators to subdirectories 
with matching names. 
To complicate matters further many of the tables are dropped and new tables 
added at a pretty frightening rate, so having to change the configuration, and 
restart servers to add a new path to the dynamic.classpath property is 
something of a none starter.

It all works fine until a build has a change in an iterator and is targeted 
against an existing table, the app correctly identifies and uploads the new 
jars, but accumulo obviously doesn’t pick up the change. In many cases I could 
live with it if simply dropping the tables and reingesting was sufficient, but 
short of ingesting into a new table name even that doesn’t always pick up the 
new Iterators.
We have currently resorted to manually tracking every iterator change (the rate 
of which has at least slowed down recently) and doing rolling restarts of 
tablet servers on off hours, but we end up often not knowing if an bug is real 
or an issue in a TS having an old iterator loaded.

Is there a way to get the TS to watch an entire subtree for Jar changes?

Assuming there isn’t, when I get a few days without a looming deliverable, I 
was going to migrate to 1.7 and if that has the same issue look at making and 
submitting a fix.


Rob Povey






On 10/28/15, 2:25 PM, "Josh Elser" <josh.el...@gmail.com> wrote:

>Rob Povey wrote:
>> However I’m pretty reticent right now to add anymore iterators to our
>> project, they’ve been a test nightmare for us internally.
>
>Off-topic, I'd like to hear more about what is painful. Do you have the 
>time to fork off a thread and let us know how it hurts?


Re: Is there a sensible way to do this? Sequential Batch Scanner

2015-10-28 Thread Rob Povey
Unfortunately that’s pretty much what I’m doing now, and the results are large 
enough that pulling them back and sorting them causes fairly dramatic GC issues.
If I could get them in sorted order I no longer need to retain them, I can just 
process them and discard them eliminating my GC issues.
I think the way I’ll end up working around this in the short term is to pull 
pages of data from a batch scanner, sort those, then combine the paged results. 
That should be manageable.

Rob Povey

From: Keith Turner <ke...@deenlo.com<mailto:ke...@deenlo.com>>
Reply-To: "user@accumulo.apache.org<mailto:user@accumulo.apache.org>" 
<user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Date: Wednesday, October 28, 2015 at 8:04 AM
To: "user@accumulo.apache.org<mailto:user@accumulo.apache.org>" 
<user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Subject: Re: Is there a sensible way to do this? Sequential Batch Scanner

Will the results always fit into memory?  If so could put results from batch 
scanner into ArrayList and sort it.

On Tue, Oct 27, 2015 at 6:21 PM, Rob Povey 
<r...@maana.io<mailto:r...@maana.io>> wrote:
What I want is something that behaves like a BatchScanner (I.e. Takes a 
collection of Ranges in a single RPC), but preserves the scan ordering.
I understand this would greatly impact performance, but in my case I can 
manually partition my request on the client, and send one request per tablet.
I can’t use scanners, because in some cases I have 10’s of thousands of none 
consecutive ranges.
If I use a single threaded BatchScanner, and only request data from a single 
Tablet, am I guaranteed ordering?
This appears to work correctly in my small tests (albeit slower than a single 1 
thread Batch scanner call), but I don’t really want to have to rely on it if 
the semantic isn’t guaranteed.
If not Is there another “efficient” way to do this.

Thanks

Rob Povey




Re: Is there a sensible way to do this? Sequential Batch Scanner

2015-10-28 Thread Rob Povey
Thanks, I had thought about trying this, and it’s good to know it’s a viable 
solution.

However I’m pretty reticent right now to add anymore iterators to our project, 
they’ve been a test nightmare for us internally.
Because of the way our internal process works, at any point in time we have 
many versions of our product running against a subset of tables in a single 
Accumulo instance and at least in 1.6 there doesn’t appear to be a good way to 
have the tablet servers auto reload the iterators when builds are updated (you 
can specify paths to watch, but it doesn't seem to deal with wild cards). Our 
internal servers have literally 100’s of tables which require different 
versions of iterators so they are in differing HDFS paths.

Thanks

Rob Povey


From: Dylan Hutchison <dhutc...@uw.edu<mailto:dhutc...@uw.edu>>
Reply-To: "user@accumulo.apache.org<mailto:user@accumulo.apache.org>" 
<user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Date: Tuesday, October 27, 2015 at 4:35 PM
To: Accumulo User List 
<user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Subject: Re: Is there a sensible way to do this? Sequential Batch Scanner

Hi Rob,

One solution is to use an Accumulo iterator.  Suppose you want to scan a set of 
non-overlapping ranges R.  Use a (non-batch) Scanner, with range spanning the 
least start key in R to the greatest end key in R, and a server-side iterator 
that works as follows:

  *   Pass R to the server-side iterator via iterator options.
  *   On a call to seek(Range r, ..., ...) in the iterator: let the iterator 
seek its parent for the first range in R that intersects with r.
  *   On a call to next(), if the current seek'ed range is finished, seek its 
parent to the next range in R that intersects with r, until no more ranges in R 
intersect with r.  At that point the scan is finished.

The result is that you can scan a number of non-disjoint ranges with "one 
Scanner call" whose results come back in order.  We did this "moving seek 
control" into the land of iterators.  One word of caution: if the number of 
ranges is very large, you might run into 
ACCUMULO-3710<https://issues.apache.org/jira/browse/ACCUMULO-3710> -- too many 
range objects get materialized at the tablet server which results in an out of 
memory error.

I have implemented something like this in the Graphulo project under 
SeekFilterIterator<https://github.com/Accla/graphulo/blob/master/src/main/java/edu/mit/ll/graphulo/skvi/SeekFilterIterator.java>
 and its related classes.  Take a look at that if you want to try this idea, 
and feel free to follow up with questions.

Cheers, Dylan




On Tue, Oct 27, 2015 at 3:21 PM, Rob Povey 
<r...@maana.io<mailto:r...@maana.io>> wrote:
What I want is something that behaves like a BatchScanner (I.e. Takes a 
collection of Ranges in a single RPC), but preserves the scan ordering.
I understand this would greatly impact performance, but in my case I can 
manually partition my request on the client, and send one request per tablet.
I can’t use scanners, because in some cases I have 10’s of thousands of none 
consecutive ranges.
If I use a single threaded BatchScanner, and only request data from a single 
Tablet, am I guaranteed ordering?
This appears to work correctly in my small tests (albeit slower than a single 1 
thread Batch scanner call), but I don’t really want to have to rely on it if 
the semantic isn’t guaranteed.
If not Is there another “efficient” way to do this.

Thanks

Rob Povey




Is there a sensible way to do this? Sequential Batch Scanner

2015-10-27 Thread Rob Povey
What I want is something that behaves like a BatchScanner (I.e. Takes a 
collection of Ranges in a single RPC), but preserves the scan ordering.
I understand this would greatly impact performance, but in my case I can 
manually partition my request on the client, and send one request per tablet.
I can’t use scanners, because in some cases I have 10’s of thousands of none 
consecutive ranges.
If I use a single threaded BatchScanner, and only request data from a single 
Tablet, am I guaranteed ordering?
This appears to work correctly in my small tests (albeit slower than a single 1 
thread Batch scanner call), but I don’t really want to have to rely on it if 
the semantic isn’t guaranteed.
If not Is there another “efficient” way to do this.

Thanks

Rob Povey