from:"William Slacum"

Re: Making a RowCounterIterator

2016-07-15 Thread William Slacum

The iterator in the gist also counts cells/entries/KV pairs, not unique
rows. You'll want to have some way to skip to the next row value if you
want the count to be reflective of the number of rows being read.

On Fri, Jul 15, 2016 at 3:34 PM, Shawn Walker 
wrote:

> My read is that you're mistaking the sequence of calls Accumulo will be
> making to your iterator.  The sequence isn't quite the same as a Java
> iterator (initially positioned "before" the first element), and is more
> like a C++ iterator:
>
> 0. Accumulo calls seek(...)
> 1. Is there more data? Accumulo calls hasTop(). You return yes.
> 2. Ok, so there's data.  Accumulo calls getTopKey(), getTopValue() to
> retrieve the data. You return a key indicating 0 columns seen (since next()
> hasn't yet been called)
> 3. First datum done, Accumulo calls next()
> ...
>
> I imagine that if you pull the second item out of your scan result, it'll
> have the number you expect.  Alternately, you might consider performing the
> count computation during an override of the seek(...) method, instead of in
> the next(...) method.
>
> --
> Shawn Walker
>
>
>
> On Fri, Jul 15, 2016 at 2:24 PM, Mario Pastorelli <
> mario.pastore...@teralytics.ch> wrote:
>
>> I'm trying to create a RowCounterIterator that counts all the rows and
>> returns only one key-value with the counter inside. The problem is that I
>> can't get it work. The Scala code is available in the gist
>> 
>> together with some pseudo-code of a test. The problem is that if I add an
>> entry to my table, this iterator will return 0 instead of 1 and apparently
>> the reason is that super.hasTop() is always false. I've tried without the
>> iterator and the scanner returns 1 elements. Any idea of what I'm doing
>> wrong here? Is WrappingIterator the right class to extend for this kind of
>> behaviour?
>>
>> Thanks,
>> Mario
>>
>> --
>> Mario Pastorelli | TERALYTICS
>>
>> *software engineer*
>>
>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>> phone: +41794381682
>> email: mario.pastore...@teralytics.ch
>> www.teralytics.net
>>
>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>> Zurich
>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>> Yann de Vries
>>
>> This e-mail message contains confidential information which is for the
>> sole attention and use of the intended recipient. Please notify us at once
>> if you think that it may not be intended for you and delete it immediately.
>>
>
>

Re: Unable to import RFile produced by AccumuloFileOutputFormat

2016-07-08 Thread William Slacum

I wonder if the file isn't being decrypted properly. I don't see why it
would write out incompatible file versions.

On Fri, Jul 8, 2016 at 3:02 PM, Josh Elser  wrote:

> Interesting! I have not run into this one before.
>
> You could use `accumulo rfile-info`, but I'd guess that would net the same
> exception you see below.
>
> Let me see if I can dig a little into the code and come up with a
> plausible explanation.
>
>
> Russ Weeks wrote:
>
>> Hi, folks,
>>
>> Has anybody ever encountered a problem where the RFiles that are
>> generated by AccumuloFileOutputFormat can't be imported using
>> TableOperations.importDirectory?
>>
>> I'm seeing this problem very frequently for small RFiles and
>> occasionally for larger RFiles. The errors shown in the monitor's log UI
>> suggest a corrupt file, to me. For instance, the stack trace below shows
>> a case where the BCFileVersion was incorrect, but sometimes it will
>> complain about an invalid length, negative offset, or invalid codec.
>>
>> I'm using HDP Accumulo 1.7.0 (1.7.0.2.3.4.12-1) on an encrypted HDFS
>> volume, with Kerberos turned on. The RFiles are generated by
>> AccumuloFileOutputFormat from a Spark job.
>>
>> A very small RFile that exhibits this problem is available here:
>> http://firebar.newbrightidea.com/downloads/bad_rfiles/Iwaz.rf
>>
>> I'm pretty confident that the keys are being written to the RFile in
>> order. Are there any tools I could use to inspect the internal structure
>> of the RFile?
>>
>> Thanks,
>> -Russ
>>
>> Unable to find tablets that overlap file
>> hdfs://[redacted]/accumulo/data/tables/f/b-ze9/Izeb.rf
>> java.lang.RuntimeException: Incompatible BCFile fileBCFileVersion.
>> at
>>
>> org.apache.accumulo.core.file.rfile.bcfile.BCFile$Reader.(BCFile.java:828)
>> at
>>
>> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.init(CachableBlockFile.java:246)
>> at
>>
>> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:257)
>> at
>>
>> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.access$100(CachableBlockFile.java:137)
>> at
>>
>> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$MetaBlockLoader.get(CachableBlockFile.java:209)
>> at
>>
>> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBlock(CachableBlockFile.java:313)
>> at
>>
>> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:368)
>> at
>>
>> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:137)
>> at org.apache.accumulo.core.file.rfile.RFile$Reader.(RFile.java:843)
>> at
>>
>> org.apache.accumulo.core.file.rfile.RFileOperations.openReader(RFileOperations.java:79)
>> at
>>
>> org.apache.accumulo.core.file.DispatchingFileFactory.openReader(DispatchingFileFactory.java:69)
>> at
>>
>> org.apache.accumulo.server.client.BulkImporter.findOverlappingTablets(BulkImporter.java:644)
>> at
>>
>> org.apache.accumulo.server.client.BulkImporter.findOverlappingTablets(BulkImporter.java:615)
>> at
>>
>> org.apache.accumulo.server.client.BulkImporter$1.run(BulkImporter.java:146)
>> at
>> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>> at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at
>> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>> at java.lang.Thread.run(Thread.java:745)
>>
>

Re: java.lang.NoClassDefFoundError with fields of custom Filter

2016-07-07 Thread William Slacum

You could also shade/relocate dependency classes within the uber/fat jar.
It has pitfalls but it is very easy to set up.

On Thursday, July 7, 2016, Massimilian Mattetti  wrote:

> Hi Jim,
>
> the approach of using namespace from HDFS looks promising. I need to
> investigate a little on how it works but I guess I will take your advice.
> Thank you.
>
> Cheers,
> Massimiliano
>
>
>
>
> From:James Hughes  >
> To:user@accumulo.apache.org
> 
> Date:07/07/2016 08:28 PM
> Subject:Re: java.lang.NoClassDefFoundError with fields of custom
> Filter
> --
>
>
>
> Hi Massimiliano,
>
> I'm a fan of producing uber jars for this kind of thing; we do that for
> GeoMesa.  There is one gotcha which can come up:  if you have several uber
> jars in lib/ext, they can collide in rather unexpected ways.
>
> There are two options to call out:
>
> First, Accumulo has support for loading jars from HDFS into namespaces.
> With that, you could have various namespaces for different versions or
> different collections of iterator projects.  If you are sharing a dev cloud
> with other projects or co-workers working on the same project that can be
> helpful since it would avoid restarts, etc.  Big thumbs-up for this
> approach!
>
> Second, rather than having an uber jar, you could build up zip files with
> the various jars you need for your iterators and unzip them in lib/ext.  If
> you did that for multiple competing iterator projects, you'd avoid
> duplication of code inside uber-jars.  Also, you'd be able to see if there
> are 8 versions of Log4J and Guava in lib/ext...;)  It wouldn't be as
> powerful as the namespace, but there's something nice about having a
> low-tech approach.
>
> Others will likely have varied experiences; I'm not sure if there's an
> established 'best practice' here.
>
> Cheers,
>
> Jim
>
> On Thu, Jul 7, 2016 at 12:56 PM, Massimilian Mattetti <
> *massi...@il.ibm.com*
> > wrote:
> Thanks for your prompt response, you are right Jim. There is a static
> dependency to Log4J in my lexicoder. Adding the Log4J Jar to the classpath
> solved the problem.
> Would you suggest to use an Uber Jar to avoid this kind of problems?
>
> Regards,
> Massimiliano
>
>
>
>
> From:James Hughes <*jn...@virginia.edu*
> >
> To:*user@accumulo.apache.org*
> 
> Date:07/07/2016 06:25 PM
> Subject:Re: java.lang.NoClassDefFoundError with fields of custom
> Filter
> --
>
>
>
>
> Hi Massimilian,
>
> As a quick note, your error says that it could not initialize class
> accumulo.lexicoders.MyLexicoder.  Did you provide all the dependencies for
> your class on Accumulo's classpath?
>
> That exception (or similar) can occur if there is a static block in your
> MyLexicoder class which can't run properly.
>
> Cheers,
>
> Jim
>
>
> On Thu, Jul 7, 2016 at 11:19 AM, Massimilian Mattetti <
> *massi...@il.ibm.com*
> > wrote:
> Hi,
>
> I have implemented a custom filter and a custom lexicoder. Both this two
> classes are packed in the same jar that has been deployed under the
> directory $ACCUMULO_HOME/*lib*/*ext*of my Accumulo servers (version
> 1.7.1). The lexicoder is used by the filter to get the real object from the
> accumulo value and test some conditions on it. When I tried to scan the
> table applying this filter I got the following exception:
>
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class
> accumulo.lexicoders.MyLexicoder
> at accumulo.filters.MyFilter.(MyFilter.java:24)
> at sun.reflect.GeneratedConstructorAccessor9.newInstance(Unknown
> Source)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at java.lang.Class.newInstance(Class.java:442)
> at
> org.apache.accumulo.core.iterators.IteratorUtil.loadIterators(IteratorUtil.java:261)
> at
> org.apache.accumulo.core.iterators.IteratorUtil.loadIterators(IteratorUtil.java:237)
> at
> org.apache.accumulo.core.iterators.IteratorUtil.loadIterators(IteratorUtil.java:218)
> at
> org.apache.accumulo.core.iterators.IteratorUtil.loadIterators(IteratorUtil.java:205)
> at
> org.apache.accumulo.tserver.tablet.ScanDataSource.createIterator(ScanDataSource.java:193)
> at
> org.apache.accumulo.tserver.tablet.ScanDataSource.iterator(ScanDataSource.java:127)
> at
> org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.seek(SourceSwitchingIterator.java:180)
> at
>

Re: [ANNOUNCE] Fluo 1.0.0-beta-2 is released

2016-01-19 Thread William Slacum

Cool beans, Keith!

On Tue, Jan 19, 2016 at 11:30 AM, Keith Turner  wrote:

> The Fluo project is happy to announce a 1.0.0-beta-2[1] release which is
> the
> third release of Fluo and likely the final release before 1.0.0. Many
> improvements in this release were driven by the creation of two new Fluo
> related projects:
>
>   * Fluo recipes[2] is a collection of common development patterns
> designed to
> make Fluo application development easier. Creating Fluo recipes
> required
> new Fluo functionality and updates to the Fluo API. The first release
> of
> Fluo recipes has been made and is available in Maven Central.
>
>   * WebIndex[3] is an example Fluo application that indexes links to web
> pages
> in multiple ways. Webindex enabled the testing of Fluo on real data at
> scale.  It also inspired improvements to Fluo to allow it to work
> better
> with Apache Spark.
>
> Fluo is now at a point where its two cluster test suites, Webindex[3] and
> Stress[4], are running well for long periods on Amazon EC2. We invite early
> adopters to try out the beta-2 release and help flush out problems before
> 1.0.0.
>
> [1]: http://fluo.io/1.0.0-beta-2-release/
> [2]: https://github.com/fluo-io/fluo-recipes
> [3]: https://github.com/fluo-io/webindex
> [4]: https://github.com/fluo-io/fluo-stress
>
>

Re: compression of keys for a sequential scan over an inverted index

2015-10-26 Thread William Slacum

Thanks, Jonathan! I've wondered about specific numbers on this topic when
dealing with geohashes, so this is a very useful tool.

On Sun, Oct 25, 2015 at 11:22 AM, Jonathan Wonders 
wrote:

> I have been able to put some more thought into this over the weekend and
> make initial observations on tables I currently have populated.  Looking at
> the rfile-info for a few different tables, I noticed that one which has
> particularly small lexicographical deltas between keys costs an average of
> ~2.5 bits per key to store on disk.  All of the data is stored in the row
> component of the key and a full row is typically about 36 bytes.  I wrote a
> little utility to recreate ScanResult objects for batches of sequential
> key-value pairs returned from a scanner and then used the TCompactProtocol
> to write the ScanResult to a byte array.  Each key-value pair costs
> rougly 48 bytes which makes sense given that every row is different and
> there will be some space required for the timestamps, visibilities, and
> other bookeeping info.
>
> Another table I looked at has larger lexicographical deltas between keys
> and costs roughly 5 bytes per key to store on disk.  This table is a
> reverse index with very large rows, each column within a row identifies
> data that resides in another table.  Each column is rougly 12 bytes
> uncompressed.  When encoded in a ScanResult, each key-value pair costs
> roughly 25 bytes which makes sense since the row cost should be negligible
> for large batch sizes and the overhead from timestamp, visibility, and
> other bookkeeping info is roungly the same as the other table.
>
> Since compression depends heavily on both table design and the actual
> data, it seemed the next logical step would be to create a tool that the
> community could use to easily measure the compression ratio for ScanResult 
> objects.
> So, I threw together a shell extension to wrap the utility that I
> previously described.  It measures compression ratio for the default
> strategy Key.compress as well as a few other simple strategies that seemed
> reasonable to test. The usage is almost the same as the scan command, it
> just prints out compression statistics rather than the data.
>
> It lives at https://github.com/jwonders/accumulo-experiments with
> branches for Accumulo 1.6.x, 1.7.x, and 1.8.x.
>
> Any feedback is welcome.  I hope others find this useful for understanding
> this particular aspect of scan performance.
>
> V/R
> Jonathan
>
>
> On Thu, Oct 22, 2015 at 4:37 PM, Jonathan Wonders 
> wrote:
>
>> Josh,
>>
>> Thanks for the information.  I did read through the discussion about
>> compression of visibility expressions and columns within RFiles a while
>> back which got me thinking about some of this. It makes sense that gzip or
>> lzo/snappy compression would have a very noticeable impact when there are
>> columns or visibility expressions that are not compressed with RLE even if
>> neighboring rows have very small lexicographical deltas.
>>
>> I will put some thought into desiging an experiment to evaluate whether
>> or not there is any benefit to applying RLE during key-value transport from
>> server to client.  Even if it proves to be situationally beneficial, I
>> think it could be implemented as a common iterator similar to the
>> WholeRowIterator.
>>
>> Given, the current compression strategy I would expect better
>> server-client transport compression retrieving a single row with many
>> columns
>>
>> : []
>>
>> compared to many lexicographically close rows.
>>
>> : []
>>
>> with the understanding that very large rows can lead to poor load
>> balancing.
>>
>> V/R
>> Jonathan
>>
>> On Thu, Oct 22, 2015 at 11:54 AM, Josh Elser 
>> wrote:
>>
>>> Jonathan Wonders wrote:
>>>
 I have been digging into some details of Accumulo to model the disk and
 network costs associated with various types of scan patterns and I have
 a few questions regarding compression.

 Assuming an inverted index table with rows following the pattern of

 

 and a scan that specifies an exact key and value so as to constrain the
 range, it seems that the dominant factor in network utiltization would
 be sending key-value pairs from the tablet server to the client and a
 secondary factor would be transmitting data from non-local RFiles
 (assuming no caching).

>>>
>>> Sounds about right to me.
>>>
>>> Is my understanding correct that the on-disk compression of this type of
 table is predominantly a function of the average number of differing
 bits between adjacent ids?  Or, has anyone observed a significant
 improvement with gz or lzo vs no additional compression?  I'm
 considering running some experiments to measure the difference for a few
 types of ids (uuid, snowflake-like, content based hashes), but I'm
 curious if anyone else has done similar experiments.

Re: Watching for Changes with Write Ahead Log?

2015-10-01 Thread William Slacum

Soup gave a talk about something down this alley:
https://www.youtube.com/watch?v=aedejUXWrV0

On Thu, Oct 1, 2015 at 2:58 PM, Keith Turner  wrote:

> Could possibly use a ThreadLocal containing a SoftReference
>
> Another place you could possibly put this code instead of in a constraint
> is in a minor compaction iterator.   Although you will still have the clean
> up problem since iterators do not have a close method.  There is an open
> ticket where Billie suggested that Accumulo can call close on any iterator
> that implements Closeable
>
> On Thu, Oct 1, 2015 at 2:48 PM, Parise, Jonathan <
> jonathan.par...@gd-ms.com> wrote:
>
>> I think this is one of those things where there really aren’t great
>> solutions.
>>
>>
>>
>> A static connection could work, but if multiple Constraint instances can
>> exist at the same time, it would probably not work. Since all of them would
>> be trying to use the same connection at the same time.
>>
>>
>>
>> ThreadLocal could possibly work better. The only question is how long
>> lived is the thread that calls the constraints? For example that thread
>> could be torn down as soon as the constraint is done. In that case the
>> performance would be no better than creating and tearing down everything
>> each time check() is called.
>>
>>
>>
>> This is why I am trying to understand the Constraint’s lifecycle, so I
>> can come up with the least bad way of solving this problem.
>>
>>
>>
>> Thanks for the ideas! I am just not sure I know enough about the
>> lifecycle of Constraints to understand if these suggestions would be
>> helpful.
>>
>>
>>
>> Jon Parise
>>
>>
>>
>> *From:* John Vines [mailto:vi...@apache.org]
>> *Sent:* Thursday, October 01, 2015 2:40 PM
>>
>> *To:* user@accumulo.apache.org
>> *Subject:* Re: Watching for Changes with Write Ahead Log?
>>
>>
>>
>> As dirty as it is, that sounds like a case for a static, or maybe thread
>> local, object
>>
>>
>>
>> On Thu, Oct 1, 2015, 7:19 PM Parise, Jonathan 
>> wrote:
>>
>> I have a few follow up questions in regard to constraints.
>>
>>
>>
>> What is the lifecycle of a constraint? What I mean by this is are the
>> constraints somehow tied to Accumulo’s lifecycle or are they just
>> instantiated each time a mutation occurs and then disposed?
>>
>>
>>
>> Also, are there multiple instances of the same constraint class at any
>> time or do all mutation on a table go through the exact same constraint?
>>
>>
>>
>> My guess is that  when a mutation comes in a new constraint is made
>> through reflection. Then check() is called, the violation codes are parsed
>> and the object is disposed/finalized.
>>
>>
>>
>> The reason I ask is that what I want to do is update my ElasticSearch
>> index each time I see a mutation on the table. However, I don’t want to
>> have to make a connection, send the data and then tear down the connection
>> each time. That’s a lot of unnecessary overhead and with all that overhead
>> happening on every mutation performance could be badly impacted.
>>
>>
>>
>> Is there some way to cache something like a connection and reuse it
>> between calls to the Constraint’s check() method? How would such a thing be
>> cleaned up if Accumulo is shut down?
>>
>>
>>
>>
>>
>> Thanks again,
>>
>>
>>
>> Jon
>>
>> *From:* Parise, Jonathan [mailto:jonathan.par...@gd-ms.com
>> ]
>> *Sent:* Wednesday, September 30, 2015 9:21 AM
>> *To:* user@accumulo.apache.org
>> *Subject:* RE: Watching for Changes with Write Ahead Log?
>>
>>
>>
>> In this particular case, I need to update some of my application state
>> when changes made by another system occur.
>>
>>
>>
>> I would need to do a few things to accomplish my goal.
>>
>>
>>
>> 1)  Be notified or see that a table had changed
>>
>> 2)  Checked that against changes I know my system has made
>>
>> 3)  If my system is not the originator of the change, update
>> internal state to reflect the change.
>>
>>
>>
>> Examples of state I may need to update include an ElasticSearch index and
>> also an in memory cache.
>>
>>
>>
>> I’m going to read up on constraints again and see if I can use them for
>> this purpose.
>>
>>
>>
>> Thanks!
>>
>>
>>
>> Jon
>>
>>
>>
>>
>>
>>
>>
>> *From:* Adam Fuchs [mailto:afu...@apache.org ]
>> *Sent:* Tuesday, September 29, 2015 5:46 PM
>> *To:* user@accumulo.apache.org
>> *Subject:* Re: Watching for Changes with Write Ahead Log?
>>
>>
>>
>> Jon,
>>
>>
>>
>> You might think about putting a constraint on your table. I think the API
>> for constraints is flexible enough for your purpose, but I'm not exactly
>> sure how you would want to manage the results / side effects of your
>> observations.
>>
>>
>>
>> Adam
>>
>>
>>
>>
>>
>> On Tue, Sep 29, 2015 at 5:41 PM, Parise, Jonathan <
>> jonathan.par...@gd-ms.com> wrote:
>>
>> Hi,
>>
>>
>>
>> I’m working on a system where generally changes to Accumulo will come
>> through that system. However, in some cases,

Re: Question about configuring the linux niceness of tablet servers?

2015-08-17 Thread William Slacum

By Hadoop do you mean a Yarn NodeManager process?

On Mon, Aug 17, 2015 at 4:21 PM, Jeff Kubina jeff.kub...@gmail.com wrote:

 On each of the processing nodes in our cluster we have running 1) HDFS
 (datanode), 2) Accumulo (tablet server), and 3) Hadoop. Since Accumulo
 depends on the HDFS, and Hadoop depends on the HDFS and sometimes on
 Accumulo, we are considering setting the niceness of HDFS to 0 (the current
 value), Accumulo to 1, and Hadoop to 2 on each of the nodes. The objective
 is to improve the real time performance of Accumulo.

 Does anyone have experience configuring their cluster in a similar manner
 that they can share?  Are there any serious cons to doing this?

Origin of hive.auto.convert.sortmerge.join.noconditionaltask

2015-08-04 Thread William Slacum

Hi all,

I've had some questions from users regarding setting
`hive.auto.convert.sortmerge.join.noconditionaltask`. I see, in some
documentation from users and vendors, that it is recommended to set this
parameter. In neither Hive 0.12 nor 0.14 can I find in HiveConf where this
is actually defined and used. Am I correct in thinking that this is just
some cruft that's survived without verification?

Thanks!

Re: Origin of hive.auto.convert.sortmerge.join.noconditionaltask

2015-08-04 Thread William Slacum

You are correct sir!

On Tue, Aug 4, 2015 at 3:42 PM, Josh Elser josh.el...@gmail.com wrote:

 Might you have meant to send this to u...@hive.apache.org?


 William Slacum wrote:

 Hi all,

 I've had some questions from users regarding setting
 `hive.auto.convert.sortmerge.join.noconditionaltask`. I see, in some
 documentation from users and vendors, that it is recommended to set this
 parameter. In neither Hive 0.12 nor 0.14 can I find in HiveConf where
 this is actually defined and used. Am I correct in thinking that this is
 just some cruft that's survived without verification?

 Thanks!

Re: How to control Minor Compaction by programming

2015-07-30 Thread William Slacum

Swap out 1.5 in the previous link for the version you're probably using.

Which charts are you looking at for the compactions? Usually it's just the
number of compactions currently running for the system.

On Thu, Jul 30, 2015 at 7:10 PM, William Slacum wsla...@gmail.com wrote:

 See
 http://accumulo.apache.org/1.5/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#flush%28java.lang.String,%20org.apache.hadoop.io.Text,%20org.apache.hadoop.io.Text,%20boolean%29
 for minor compacting (aka flushing) a table via the API.


 On Thu, Jul 30, 2015 at 5:52 PM, Hai Pham htp0...@tigermail.auburn.edu
 wrote:

 Hi,


 Please share with me is there any way that we can init / control the
 Minor Compaction by programming (not from the shell). My situation is when
 I ingest a large data using the BatchWriter, the minor compaction is
 triggered uncontrollably. The flush() command in BatchWriter seems not for
 this purpose.

 I also tried to play around with parameters in documentation but seems
 not much helpful.


 Also, can you please explain the number 0, 1.0, 2.0, ... in charts (web
 monitoring) denoting the level of Minor Compaction and Major Compaction?


 Thank you!

 Hai Pham

Re: How to control Minor Compaction by programming

2015-07-30 Thread William Slacum

See
http://accumulo.apache.org/1.5/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#flush%28java.lang.String,%20org.apache.hadoop.io.Text,%20org.apache.hadoop.io.Text,%20boolean%29
for minor compacting (aka flushing) a table via the API.


On Thu, Jul 30, 2015 at 5:52 PM, Hai Pham htp0...@tigermail.auburn.edu
wrote:

 Hi,


 Please share with me is there any way that we can init / control the Minor
 Compaction by programming (not from the shell). My situation is when I
 ingest a large data using the BatchWriter, the minor compaction is
 triggered uncontrollably. The flush() command in BatchWriter seems not for
 this purpose.

 I also tried to play around with parameters in documentation but seems not
 much helpful.


 Also, can you please explain the number 0, 1.0, 2.0, ... in charts (web
 monitoring) denoting the level of Minor Compaction and Major Compaction?


 Thank you!

 Hai Pham

Re: AccumuloInputFormat with pyspark?

2015-07-15 Thread William Slacum

Look in ConfiguratorBase for how it converts enums to config keys. These
are the two methods that are used:

  /**
   * Provides a configuration key for a given feature enum, prefixed by the
implementingClass
   *
   * @param implementingClass
   *  the class whose name will be used as a prefix for the
property configuration key
   * @param e
   *  the enum used to provide the unique part of the configuration
key
   * @return the configuration key
   * @since 1.6.0
   */
  protected static String enumToConfKey(Class? implementingClass, Enum?
e) {
return implementingClass.getSimpleName() + . +
e.getDeclaringClass().getSimpleName() + . + StringUtils.camelize(e.name
().toLowerCase());
  }

  /**
   * Provides a configuration key for a given feature enum.
   *
   * @param e
   *  the enum used to provide the unique part of the configuration
key
   * @return the configuration key
   */
  protected static String enumToConfKey(Enum? e) {
return e.getDeclaringClass().getSimpleName() + . +
StringUtils.camelize(e.name().toLowerCase());
  }

On Wed, Jul 15, 2015 at 11:20 AM, Kina Winoto winoto.kin...@gmail.com
wrote:

 Has anyone used the python Spark API and AccumuloInputFormat?

 Using AccumuloInputFormat in scala and java within spark is
 straightforward, but the python spark API's newAPIHadoopRDD function takes
 in its configuration via a python dict (
 https://spark.apache.org/docs/1.1.0/api/python/pyspark.context.SparkContext-class.html#newAPIHadoopRDD)
 and there isn't an obvious job configuration set of keys to use. From
 looking at the Accumulo source, it seems job configuration values are
 stored with keys that are java enums and it's unclear to me what to use for
 configuration keys in my python dict.

 Any thoughts as to how to do this would be helpful!

 Thanks,

 Kina

Re: Abnormal behaviour of custom iterator in getting entries

2015-06-12 Thread William Slacum

What do you mean by multiple entries? Are you doing something similar to
the WholeRowIterator, which encodes all the entries for a given row into a
single key value?

Are you using any other iterators?

In general, calls to `hasTop()`, `getTopKey()` and `getTopValue()` should
not change the state of the iterator, so it should be safe to call them
repeatedly in between calls to `next()` and `seek()`.

On Fri, Jun 12, 2015 at 7:47 AM, shweta.agrawal shweta.agra...@orkash.com
wrote:

 Hi,

 I am making a custom iterator which returns multiple entries. For some
 entries getTopValue function is called, sometimes skipped. Due to this
 behaviour i am not getting all the entries at scan time which are to be
 returned.

 I had written functions calling hierarchy in a text file which is:
 hasTop
 getTopKey
 hasTop
 getTopKey
 getTopValue
 next
 hasTop

 Thanks
 Shweta

Re: Getting InterruptedException

2015-06-03 Thread William Slacum

What does your code look like?

I've seen issues where I have some code of the form:

BatchScanner s = connector.createBatchScanner(...);
for(Entry e : s) { System.out.println(e); }

This usually results in an InterruptedException because the
TabletServerBatchReaderIterator doesn't seem to have a reference back to
the batch scanner implementation, so the scanner will get garbage
collected, which closes the batch scanner. If I change my code to the
following I can avoid the issue:

BatchScanner s = connector.createBatchScanner(...);
try {
  for(Entry e: s) { System.out.println(e); }
} finally {
  s.close();
}

On Wed, Jun 3, 2015 at 7:57 AM, shweta.agrawal shweta.agra...@orkash.com
wrote:

 Hi all,

 I am reading data from one table through batch scanner and writing to
 other table and doing some modification.

 But while doing this i am getting following error:

 Exception in thread main java.lang.RuntimeException:
 java.lang.InterruptedException
 at
 org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$1.receive(TabletServerBatchReaderIterator.java:173)
 at
 org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:697)
 at
 org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:372)
 at
 org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at
 org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
 at
 org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
 at
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
 at
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
 at
 java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:324)
 at
 org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$1.receive(TabletServerBatchReaderIterator.java:166)
 ... 8 more

 Can anyone tell me about this exception ?

 Thanks
 Shweta

Re: [ANNOUNCE] Fluo 1.0.0-alpha-1 Released

2014-10-09 Thread William Slacum

woohoo

Look forward to getting to use this!

On Thu, Oct 9, 2014 at 4:54 PM, Corey Nolet cjno...@gmail.com wrote:

 The Fluo project is happy to announce the 1.0.0-alpha-1 release of Fluo.

 Fluo is a transaction layer that enables incremental processing on top of
 Accumulo. It integrates into Yarn using Apache Twill.

 This is the first release of Fluo and is not ready for production use. We
 invite developers to try it out, play with the quickstart  examples, and
 contribute back in the form of bug reports, new feature requests, and pull
 requests.

 For more information, visit http://www.fluo.io.

Re: Using iterators to generate data

2014-08-30 Thread William Slacum

This comes up a bit, so maybe we should add it to the FAQ (or just have
better information about iterators in general). The short answer is that
it's usually not recommended, because there aren't strong guarantees about
the lifetime of an iterator (so we wouldn't know when to close any
resources held by an iterator instance, such as batch writer thread pools)
and there's 0 resource management related to tablet server-to-tablet server
communications.

Check out Fluo, made by our own Chief Keith Turner  Mike The Trike
Walch: https://github.com/fluo-io/fluo

It's an implementation of Google's percolator, which provides the
capability to handle new data server side as well as transactional
guarantees.


On Fri, Aug 29, 2014 at 5:09 PM, Russ Weeks rwe...@newbrightidea.com
wrote:

 There are plenty of examples of using custom iterators to filter or
 combine data at either the cell level or the row level. In these cases, the
 amount of data coming out of the iterator is less than the amount going in.
 What about going the other direction, using a custom iterator to generate
 new data based on the contents of a cell or a row? I guess this is also
 what a combiner does but bear with me...

 The immediately obvious use case is parsing. Suppose one cell in my row
 holds an XML document. I'd like to configure an iterator with an XPath
 expression to pull a field out of the document, so that I can leverage the
 distributed processing of the cluster instead of parsing the doc on the
 scanner-side.

 I'm sure there are constraints or things to watch out for, does anybody
 have any recommendations here? For instance, the generated cells would
 probably have to be in the same row as the input cells?

 I'm using MapReduce to satisfy all these use cases right now but I'm
 interested to know how much of my code could be ported to Iterators.

 Thanks!
 -Russ

Re: Optimal # proxy servers

2014-08-11 Thread William Slacum

Going through the proxy will always be an extra RPC step over using a Java
client. Eliminating that step, I think, would net the most benefit.


On Mon, Aug 11, 2014 at 12:16 AM, John R. Frank j...@diffeo.com wrote:


 Josh,

 Following up on this earlier post about the proxy:

 http://www.mail-archive.com/user%40accumulo.apache.org/msg03445.html



 On 4/14/14, 1:38 PM, Josh Elser wrote:

  If you care about maximizing your throughput, ingest is probably not
 desirable through the proxy (you can probably get ~10x faster using the
 Java BatchWriter API).


  Hrm. 10x may have been overstating too. 5x is probably more accurate.
 YMMV :)




 Is there something more than the extra network hop that makes the proxy
 slow?  The proxy exposes a BatchWriter interface:

 https://github.com/accumulo/pyaccumulo/blob/master/README.
 md#writing-mutations-with-a-batchwriter-batched-and-
 optimized-for-throughput

 So, we can batch up multiple requests through the proxy.  Is there
 something else that is only available (only possible?) by going direct
 instead of through the proxy?

 For example, is there a logical difference between what can be done with
 the Java BatchWriter API and this kind of batching loop running through the
 thrift proxy:

 https://github.com/diffeo/kvlayer/blob/master/kvlayer/_accumulo.py#L149

 (Note the crude handling of the max thrift message size.)

 If there is a logical difference, perhaps it would be worthwhile to
 translate the Java BatchWriter into C so there can be native support for
 C/C++/Python applications doing high-speed bulk ingest?


 Thanks for your thoughts on this.


 Regards,
 John

Re: 'scanner closed' error

2014-08-03 Thread William Slacum

I have seen issues if I don't have an explicit close on the batch scanner.
When I don't have the close, the gc ends up calling `finalize()` which
closes the thread pool. Basically, the work around is to manage the
lifetime of the instance yourself, rather than leave it up to fate.


On Sun, Aug 3, 2014 at 7:03 PM, Don Resnik 
don.res...@objectivesolutions.com wrote:

 Josh,

 Thanks for the response.  I did see that ticket in my initial research.
 If I understood correctly, that ticket makes it sound like the scanner was
 closed programmatically with threads still running, so there was not really
 an error.  In my case, the error is coming up well before the scanner has
 completed.  We did not intend to programmatically close the scanner where
 is it closing on us, but I will confirm this week that we do not have a
 condition that would cause the scanner to close prematurely with threads
 still running.

 Thanks,

 Don Resnik




 On Sun, Aug 3, 2014 at 5:05 PM, Josh Elser josh.el...@gmail.com wrote:

 Don,

 Does this describe your error?

 https://issues.apache.org/jira/browse/ACCUMULO-607


 On 8/3/14, 4:50 PM, Don Resnik wrote:



 I have some query logic that uses a stack of custom iterators with a
 batch scanner.  The query begins to return values but then stops with a
 'scanner closed' error.  The only reference I can find to scanner closed
 in the src is in TabletServerBatchReaderIterator.  I can see that the
 error is thrown when the query thread pool is shutdown, but I am not
 sure why this is happening.  This query logic works on a single node
 instance, but I get the scanner closed error when running on a multi-node
 cluster.

 So far the stack traces have not been very helpful and we are not sure
 where or how to troubleshoot this.  Any info on what conditions would
 lead to a scanner closed error and where to begin looking to resolve
 would be appreciated.

 Thanks,

Re: Z-Curve/Hilbert Curve

2014-07-24 Thread William Slacum

Quick google search yielded:

https://github.com/GeoLatte/geolatte-geom/blob/master/src/main/java/org/geolatte/geom/curve/MortonCode.java


On Thu, Jul 24, 2014 at 10:10 AM, THORMAN, ROBERT D rt2...@att.com wrote:

  Can anyone share a Java method to convert lat/lon (decimal degrees) to
 Z-Curve (string)?  I need to order my geo-spatial data into lexical order.

  v/r
 Bob Thorman
 Principal Big Data Engineer
 ATT Big Data CoE
 2900 W. Plano Parkway
 Plano, TX 75075
 972-658-1714

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-15 Thread William Slacum

Herp... serves me right for not setting up a proper test case.

I think you need to override seek as well:

@Override
public void seek(...) throws IOException {
  super.seek(...);
  next();
}

I think I just realized the wrapping iterator could use some clean up,
because this isn't obvious. Basically after the wrapping iterator's seek is
called, it never calls the implementor's next() to actually set up the
first top key and value.



On Tue, Jul 15, 2014 at 9:50 AM, Michael Moss michael.m...@gmail.com
wrote:

 I set up debugging and am rethrowing the exception. What's strange is it
 appears that despite the iterator instance being properly set to
 iterator.Counter (my implementation), my breakpoints aren't being hit, only
 in the parent classes (Wrapping Iterator) and (SortedKeyValueIterator).

 I have two rows in the table, when I scan with no iterator:
 2014-07-15 06:46:26,577 [Audit   ] INFO : operation: permitted; user:
 root; action: scan; targetTable: pojo; authorizations: public,; range:
 (-inf,+inf); columns: []; iterators: []; iteratorOptions: {};
 2014-07-15 06:46:26,589 [tserver.TabletServer] DEBUG: ScanSess tid
 10.0.2.15:45073 8* 2 entries* in 0.01 secs, nbTimes = [7 7 7.00 1]

 When I scan with the iterator (0 entries?):
 2014-07-15 06:45:58,036 [Audit   ] INFO : operation: permitted; user:
 root; action: scan; targetTable: pojo; authorizations: public,; range:
 (-inf,+inf); columns: []; iterators: []; iteratorOptions: {};
 2014-07-15 06:45:58,047 [tserver.TabletServer] DEBUG: ScanSess tid
 10.0.2.15:44992 8 *0 entries* in 0.01 secs, nbTimes = [6 6 6.00 1]

 No exceptions otherwise. Really appreciate all the ongoing help.

 Best,

 -Mike


 On Mon, Jul 14, 2014 at 6:40 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 Anything in your Tserver log? I think you should just rethrow that
 IOExcepton on your source's next() method, since they're usually not
 recoverable (ie, just make Counter#next throw IOException)


 On Mon, Jul 14, 2014 at 5:48 PM, Josh Elser josh.el...@gmail.com wrote:

 A quick sanity check is to make sure you have data in the table and that
 you can read the data without your iterator (I've thought I had a bug
 because I didn't have proper visibilities more times than I'd like to
 admit).

 Alternatively, you can also enable remote-debugging via Eclipse into the
 TabletServer which might help you understand more of what's going on.

 Lots of articles on how to set this up [1]. In short, add -Xdebug
 -Xrunjdwp:transport=dt_socket,server=y,address=8000 to
 ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the tserver, connect
 eclipse to 8000 via the Debug configuration menu, set a breakpoint in your
 init, seek and next methods, and `scan` in the shell.


 [1] http://javarevisited.blogspot.com/2011/02/how-to-setup-
 remote-debugging-in.html


 On 7/14/14, 5:33 PM, Michael Moss wrote:

 Hmm...Still doesn't return anything from the shell.

 http://pastebin.com/ndRhspf8

 Any thoughts? What's the best way to debug these?


 On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
 wilhelm.von.cl...@accumulo.net mailto:wilhelm.von.cl...@accumulo.net
 

 wrote:

 Ah, an artifact of me just willy nilly writing an iterator :) Any
 reference to `this.source` should be replaced with
 `this.getSource()`. In `next()`, your workaround ends up calling
 `this.hasTop()` as the while loop condition. It will always return
 false because two lines up we set `top_key` to null. We need to make
 sure that the source iterator has a top, because we want to read
 data from it. We'll have to change the loop condition to
 `while(this.getSource().hasTop())`. On line 38 of your code we'll
 need to call `this.getSource().next()` instead of `this.next()`.

 The iterator interface is documented, but there hasn't been a
 definitive go-to for making one. I've been drafting a blog post, but
 since it doesn't exist yet, hopefully the following will suffice.

 The lifetime of an iterator is (usually) as follows:

 (1) A new instance is called via Class.newInstance (so a no-args
 constructor is needed)
 (2) Init is called. This allows users to configure the iterator, set
 its source, and possible check the environment. We can also call
 `deepCopy` on the source if we want to have multiple sources (we'd
 do this if we wanted to do a merge read out of multiple column
 families within a row).
 (3) seek() is called. This gets our readers to the correct positions
 in the data that are within the scan range the user requested, as
 well as turning column families on or off. The name should
 reminiscent of seeking to some key on disk.
 (4) hasTop() is called. If true, that means we have data, and the
 iterator has a key/value pair that can be retrieved by calling
 getTopKey() and getTopValue(). If fasle, we're done because there's
 no data to return.
 (5) next() is called. This will attempt find a new

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-14 Thread William Slacum

Hi Mike!

The Combiner interface is only for aggregating keys within a single row.
You can probably get away with implementing your combining logic in a
WrappingIterator that reads across all the rows in a given tablet.

To do some combine/fold/reduce operation, Accumulo needs the input type to
be the same as the output type. The combiner doesn't have a notion of a
present type (as you'd see in something like Algebird's Groups), but you
can use another iterator to perform your transformation.

If you wanted to extract the count field from your Avro object, you could
write a new Iterator that took your Avro object, extracted the desired
field, and returned it as its top value. You can then set this iterator as
the source of the aggregator, either programmatically or via by wrapping
the source object passed to the aggregator in its
SortedKeyValueIterator#init call.

This is a bit inefficient as you'd have to serialize to a Value and then
immediately deserialize it in the iterator above it. You could mitigate
this by exposing a method that would get the extracted value before
serializing it.

This kind of counting also requires client side logic to do a final combine
operation, since the aggregations from all the tservers are partial results.

I believe that CountingIterator is not meant for user consumption, but I do
not know if it's related to your issue in trying to use it from the shell.
Iterators set through the shell, in previous versions of Accumulo, have a
requirement to implement OptionDescriber. Many default iterators do not
implement this, and thus can't set in the shell.



On Mon, Jul 14, 2014 at 2:44 PM, Michael Moss michael.m...@gmail.com
wrote:

 Hi, All.

 I'm curious what the best practices are around persisting complex
 types/data in Accumulo (and aggregating on fields within them).

 Let's say I have (row, column family, column qualifier, value):
 A foo  MyHugeAvroObject(count=2)
 A foo  MyHugeAvroObject(count=3)

 Let's say MyHugeAvroObject has a field Integer count with the values
 above.

 What is the best way to aggregate on row, column family, column qualifier
 by count? In my above example:
 A foo  5

 The TypedValueCombiner.typedReduce method can deserialize any V, in my
 case MyHugeAvroObject, but it needs to return a value of type V. What are
 the best practices for deeply nested/complex objects? It's not always
 straightforward to map a complex Avro type into Row - Column Family -
 Column Qualifier.

 Rather than using a TypedCombiner, I looked into using an Aggregator
 (which appears deprecated as of 1.4), which appears to let me return
 arbitrary values, but despite running setiter, my aggregator doesn't seem
 to do anything.

 I also tried looking at implementing a WrappingIterator, which also
 appears to allow me to return arbitary values (such as Accumulo's
 CountingIterator), but I get cryptic errors when trying to setiter, I'm on
 Accumulo 1.6:

 root@dev kyt setiter -t kyt -scan -p 10 -n countingIter -class
 org.apache.accumulo.core.iterators.system.CountingIterator
 2014-07-14 11:12:55,623 [shell.Shell] ERROR:
 java.lang.IllegalArgumentException:
 org.apache.accumulo.core.iterators.system.CountingIterator

 This is odd because other included implementations of WrappingIterator
 seem to work (perhaps the implementation of CountingIterator is dated):
 root@dev kyt setiter -t kyt -scan -p 10 -n deletingIterator -class
 org.apache.accumulo.core.iterators.system.DeletingIterator
 The iterator class does not implement OptionDescriber. Consider this for
 better iterator configuration using this setiter command.
 Name for iterator (enter to skip):

 All in all, how can I aggregate simple values, like counters from rows
 with complex Avro objects as Values without having to add aggregations
 fields to these Value objects?

 Thanks!

 -Mike

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-14 Thread William Slacum

For a bit of psuedocode, I'd probably make a class that did something akin
to: http://pastebin.com/pKqAeeCR

I wrote that up real quick in a text editor-- it won't compile or anything,
but should point you in the right direction.


On Mon, Jul 14, 2014 at 3:44 PM, William Slacum 
wilhelm.von.cl...@accumulo.net wrote:

 Hi Mike!

 The Combiner interface is only for aggregating keys within a single row.
 You can probably get away with implementing your combining logic in a
 WrappingIterator that reads across all the rows in a given tablet.

 To do some combine/fold/reduce operation, Accumulo needs the input type to
 be the same as the output type. The combiner doesn't have a notion of a
 present type (as you'd see in something like Algebird's Groups), but you
 can use another iterator to perform your transformation.

 If you wanted to extract the count field from your Avro object, you
 could write a new Iterator that took your Avro object, extracted the
 desired field, and returned it as its top value. You can then set this
 iterator as the source of the aggregator, either programmatically or via by
 wrapping the source object passed to the aggregator in its
 SortedKeyValueIterator#init call.

 This is a bit inefficient as you'd have to serialize to a Value and then
 immediately deserialize it in the iterator above it. You could mitigate
 this by exposing a method that would get the extracted value before
 serializing it.

 This kind of counting also requires client side logic to do a final
 combine operation, since the aggregations from all the tservers are partial
 results.

 I believe that CountingIterator is not meant for user consumption, but I
 do not know if it's related to your issue in trying to use it from the
 shell. Iterators set through the shell, in previous versions of Accumulo,
 have a requirement to implement OptionDescriber. Many default iterators do
 not implement this, and thus can't set in the shell.



 On Mon, Jul 14, 2014 at 2:44 PM, Michael Moss michael.m...@gmail.com
 wrote:

 Hi, All.

 I'm curious what the best practices are around persisting complex
 types/data in Accumulo (and aggregating on fields within them).

 Let's say I have (row, column family, column qualifier, value):
 A foo  MyHugeAvroObject(count=2)
 A foo  MyHugeAvroObject(count=3)

 Let's say MyHugeAvroObject has a field Integer count with the values
 above.

 What is the best way to aggregate on row, column family, column qualifier
 by count? In my above example:
 A foo  5

 The TypedValueCombiner.typedReduce method can deserialize any V, in my
 case MyHugeAvroObject, but it needs to return a value of type V. What are
 the best practices for deeply nested/complex objects? It's not always
 straightforward to map a complex Avro type into Row - Column Family -
 Column Qualifier.

 Rather than using a TypedCombiner, I looked into using an Aggregator
 (which appears deprecated as of 1.4), which appears to let me return
 arbitrary values, but despite running setiter, my aggregator doesn't seem
 to do anything.

 I also tried looking at implementing a WrappingIterator, which also
 appears to allow me to return arbitary values (such as Accumulo's
 CountingIterator), but I get cryptic errors when trying to setiter, I'm on
 Accumulo 1.6:

 root@dev kyt setiter -t kyt -scan -p 10 -n countingIter -class
 org.apache.accumulo.core.iterators.system.CountingIterator
 2014-07-14 11:12:55,623 [shell.Shell] ERROR:
 java.lang.IllegalArgumentException:
 org.apache.accumulo.core.iterators.system.CountingIterator

 This is odd because other included implementations of WrappingIterator
 seem to work (perhaps the implementation of CountingIterator is dated):
 root@dev kyt setiter -t kyt -scan -p 10 -n deletingIterator -class
 org.apache.accumulo.core.iterators.system.DeletingIterator
 The iterator class does not implement OptionDescriber. Consider this for
 better iterator configuration using this setiter command.
 Name for iterator (enter to skip):

 All in all, how can I aggregate simple values, like counters from rows
 with complex Avro objects as Values without having to add aggregations
 fields to these Value objects?

 Thanks!

 -Mike

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-14 Thread William Slacum

Ah, an artifact of me just willy nilly writing an iterator :) Any reference
to `this.source` should be replaced with `this.getSource()`. In `next()`,
your workaround ends up calling `this.hasTop()` as the while loop
condition. It will always return false because two lines up we set
`top_key` to null. We need to make sure that the source iterator has a top,
because we want to read data from it. We'll have to change the loop
condition to `while(this.getSource().hasTop())`. On line 38 of your code
we'll need to call `this.getSource().next()` instead of `this.next()`.

The iterator interface is documented, but there hasn't been a definitive
go-to for making one. I've been drafting a blog post, but since it doesn't
exist yet, hopefully the following will suffice.

The lifetime of an iterator is (usually) as follows:

(1) A new instance is called via Class.newInstance (so a no-args
constructor is needed)
(2) Init is called. This allows users to configure the iterator, set its
source, and possible check the environment. We can also call `deepCopy` on
the source if we want to have multiple sources (we'd do this if we wanted
to do a merge read out of multiple column families within a row).
(3) seek() is called. This gets our readers to the correct positions in the
data that are within the scan range the user requested, as well as turning
column families on or off. The name should reminiscent of seeking to some
key on disk.
(4) hasTop() is called. If true, that means we have data, and the iterator
has a key/value pair that can be retrieved by calling getTopKey() and
getTopValue(). If fasle, we're done because there's no data to return.
(5) next() is called. This will attempt find a new top key and value. We go
back to (4) to see if next was successful in finding a new top key/value
and will repeat until the client is satisfied or hasTop() returns false.

You can kind of make a state machine out of those steps where we loop
between (4) and (5) until there's no data. There are more advanced
workflows where next() can be reading from multiple sources, as well as
seeking them to different positions in the tablet.


On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss michael.m...@gmail.com
wrote:

 Thanks, William. I was just hitting you up for an example :)

 I adapted your pseudocode (http://pastebin.com/ufPJq0g3), but noticed
 that this.source in your example didn't have visibility. Did I worked
 around it correctly?

 When I add my iterator to my table and run scan from the shell, it returns
 nothing - what should I expect here? In general I've found the iterator
 interface pretty confusing and haven't spent the time wrapping my head
 around it yet. Any documentation or examples (beyond what I could find on
 the site or in the code) appreciated!

 *root@dev table pojo*
 *root@dev pojo listiter -scan -t pojo*
 *-*
 *-Iterator counter, scan scope options:*
 *-iteratorPriority = 10*
 *-iteratorClassName = iterators.Counter*
 *-*
 *root@dev pojo scan*
 *root@dev pojo*

 Best,

 -Mike




 On Mon, Jul 14, 2014 at 4:07 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 For a bit of psuedocode, I'd probably make a class that did something
 akin to: http://pastebin.com/pKqAeeCR

 I wrote that up real quick in a text editor-- it won't compile or
 anything, but should point you in the right direction.


 On Mon, Jul 14, 2014 at 3:44 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 Hi Mike!

 The Combiner interface is only for aggregating keys within a single row.
 You can probably get away with implementing your combining logic in a
 WrappingIterator that reads across all the rows in a given tablet.

 To do some combine/fold/reduce operation, Accumulo needs the input type
 to be the same as the output type. The combiner doesn't have a notion of a
 present type (as you'd see in something like Algebird's Groups), but you
 can use another iterator to perform your transformation.

 If you wanted to extract the count field from your Avro object, you
 could write a new Iterator that took your Avro object, extracted the
 desired field, and returned it as its top value. You can then set this
 iterator as the source of the aggregator, either programmatically or via by
 wrapping the source object passed to the aggregator in its
 SortedKeyValueIterator#init call.

 This is a bit inefficient as you'd have to serialize to a Value and then
 immediately deserialize it in the iterator above it. You could mitigate
 this by exposing a method that would get the extracted value before
 serializing it.

 This kind of counting also requires client side logic to do a final
 combine operation, since the aggregations from all the tservers are partial
 results.

 I believe that CountingIterator is not meant for user consumption, but I
 do not know if it's related to your issue in trying to use it from the
 shell. Iterators set through the shell, in previous versions of Accumulo,
 have a requirement to implement

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-14 Thread William Slacum

Anything in your Tserver log? I think you should just rethrow that
IOExcepton on your source's next() method, since they're usually not
recoverable (ie, just make Counter#next throw IOException)


On Mon, Jul 14, 2014 at 5:48 PM, Josh Elser josh.el...@gmail.com wrote:

 A quick sanity check is to make sure you have data in the table and that
 you can read the data without your iterator (I've thought I had a bug
 because I didn't have proper visibilities more times than I'd like to
 admit).

 Alternatively, you can also enable remote-debugging via Eclipse into the
 TabletServer which might help you understand more of what's going on.

 Lots of articles on how to set this up [1]. In short, add -Xdebug
 -Xrunjdwp:transport=dt_socket,server=y,address=8000 to
 ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the tserver, connect
 eclipse to 8000 via the Debug configuration menu, set a breakpoint in your
 init, seek and next methods, and `scan` in the shell.


 [1] http://javarevisited.blogspot.com/2011/02/how-to-setup-
 remote-debugging-in.html


 On 7/14/14, 5:33 PM, Michael Moss wrote:

 Hmm...Still doesn't return anything from the shell.

 http://pastebin.com/ndRhspf8

 Any thoughts? What's the best way to debug these?


 On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
 wilhelm.von.cl...@accumulo.net mailto:wilhelm.von.cl...@accumulo.net

 wrote:

 Ah, an artifact of me just willy nilly writing an iterator :) Any
 reference to `this.source` should be replaced with
 `this.getSource()`. In `next()`, your workaround ends up calling
 `this.hasTop()` as the while loop condition. It will always return
 false because two lines up we set `top_key` to null. We need to make
 sure that the source iterator has a top, because we want to read
 data from it. We'll have to change the loop condition to
 `while(this.getSource().hasTop())`. On line 38 of your code we'll
 need to call `this.getSource().next()` instead of `this.next()`.

 The iterator interface is documented, but there hasn't been a
 definitive go-to for making one. I've been drafting a blog post, but
 since it doesn't exist yet, hopefully the following will suffice.

 The lifetime of an iterator is (usually) as follows:

 (1) A new instance is called via Class.newInstance (so a no-args
 constructor is needed)
 (2) Init is called. This allows users to configure the iterator, set
 its source, and possible check the environment. We can also call
 `deepCopy` on the source if we want to have multiple sources (we'd
 do this if we wanted to do a merge read out of multiple column
 families within a row).
 (3) seek() is called. This gets our readers to the correct positions
 in the data that are within the scan range the user requested, as
 well as turning column families on or off. The name should
 reminiscent of seeking to some key on disk.
 (4) hasTop() is called. If true, that means we have data, and the
 iterator has a key/value pair that can be retrieved by calling
 getTopKey() and getTopValue(). If fasle, we're done because there's
 no data to return.
 (5) next() is called. This will attempt find a new top key and
 value. We go back to (4) to see if next was successful in finding a
 new top key/value and will repeat until the client is satisfied or
 hasTop() returns false.

 You can kind of make a state machine out of those steps where we
 loop between (4) and (5) until there's no data. There are more
 advanced workflows where next() can be reading from multiple
 sources, as well as seeking them to different positions in the tablet.


 On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss
 michael.m...@gmail.com mailto:michael.m...@gmail.com wrote:

 Thanks, William. I was just hitting you up for an example :)

 I adapted your pseudocode (http://pastebin.com/ufPJq0g3), but
 noticed that this.source in your example didn't have
 visibility. Did I worked around it correctly?

 When I add my iterator to my table and run scan from the shell,
 it returns nothing - what should I expect here? In general I've
 found the iterator interface pretty confusing and haven't spent
 the time wrapping my head around it yet. Any documentation or
 examples (beyond what I could find on the site or in the code)
 appreciated!

 /root@dev table pojo/
 /root@dev pojo listiter -scan -t pojo/
 /-/
 /-Iterator counter, scan scope options:/
 /-iteratorPriority = 10/
 /-iteratorClassName = iterators.Counter/
 /-/
 /root@dev pojo scan/
 /root@dev pojo/


 Best,

 -Mike




 On Mon, Jul 14, 2014 at 4:07 PM, William Slacum
 wilhelm.von.cl...@accumulo.net
 mailto:wilhelm.von.cl...@accumulo.net wrote:

 For a bit of psuedocode, I'd

Re: Forgot SECRET, how to delete zookeeper nodes?

2014-07-13 Thread William Slacum

If the zookeeper data is gone, your best bet is try and identify which
directories under /accumulo/tables points to which tables you had. You can
then bulk import the files into a new instance's tables.


On Sun, Jul 13, 2014 at 11:54 PM, Vicky Kak vicky@gmail.com wrote:

 I am not sure if the tables could be recovered seamlessly, the tables are
 stored in undelying hdfs.
 I was thinking of using
 http://accumulo.apache.org/1.6/examples/bulkIngest.html to recover the
 tables, the better would be if we could update the zookeeper data pointing
 to the existing hdfs table data.
 I don't have more information about it as of now, we need someone else to
 help us here.


 On Mon, Jul 14, 2014 at 9:06 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 It's too deleted... so the only option I have is to delete the zookeeper
 nodes and reinitialize accumulo.

 You're right, I deleted the zk nodes and now Accumulo complains nonode
 error.

 Can I recover the tables for a new instance?

 Jianshi


 On Mon, Jul 14, 2014 at 11:28 AM, Vicky Kak vicky@gmail.com wrote:

 Can't you get the secret from the corresponding accumulo-site.xml or
 this is too deleted?

 Deletion from the zookeeper should be done using the rmr /accumulo
 command, you will have to use zkCli.sh to use zookeeper client. I have been
 doing this sometime back, have not used it recently.
 I would not recommend to delete the information in zookeeper unless
 there is not other option, you may loose the data IMO.



 On Mon, Jul 14, 2014 at 8:40 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Clusters got updated and user home files lost... I tried to reinstall
 accumulo but I forgot the secret I put before.

 So how can I delete /accumulo in Zookeeper?

 Or is there a way to rename instance_id?

 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/

Re: Mapreduce output format killing tablet servers

2014-06-25 Thread William Slacum

I had a similar thread going on and am currently rummaging through the
batch writer code (as well as pontificating on how the tablet server
handles multiple write clients for the tablet).

What is your ingest skew like? Is it uniform? How quickly do splits occur?
I've seen, at relatively low scale, doing live ingest become problematic.

Have you looked into using file output? One of our committers, Cory, has a
library that can handle writing to multiple tables/files. You can peek
here: https://github.com/calrissian/accumulo-recipes (doing a `find . -name
'Group*'` will give you the classes you need). I had to do some massaging
to get them to work properly and am happy to share what I had to do if this
becomes a route you're interested in.


On Wed, Jun 25, 2014 at 2:10 PM, Sean Busbey bus...@cloudera.com wrote:

 What version of Accumulo?

 What version of Hadoop?

 What does your server memory and per-role allocation look like?

 Can you paste the tserver debug log?



 On Wed, Jun 25, 2014 at 1:01 PM, Jacob Rust jr...@clearedgeit.com wrote:

 I am trying to create an inverted text index for a table using accumulo
 input/output format in a java mapreduce program.  When the job reaches the
 reduce phase and creates the table / tries to write to it the tablet
 servers begin to die.

 Now when I do a start-all.sh the tablet servers start for about a minute
 and then die again. Any idea as to why the mapreduce job is killing the
 tablet servers and/or how to bring the tablet servers back up without
 failing?

 This is on a 12 node cluster with low quality hardware.

 The java code I am running is here http://pastebin.com/ti7Qz19m

  The log files on each tablet server only display the startup
 information, no errors. The log files on the master server show these
 errors http://pastebin.com/LymiTfB7




 --
 Jacob Rust
 Software Intern




 --
 Sean

Re: BatchWriter woes

2014-06-24 Thread William Slacum

I can try to confirm that, but the monitor isn't showing any failures
during ingest. By half dead do you mean the master thinks it is alive,
but in actuality it isn't?


On Fri, Jun 20, 2014 at 10:32 AM, Keith Turner ke...@deenlo.com wrote:




 On Thu, Jun 19, 2014 at 11:57 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 I'm finding some ingest jobs I have running in a bit of a sticky sitch:

 I have a MapReduce job that reads a table, transforms the entries,
 creates an inverted index, and writes out mutations to two tables. The
 cluster size is in the tens of nodes, and I usually have 32 mappers running.

 The batch writer configs are:
 - memory buffer: 128MB
 - max latency: 5 minutes
 - threads: 32
 - timeout: default Long.MAX_VALUE

 I know we're on Accumulo 1.5.0 and I believe using CDH 4.5.0, Zookeeper
 3.3.6.

 I'm noticing an ingest pattern of usually ok rates for the cluster (in
 the 100K+ entries per second), but after some time they start to drop off
 to ~10K E/s. Sometimes this happens when a round of compactions kicks off
 (usually major, not minor), sometimes not. Eventually, the mappers will
 timeout. We have them set to timeout after 10 minutes of not reporting
 status.

 I added a bit of probing/profiling, and noticed that there's an
 exponential growth in per entry processing time in the mapper. They're of
 pretty uniform size, so there should not be much variance in the times. The
 times go from single milliseconds, to hundreds of milliseconds, to seconds,
 to minutes.

 If I jstack a mapper, it's sitting in TabletServerBatchWriter#waitRTE. It
 should only enter that method if the batch writer has (a) too much data
 buffered or (b) the user requested a flush. I'm inferring that (a) is the
 case, because there is no explicit TabletServerBatchWriter#flush() call.

 We did notice that there was a send thread trying to send to a dead
 server. We can't ssh to the IP it was trying to send to, and have verified
 manually that it's not listed in the current tablet servers. We did notice
 that the master log is reporting that a recovery on a WAL associated with
 that IP is under way. Looking back, the master had been reporting that
 message for about a day and a half. The message was similar to the one
 described in https://issues.apache.org/jira/browse/ACCUMULO-1364 . I do
 not know the significance of this as it relates to my jobs.


 Do you think its trying to write to a half dead server?  Does that server
 still have locations in the metadata table?



 I did some digging in TabletServerBatchWriter, and the only thing I can
 kind of see happening is that if SendTask#sendMutationsToTabletServer
 receives a TException, it rethrows it as an IOException, then SendTask#send
 will catch that exception and add the mutations to the failures collection.
 Since the timeout is Long.MAX_VALUE, I think it's possible this loop can
 continue forever or until some outside force kills the entire process.

 Does this seem coherent? Is there anything else that could cause this?

 I'm on the track of converting the code over to using bulk ingest, but I
 think there's an issue with a vanilla BatchWriter that I would just be
 getting around instead of actually fixing.

 Also, I'd love to provide logs, but there's a high amount of friction in
 getting them, so I won't be able to deliver on that front.

Re: Meaning of in METADATA table [SEC=UNOFFICIAL]

2014-06-24 Thread William Slacum

 is a byte used for doing an ordering on rows that share the same prefix.

There was a presentation floating around on the specifics of the metadata
table at one point. I believe that helps tablet information sort before the
last tablet, which is suffixed with '~', to force it to sort after the
other tablets. We'll probably get an Eric or Keith email soon laying down
the law, but that's what I remember.


On Tue, Jun 24, 2014 at 9:57 PM, Dickson, Matt MR 
matt.dick...@defence.gov.au wrote:

  *UNOFFICIAL*
 When looking up rfile references in the metadata table we normally see
 *tableid*;range for the rowid.   I've noticed some rowids are
 *tableid* eg.3p

 Is this because the table is small and hasn't been split or some other
 reason?

Re: How does Accumulo compare to HBase

2014-06-23 Thread William Slacum

I think first and foremost, how has writing your application been? Is it
something you can easily onboard other people for? Does it seem stable
enough? If you can answer those questions positively, I think you have a
winning situation.

The big three Hadoop vendors (Cloudera, Hortonworks and MapR) all provide
some level of support for Accumulo, so it has the pedigree of other members
of the Hadoop ecosystem.

Regarding the performance, I think Mike's presentation needs some context.
He can definitely provide more context than the rest of us (and possibly
Sean or Bill |-|), but I think one thing he was driving home is that out of
the box, Accumulo is configured to run on someone's laptop. There are
adjustments to be made when running at any scale greater than a dev machine
and they may not be documented clearly.


On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra tslut...@us.ibm.com
wrote:

 Mike did a pretty good presentation on performance comparison between
 Accumulo / HBase. Again not official IMO but is pretty detailed in the
 approach take and apples-apples comparison
 http://www.slideshare.net/AccumuloSummit/10-30-drob



 [image: Inactive hide details for Jeremy Kepner ---06/23/2014 07:42:57
 PM---Performance is probably the largest difference between Accu]Jeremy
 Kepner ---06/23/2014 07:42:57 PM---Performance is probably the largest
 difference between Accumulo and HBase. Accumulo can ingest/scan

 From: Jeremy Kepner kep...@ll.mit.edu
 To: user@accumulo.apache.org
 Date: 06/23/2014 07:42 PM
 Subject: Re: How does Accumulo compare to HBase
 --



 Performance is probably the largest difference between Accumulo and HBase.

 Accumulo can ingest/scan at a rate of 800K entries/sec/node.
 This performance scales well into the hundreds of nodes to deliver
 100M+ entries/sec.

 There are no recent HBase benchmarks and none in the peer-reviewed
 literature.
 Old data suggests that HBase performance is ~1% of Accumulo performance.

 In short, one can often replace a 20+ node database with
 a single node Accumulo database.

 On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi Huang wrote:
  Er... basically I need to explain to my manager why choosing Accumulo,
  instead of HBase.
 
  So what are the pros and cons of Accumulo vs. HBase? (btw HBase 0.98 also
  got cell-level security, modeled after Accumulo)
 
  --
  Jianshi Huang
 
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/

BatchWriter woes

2014-06-19 Thread William Slacum

I'm finding some ingest jobs I have running in a bit of a sticky sitch:

I have a MapReduce job that reads a table, transforms the entries, creates
an inverted index, and writes out mutations to two tables. The cluster size
is in the tens of nodes, and I usually have 32 mappers running.

The batch writer configs are:
- memory buffer: 128MB
- max latency: 5 minutes
- threads: 32
- timeout: default Long.MAX_VALUE

I know we're on Accumulo 1.5.0 and I believe using CDH 4.5.0, Zookeeper
3.3.6.

I'm noticing an ingest pattern of usually ok rates for the cluster (in the
100K+ entries per second), but after some time they start to drop off to
~10K E/s. Sometimes this happens when a round of compactions kicks off
(usually major, not minor), sometimes not. Eventually, the mappers will
timeout. We have them set to timeout after 10 minutes of not reporting
status.

I added a bit of probing/profiling, and noticed that there's an exponential
growth in per entry processing time in the mapper. They're of pretty
uniform size, so there should not be much variance in the times. The times
go from single milliseconds, to hundreds of milliseconds, to seconds, to
minutes.

If I jstack a mapper, it's sitting in TabletServerBatchWriter#waitRTE. It
should only enter that method if the batch writer has (a) too much data
buffered or (b) the user requested a flush. I'm inferring that (a) is the
case, because there is no explicit TabletServerBatchWriter#flush() call.

We did notice that there was a send thread trying to send to a dead server.
We can't ssh to the IP it was trying to send to, and have verified manually
that it's not listed in the current tablet servers. We did notice that the
master log is reporting that a recovery on a WAL associated with that IP is
under way. Looking back, the master had been reporting that message for
about a day and a half. The message was similar to the one described in
https://issues.apache.org/jira/browse/ACCUMULO-1364 . I do not know the
significance of this as it relates to my jobs.

I did some digging in TabletServerBatchWriter, and the only thing I can
kind of see happening is that if SendTask#sendMutationsToTabletServer
receives a TException, it rethrows it as an IOException, then SendTask#send
will catch that exception and add the mutations to the failures collection.
Since the timeout is Long.MAX_VALUE, I think it's possible this loop can
continue forever or until some outside force kills the entire process.

Does this seem coherent? Is there anything else that could cause this?

I'm on the track of converting the code over to using bulk ingest, but I
think there's an issue with a vanilla BatchWriter that I would just be
getting around instead of actually fixing.

Also, I'd love to provide logs, but there's a high amount of friction in
getting them, so I won't be able to deliver on that front.

Re: [DISCUSS] Should we support upgrading 1.4 - 1.6 w/o going through 1.5?

2014-06-16 Thread William Slacum

How much of this is a standalone utility? I think a magic button approach
would be good for this case.


On Mon, Jun 16, 2014 at 5:24 PM, Sean Busbey bus...@cloudera.com wrote:

 In an effort to get more users off of our now unsupported 1.4 release,
 should we support upgrading directly to 1.6 without going through a 1.5
 upgrade?

 More directly for those on user@: would you be more likely to upgrade off
 of 1.4 if you could do so directly to 1.6?

 We have this working locally at Cloudera as a part of our CDH integration
 (we shipped 1.4 and we're planning to ship 1.6 next).

 We can get into implementation details on a jira if there's positive
 consensus, but the changes weren't very complicated. They're mostly

 * forward porting and consolidating some upgrade code
 * additions to the README for instructions

 Personally, I can see the both sides of the argument. On the plus side,
 anything to get more users off of 1.4 is a good thing. On the negative
 side, it means we have the 1.4 related upgrade code sitting in a supported
 code branch longer.

 Thoughts?

 --
 Sean

Re: Unable to load Iterator with setscaniter and setshelliter

2014-06-15 Thread William Slacum

Wouldn't the iterator have to be on the classpath for the JVM that launches
the shell command?


On Sun, Jun 15, 2014 at 9:02 AM, Vicky Kak vicky@gmail.com wrote:


 setiter -n MyIterator -p 10 -scan -minc -majc -class
 com.codebits.d4m.iterator.MyIterator
 scan

 The above line fails for me with the similar kind of error i.e
 ClassNotFoundException

 root@accumulo atest setiter -n MyIterator -p 10 -scan -minc -majc -class
 org.dallaybatta.MyIterator
 2014-06-15 18:20:18,061 [shell.Shell] ERROR:
 org.apache.accumulo.shell.ShellCommandException: Command could not be
 initialized (Unable to load org.dallaybatta.MyIterator; class not found.)


 My hdfs contains the corresponding jars but it yet fails.
 After digging a code for a while I figured that the error is coming from
 org.apache.accumulo.shell.commands.SetIterCommand::execute

 *
 try {
   clazz =
 classloader.loadClass(className).asSubclass(SortedKeyValueIterator.class);
   untypedInstance = clazz.newInstance();
 } catch (ClassNotFoundException e) {
   StringBuilder msg = new StringBuilder(Unable to load
 ).append(className);
   if (className.indexOf('.')  0) {
 msg.append(; did you use a fully qualified package name?);
   } else {
 msg.append(; class not found.);
   }
   throw new ShellCommandException(ErrorCode.INITIALIZATION_FAILURE,
 msg.toString());
 } catch (InstantiationException e) {

 *

 Typically the ClassNotFoundException can appear also if the dependent
 classes are not present, here SortedKeyValueIterator could be the use case,
 I moved the accumulo core to the classpath folder in the hdfs but still
 could not get it working. May be some other dependent classes are required,
 needs more time to analyse in this direction.

 The document states the following so it should ideally work,
 although the VFS classloader allows for classpath manipulation using a
 variety of schemes including URLs and HDFS URIs.

 I find it strange that your test and my tests results differ as you are
 able to set Iterator for a single row but I am not.

 Thanks,
 Vicky




 On Sun, Jun 15, 2014 at 8:25 AM, David Medinets david.medin...@gmail.com
 wrote:

 I'm sure that I'm overlooking something simple. I can load my iterator
 using setiter but not with setscaniter or setshelliter.

 Here is my do-nothing iterator:

 public class MyIterator extends WrappingIterator implements
 OptionDescriber {

 @Override
 public IteratorOptions describeOptions() {
 String name = dummy;
 String description = Dummy Description;
 MapString, String namedOptions = new HashMapString, String();
 ListString unnamedOptionDescriptions = null;
 return new IteratorOptions(name, description, namedOptions,
 unnamedOptionDescriptions);
 }

 @Override
 public boolean validateOptions(MapString, String options) {
 return true;
 }

 }

 I copy the jar file out to HDFS:

 hadoop fs -mkdir /user/vagrant/d4m/classpath
 hadoop fs -put /vagrant/schema/target/d4m_schema-0.0.1-SNAPSHOT.jar
 /user/vagrant/classpath

 I set the table-specific classpath context:

 createtable atest
 table atest
 insert row cf cq value
 config -s
 general.vfs.context.classpath.d4m=hdfs://affy-master:9000/user/vagrant/classpath
 config -t atest -s table.classpath.context=d4m

 Now I can configure the iterator and scan over the single row without a
 problem:

 setiter -n MyIterator -p 10 -scan -minc -majc -class
 com.codebits.d4m.iterator.MyIterator
 scan
 deleteiter -n MyIterator -scan -minc -majc

 However, the setscaniter commands fails:

 root@instance atest setscaniter -n MyIterator -p 10 -class
 com.codebits.d4m.iterator.MyIterator
 2014-06-15 02:54:14,098 [shell.Shell] WARN : Deprecated, use setshelliter
 Dummy Description
 2014-06-15 02:54:14,126 [shell.Shell] ERROR:
 org.apache.accumulo.core.util.shell.ShellCommandException: Command could
 not be initialized (Unable to load com.codebits.d4m.iterator.MyIterator)

 As does the setshelliter:

 root@instance atest setshelliter -pn d4m -n MyIterator -p 10 -class
 com.codebits.d4m.iterator.MyIterator
 Dummy Description
 2014-06-15 02:55:07,025 [shell.Shell] ERROR:
 org.apache.accumulo.core.util.shell.ShellCommandException: Command could
 not be initialized (Unable to load com.codebits.d4m.iterator.MyIterator)

 I don't see any messages in the log files.

 Any suggestions to resolve these issues?

Re: Unable to load Iterator with setscaniter and setshelliter

2014-06-15 Thread William Slacum

@Josh-- it seems dangerous to have the shell start loading classes from a
location that was specified for a table only. It could make sense, when in
a table context, to have the shell also look in any configured VFS
locations and then drop the class loader once out of that context.


On Sun, Jun 15, 2014 at 11:03 AM, dlmarion dlmar...@comcast.net wrote:

 general.vfs.classpaths is not set?
 I saw in an earlier email that you are setting the context classloader.
 There are some places where the context manager was not being used when it
 should. This could be one of those cases. I would submit a jira ticket with
 all the information and I will try to look at it in the next day or so. Be
 sure to include classpath settings, list of files in HDFS for those
 locations, and the commands that are failing.


 Sent via the Samsung GALAXY S®4, an ATT 4G LTE smartphone


  Original message 
 From: David Medinets
 Date:06/15/2014 10:53 AM (GMT-05:00)
 To: accumulo-user
 Subject: Re: Unable to load Iterator with setscaniter and setshelliter

 The classpath settings in accumulo-site.xml are the following (which I
 think is the default):

 property
   namegeneral.classpaths/name
   value
 $ACCUMULO_HOME/server/target/classes/,
 $ACCUMULO_HOME/core/target/classes/,
 $ACCUMULO_HOME/start/target/classes/,
 $ACCUMULO_HOME/examples/target/classes/,
 $ACCUMULO_HOME/lib/[^.].$ACCUMULO_VERSION.jar,
 $ACCUMULO_HOME/lib/[^.].*.jar,
 $ZOOKEEPER_HOME/zookeeper[^.].*.jar,
 $HADOOP_HOME/conf,
 $HADOOP_HOME/[^.].*.jar,
 $HADOOP_HOME/lib/[^.].*.jar,
   /value
   descriptionClasspaths that accumulo checks for updates and class
 files.
   When using the Security Manager, please remove the
 .../target/classes/ values.
   /description
 /property


 On Sun, Jun 15, 2014 at 10:49 AM, dlmarion dlmar...@comcast.net wrote:

 What does your classpath settings look like in accumulo-site.xml. I
 recently made some fixes in 1.6.1-Snapshot where the context classloader
 was not being used in all cases. I dont think this case was affected though.


 Sent via the Samsung GALAXY S®4, an ATT 4G LTE smartphone


  Original message 
 From: Josh Elser
 Date:06/15/2014 10:31 AM (GMT-05:00)
 To: user@accumulo.apache.org
 Subject: Re: Unable to load Iterator with setscaniter and setshelliter

 Naw, the commons-vfs loader should be loading those resources using a
 second classloader.

 Maybe it's just a problem with the HDFS code? Does it work if you have
 the jar with your iterator in lib/ or lib/ext? Or, maybe something is wrong
 like you defined a private constructor which is throwing that Exception?
 On Jun 15, 2014 8:50 AM, William Slacum wilhelm.von.cl...@accumulo.net
 wrote:

 Wouldn't the iterator have to be on the classpath for the JVM that
 launches the shell command?


 On Sun, Jun 15, 2014 at 9:02 AM, Vicky Kak vicky@gmail.com wrote:


 setiter -n MyIterator -p 10 -scan -minc -majc -class
 com.codebits.d4m.iterator.MyIterator
 scan

 The above line fails for me with the similar kind of error i.e
 ClassNotFoundException

 root@accumulo atest setiter -n MyIterator -p 10 -scan -minc -majc
 -class org.dallaybatta.MyIterator
 2014-06-15 18:20:18,061 [shell.Shell] ERROR:
 org.apache.accumulo.shell.ShellCommandException: Command could not be
 initialized (Unable to load org.dallaybatta.MyIterator; class not found.)


 My hdfs contains the corresponding jars but it yet fails.
 After digging a code for a while I figured that the error is coming
 from org.apache.accumulo.shell.commands.SetIterCommand::execute

 *
 try {
   clazz =
 classloader.loadClass(className).asSubclass(SortedKeyValueIterator.class);
   untypedInstance = clazz.newInstance();
 } catch (ClassNotFoundException e) {
   StringBuilder msg = new StringBuilder(Unable to load
 ).append(className);
   if (className.indexOf('.')  0) {
 msg.append(; did you use a fully qualified package name?);
   } else {
 msg.append(; class not found.);
   }
   throw new ShellCommandException(ErrorCode.INITIALIZATION_FAILURE,
 msg.toString());
 } catch (InstantiationException e) {

 *

 Typically the ClassNotFoundException can appear also if the dependent
 classes are not present, here SortedKeyValueIterator could be the use case,
 I moved the accumulo core to the classpath folder in the hdfs but still
 could not get it working. May be some other dependent classes are required,
 needs more time to analyse in this direction.

 The document states the following so it should ideally work,
 although the VFS classloader allows for classpath manipulation using a
 variety of schemes including URLs and HDFS URIs.

 I find it strange that your test and my tests results differ as you are
 able to set Iterator for a single row

Re: Improving Batchscanner Performance

2014-05-20 Thread William Slacum

By blocking, we mean you have to complete the entire index look up before
fetching your records.

Conceptually, instead of returning a `CollectionText rows`, return an
`IteratorText rows` and consume them in batches as the first look up
produces them. That way record look ups can occur in parallel with index
look ups.


On Tue, May 20, 2014 at 1:51 PM, Slater, David M.
david.sla...@jhuapl.eduwrote:

 Hi Josh,

 I should have clarified - I am using a batchscanner for both lookups. I
 had thought of putting it into two different threads, but the first scan is
 typically an order of magnitude faster than the second.

 The logic for upperbounding the results returned is outside of the method
 I provided. Since there is a one-to-one relationship between rowIDs and
 records on the second scan, I just limit the number of rows I send to this
 method.

 As for blocking, I'm not sure exactly what you mean. I complete the first
 scan in its entirety, which  before entering this method with the
 collection of Text rowIDs. The method for that is:

 public CollectionText getRowIDs(CollectionRange ranges, Text term,
 String tablename, int queryThreads, int limit) throws
 TableNotFoundException {
 SetText guids = new HashSetText();
 if (!ranges.isEmpty()) {
 BatchScanner scanner = conn.createBatchScanner(tablename, new
 Authorizations(), queryThreads);
 scanner.setRanges(ranges);
 scanner.fetchColumnFamily(term);
 for (Map.EntryKey, Value entry : scanner) {
 guids.add(entry.getKey().getColumnQualifier());
 if (guids.size()  limit) {
 return null;
 }
 }
 scanner.close();
 }
 return guids;
 }

 Essentially, my query does:
 CollectionText rows = getRowIDs(new Range(minRow, maxRow), new
 Text(index), mytable, 10, 1);
 Collectionbyte[] data = getRowData(rows, mytable, 10);


 -Original Message-
 From: Josh Elser [mailto:josh.el...@gmail.com]
 Sent: Tuesday, May 20, 2014 1:32 PM
 To: user@accumulo.apache.org
 Subject: Re: Improving Batchscanner Performance

 Hi David,

 Absolutely. What you have here is a classic producer-consumer model.
 Your BatchScanner is producing results, which you then consume by your
 scanner, and ultimately return those results to the client.

 The problem with your below implementation is that you're not going to be
 polling your batchscanner as aggressively as you could be. You are blocking
 while you can fetch each of those new Ranges from the Scanner before
 fetching new ranges. Have you considered splitting up the BatchScanner and
 Scanner code into two different threads?

 You could easily use a ArrayBlockingQueue (or similar) to pass results
 from the BatchScanner to the Scanner. I would imagine that this would give
 you a fair improvement in performance.

 Also, it doesn't appear that there's a reason you can't use a BatchScanner
 for both lookups?

 One final warning, your current implementation could also hog heap very
 badly if your batchscanner returns too many records. The producer/consumer
 I proposed should help here a little bit, but you should still be asserting
 upper-bounds to avoid running out of heap space in your client.

 On 5/20/14, 1:10 PM, Slater, David M. wrote:
  Hey everyone,
 
  I'm trying to improve the query performance of batchscans on my data
 table. I first scan over index tables, which returns a set of rowIDs that
 correspond to the records I am interested in. This set of records is fairly
 randomly (and uniformly) distributed across a large number of tablets, due
 to the randomness of the UID and the query itself. Then I want to scan over
 my data table, which is setup as follows:
  row   colFam  colQual value
  rowUID -- --  byte[] of
 data
 
  These records are fairly small (100s of bytes), but numerous (I may
 return 5 or more). The method I use to obtain this follows.
 Essentially, I turn the rows returned from the first query into a set of
 ranges to input into the batchscanner, and then return those rows,
 retrieving the value from them.
 
  // returns the data associated with the given collection of rows
   public Collectionbyte[] getRowData(CollectionText rows, Text
 dataType, String tablename, int queryThreads) throws TableNotFoundException
 {
   Listbyte[] values = new ArrayListbyte[](rows.size());
   if (!rows.isEmpty()) {
   BatchScanner scanner = conn.createBatchScanner(tablename,
 new Authorizations(), queryThreads);
   ListRange ranges = new ArrayListRange();
   for (Text row : rows) {
   ranges.add(new Range(row));
   }
   scanner.setRanges(ranges);
   for (Map.EntryKey, Value entry : scanner) {
   values.add(entry.getValue().get());
   }

Re: Delete All Data In Table

2014-05-12 Thread William Slacum

You could save the splits, delete the table, then reapply the splits.


On Mon, May 12, 2014 at 9:23 AM, BlackJack76 justin@gmail.com wrote:

 Besides using the tableOperations to deleteRows or delete the table
 entirely,
 what is the fastest way to delete all data in a table?  I am currently
 using
 a BatchDeleter but it is extremely slow when I have a large amount of data.
 Any better options?

 I don't want to use the tableOperations because both the deleteRows and
 delete blow away the splits.  I would like to keep the splits in place.

 Appreciate your thoughts!




 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748.html
 Sent from the Users mailing list archive at Nabble.com.

Re: Common Big Data Architecture Writeup

2014-04-29 Thread William Slacum

You could do mutations or bulk loading. As long as you can phrase your data
in terms of keys and values, you can store it in Accumulo.


On Tue, Apr 29, 2014 at 1:48 PM, Geoffry Roberts threadedb...@gmail.comwrote:

 David started this thread yesterday.  Since then I have read everything, I
 think,  and like what I see.   I still have a question.  To populate an
 Accumulo database, using the D4M schema, it would appear one would do so
 using Mutation objects et. al.  just as if one were not using D4M schema.
  Am I correct?  All the examples appear to focus on the analytic side of
 things.

 Thanks


 On Mon, Apr 28, 2014 at 9:12 PM, Kepner, Jeremy - 0553 - MITLL 
 kep...@ll.mit.edu wrote:

 D4M is two things:

 (1) A set of software for doing analytics.
 (2) A schema for ingesting and indexing diverse data into a NoSQL
 database like Accumulo

 It hits two parts of the Common Big Data Architecture.
 The CBDA is merely a restating of the obvious components a system needs
 to effective at processing Big Data.
 It can be implemented with a variety of technologies.

 Regards.  -Jeremy


 On Apr 28, 2014, at 9:08 PM, Chris Bennight ch...@slowcar.net
  wrote:

 I'm not getting what exactly the Common Big Data Architecture is?

 Is it just a term that describes any system that has the 7 components
 Jeremy mentioned (fs, ingest, DB, analytics, web services, resource
 scheduler, elastic compute)?   If so, what's the significance of naming
 this collection?

 And how exactly is D4M related to this?   (I understand it (D4M) hits a
 subset of those features, but don't think it encompasses all of those)

 Apologies if these are obtuse questions, I just feel like Im not
 comprehending what information is trying to be conveyed?




 On Mon, Apr 28, 2014 at 8:51 PM, Jeremy Kepner kep...@ll.mit.edu wrote:

 No problem.  I am glad to start getting the definitions out there.
 Great work on the page.  I think helps clarify things a lot.

 On Mon, Apr 28, 2014 at 08:45:47PM -0400, David Medinets wrote:
 Sorry for my misunderstanding. I've updated the github project and
 moved
 it to [1]https://github.com/medined/D4M_Schema.
 
 On Mon, Apr 28, 2014 at 5:36 PM, Jeremy Kepner [2]
 kep...@ll.mit.edu
 wrote:
 
   David's well written example is illustrating the D4M Schema
   ([3]
 http://ieee-hpec.org/2013/index_htm_files/11-Kepner-D4Mschema-IEEE-HPEC.pdf
 ).
 
   The Common Big Data Architecture is a broad description that
 encompasses
   many
   big data systems and consists of 7 components: filesystem, ingest
   processes,
   databases, analytic processes, web services, resource scheduler,
 and
   elastic computing.  A reference will most likely appear in IEEE
 HPEC
   2014.
 
   Accumulo is the database of choice in many CBDA systems.
 
   The D4M schema is used in many Accumulo systems.
 
   On Mon, Apr 28, 2014 at 05:23:00PM -0400, David Medinets wrote:
   [1][4]
 https://github.com/medined/Common-Big-Data-Architecture -
   This project
   provides simple examples of the CBDA which is used by the
 D4M 2.0
   software.
   
References
   
   Visible links
   1. [5]
 https://github.com/medined/Common-Big-Data-Architecture
 
  References
 
 Visible links
 1. https://github.com/medined/D4M_Schema
 2. mailto:kep...@ll.mit.edu
 3.
 http://ieee-hpec.org/2013/index_htm_files/11-Kepner-D4Mschema-IEEE-HPEC.pdf
 4. https://github.com/medined/Common-Big-Data-Architecture
 5. https://github.com/medined/Common-Big-Data-Architecture






 --
 There are ways and there are ways,

 Geoffry Roberts

Re: Write to table from Accumulo iterator

2014-04-25 Thread William Slacum

Our own Keith Turner is trying to make this possible with Accismus (
https://github.com/keith-turner/Accismus). I don't know the current state
of it, but I believe it's still in the early stages.

I've always been under the impression that launching a scanner or writer
from within an iterator, as it can cause deadlock in the system if it is
under heavy load.

If it doesn't meet your needs, I'd recommend writing a daemon process that
identifies new documents via a scanner and filter, then write indices for
it. It's more network bound than doing it in an iterator, but it's safer.

On Fri, Apr 25, 2014 at 11:29 PM, David Medinets
david.medin...@gmail.comwrote:

Can you change the ingest process to token on ingest?

On Fri, Apr 25, 2014 at 10:45 PM, BlackJack76 justin@gmail.comwrote:

Sure thing. Basically, I am attempting to index a document. When I find
the
document, I want to insert the tokens directly back into the table. I
want
to do it directly from the seek routine so that I don't need to return
anything back to the client.

For example, seek may locate the document that has the following sentence:

The quick brown fox

From there, I tokenize the document and want to insert the individual
tokens
back into tokens back into Accumulo (i.e., The, quick, brown, and fox all
as
separate mutations).

--
View this message in context:
http://apache-accumulo.1065345.n5.nabble.com/Write-to-table-from-Accumulo-iterator-tp9412p9414.html
Sent from the Users mailing list archive at Nabble.com.

Re: Embedded Mutations: Is this kind of thing done?

2014-04-24 Thread William Slacum

Depending on your table schema, you'll probably want to translate an object
graph into multiple mutations.


On Thu, Apr 24, 2014 at 8:40 PM, David Medinets david.medin...@gmail.comwrote:

 If the sub-document changes, you'll need to search the values of every
 Accumulo entry?


 On Thu, Apr 24, 2014 at 5:31 PM, Geoffry Roberts 
 threadedb...@gmail.comwrote:

 The use case is, I am walking a complex object graph and persisting what
 I find there.  Said object graph in my case is always EMF (eclipse modeling
 framework) compliant.  An EMF graph can have in if references to--brace
 yourself--a non-cross document containment reference.  When using Mongo,
 these were persisted as a DBObject embedded into a containing DBObject.
  I'm trying to decide whether I want to follow suit.

 Any thoughts?


 On Thu, Apr 24, 2014 at 4:03 PM, Sean Busbey bus...@cloudera.com wrote:

 Can you describe the use case more? Do you know what the purpose for the
 embedded changes are?


 On Thu, Apr 24, 2014 at 2:59 PM, Geoffry Roberts threadedb...@gmail.com
  wrote:

 All,

 I am in the throws of converting some(else's) code from MongoDB to
 Accumulo.  I am seeing a situation where one DBObject if being embedded
 into another DBObject.  I see that Mutation supports a method called
 getRow()  that returns a byte array.  I gather I can use this to achieve a
 similar result if I were so inclined.

 Am I so inclined?  i.e. Is this the way we do things in Accumulo?

 DBObject, roughly speaking, is Mongo's counterpart to Mutation.

 Thanks mucho

 --
 There are ways and there are ways,

 Geoffry Roberts




 --
 Sean




 --
 There are ways and there are ways,

 Geoffry Roberts

Re: bulk ingest without mapred

2014-04-08 Thread William Slacum

 java.io.FileNotFoundException: File does not exist:
bulk/entities_fails/failures

sticks out to me. it looks like a relative path. where does that directory
exist on your file system?


On Tue, Apr 8, 2014 at 9:40 AM, pdread paul.r...@siginttech.com wrote:

 Hi

 I interface to an accumulo cloud (100s of nodes) which I don't maintain.
 I'll try and keep this short, the interface App is used to ingest millions
 of docs/week from various streams, some are required near real time. A
 problem came up where the tservers would not stay up and our ingest would
 halt. Now the admins are working on fixing this but I'm not optimistic.
 Others who have run into this tell me its the use of Mutations that is
 causing the problem and it will go away if I do bulk ingest. However
 mapreduce is way to slow to spin up and does not map to our arch.

 So here is what I have been trying to do. After much research I think I
 should be able to bulk ingest if I create the RFile and feed this to
 TableOperations.importDirectory(). I can create the RFile ok, at least I
 thinks so, I create the failure directory using hadoops' file system. I
 check that the failure directory is there and is a directory but when I
 feed
 it to the import I get an error over on the accumulo master log that the it
 can not find the failure directory. Now the interesting thing is I have
 traced the code thourgh the accumulo client it checks successfully for the
 load file and the failure directory. What am I doing wrong?

 First the client error:

 org.apache.accumulo.core.client.AccumuloException: Internal error
 processing
 waitForTableOperation
 at

 org.apache.accumulo.core.client.admin.TableOperationsImpl.doTableOperation(TableOperationsImpl.java:290)
 at

 org.apache.accumulo.core.client.admin.TableOperationsImpl.doTableOperation(TableOperationsImpl.java:258)
 at

 org.apache.accumulo.core.client.admin.TableOperationsImpl.importDirectory(TableOperationsImpl.java:945)
 at

 airs.medr.accumulo.server.table.EntityTable.writeEntities(EntityTable.java:130)

 Now the master log exception:

 2014-04-08 08:33:50,609 [thrift.MasterClientService$Processor] ERROR:
 Internal error processing waitForTableOperation
 java.lang.RuntimeException: java.io.FileNotFoundException: File does not
 exist: bulk/entities_fails/failures
 at

 org.apache.accumulo.server.master.Master$MasterClientServiceHandler.waitForTableOperation(Master.java:1053)
 at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at

 org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:59)
 at $Proxy6.waitForTableOperation(Unknown Source)
 at

 org.apache.accumulo.core.master.thrift.MasterClientService$Processor$waitForTableOperation.process(MasterClientService.java:2004)
 at

 org.apache.accumulo.core.master.thrift.MasterClientService$Processor.process(MasterClientService.java:1472)
 at

 org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:154)
 at

 org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631)
 at

 org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:202)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at
 org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist:
 bulk/entities_fails/failures
 at

 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:528)
 at

 org.apache.accumulo.server.trace.TraceFileSystem.getFileStatus(TraceFileSystem.java:797)
 at

 org.apache.accumulo.server.master.tableOps.BulkImport.call(BulkImport.java:157)
 at

 org.apache.accumulo.server.master.tableOps.BulkImport.call(BulkImport.java:110)
 at

 org.apache.accumulo.server.master.tableOps.TraceRepo.call(TraceRepo.java:65)
 at
 org.apache.accumulo.server.fate.Fate$TransactionRunner.run(Fate.java:65)


 Thoughts?

 Thanks

 Paul




 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904.html
 Sent from the Users mailing list archive at Nabble.com.

Re: bulk ingest without mapred

2014-04-08 Thread William Slacum

The extension is .rf. Are you using an RFile.Writer?


On Tue, Apr 8, 2014 at 1:29 PM, pdread paul.r...@siginttech.com wrote:

 Josh

 As I had stated in one of my previous posts I am using FileSystem. I am
 using the code from the MapReduce bulk ingest without the MapReduce. I did
 feed the TableOperations.importDirectory a load directory and that is
 where if found the entities.txt, in that load directory. So now the only
 question remains is what is the proper extension for the RFile. The
 entities.txt is a RFile which I created witht the appropriate Key/Value
 pairs that should load/match my table.

 Thanks

 Paul



 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8922.html
 Sent from the Users mailing list archive at Nabble.com.

Re: NOT operator in visibility string

2014-03-08 Thread William Slacum

Thanks, Joe!


On Fri, Mar 7, 2014 at 2:01 PM, joeferner joe.m.fer...@gmail.com wrote:

 Submitted the patch here:  ACCUMULO-2439
 https://issues.apache.org/jira/browse/ACCUMULO-2439




 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/NOT-operator-in-visibility-string-tp7949p7993.html
 Sent from the Users mailing list archive at Nabble.com.

Re: Synchronized Access to ZooCache Causing Threads to Block

2014-02-12 Thread William Slacum

FWIW you can probably avoid the scan by making your insert idempotent aside
from the timestamp and let versioning handle deduplication.


On Wed, Feb 12, 2014 at 1:19 PM, Ariel Valentin ar...@arielvalentin.comwrote:

 Sorry but I am not at liberty to be specific about our business problem.

 Typical usage is multiple clients writing data to tables, which scan to
 avoid duplicate entries.

 Ariel Valentin
 e-mail: ar...@arielvalentin.com

 website: http://blog.arielvalentin.com
 skype: ariel.s.valentin
 twitter: arielvalentin
 linkedin: http://www.linkedin.com/profile/view?id=8996534
 ---
 *simplicity *communication
 *feedback *courage *respect


 On Wed, Feb 12, 2014 at 10:59 AM, Josh Elser josh.el...@gmail.com wrote:

 Also, I forgot this part before:

 The ZooCache instance that's used *typically* comes from the Instance
 object that your Connector was created from. In other words, if you create
 multiple Instances (ZooKeeperInstance, usually), you can get multiple
 ZooCaches which means that concurrent calls to methods off of those objects
 should not block one another (createScanner off of connector1 from
 instance1 should not block createScanner off of connector2 from instance2).

 That should be something quick you can play with if you so desire.


 On 2/12/14, 9:57 AM, Josh Elser wrote:

 Yep, you'll likely also block on BatchScanner, anything in
 TableOperations, and a host of other things.

 For scanners, there's likely a standing recommendation to amortize the
 use of those objects (if you want to look up 5 range, don't make 5
 scanners).

 Creating a cache per member in the work would likely require some kind
 of paxos implementation to provide consistency which is highly
 undesirable.

 One thing I'm curious about is the impact of removing ZooCache
 altogether from things like the client api and see what happens. I don't
 have a good way to measure that impact off the top of my head though.

 Anyways, is this causing you problems in your usage of the api? Could
 you elaborate a bit more on the specifics?

 On Feb 12, 2014 4:48 AM, Ariel Valentin ar...@arielvalentin.com
 mailto:ar...@arielvalentin.com wrote:

 I have run into a problem related to ACCUMULO-1833, which appears to
 have addressed the issue for MutliTableBatchWriter; however I am
 seeing this issue on the scanner side also:

 394750-http-/192.168.220.196:8080-35 daemon prio=10
 tid=0x7f3108038000 nid=0x538a waiting for monitor entry
 [0x7f31287d1000]

 394878:   java.lang.Thread.State: BLOCKED (on object monitor)

 394933- at
 org.apache.accumulo.fate.zookeeper.ZooCache.
 getInstance(ZooCache.java:301)

 395012- - waiting to lock 0xfa64f5b8 (a java.lang.Class
 for org.apache.accumulo.fate.zookeeper.ZooCache)

 395120- at
 org.apache.accumulo.core.client.impl.Tables.
 getZooCache(Tables.java:40)

 395196- at
 org.apache.accumulo.core.client.impl.Tables.getMap(Tables.java:44)

 395267- at
 org.apache.accumulo.core.client.impl.Tables.
 getNameToIdMap(Tables.java:78)

 395346- at
 org.apache.accumulo.core.client.impl.Tables.getTableId(
 Tables.java:64)

 395421- at
 org.apache.accumulo.core.client.impl.ConnectorImpl.
 getTableId(ConnectorImpl.java:75)

 395510- at
 org.apache.accumulo.core.client.impl.ConnectorImpl.
 createScanner(ConnectorImpl.java:137)

 I have not spent enough time reasoning about the code to understand
 all of the nuances but I am interested in knowing if there are any
 mitigating strategies for dealing with this thread contention e.g.
 would creating a cache entry for each member of the Zookeeper
 ensemble help relieve the strain? use multiple classloaders? or is
 my only option to spawn multiple JVMs?

 Thanks,

 Ariel Valentin
 e-mail: ar...@arielvalentin.com mailto:ar...@arielvalentin.com

 website: http://blog.arielvalentin.com
 skype: ariel.s.valentin
 twitter: arielvalentin
 linkedin: http://www.linkedin.com/profile/view?id=8996534
 ---
 *simplicity *communication
 *feedback *courage *respect

Re: scanner question in regards to columns loaded

2014-01-26 Thread William Slacum

Filters (and more generally, iterators) are executed on the server. There
is an option to run them client side. See
http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/core/client/ClientSideIteratorScanner.html

Using fetchColumnFamily will return only keys that have specific column
family values, not rows.

If I have a few keys in a table:

row1 family1: qualifier1
row1 family2: qualifier2
row2 family1: qualifier1

Let's say I call `scanner.fetchColumnFamily(family1)`. My scanner will
return:

row1 family1: qualifier1
row2 family1: qualifier1

Now let's say I want to do a scan, but call
`scanner.fetchColumnFamily(family2)`. My scanner will return:

row1 family2: qualifier2

If you want whole rows that contain specific column families, then I
believe you'd have to write a custom iterator using the RowFilter
http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/core/iterators/user/RowFilter.html

On Sun, Jan 26, 2014 at 7:39 PM, Jamie Johnson jej2...@gmail.com wrote:

After a little reading...if I use fetchColumnFamily does that skip any
rows that does not have the column family?
On Jan 26, 2014 7:27 PM, Jamie Johnson jej2...@gmail.com wrote:

Thanks for the ideas. Filters are client side right?

I need to read the documentation more as I don't know how to just query a
column family. Would it be possible to get all terms that start with a
particular value? I was thinking that we would need a special prefix for
this but if something could be done without needing it that would work well.
On Jan 26, 2014 5:44 PM, Christopher ctubb...@apache.org wrote:

Ah, I see. Well, you could do that with a custom filter (iterator),
but otherwise, no, not unless you had some other special per-term
entry to query (rather than per-term/document pair). The design of
this kind of table though, seems focused on finding documents which
contain the given terms, though, not listing all terms seen. If you
need that additional feature and don't want to write a custom filter,
you could achieve that by putting a special entry in its own row for
each term, in addition to the entries per-term/document pair, as in:

RowID ColumnFamily Column Qualifier Value
term1term -
-
term1=doc_id2 index count 5

Then, you could list terms by querying the term column family
without getting duplicates. And, you could get decent performance with
this scan if you put the term column family and the index column
family in separate locality groups. You could even make this entry an
aggregated count for all documents (see documentation for combiners),
in case you want corpus-wide term frequencies (for something like
TF-IDF computations).

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Sun, Jan 26, 2014 at 7:55 AM, Jamie Johnson jej2...@gmail.com
wrote:
I mean if a user asked for all terms that started with term is there
a way
to get term1 and term2 just once while scanning or would I get each
twice,
once for each docid and need to filter client side?

On Jan 26, 2014 1:33 AM, Christopher ctubb...@apache.org wrote:

If you use the Range constructor that takes two arguments, then yes,
you'd get two entries. However, count would come before doc_id,
though, because the qualifier is part of the Key, and therefore, part
of the sort order. There's also a Range constructor that allows you to
specify whether you want the startKey and endKey to be inclusive or
exclusive.

I don't know of a specific document that outlines various strategies
that I can link to. Perhaps I'll put one together, when I get some
spare time, if nobody else does. I think most people do a lot of
experimentation to figure out which strategies work best.

I'm not entirely sure what you mean about getting an iterator over
all terms without duplicates. I'm assuming you don't mean duplicate
versions of a single entry, which is handled by the
VersioningIterator, which should be on new tables by default, and set
to retain the recent 1 version, to support updates. With the scheme I
suggested, your table would look something like the following,
instead:

RowID ColumnFamily Column Qualifier
Value
term1=doc_id1 index count
10
term1=doc_id2 index count 5
term2=doc_id3 index count 3
term3=doc_id1 index count
12

With this scheme, you'd have only a single entry (a count) for each
row, and a single row for each term/document combination, so you
wouldn't have any duplicate counts for any given term/document. If
that's what you mean by duplicates...

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson jej2...@gmail.com
wrote:
Thanks for the reply Chris. Say I

Re: ISAM file location vs. read performance

2014-01-12 Thread William Slacum

Some data on short circuit reads would be great to have.

I'm unsure of how correct the compaction leading to eventual locality
postulation is. It seems, to me at least, that in the case of a multi-block
file, the file system would eventually try to distribute those blocks
rather than leave them all on a single host.

One quick correction: not splittable means that the file can't be
processed (ie, MapReduce'd over) in chunks, not that the file won't be
split into blocks.



On Sun, Jan 12, 2014 at 1:58 PM, Arshak Navruzyan arsh...@gmail.com wrote:

 John,

 Thanks for the explanation.  I had to look up the HDFS block distribution
 documentation and it now makes complete sense.

 the 1st replica is placed on the local machine

 So since the compacted RFile is not splittable by HDFS, this ensures that
 the whole thing will be available where the Accumulo tablet is running.

 Maybe I can test out the shortcircuit reads and report back.

 Thanks,

 Arshak


 On Sun, Jan 12, 2014 at 9:36 AM, John Vines vi...@apache.org wrote:

 So I'm not certain on our performance with short circuit reads, aside
 from them being better.

 But because of the way hdfs writes get distributed, a tablet server has a
 strong probability of being a local read, so that is there. This is because
 a tserver with ultimately end up major compacting it's files, ensuring
 locality. So simply constantly ingesting will lead to eventual locality if
 it wasn't there before. It just so happens those reads go through a
 datanode, but not necessarily through the network.

 Sent from my phone, please pardon the typos and brevity.
 On Jan 12, 2014 12:29 PM, Arshak Navruzyan arsh...@gmail.com wrote:

 One aspect of Accumulo architecture is still unclear to me.  Would you
 achieve better scan performance if you could guarantee that the tablet and
 its ISAM file lived on the same node?  Guessing ISAM files are not
 splittable so they pretty much stay on one HDFS data node (plus the replica
 copy). Or is the theory that SATA and a 10GBps network provide more or less
 the same throughput?

 I generally understand that as the table grows and Accumulo creates more
 splits (tablets) you get better distribution over the cluster but seems
 like data location would still be important.   HBase folks seem to think
 that you can approx. double your throughput if let the region server
 directly read the file (dfs.client.read.shortcircuit=true) as opposed to
 going through the data node. (
 http://files.meetup.com/1350427/hug_ebay_jdcryans.pdf).  Perhaps this
 is due more to HDFS overhead?

 I do get that one really nice thing about Accumulo's architecture is
 that it costs almost nothing to reassign tablet to a different tserver and
 this is a huge problem for other systems.

Re: How to remove entire row at the server side?

2013-11-05 Thread William Slacum

If an iterator is only set at scan time, then its logic will only be
applied when a client scans the table. The data will persist through major
and minor compaction and be visible if you scanned the RFile(s) backing the
table. Suppress is the better word in this case. Would you please open a
ticket pointing us where to update the documentation?

It looks like you'd want to implement a RowFilter for your use case. It has
the necessary hooks to avoid reading a whole row into memory and handling
the logic of determining whether or not to write keys that occur before the
column you're filtering on (at the cost of reading those keys twice).




On Tue, Nov 5, 2013 at 6:20 PM, Terry P. texpi...@gmail.com wrote:

 Greetings everyone,
 I'm looking at the AgeOffFilter as a base from which to write a
 server-side filter / iterator to purge rows when they have aged off based
 on the value of a specific column in the row (expiry datetime = now). So
 this differs from the AgeOffFilter in that the criterion for removal is
 from the same column in every row (not the Accumulo timestamp for an
 individual entry), and we need to remove the entire row not just individual
 entries. For example:

 Format: Key:CF:CQ:Value
 abc:data:title:My fantastic data
 abc:data:content:bytedata
 abc:data:creTs:2013-08-04T17:14:12Z
 abc:data:*expTs*:2013-11-04T17:14:12Z
 ... 6-8 more columns of data per row ...

 where *expTs* is the column to determine if the entire row should be
 removed based on whether its value is = NOW.

 This task seemed easy enough as a client program (and it is really), but a
 server-side iterator would be far more efficient than sending millions of
 rowkeys across the network just to delete them (we'll be deleting more than
 a million every hour).  But I'm struggling to get there.

 In looking at AgeOffFilter.java, is the magic in the AgeOffFilter class
 that removes (deletes) an entry from a table the fact that the accept
 method returns false, combined with the fact that the iterator would be set
 to run at -majc or -minc time and it is the compaction code that actually
 deletes the entry?  If set to run only at scan time, would AgeOffFilter
 simply not return the rows during the scan, but not delete them?  The
 wording in the iterator classes varies, some saying remove others say
 suppress so it's not clear to me

 If that's the case, then I think I know where to implement the logic. The
 question is, how can I remove all the entries for the row once the accept
 method has determined it meets the criteria?

 Or as Mike Drob mentioned in a prior post, will basing my class on the
 RowFilter class instead of just Filter make things easier?  Or the
 WholeRowIterator?  Just trying to find the simplest solution.

 Sorry for what may be obvious questions but I'm more of a DB Architect
 that does some coding, and not a Java programmer by trade. With all of the
 amazing things Accumulo does, honestly I was surprised when I couldn't find
 a way to delete rows in the shell by criteria other than the rowkey!  I'm
 more used to having a shell to 'delete from *table *where *column *=
 *value*'.

 But looking at it now, everyone's criteria for deletion will likely be
 different given the flexibility of a key=value store.  If our rowkey had
 the date/timestamp as a prefix, I know an easy deletemany command in the
 shell would do the trick -- but the nature of the data is such that
 initially no expiration timestamp is set, and there is no means to update
 the key from the client app when expiration timestamp finally gets set (too
 much rework on that common tool I'm afraid).

 Thanks in advance.

Re: [DISCUSS] Hadoop 2 and Accumulo 1.6.0

2013-10-23 Thread William Slacum

There wasn't any discussions in those tickets as to what Hadoop 2 provides
Accumulo. If we're going to still support 1, then any new features only
possible with 2 have to become optional until we ditch support for 1. Is
there anything people have in mind, feature wise, that Hadoop 2 would help
with?

On Wed, Oct 23, 2013 at 7:05 PM, Josh Elser josh.el...@gmail.com wrote:

To ensure that we get broader community interaction than only on a Jira
issue [1], I want to get community feedback about the version of Hadoop
which the default, deployed Accumulo artifacts will be compiled against.

Currently, Accumulo builds against a Hadoop-1 series release
(1.5.1-SNAPSHOT and 1.6.0-SNAPSHOT build against 1.2.1, and 1.5.0 builds
against 1.0.4). Last week, the Apache Hadoop community voted to release
2.2.0 as GA (general availability) -- in other words, the Apache Hadoop
community is calling Hadoop-2.2.0 stable.

As has been discussed across various issues on Jira, this means a few
different things for Accumulo. Most importantly, this serves as a
recommendation by us that users should be trying to use Hadoop-2.2.0 with
Accumulo 1.6.0. This does *not* mean that we do not support Hadoop1 ([2]
1.2.1 specifically). Hadoop-1 support would still be guaranteed by us for
1.6.0.

- Josh

[1]
https://issues.apache.org/**jira/browse/ACCUMULO-1419https://issues.apache.org/jira/browse/ACCUMULO-1419
[2]
https://issues.apache.org/**jira/browse/ACCUMULO-1643https://issues.apache.org/jira/browse/ACCUMULO-1643

Re: Trouble with IntersectingIterator

2013-10-01 Thread William Slacum

That iterator is designed to be used with a sharded table format, where in
the index and record each occur within the same row. See the Accumulo
examples page http://accumulo.apache.org/1.4/examples/shard.html


On Tue, Oct 1, 2013 at 3:35 PM, Heath Abelson habel...@netcentricinc.comwrote:

  I am attempting to get a very simple example working with the
 Intersecting Iterator. I made up some dummy objects for me to do this work:
 

 ** **

 A scan on the “Mail” table looks like this:

 ** **

 m1 mail:body [U(USA)]WTF?

 m1 mail:receiver [U(USA)]mgiordano

 m1 mail:sender [U(USA)]habelson

 m1 mail:sentTime [U(USA)]1380571500

 m1 mail:subject [U(USA)]Lunch

 m2 mail:body [U(USA)]I know right?

 m2 mail:receiver [U(USA)]jmarcolla

 m2 mail:sender [U(USA)]habelson

 m2 mail:sentTime [U(USA)]1380571502

 m2 mail:subject [U(USA)]Lunch

 m3 mail:body [U(USA)]exactly!

 m3 mail:receiver [U(USA)]habelson

 m3 mail:sender [U(USA)]mgiordano

 m3 mail:sentTime [U(USA)]1380571504

 m3 mail:subject [U(USA)]Lunch

 m4 mail:body [U(USA)]Dude!

 m4 mail:receiver [U(USA)]mcross

 m4 mail:sender [U(USA)]habelson

 m4 mail:sentTime [U(USA)]1380571506

 m4 mail:subject [U(USA)]Lunch

 m5 mail:body [U(USA)]Yeah

 m5 mail:receiver [U(USA)]habelson

 m5 mail:sender [U(USA)]mcross

 m5 mail:sentTime [U(USA)]1380571508

 m5 mail:subject [U(USA)]Lunch

 ** **

 A scan on the “MailIndex” table looks like this:

 ** **

 receiver habelson:m3 []habelson

 receiver habelson:m5 []habelson

 receiver jmarcolla:m2 []jmarcolla

 receiver mcross:m4 []mcross

 receiver mgiordano:m1 []mgiordano

 sender habelson:m1 []habelson

 sender habelson:m2 []habelson

 sender habelson:m4 []habelson

 sender mcross:m5 []mcross

 sender mgiordano:m3 []mgiordano

 sentTime 1380571500:m1 []1380571500

 sentTime 1380571502:m2 []1380571502

 sentTime 1380571504:m3 []1380571504

 sentTime 1380571506:m4 []1380571506

 sentTime 1380571508:m5 []1380571508

 subject Lunch:m1 []Lunch

 subject Lunch:m2 []Lunch

 subject Lunch:m3 []Lunch

 subject Lunch:m4 []Lunch

 subject Lunch:m5 []Lunch

 ** **

 If I use an IntersectingIterator with a BatchScanner and pass it the terms
 “habelson”,”mgiordano” (or seemingly any pair of terms) I get zero results.
 If, instead, I use the same value as both terms (i.e.
 “habelson”,”habelson”) I properly get back the records that contain that
 value.

 ** **

 My code is almost identical to the userguide example, and I am using
 Accumulo 1.4.3

 ** **

 Any help would be appreciated

 ** **

 ** **

 ** **

 ** **

 ** **

 Heath Abelson

 NetCentric Technology, Inc.

 3349 Route 138, Building A

 Wall, NJ  07719

 Phone: 732-544-0888 x159

 Email:  habel...@netcentricinc.com  

 ** **

Re: Intersecting Iterators [SEC=UNCLASSIFIED]

2013-08-14 Thread William Slacum

Usually the intersecting iterator is used when you're modeling a document
partitioned table. That is, you have relatively few row values compared to
the number of documents you're storing (like, on the order of hundreds to
millions of documents in a single row). It looks like you have a single row
for each document, with field indices stored in the same row as the
document.

What I might suggest is something like:

Row: date
ColumnFamily (a): fi||field||data
ColumnQualifier (a): document-id
ColumnFamily (b): document Id
ColumnQualifier (b): field||data

I believe that having 1:1 mapping between shards/rows and document IDs can
cause significant overhead when it comes to scanning, because it will be
constantly seek'ing within the same RFile blocks.


On Wed, Aug 14, 2013 at 12:50 AM, Williamson, Luke MR 1 
luke.williams...@defence.gov.au wrote:

 UNCLASSIFIED

 I have tried increasing the number of threads and it seems to guarantee
 that it will return before it hits the timeout but it is taking approx. 7
 minutes to complete. Looking at the accumulo manager page it appears that
 all the tablet servers get equally hit (around 16 per node) and start to
 return but a couple of tablet servers take longer than the others. This
 behaviour was indicated to potentially happen in the doco but I was hoping
 it wouldn't be taking this long.

 

 From: David Medinets [mailto:david.medin...@gmail.com]
 Sent: Wednesday, 14 August 2013 12:45
 To: accumulo-user
 Subject: Re: Intersecting Iterators [SEC=UNCLASSIFIED]


 I'm wondering about the 20 threads in the BatchScanner. Have you played
 with increasing it? I've seen that number go above 15 per accumulo node.
 Are you seeing the scans in the Accumulo monitor? Are the scans progressing
 through the Accumulo nodes?


 On Tue, Aug 13, 2013 at 9:58 PM, Williamson, Luke MR 1 
 luke.williams...@defence.gov.au wrote:


 UNCLASSIFIED

 Hi,

 I have field indexes that looks something like

 Row Id: date-UUID
 CF: fi||type||value
 CQ: date-UUID

 For example:

 20130814-550e8400-e29b-41d4-a716-44665544 fi||verb||run
 20130814-550e8400-e29b-41d4-a716-44665544
 20130814-550e8400-e29b-41d4-a716-44665544 page||58 line||16
 the boy can run up the hill

 From what I could determine from the doco and API I am executing
 the following code to perform an intersecting query on two values...

 SetRange shards = new HashSetRange();

 Text[] terms = {new Text(fi||type||value), new
 Text(fi||type||value)};

 BatchScanner bs = conn.createBatchScanner(table, auths, 20);
 bs.setTimeout(360, TimeUnit.SECONDS);

 IteratorSetting iter = new IteratorSetting(20, ii,
 IntersectingIterator.class); IntersectingIterator.setColumnFamilies(iter,
 terms); bs.addScanIterator(iter);

 bs.setRanges(Collections.singleton(new Range()));

 for(EntryKey,Value entry : bs) {

 shards.add(new Range(entry.getKey().getColumnQualifier()));
 }

 I then perform a second batch scan using the set of ranges
 returned by the above to get my actual results.

 My issues is that the intersecting query takes several minutes to
 return if at all (in some cases it times out). Is this expected? Is there
 some way to improve performance? Is there a better way to do this sort of
 query?

 Any guidance would be much appreciated.

 Thanks

 Luke


 IMPORTANT: This email remains the property of the Department of
 Defence and is subject to the jurisdiction of section 70 of the Crimes Act
 1914. If you have received this email in error, you are requested to
 contact the sender and delete the email.




 IMPORTANT: This email remains the property of the Department of Defence
 and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If
 you have received this email in error, you are requested to contact the
 sender and delete the email.

Re: How to efficiently find lexicographically adjacent records?

2013-08-07 Thread William Slacum

Finding the keys after your hypothetical key is easy, as you can just make
it the first key in the range you pass to your Scanner. Since accumulo
doesn't do backwards scanning, you might have to consider having two tables
or sets of rows, one that sorts lexicographically and the other that sorts
in reverse lexicographic order.

There's also probably trickery you can do with the key extents and buffered
reading in an iterator to avoid having to write your data twice. I think it
would involve picking a tablet that would contain your key and expanding
your scans if you don't have enough data.


On Wed, Aug 7, 2013 at 5:31 PM, Jeff Kubina jeff.kub...@gmail.com wrote:

 I have records key; value in an Accumulo table where the key is about a
 50 long byte string. Given a new key k, I want to find the m records that
 would precede and succeed the record k;v if it were inserted into the
 table. Any ideas on how I can do this efficiently? The record k;v will
 eventually be inserted into the table.

 -Jeff

Re: Improving ingest performance [SEC=UNCLASSIFIED]

2013-07-24 Thread William Slacum

There can also be significant overhead in starting a MR job if you're using
`-libjars` for distributing your dependencies. This effect is more
pronounced as the number of nodes increases.  I would recommend looking
into the distributed cache (there's a quick description at
http://developer.yahoo.com/hadoop/tutorial/module5.html, googling some more
will probably get your details on the subject). This is especially helpful
if you plan on running the same job repeatedly without changing the
dependencies often.


On Wed, Jul 24, 2013 at 10:35 AM, Jeremy Kepner kep...@ll.mit.edu wrote:

 (5,000,000,000 records) x (~10 entries/record) /
 ((12 nodes) x (70 minutes) x (60 seconds/minute))

 = ~100,000 entries/sec/node

 This is consistent with other published results

 On Wed, Jul 24, 2013 at 02:26:18AM -0400, Dickson, Matt MR wrote:
 UNCLASSIFIED
 
 Hi,
 
 I'm trying to improve ingest performance on a 12 node test cluster.
 Currently I'm loading 5 billion records in approximately 70 minutes
 which
 seems excessive.  Monitoring the job there are 2600 map jobs (there
 is no
 reduce stage, just the mapper) with 288 running at any one time.  The
 performance seems slowest in the early stages of the job prior to to
 min
 or maj compactions occuring.  Each server has 48 GB memory and
 currently
 the accumulo settings are based on the 3GB settings in the example
 config
 directory, ie tserver.memory.maps.max = 1GB,
 tserver.cache.index.site=50M
 and tserver.cache.index.site=512M.  All other settings on the table
 are
 default.
 
 Questions.
 
 1. What is Accumulo doing in the initial stage of a load and which
 configurations should I focus on to improve this?
 2. At what ingest rate should I consider using the bulk ingest process
 with rfiles?
 
 Thanks
 Matt
 
 IMPORTANT: This email remains the property of the Department of
 Defence
 and is subject to the jurisdiction of section 70 of the Crimes Act
 1914.
 If you have received this email in error, you are requested to
 contact the
 sender and delete the email.

Re: Accumulo / HBase migration

2013-07-09 Thread William Slacum

We could also just add a transformation from HFileReader -
LocalityGroupReader, since I think HBase's storage model (forgive me if
there's a better term) maps pretty well to that.


On Tue, Jul 9, 2013 at 2:20 PM, dlmar...@comcast.net wrote:

 I believe that Brian Loss committed code in 1.5 for a column visibility
 correction iterator or something that you could use to do this. You could
 use that and compact the table after the import.

 --
 *From: *Donald Miner dmi...@clearedgeit.com
 *To: *user@accumulo.apache.org
 *Sent: *Tuesday, July 9, 2013 1:36:20 PM
 *Subject: *Re: Accumulo / HBase migration


 I did think about this. My naive answer is just by default ignore
 visibilities (meaning make everything public or make everything the same
 visibility). It would be interesting however to be able to insert a chunk
 of code that inferred the visibility from the record itself. That is, you'd
 have a function you can pass in that returns a ColumnVisibility and takes
 in a value/rowkey/etc.


 On Tue, Jul 9, 2013 at 1:31 PM, Kurt Christensen hoo...@hoodel.comwrote:


 I don't have a response to your question, but it seems to me that the big
 capability difference is visibility field. When doing bulk translations
 like this, do you just fill visibility with some default value?

 -- Kurt


 On 7/9/13 1:26 PM, Donald Miner wrote:

  Has anyone developed tools to migrate data from an existing HBase
 implementation to Accumulo? My team has done it manually in the past but
 it seems like it would be reasonable to write a process that handled the
 steps in a more automated fashion.

 Here are a few sample designs I've kicked around:

 HBase - mapreduce - mappers bulk write to accumulo - Accumulo
 or
 HBase - mapreduce - tfiles via AccumuloFileOutputFormat - Accumulo
 bulk load - Accumulo
 or
 HBase - bulk export - map-only mapreduce to translate hfiles into
 tfiles (how hard would this be??) - Accumulo bulk load - Accumulo

 I guess this could be extended to go the other way around (and also
 include Cassandra perhaps).

 Maybe we'll start working on this soon. I just wanted to kick the idea
 out there to see if it's been done before or if anyone has some gut
 reactions to the process.

 -Don

 This communication is the property of ClearEdge IT Solutions, LLC and
 may contain confidential and/or privileged information. Any review,
 retransmissions, dissemination or other use of or taking of any action in
 reliance upon this information by persons or entities other than the
 intended recipient is prohibited. If you receive this communication in
 error, please immediately notify the sender and destroy all copies of the
 communication and any attachments.


 --

 Kurt Christensen
 P.O. Box 811
 Westminster, MD 21158-0811

 --**--**
 
 I'm not really a trouble maker. I just play one on TV.




 --
   *
 *Donald Miner
 Chief Technology Officer
 ClearEdge IT Solutions, LLC
 Cell: 443 799 7807
 www.clearedgeit.com

 This communication is the property of ClearEdge IT Solutions, LLC and may
 contain confidential and/or privileged information. Any review,
 retransmissions, dissemination or other use of or taking of any action in
 reliance upon this information by persons or entities other than the
 intended recipient is prohibited. If you receive this communication in
 error, please immediately notify the sender and destroy all copies of the
 communication and any attachments.

Re: Preferred method for a client to obtain a connector reference

2013-05-30 Thread William Slacum

There's an almost identical method that, instead of a CharSequence or
byte[], takes an AuthorizationToken object. If you're using user/password,
use a PasswordToken (I think that's the name of the object).


On Thu, May 30, 2013 at 4:00 PM, Newman, Elise
enew...@integrity-apps.comwrote:

  Okay, I was just wondering if there was another preferred way, since
 Instance.getConnector is marked as deprecated.

 ** **

 Thanks!

 ** **

 ** **

 *From:* Adam Fuchs [mailto:afu...@apache.org]
 *Sent:* Thursday, May 30, 2013 12:57 PM
 *To:* user@accumulo.apache.org
 *Subject:* Re: Preferred method for a client to obtain a connector
 reference

 ** **

 Elise,

 You'll want to use instance.getConnector(...), where instance is probably
 a ZookeeperInstance.

 Cheers,
 Adam

 On May 30, 2013 3:20 PM, Newman, Elise enew...@integrity-apps.com
 wrote:

 Hello!

  

 Stupid question: What is the preferred way for a client to get a connector
 reference? The SimpleClient example uses
 org.apache.accumulo.core.client.Instance.getConnector, but that method
 appears to be deprecated (I’m using a snapshot of Accumulo 1.6).

  

 Thanks!

 Elise

Re: Wikisearch Performance Question

2013-05-21 Thread William Slacum

According to https://issues.apache.org/jira/browse/HADOOP-7823 , it should
possible to split bzip2 files in Hadoop 1.1.


On Tue, May 21, 2013 at 3:54 PM, Eric Newton eric.new...@gmail.com wrote:

 The files decompress remarkably fast, too. I seem to recall about 8
 minutes on our hardware.

 I could not get map/reduce to split on blocks in bzip'd files.

 That gave me a long tail since the English file is so much bigger.

 Uncompressing the files is the way to go.

 -Eric


 On Tue, May 21, 2013 at 2:58 PM, Josh Elser josh.el...@gmail.com wrote:

 You should see much better ingest performance having decompressed input.
 Hadoop will also 'naturally' handle the splits for you based on the HDFS
 block size.


 On 5/21/13 2:35 PM, Patrick Lynch wrote:

 I think your description is accurate, except that I split the single
 archive into a much greater number of pieces than the number of
 different archives I ingested. Specifically, I set numGroups to a higher
 number, I didn't split the archive my hand in hdfs. The archives are
 bzip2-ed, not gzip-ed. Will decompressing still have the same benefit?


 -Original Message-
 From: Josh Elser josh.el...@gmail.com
 To: user user@accumulo.apache.org
 Sent: Tue, May 21, 2013 2:16 pm
 Subject: Re: Wikisearch Performance Question

 Let me see if I understand what you're asking: you took one mediawiki
 archive and split it into n archives of size 1/n the original. You then
 took many n _different_ mediawiki archives and ingested those. You tried
 to get the speed of ingesting many different archives be as fast as
 splitting an original single archive?

 Are you using gzip'ed input files? Have you tried just decompressing the
 gzip into plaintext? Hadoop will naturally split uncompressed text and
 and give you nice balancing.

 I haven't looked at the ingest code in a long time. Not sure if it ever
 received much love.

 On 5/21/13 1:30 PM, Patrick Lynch wrote:

 user@accumulo,

 I was working with the Wikipedia Accumulo ingest examples, and I was
 trying to get the ingest of a single archive file to be as fast as
 ingesting multiple archives through parallelization. I increased the
 number of ways the job split the single archive so that all the servers
 could work on ingesting at the same time. What I noticed, however, was
 that having all the servers work on ingesting the same file was still
 not nearly as fast as using multiple ingest files. I was wondering if I
 could have some insight into the design of the Wikipedia ingest that
 could explain this phenomenon.

 Thank you for your time,
 Patrick Lynch

Iterators returning keys out of scan range

2013-05-01 Thread William Slacum

I was always under the impression there was a check, presumably on the
client side, that would end a scan session if a key was returned that was
not in the original scan range.

Say I scanned my table for the range [A, B], but I had an iterator that
returned only keys beginning with C. I would expect that I wouldn't see
any data, and I'm reasonably certain that in some 1.3 variants this was the
case. However, I was able to drum up a test case that disproves this. A
similar test can be found here http://pastebin.com/g109eACC. It will
require some import magic to get running, but the jist is pretty simple. I
am running against Accumulo 1.4.2.

I'm hitting up the user list because I'd like to confirm:

1) Is it expected behavior that a scan should terminate once it receives a
key outside of its scan range?

2) If (1) is true, when did this change?

I'm actually incredibly glad it works the way it does for my needs, however
I believe we should document that doing this has several pitfalls and
possible remedies for those pitfalls.

Re: Iterators returning keys out of scan range

2013-05-01 Thread William Slacum

Sorry guys, I forgot add some methods to the iterator to make it work.

http://pastebin.com/pXR5veP6


On Wed, May 1, 2013 at 8:01 PM, William Slacum 
wilhelm.von.cl...@accumulo.net wrote:

 I was always under the impression there was a check, presumably on the
 client side, that would end a scan session if a key was returned that was
 not in the original scan range.

 Say I scanned my table for the range [A, B], but I had an iterator
 that returned only keys beginning with C. I would expect that I wouldn't
 see any data, and I'm reasonably certain that in some 1.3 variants this was
 the case. However, I was able to drum up a test case that disproves this. A
 similar test can be found here http://pastebin.com/g109eACC. It will
 require some import magic to get running, but the jist is pretty simple. I
 am running against Accumulo 1.4.2.

 I'm hitting up the user list because I'd like to confirm:

 1) Is it expected behavior that a scan should terminate once it receives a
 key outside of its scan range?

 2) If (1) is true, when did this change?

 I'm actually incredibly glad it works the way it does for my needs,
 however I believe we should document that doing this has several pitfalls
 and possible remedies for those pitfalls.

Re: Iterator name already in use with AccumuloInputFormat?

2013-04-11 Thread William Slacum

And it uses the `IteratorSetting(int height, Class? iterator)`
constructor, so the name of the iterator is the class itself. Naming your
iterator should be a short term fix. I created ACCUMULO-1267 to make a
smarter input format.

On Thu, Apr 11, 2013 at 2:22 PM, William Slacum 
wilhelm.von.cl...@accumulo.net wrote:

 I've noticed the InputFormat sticks one in the stack at some arbitrary
 height, such as 50.


 On Thu, Apr 11, 2013 at 2:10 PM, Chris Sigman cypri...@gmail.com wrote:

 I've just run a job for the second time where I've called addIterator
 with a RegExFilter, and it's saying that the filter name is already in use.
  When I try to scan using the shell though, the iterator's not there...
 what's going on?

 Thanks,
 --
 Chris

Re: [VOTE] accumulo-1.4.3 RC2

2013-03-18 Thread William Slacum

The build hangs in cloudtrace for me on Mac OS 10.7.5, oddly enough on a
TSocket creation. I thought it was due to me having Thrift 0.9 installed,
but I can't see it getting picked up when I try to build via `mvn -X...`,
only thrift-0.6.1. Anyone else run into the same thing?

I'm not too worried about it since I have witnessed the build succeeding on
OSX and was able to build it under Linux Mint in a VM.

On Sun, Mar 17, 2013 at 4:04 PM, Josh Elser josh.el...@gmail.com wrote:

 Good point, Keith. I reverted my change to CHANGES that I made last night
 in light of that.


 On Sunday, March 17, 2013, Keith Turner wrote:

 On Sun, Mar 17, 2013 at 12:24 AM, Josh Elser josh.el...@gmail.com
 wrote:
  Keith, I remember something similar, but I still have no idea why that
  should make any difference...
 
  Overall +1 for 1.4.3 RC2 as a 1.4.3 release
 
  Good
  * Checksums good (dist and src)
  * Sigs good (dist and src)
  * Tag builds and deploys with example conf
  * Ran functional tests and continuous ingest
  * Javadoc and source jars built in dist
  * License header check work on Linux
 
  Not good/could be better
  * Cross-ref CHANGES to Jira: CHANGES has no reference to ACCUMULO-1170
  (fixed in 1.4.3)

 I do not think its a big deal not having this bug in the 1.4.3 release
 notes.  The bug does not occur in any released version.   I made it a
 subtask of ACCUMULO-1062, its a bugfix to that 1.4.3 bugfix.

  * RAT check is still busted in OSX
  * One bundled presentation needs fixing (same as previously mentioned)
  * Noticed we don't build developer manual with releases, only user
 manual
  * Said developer manual still says version1.3
 
 
  On 03/13/2013 03:13 PM, Keith Turner wrote:
 
  On Wed, Mar 13, 2013 at 3:07 PM, John Vines vi...@apache.org wrote:
 
  My rat check did not have those 2 odp files.
 
  It seems like there was difference when running the check on Mac vs
  Linux.  But maybe not, its been a while.
 
 
  On Wed, Mar 13, 2013 at 2:37 PM, Josh Elser josh.el...@gmail.com
 wrote:
 
  I'm failing the rat:check, finding 55 files without licenses instead
 of
  53. Eric (anyone who doesn't fail the rat:check), can you
  cross-reference,
  please? [1]
 
  Also, test/system/continuous/ScaleTest.odp needs to be updated.
 
  Still poking around...
 
  [1]
!? CHANGES
!? conf/examples/1GB/native-standalone/gc
!? conf/examples/1GB/native-standalone/masters
!? conf/examples/1GB/native-standalone/monitor
!? conf/examples/1GB/native-standalone/slaves
!? conf/examples/1GB/native-standalone/tracers
!? conf/examples/1GB/standalone/gc
!? conf/examples/1GB/standalone/masters
!? conf/examples/1GB/standalone/monitor
!? conf/examples/1GB/standalone/slaves
!? conf/examples/1GB/standalone/tracers
!? conf/examples/2GB/native-standalone/gc
!? conf/examples/2GB/native-standalone/masters
!? conf/examples/2GB/native-standalone/monitor
!? conf/examples/2GB/native-standalone/slaves
!? conf/examples/2GB/native-standalone/tracers
!? conf/examples/2GB/standalone/gc
!? conf/examples/2GB/standalone/masters
!? conf/examples/2GB/standalone/monitor
!? conf/examples/2GB/standalone/slaves
!? conf/examples/2GB/standalone/tracers
!? conf/examples/3GB/native-standalone/gc
!? conf/examples/3GB/native-standalone/masters
!? conf/examples/3GB/native-standalone/monitor
!? conf/examples/3GB/native-standalone/slaves
!? conf/examples/3GB/native-standalone/tracers
!? conf/examples/3GB/standalone/gc
!? conf/examples/3GB/standalone/masters
!? conf/examples/3GB/standalone/monitor
!? conf/examples/3GB/standalone/slaves
!? conf/examples/3GB/standalone/tracers
!? conf/examples/512MB/native-standalone/gc
!? conf/examples/512MB/native-standalone/masters
!? conf/examples/512MB/native-standalone/monitor
!? conf/examples/512MB/native-standalone/slaves
!? conf/examples/512MB/native-standalone/tracers
!? conf/examples/512MB/standalone/gc
!? conf/examples/512MB/standalone/masters
!? conf/examples/512MB/standalone/monitor
!? conf/examples/512MB/standalone/slaves
!? conf/examples/512MB/standalone/tracers
!? docs/src/developer_manual/component_docs.odp
!? src/packages/deb/accumulo/conffile
!? test/system/auto/simple/_

Re: [VOTE] accumulo-1.4.3 RC2

2013-03-14 Thread William Slacum

As an aside, do we keep track of the ingest and query rates with each
release? I know Josh had a bit of a side project to do it nightly, but it'd
be interesting to check whether or not as the project grows, we aren't
making noticeable trade offs in performance.

On Thu, Mar 14, 2013 at 10:36 AM, Eric Newton eric.new...@gmail.com wrote:

 * all of the integration tests ran
 * overnight continuous ingest w/agitation verified (18B entries)


 On Wed, Mar 13, 2013 at 5:58 PM, Keith Turner ke...@deenlo.com wrote:

 +1

  * sigs and hashes are ok
  * src tarball eq tag
  * documentations looks good
  * native libs are ok
  * spot checked a few bug fixes that are supposed to be in 1.4.3
  * CHANGES has 1.4.3 tickets
  * was able to run instamo against staged repo and verify
 ACCUMULO-907... the repo link in the email was wrong, but the text for
 the link was correct

 On Wed, Mar 13, 2013 at 2:02 PM, Eric Newton eric.new...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Accumulo
 version
  1.4.3.
 
  In this release candidate:
 
  * fix for ACCUMULO-1170
  * javadocs in release artifact
  * fix for ACCUMULO-1173
 
  The src tar ball was generated by exporting:
https://svn.apache.org/repos/asf/accumulo/tags/1.4.3rc2
 
  To build the dist tar ball from the source, run the following command:
   src/assemble/build.sh
 
  Tarballs, checksums, signatures:
   http://people.apache.org/~ecn/1.4.3rc2
 
  Maven Staged Repository:
 
 https://repository.apache.org/content/repositories/orgapacheaccumulo-006
 
  Keys:
   http://www.apache.org/dist/accumulo/KEYS
 
  Changes:
https://svn.apache.org/repos/asf/accumulo/tags/1.4.3rc2/CHANGES
 
  The vote will be held open for the next 72 hours.

Re: Mappers for Accumulo

2013-03-11 Thread William Slacum

So you want both auto adjusting and not auto adjusting depending on the
size of a range? I suppose you could lift the code for doing the adjusting,
and do some introspection on the ranges (such as how may tablets do I have
in this range?) and apply as necessary.

On Mon, Mar 11, 2013 at 4:47 PM, Aji Janis aji1...@gmail.com wrote:

 So looks like doing a ListRange is what I need so that I can have a
 mapper per range. However, a more interesting scenario is one when given a
 big range I want to split it into multiple ranges. In other words if my
 rowid was 1_hello, 2_hello,  9_hello, 10_hello. And the range given was
 2 to 5. But i want one mapper per integer so 4 mappers in this case... any
 ideas on how I can accomplish that?


 Thanks all for suggestions.


 On Fri, Mar 8, 2013 at 7:02 PM, Keith Turner ke...@deenlo.com wrote:

 On Fri, Mar 8, 2013 at 4:17 PM, Aji Janis aji1...@gmail.com wrote:
  Thank you. Follow up question.
 
  Would this enforce one mapper per range even if all the data (From three
  ranges) is on one node/tablet?

 Look at disableAutoAdjustRanges(). This determines wether it creates a
 mapper per tablet per range OR per range.


 
 
 
  On Fri, Mar 8, 2013 at 1:17 PM, Mike Hugo m...@piragua.com wrote:
 
  See AccumuloInputFormat
 
  ArrayListRange ranges = new ArrayListRange();
  // populate array list of row ranges ...
  AccumuloInputFormat.setRanges(job, ranges);
 
 
  You should get one mapper per range.
 
 
 
 
  On Fri, Mar 8, 2013 at 12:11 PM, Aji Janis aji1...@gmail.com wrote:
 
  Hello,
 
   I am trying to figure out how I can configure number of mappers (if
 its
  even possible) based on a Accumulo row range. My accumulo rowid uses
 the
  format:
 
  abc/1
  abc/2
  ...
  def/3
  
  xyz/13...
 
  If I want to specify three ranges: [abc/1 to abc/3] , [def/1 to def
 5] ,
  [jkl/13 to klm 15]. and have one mapper work on one range, is there a
 way I
  can do this?? How do I even set up my mapreduce job to accept these
  ranges??? Thankyou for all feedback.

Re: Running Helloworld from different host

2012-12-21 Thread William Slacum

On your accumulo master, what do you you in your conf/slaves file?

On Fri, Dec 21, 2012 at 9:43 AM, Kevin Pauli ke...@thepaulis.com wrote:

 Hi, I'm trying to get my first Accumulo environment setup to evaluate it.
  I've got it running within a CentOS VM, and I've setup the helloworld
 data.

 My CentOS guest IP is 192.168.254.130.  The helloworld examples work fine
 from an ssh attached to the guest OS.

 I'm now trying to access the helloworld data from my host OS, via the
 org.apache.accumulo.examples.simple.helloworld.ReadData program.  It gets
 this far:

 [11:04:41.868] INFO: Initiating client connection, connectString=
 192.168.254.130:2181 sessionTimeout=3
 watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@3ca55242

 [11:04:46.432] INFO: Opening socket connection to server
 192.168.254.130/192.168.254.130:2181. Will not attempt to authenticate
 using SASL (unknown error)
 [11:04:46.439] INFO: Socket connection established to
 192.168.254.130/192.168.254.130:2181, initiating session
 [11:04:46.459] INFO: Session establishment complete on server
 192.168.254.130/192.168.254.130:2181, sessionid = 0x13bba6809700691,
 negotiated timeout = 3
 WARN [main] (ServerClient.java:156) - Failed to find an available server
 in the list of servers: [127.0.0.1:9997:9997 (12)]

 I'm wondering where that that value 127.0.0.1:9997:9997 is coming from.
  Is that Zookeeper trying to redirect me to the Accumulo server?  Do I want
 it to be 192.168.254.130:9997 instead, since from my host OS's
 perspective, 127.0.0.1 is the host OS localhost, not the guest OS?  If so,
 how do I configure accumulo's registration within Zookeeper?

 I have opened up port 9997 in the CentOS guest.

 Apologies as I am new to Zookeeper as well as Accumulo.

 --
 Regards,
 Kevin Pauli

Re: How to store numerics or dates as values in Accumulo?

2012-12-21 Thread William Slacum

Rya is a triple store backed by Accumulo:
http://www.deepdyve.com/lp/association-for-computing-machinery/rya-a-scalable-rdf-triple-store-for-the-clouds-7Xh905FY0y

On Fri, Dec 21, 2012 at 2:01 PM, Keith Turner ke...@deenlo.com wrote:

 Take a look at  the Typo Lexicoders.   A Lexicoder serializes data
 such that the serialized form sort correctly lexicographically.   Typo
 has Long, ULong, Double,  BigInteger Lexicoders.


 https://github.com/keith-turner/typo/tree/master/src/main/java/org/apache/accumulo/typo/encoders

 Keith

 On Fri, Dec 21, 2012 at 4:23 PM, Kevin Pauli ke...@thepaulis.com wrote:
  What is the recommended way of storing numeric data in Accumulo?  It
 looks
  like Mutation.put takes only a CharSequence or a Value, and a Value can
 only
  take a byte[].
 
  --
  Regards,
  Kevin Pauli

Re: Satisfying Zookeper dependency when installing Accumulo in CentOS

2012-12-19 Thread William Slacum

Did you set ZOOKEEPER_HOME in the accumulo-env.sh script or your
environment?

On Wed, Dec 19, 2012 at 2:03 PM, Kevin Pauli ke...@thepaulis.com wrote:

 I'm trying to install Accumulo in CentOS.  I have installed the jdk and
 hadoop, but can't seem to make Accumulo install happy wrt zookeeper.

 I installed Zookeper according to the instructions here:
 http://zookeeper.apache.org/doc/r3.4.5/zookeeperStarted.html#sc_InstallingSingleMode

 And Zookeeper is running:

 $ sudo bin/zkServer.sh start
 JMX enabled by default
 Using config: /usr/lib/zookeeper-3.4.5/bin/../conf/zoo.cfg
 Starting zookeeper ... STARTED

 But when trying to install Accumulo, this is what I get:

 $ sudo rpm -ivh Downloads/accumulo-1.4.2-1.amd64.rpm
 error: Failed dependencies:
 zookeeper is needed by accumulo-1.4.2-1.amd64

 --
 Regards,
 Kevin Pauli

Re: Satisfying Zookeper dependency when installing Accumulo in CentOS

2012-12-19 Thread William Slacum

Nvm you're a step behind where I thought you were at. Turns out I'm of no
help :)

On Wed, Dec 19, 2012 at 2:06 PM, William Slacum 
wilhelm.von.cl...@accumulo.net wrote:

 Did you set ZOOKEEPER_HOME in the accumulo-env.sh script or your
 environment?


 On Wed, Dec 19, 2012 at 2:03 PM, Kevin Pauli ke...@thepaulis.com wrote:

 I'm trying to install Accumulo in CentOS.  I have installed the jdk and
 hadoop, but can't seem to make Accumulo install happy wrt zookeeper.

 I installed Zookeper according to the instructions here:
 http://zookeeper.apache.org/doc/r3.4.5/zookeeperStarted.html#sc_InstallingSingleMode

 And Zookeeper is running:

 $ sudo bin/zkServer.sh start
 JMX enabled by default
 Using config: /usr/lib/zookeeper-3.4.5/bin/../conf/zoo.cfg
 Starting zookeeper ... STARTED

 But when trying to install Accumulo, this is what I get:

 $ sudo rpm -ivh Downloads/accumulo-1.4.2-1.amd64.rpm
 error: Failed dependencies:
 zookeeper is needed by accumulo-1.4.2-1.amd64

 --
 Regards,
 Kevin Pauli

Re: Reduce task failing on job with error java.lang.IllegalStateException: Keys appended out-of-order

2012-12-06 Thread William Slacum

'col3' sorts lexicographically before 'col16'. you'll either need to encode
your numerics or zero pad them.

On Thu, Dec 6, 2012 at 9:03 AM, Andrew Catterall 
catteralland...@googlemail.com wrote:

 Hi,


 I am trying to run a bulk ingest to import data into Accumulo but it is
 failing at the reduce task with the below error:



 java.lang.IllegalStateException: Keys appended out-of-order.  New key
 client@20121206123059@0014efca-d8e8-492e-83cb-e5b6b7c49f7a foo:col3 [myVis]
 9223372036854775807 false, previous key 
 client@20121206123059@0014efca-d8e8-492e-83cb-e5b6b7c49f7a
 foo:col16 [myVis] 9223372036854775807 false

 at
 org.apache.accumulo.core.file.rfile.RFile$Writer.append(RFile.java:378)



 Could this be caused by the order at which the writes are being done?


 *-- Background*

 *
 *

 The input file is a tab separated file.  A sample row would look like:

 Data1Data2Data3Data4Data5… DataN



 The map parses the data, for each row, into a MapString, String.  This
 will contain the following:

 Col1   Data1

 Col2   Data2

 Col3   Data3

 …

 ColN  DataN


 An outputKey is then generated for this row in the format *
 client@timeStamp@randomUUID*

 Then for each entry in MapString, String a outputValue is generated in
 the format *ColN|DataN*

 The outputKey and outputValue are written to Context.



 This completes successfully, however, the reduce task fails.


 My ReduceClass is as follows:



   *public* *static* *class* ReduceClass *extends* 
 ReducerText,Text,Key,Value
 {

  *public* *void* reduce(Text key, IterableText keyValues,
 Context output) *throws* IOException, InterruptedException {



 // for each value belonging to the key

 *for* (Text keyValue : keyValues) {



//split the keyValue into *Col* and Data

  String[] values = keyValue.toString().split(\\|);



  // Generate key

  Key outputKey = *new* Key(key, *new* Text(foo), *
 new* Text(values[0]), *new* Text(myVis));



  // Generate value

  Value outputValue = *new* Value(values[1].getBytes(),
 0, values[1].length());



  // Write to context

  output.write(outputKey, outputValue);

 }

  }

   }




 *-- Expected output*



 I am expecting the contents of the Accumulo table to be as follows:



 client@20121206123059@0014efca-d8e8-492e-83cb-e5b6b7c49f7a foo:Col1 [myVis]
 Data1

 client@20121206123059@0014efca-d8e8-492e-83cb-e5b6b7c49f7a foo:Col2 [myVis]
 Data2

 client@20121206123059@0014efca-d8e8-492e-83cb-e5b6b7c49f7a foo:Col3 [myVis]
 Data3

 client@20121206123059@0014efca-d8e8-492e-83cb-e5b6b7c49f7a foo:Col4 [myVis]
 Data4

 client@20121206123059@0014efca-d8e8-492e-83cb-e5b6b7c49f7a foo:Col5 [myVis]
 Data5

 …

 client@20121206123059@0014efca-d8e8-492e-83cb-e5b6b7c49f7a foo:ColN [myVis]
 DataN





 Thanks,

 Andrew

Re: Performance of table with large number of column families

2012-11-09 Thread William Slacum

That shouldn't be a huge issue. How many rows/partitions do you have? How
many do you have to scan to find the specific column family/doc id you want?

On Fri, Nov 9, 2012 at 11:26 AM, Anthony Fox adfaccu...@gmail.com wrote:

 I have a table set up to use the intersecting iterator pattern.  The
 table has about 20M records which leads to 20M column families for the
 data section - 1 unique column family per record.  The index section of
 the table is not quite as large as the data section.  The rowkey is a
 random padded integer partition between 000 and 999.  I turned
 bloom filters on and used the ColumnFamilyFunctor to get performant
 column family scans without specifying a range like in the bloom filter
 examples in the README.  However, my column family scans (without any
 custom iterator) are still fairly slow - ~30 seconds for a column family
 batch scan of one record. I've also tried RowFunctor but I see similar
 performance.  Can anyone shed any light on the performance metrics I'm
 seeing?

 Thanks,
 Anthony

Re: Performance of table with large number of column families

2012-11-09 Thread William Slacum

I guess assuming you have 10M possible partitions, if you're using a
relatively uniform hash to generate your IDs, you'll average about 2 per
partition. Do you have any index for term/value to partition? This will
help you narrow down your search space to a subset of your partitions.

On Fri, Nov 9, 2012 at 11:39 AM, William Slacum 
wilhelm.von.cl...@accumulo.net wrote:

 That shouldn't be a huge issue. How many rows/partitions do you have? How
 many do you have to scan to find the specific column family/doc id you want?


 On Fri, Nov 9, 2012 at 11:26 AM, Anthony Fox adfaccu...@gmail.com wrote:

 I have a table set up to use the intersecting iterator pattern.  The
 table has about 20M records which leads to 20M column families for the
 data section - 1 unique column family per record.  The index section of
 the table is not quite as large as the data section.  The rowkey is a
 random padded integer partition between 000 and 999.  I turned
 bloom filters on and used the ColumnFamilyFunctor to get performant
 column family scans without specifying a range like in the bloom filter
 examples in the README.  However, my column family scans (without any
 custom iterator) are still fairly slow - ~30 seconds for a column family
 batch scan of one record. I've also tried RowFunctor but I see similar
 performance.  Can anyone shed any light on the performance metrics I'm
 seeing?

 Thanks,
 Anthony

Re: Performance of table with large number of column families

2012-11-09 Thread William Slacum

I'm more inclined to believe it's because you have to search across 10M
different rows to find any given column family, since they're randomly, and
possibly uniformly, distributed. How many tablets are you searching across?

On Fri, Nov 9, 2012 at 11:45 AM, Anthony Fox adfaccu...@gmail.com wrote:

 Yes, there are 10M possible partitions.  I do not have a hash from value
 to partition, the data is essentially randomly balanced across all the
 tablets.  Unlike the bloom filter and intersecting iterator examples, I do
 not have locality groups turned on and I have data in the cq and the value
 for both index entries and record entries.  Could this be the issue?  Each
 record entry has approximately 30 column qualifiers with data in the value
 for each.


 On Fri, Nov 9, 2012 at 11:41 AM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 I guess assuming you have 10M possible partitions, if you're using a
 relatively uniform hash to generate your IDs, you'll average about 2 per
 partition. Do you have any index for term/value to partition? This will
 help you narrow down your search space to a subset of your partitions.


 On Fri, Nov 9, 2012 at 11:39 AM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 That shouldn't be a huge issue. How many rows/partitions do you have?
 How many do you have to scan to find the specific column family/doc id you
 want?


 On Fri, Nov 9, 2012 at 11:26 AM, Anthony Fox adfaccu...@gmail.comwrote:

 I have a table set up to use the intersecting iterator pattern.  The
 table has about 20M records which leads to 20M column families for the
 data section - 1 unique column family per record.  The index section of
 the table is not quite as large as the data section.  The rowkey is a
 random padded integer partition between 000 and 999.  I turned
 bloom filters on and used the ColumnFamilyFunctor to get performant
 column family scans without specifying a range like in the bloom filter
 examples in the README.  However, my column family scans (without any
 custom iterator) are still fairly slow - ~30 seconds for a column family
 batch scan of one record. I've also tried RowFunctor but I see similar
 performance.  Can anyone shed any light on the performance metrics I'm
 seeing?

 Thanks,
 Anthony

Re: Performance of table with large number of column families

2012-11-09 Thread William Slacum

So that means you have roughly 312.5k rows per tablet, which means about
725k column families in any given tablet. The intersecting iterator will
work at a row per time, so I think at any given moment, it will be working
through 32 at a time and doing a linear scan through the RFile blocks. With
RFile indices, that check is usually pretty fast, but you're having go
through 4 orders of magnitude more data sequentially than you can work on.
If you can experiment and re-ingest with a smaller number of tablets,
anywhere between 15 and 45, I think you will see better performance.

On Fri, Nov 9, 2012 at 11:53 AM, Anthony Fox adfaccu...@gmail.com wrote:

 Failed to answer the original question - 15 tablet servers, 32
 tablets/splits.


 On Fri, Nov 9, 2012 at 11:52 AM, Anthony Fox adfaccu...@gmail.com wrote:

 I've tried a number of different settings of table.split.threshold.  I
 started at 1G and bumped it down to 128M and the cf scan is still ~30
 seconds for both.  I've also used less rows - 0 to 9 and still see
 similar performance numbers.  I thought the column family bloom filter
 would help deal with large row space but sparsely populated column space.
  Is that correct?


 On Fri, Nov 9, 2012 at 11:49 AM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 I'm more inclined to believe it's because you have to search across 10M
 different rows to find any given column family, since they're randomly, and
 possibly uniformly, distributed. How many tablets are you searching across?


 On Fri, Nov 9, 2012 at 11:45 AM, Anthony Fox adfaccu...@gmail.comwrote:

 Yes, there are 10M possible partitions.  I do not have a hash from
 value to partition, the data is essentially randomly balanced across all
 the tablets.  Unlike the bloom filter and intersecting iterator examples, I
 do not have locality groups turned on and I have data in the cq and the
 value for both index entries and record entries.  Could this be the issue?
  Each record entry has approximately 30 column qualifiers with data in the
 value for each.


 On Fri, Nov 9, 2012 at 11:41 AM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 I guess assuming you have 10M possible partitions, if you're using a
 relatively uniform hash to generate your IDs, you'll average about 2 per
 partition. Do you have any index for term/value to partition? This will
 help you narrow down your search space to a subset of your partitions.


 On Fri, Nov 9, 2012 at 11:39 AM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 That shouldn't be a huge issue. How many rows/partitions do you have?
 How many do you have to scan to find the specific column family/doc id 
 you
 want?


 On Fri, Nov 9, 2012 at 11:26 AM, Anthony Fox adfaccu...@gmail.comwrote:

 I have a table set up to use the intersecting iterator pattern.  The
 table has about 20M records which leads to 20M column families for the
 data section - 1 unique column family per record.  The index section of
 the table is not quite as large as the data section.  The rowkey is a
 random padded integer partition between 000 and 999.  I turned
 bloom filters on and used the ColumnFamilyFunctor to get performant
 column family scans without specifying a range like in the bloom filter
 examples in the README.  However, my column family scans (without any
 custom iterator) are still fairly slow - ~30 seconds for a column family
 batch scan of one record. I've also tried RowFunctor but I see similar
 performance.  Can anyone shed any light on the performance metrics I'm
 seeing?

 Thanks,
 Anthony

Re: Performance of table with large number of column families

2012-11-09 Thread William Slacum

When I said smaller of tablets, I really mean smaller number of rows :) My
apologies.

So if you're searching for a random column family in a table, like with a
`scan -c cf` in the shell, it will start at row 0 and work sequentially
up to row 1000 until it finds the cf.

On Fri, Nov 9, 2012 at 12:11 PM, Anthony Fox adfaccu...@gmail.com wrote:

 This scan is without the intersecting iterator.  I'm just trying to pull
 back a single data record at the moment which corresponds to scanning for
 one column family.  I'll try with a smaller number of tablets, but is the
 computation effort the same for the scan I am doing?


 On Fri, Nov 9, 2012 at 12:02 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 So that means you have roughly 312.5k rows per tablet, which means about
 725k column families in any given tablet. The intersecting iterator will
 work at a row per time, so I think at any given moment, it will be working
 through 32 at a time and doing a linear scan through the RFile blocks. With
 RFile indices, that check is usually pretty fast, but you're having go
 through 4 orders of magnitude more data sequentially than you can work on.
 If you can experiment and re-ingest with a smaller number of tablets,
 anywhere between 15 and 45, I think you will see better performance.

 On Fri, Nov 9, 2012 at 11:53 AM, Anthony Fox adfaccu...@gmail.comwrote:

 Failed to answer the original question - 15 tablet servers, 32
 tablets/splits.


 On Fri, Nov 9, 2012 at 11:52 AM, Anthony Fox adfaccu...@gmail.comwrote:

 I've tried a number of different settings of table.split.threshold.  I
 started at 1G and bumped it down to 128M and the cf scan is still ~30
 seconds for both.  I've also used less rows - 0 to 9 and still see
 similar performance numbers.  I thought the column family bloom filter
 would help deal with large row space but sparsely populated column space.
  Is that correct?


 On Fri, Nov 9, 2012 at 11:49 AM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 I'm more inclined to believe it's because you have to search across
 10M different rows to find any given column family, since they're 
 randomly,
 and possibly uniformly, distributed. How many tablets are you searching
 across?


 On Fri, Nov 9, 2012 at 11:45 AM, Anthony Fox adfaccu...@gmail.comwrote:

 Yes, there are 10M possible partitions.  I do not have a hash from
 value to partition, the data is essentially randomly balanced across all
 the tablets.  Unlike the bloom filter and intersecting iterator 
 examples, I
 do not have locality groups turned on and I have data in the cq and the
 value for both index entries and record entries.  Could this be the 
 issue?
  Each record entry has approximately 30 column qualifiers with data in 
 the
 value for each.


 On Fri, Nov 9, 2012 at 11:41 AM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 I guess assuming you have 10M possible partitions, if you're using a
 relatively uniform hash to generate your IDs, you'll average about 2 per
 partition. Do you have any index for term/value to partition? This will
 help you narrow down your search space to a subset of your partitions.


 On Fri, Nov 9, 2012 at 11:39 AM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 That shouldn't be a huge issue. How many rows/partitions do you
 have? How many do you have to scan to find the specific column 
 family/doc
 id you want?


 On Fri, Nov 9, 2012 at 11:26 AM, Anthony Fox 
 adfaccu...@gmail.comwrote:

 I have a table set up to use the intersecting iterator pattern.  The
 table has about 20M records which leads to 20M column families for the
 data section - 1 unique column family per record.  The index section 
 of
 the table is not quite as large as the data section.  The rowkey is a
 random padded integer partition between 000 and 999.  I turned
 bloom filters on and used the ColumnFamilyFunctor to get performant
 column family scans without specifying a range like in the bloom 
 filter
 examples in the README.  However, my column family scans (without any
 custom iterator) are still fairly slow - ~30 seconds for a column 
 family
 batch scan of one record. I've also tried RowFunctor but I see similar
 performance.  Can anyone shed any light on the performance metrics I'm
 seeing?

 Thanks,
 Anthony

Re: Performance of table with large number of column families

2012-11-09 Thread William Slacum

I'll ask for someone to verify this comment for me (look @ u John W Vines),
but the bloom filter helps when you have a discrete number of column
families that will appear across many rows.

On Fri, Nov 9, 2012 at 12:18 PM, Anthony Fox adfaccu...@gmail.com wrote:

 Ah, ok, I was under the impression that this would be really fast since I
 have a column family bloom filter turned on.  Is this not correct?


 On Fri, Nov 9, 2012 at 12:15 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 When I said smaller of tablets, I really mean smaller number of rows :)
 My apologies.

 So if you're searching for a random column family in a table, like with a
 `scan -c cf` in the shell, it will start at row 0 and work sequentially
 up to row 1000 until it finds the cf.


 On Fri, Nov 9, 2012 at 12:11 PM, Anthony Fox adfaccu...@gmail.comwrote:

 This scan is without the intersecting iterator.  I'm just trying to pull
 back a single data record at the moment which corresponds to scanning for
 one column family.  I'll try with a smaller number of tablets, but is the
 computation effort the same for the scan I am doing?


 On Fri, Nov 9, 2012 at 12:02 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 So that means you have roughly 312.5k rows per tablet, which means
 about 725k column families in any given tablet. The intersecting iterator
 will work at a row per time, so I think at any given moment, it will be
 working through 32 at a time and doing a linear scan through the RFile
 blocks. With RFile indices, that check is usually pretty fast, but you're
 having go through 4 orders of magnitude more data sequentially than you can
 work on. If you can experiment and re-ingest with a smaller number of
 tablets, anywhere between 15 and 45, I think you will see better
 performance.

 On Fri, Nov 9, 2012 at 11:53 AM, Anthony Fox adfaccu...@gmail.comwrote:

 Failed to answer the original question - 15 tablet servers, 32
 tablets/splits.


 On Fri, Nov 9, 2012 at 11:52 AM, Anthony Fox adfaccu...@gmail.comwrote:

 I've tried a number of different settings of table.split.threshold.
  I started at 1G and bumped it down to 128M and the cf scan is still ~30
 seconds for both.  I've also used less rows - 0 to 9 and still 
 see
 similar performance numbers.  I thought the column family bloom filter
 would help deal with large row space but sparsely populated column space.
  Is that correct?


 On Fri, Nov 9, 2012 at 11:49 AM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 I'm more inclined to believe it's because you have to search across
 10M different rows to find any given column family, since they're 
 randomly,
 and possibly uniformly, distributed. How many tablets are you searching
 across?


 On Fri, Nov 9, 2012 at 11:45 AM, Anthony Fox 
 adfaccu...@gmail.comwrote:

 Yes, there are 10M possible partitions.  I do not have a hash from
 value to partition, the data is essentially randomly balanced across 
 all
 the tablets.  Unlike the bloom filter and intersecting iterator 
 examples, I
 do not have locality groups turned on and I have data in the cq and the
 value for both index entries and record entries.  Could this be the 
 issue?
  Each record entry has approximately 30 column qualifiers with data in 
 the
 value for each.


 On Fri, Nov 9, 2012 at 11:41 AM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 I guess assuming you have 10M possible partitions, if you're using
 a relatively uniform hash to generate your IDs, you'll average about 
 2 per
 partition. Do you have any index for term/value to partition? This 
 will
 help you narrow down your search space to a subset of your partitions.


 On Fri, Nov 9, 2012 at 11:39 AM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 That shouldn't be a huge issue. How many rows/partitions do you
 have? How many do you have to scan to find the specific column 
 family/doc
 id you want?


 On Fri, Nov 9, 2012 at 11:26 AM, Anthony Fox 
 adfaccu...@gmail.com wrote:

 I have a table set up to use the intersecting iterator pattern.  The
 table has about 20M records which leads to 20M column families for 
 the
 data section - 1 unique column family per record.  The index 
 section of
 the table is not quite as large as the data section.  The rowkey is 
 a
 random padded integer partition between 000 and 999.  I 
 turned
 bloom filters on and used the ColumnFamilyFunctor to get performant
 column family scans without specifying a range like in the bloom 
 filter
 examples in the README.  However, my column family scans (without 
 any
 custom iterator) are still fairly slow - ~30 seconds for a column 
 family
 batch scan of one record. I've also tried RowFunctor but I see 
 similar
 performance.  Can anyone shed any light on the performance metrics 
 I'm
 seeing?

 Thanks,
 Anthony

Re: thread safety of IndexedDocIterator

2012-11-05 Thread William Slacum

At one point, Keith had warned me against kicking off threads inside a scan
session. Is it possible we could have a discussion on the implications of
this?

On Mon, Nov 5, 2012 at 11:30 AM, Billie Rinaldi bil...@apache.org wrote:

 On Mon, Nov 5, 2012 at 11:24 AM, Sukant Hajra qn2b6c2...@snkmail.comwrote:

 We noticed that IndexedDocIterator.java has the following private static
 fields:

 private static Text indexColf = DEFAULT_INDEX_COLF;
 private static Text docColf = DEFAULT_DOC_COLF;

 The init method, which sets these is synchronized.  Still, though, this
 synchronization doesn't seem enough to allow different runs of the
 iterator to
 use different values for indexColf and docColf.  One run will set the
 Colf
 variables one way atomically in the synchronized init method. . . and
 another
 run and immediately interleave in alternate Colf settings, which breaks
 the
 original iterator run.

 For now, we're not touching the indexColf and docColf, just leaving it as
 the
 defaults.

 We're not blocked by this.  We're just curious if there's a bug in this
 design.
 Also, if it's not a defect, we're interested in learning what system
 invariant
 of iterator execution makes this not a problem.


 Sounds like a bug.  Feel free to open a ticket!

 Billie




 Thanks,
 Sukant

Re: Accumulo Map Reduce is not distributed

2012-11-02 Thread William Slacum

What about the main method that calls ToolRunner.run? If you have 4 jobs
being created, then you're calling run(String[]) or runOneTable() 4 times.

On Fri, Nov 2, 2012 at 5:21 PM, Cornish, Duane C.
duane.corn...@jhuapl.eduwrote:

 Thanks for the prompt response John!

 

 When I say that I’m pre-splitting my table, I mean I am using the
 tableOperations().addSplits(table,splits) command.  I have verified that
 this is correctly splitting my table into 4 tablets and it is being
 distributed across my cloud before I start my map reduce job.

 ** **

 Now, I only kick off the job once, but it appears that 4 separate jobs run
 (one after the other).  The first one reaches 100% in its map phase (and
 based on my output only handled ¼ of the data), then the next job starts at
 0% and reaches 100%, and so on.  So I think I’m “only running one mapper
 at a time in an MR job that has 4 mappers total.”.  I have 2 mapper slots
 per node.  My hadoop is set up so that one machine is the namenode and the
 other 3 are datanodes.  This gives me 6 slots total.  (This is not
 congruent to my accumulo where the master is also a slave – giving 4 total
 slaves).  

 ** **

 My map reduce job is not a chain job, so all 4 tablets should be able to
 run at the same time.

 ** **

 Here is my job class code below:

 ** **

 *import* org.apache.accumulo.core.security.Authorizations;

 *import* org.apache.accumulo.core.client.mapreduce.AccumuloOutputFormat;**
 **

 *import* org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat;
 

 *import* org.apache.hadoop.conf.Configured;

 *import* org.apache.hadoop.io.DoubleWritable;

 *import* org.apache.hadoop.io.Text;

 *import* org.apache.hadoop.mapreduce.Job;

 *import* org.apache.hadoop.util.Tool;

 *import* org.apache.log4j.Level;

 ** **

 ** **

 *public* *class* Accumulo_FE_MR_Job *extends* Configured *implements*Tool{
 



*private* *void* runOneTable() *throws* Exception {

 System.*out*.println(Running Map Reduce Feature Extraction Job);
 

 ** **

 Job job  = *new* Job(getConf(), getClass().getName());

 ** **

 job.setJarByClass(getClass());

 job.setJobName(MRFE);

 ** **

 job.setInputFormatClass(AccumuloRowInputFormat.*class*);

 AccumuloRowInputFormat.*setZooKeeperInstance*
 (job.getConfiguration(),

 HMaxConstants.*INSTANCE*,

 HMaxConstants.*ZOO_SERVERS*);

 ** **

 AccumuloRowInputFormat.*setInputInfo*(job.getConfiguration(),

  HMaxConstants.*USER*, 

 HMaxConstants.*PASSWORD*.getBytes(), 

 HMaxConstants.*FEATLESS_IMG_TABLE*,

 *new* Authorizations());

 

 AccumuloRowInputFormat.*setLogLevel*(job.getConfiguration(),
 Level.*FATAL*);

 ** **

 job.setMapperClass(AccumuloFEMapper.*class*);

 job.setMapOutputKeyClass(Text.*class*);

 job.setMapOutputValueClass(DoubleWritable.*class*);

 ** **

 job.setNumReduceTasks(4);

 job.setReducerClass(AccumuloFEReducer.*class*);

 job.setOutputKeyClass(Text.*class*);

 job.setOutputValueClass(Text.*class*);

 ** **

 job.setOutputFormatClass(AccumuloOutputFormat.*class*);

 AccumuloOutputFormat.*setZooKeeperInstance*
 (job.getConfiguration(),

  HMaxConstants.*INSTANCE*,

  HMaxConstants.*ZOO_SERVERS*);

 AccumuloOutputFormat.*setOutputInfo*(job.getConfiguration(),

  HMaxConstants.*USER*,

  HMaxConstants.*PASSWORD*.getBytes(),

 *true*,

 HMaxConstants.*ALL_IMG_TABLE*);

 ** **

 AccumuloOutputFormat.*setLogLevel*(job.getConfiguration(), Level.*
 FATAL*);

 ** **

 job.waitForCompletion(*true*);

 *if* (job.isSuccessful()) {

 System.*err*.println(Job Successful);

 } *else* {

 System.*err*.println(Job Unsuccessful);

 }

  }



@Override

*public* *int* run(String[] arg0) *throws* Exception {

   runOneTable();

   *return* 0;

}

 }

 ** **

 Thanks,

 Duane

 ** **

 *From:* John Vines [mailto:vi...@apache.org]
 *Sent:* Friday, November 02, 2012 5:04 PM
 *To:* user@accumulo.apache.org
 *Subject:* Re: Accumulo Map Reduce is not distributed

 ** **

 This sounds like an issue with how your MR environment is configured
 and/or how you're kicking off your mapreduce.

 Accumulo's input formats with automatically set the number of mappers to
 the number of tablets you have, so you should have seen your job go from 1
 mapper to 4. What you describe is you now do 4 MR

Re: Filter Implementation - Accumulo 1.3

2012-10-23 Thread William Slacum

Make sure that the class is available to the the tserver process. This is
done by putting the jar containing your class on all nodes under the
$ACCUMULO_HOME/lib/ext directory. If you put it under lib/ext, then you
won't need to stop and restart the process for the tserver to pick it up.

On Tue, Oct 23, 2012 at 10:15 AM, Eric Newton eric.new...@gmail.com wrote:

 Check the tablet server logs... you'll see the real problem using the
 filter in there.

 -Eric


 On Tue, Oct 23, 2012 at 9:54 AM, Victoria Bare 
 vbare.accum...@gmail.comwrote:

 Hello,

 I am currently using Accumulo 1.3 to implement a Filter.  Since I'm using
 1.3, I realize that the Filter class is not an iterator so I have created a
 MyFilter class that implements Filter to use when I initialize my Scanner.
  When I run my code, I am getting an  AccumuloServerException.

 I was referencing the posts from December 2011 on Filter Use to
 initialize my scanner with MyFilter.
 My scanner initialization currently appears as so:

 Instance zooInstance = new ZooKeeperInstance(*instanceName*, *zooServers*
 );
 Connector connector = zooInstance.getConnector(*userName*, *password*);
 Authorizations authorizations = new Authorizations();
 Scanner scanner = connector.createScanner(*tableName*, authorizations);

 scanner.setRange(*range*);

 scanner.setScanIterators(1,
 org.apache.accumulo.core.iterators.FilteringIterator, myFilter);
 scanner.setScanIteratorOption(myFilter, 0, test.offsets.MyFilter);
 scanner.setScanIteratorOption(myFilter, 0.start, *start*);

 IteratorEntryKey,Value iterator = scanner.iterator();

 while(iterator.hasNext()) {  --- Exception here

 ...

 }


 -

 public class MyFilter implements Filter{
  long startOfRange = 0;
  @Override
 public boolean accept(Key key, Value value) {
  String colqual = key.getColumnQualifier().toString();
 long end = Long.parseLong(colqual.substring(20, 39));
  if(end  startOfRange){
 return false;
  }
  return true;
  }

 @Override
 public void init(MapString, String options) {

 if(options == null){
 throw new IllegalArgumentException('start' must be set for filter);
  }
  String start = options.get(start);
  if(start == null){
 throw new IllegalArgumentException('start' must be set for filter);
  }
  startOfRange = Long.parseLong(start);
  }

 }


 -

 The Exception that I'm receiving is:

 Exception in thread main java.lang.RuntimeException:
 org.apache.accumulo.core.client.impl.AccumuloServerException:
  at
 org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java)
 at test.offsets.TestFilter.getFilterEntrySetRange(TestFilter.java)
  at
 test.offsets.TestFilter.getAnalysisProductsByClassFilteredOffset(TestFilter.java)
 at test.offsets.TestFilter.main(TestFilter.java)


 -

 I was thinking that maybe the server couldn't find the MyFilter class, or
 maybe it was a permissions error, but I wasn't sure.  When I initialize my
 Scanner to use MyFilter, is it looking on the server for the file or in my
 project?

 Any assistance you can provide would be greatly appreciated, thanks!
 Tori

Re: [VOTE] accumulo-1.4.2 RC2

2012-10-22 Thread William Slacum

-1, since I'm running into the rat issue reported by Dave Medinets when
running build.sh.

On Mon, Oct 22, 2012 at 12:20 PM, Keith Turner ke...@deenlo.com wrote:

 On Mon, Oct 22, 2012 at 9:52 AM, Josh Elser josh.el...@gmail.com wrote:
  I agree. If it's not a quick fix, we should just revert the change and
 fix
  it properly in the next release.

 Since this is a bug introduced in 1.4.1, Christopher suggested rolling
 back the changes made in 1.4.1 in the ticket.  I like this idea and
 will take a stab at it today.

 
 
  On 10/19/12 5:32 PM, Christopher Tubbs wrote:
 
  I don't know that ACCUMULO-826 should be fixed before release, as I'm
  not sure there's a good fix without changing the API, and these issues
  may occur in several places in the MapReduce API.
 
  --
  Christopher L Tubbs II
  http://gravatar.com/ctubbsii
 
 
  On Fri, Oct 19, 2012 at 3:41 PM, Eric Newton eric.new...@gmail.com
  wrote:
 
  I agree.  And thanks for taking some time to test the candidate.
 
  It would be great if we could get some feedback from all the
 committers,
  and
  soon. I assume many of them will be busy in NY next week.
 
  If you look at the CHANGES for 1.4.2, there are some significant bug
  fixes.
  We want to make sure the final release doesn't contain any unexpected
  surprises like this.
 
  -Eric
 
  On Fri, Oct 19, 2012 at 3:33 PM, Keith Turner ke...@deenlo.com
 wrote:
 
  While testing 1.4.2rc2, I ran into ACCUMULO-826.   I think this is a
  pretty severe issue that occurs under a fairly common use case.   It
  sucks to have your M/R job die after a few hours because you killed
  the processes that started the job.   I am thinking its worth holding
  1.4.2 up in order to fix this issue, thoughts?
 
  On Thu, Oct 18, 2012 at 12:46 PM, Eric Newton eric.new...@gmail.com
  wrote:
 
  Please vote on releasing the following candidate as Apache Accumulo
  version 1.4.2.
 
  The src tar ball was generated by exporting:
 
  https://svn.apache.org/repos/asf/accumulo/tags/1.4.2rc2
 
  To build the dist tar ball from the source run the following command:
  src/assemble/build.sh
 
  Tarballs, checksums, signatures:
 http://people.apache.org/~ecn/1.4.2rc2
 
  Maven Staged Repository:
 
 
 
 https://repository.apache.org/content/repositories/orgapacheaccumulo-135
 
  Keys:
 http://www.apache.org/dist/accumulo/KEYS
 
  Changes:
 https://svn.apache.org/repos/asf/accumulo/tags/1.4.2rc2/CHANGES
 
  The vote will be held open for the next 72 hours.
 
  The only change from RC1 was ACCUMULO-823.

Re: [VOTE] accumulo-1.4.2 RC2

2012-10-22 Thread William Slacum

I replied to the thread David made, since I think Billie has run into the
same issue. I'm on OSX 10.7.5 and I believe
it's docs/src/developer_manual/component_docs.odp
and test/system/continuous/ScaleTest.odp, since she mentioned that some
versions seem not care if it's binary or not.

On Mon, Oct 22, 2012 at 10:09 PM, Eric Newton eric.new...@gmail.com wrote:

 Can you identify a file that is missing a license or has an incorrect
 license?

 I have run the build on RHEL 6, and Ubuntu 12.04.  In what environment
 does the build fail?

 -Eric


 On Mon, Oct 22, 2012 at 10:06 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 -1, since I'm running into the rat issue reported by Dave Medinets when
 running build.sh.


 On Mon, Oct 22, 2012 at 12:20 PM, Keith Turner ke...@deenlo.com wrote:

 On Mon, Oct 22, 2012 at 9:52 AM, Josh Elser josh.el...@gmail.com
 wrote:
  I agree. If it's not a quick fix, we should just revert the change and
 fix
  it properly in the next release.

 Since this is a bug introduced in 1.4.1, Christopher suggested rolling
 back the changes made in 1.4.1 in the ticket.  I like this idea and
 will take a stab at it today.

 
 
  On 10/19/12 5:32 PM, Christopher Tubbs wrote:
 
  I don't know that ACCUMULO-826 should be fixed before release, as I'm
  not sure there's a good fix without changing the API, and these issues
  may occur in several places in the MapReduce API.
 
  --
  Christopher L Tubbs II
  http://gravatar.com/ctubbsii
 
 
  On Fri, Oct 19, 2012 at 3:41 PM, Eric Newton eric.new...@gmail.com
  wrote:
 
  I agree.  And thanks for taking some time to test the candidate.
 
  It would be great if we could get some feedback from all the
 committers,
  and
  soon. I assume many of them will be busy in NY next week.
 
  If you look at the CHANGES for 1.4.2, there are some significant bug
  fixes.
  We want to make sure the final release doesn't contain any unexpected
  surprises like this.
 
  -Eric
 
  On Fri, Oct 19, 2012 at 3:33 PM, Keith Turner ke...@deenlo.com
 wrote:
 
  While testing 1.4.2rc2, I ran into ACCUMULO-826.   I think this is a
  pretty severe issue that occurs under a fairly common use case.   It
  sucks to have your M/R job die after a few hours because you killed
  the processes that started the job.   I am thinking its worth
 holding
  1.4.2 up in order to fix this issue, thoughts?
 
  On Thu, Oct 18, 2012 at 12:46 PM, Eric Newton 
 eric.new...@gmail.com
  wrote:
 
  Please vote on releasing the following candidate as Apache Accumulo
  version 1.4.2.
 
  The src tar ball was generated by exporting:
 
  https://svn.apache.org/repos/asf/accumulo/tags/1.4.2rc2
 
  To build the dist tar ball from the source run the following
 command:
  src/assemble/build.sh
 
  Tarballs, checksums, signatures:
 http://people.apache.org/~ecn/1.4.2rc2
 
  Maven Staged Repository:
 
 
 
 https://repository.apache.org/content/repositories/orgapacheaccumulo-135
 
  Keys:
 http://www.apache.org/dist/accumulo/KEYS
 
  Changes:
 https://svn.apache.org/repos/asf/accumulo/tags/1.4.2rc2/CHANGES
 
  The vote will be held open for the next 72 hours.
 
  The only change from RC1 was ACCUMULO-823.

Re: [VOTE] accumulo-1.4.2 RC2

2012-10-22 Thread William Slacum

Thanks Mr. Moustache-- I ended up just kicking up the expected number of
files from 53 to 55. Should we put in an OS check in the script as a band
aid, since it seems OSX isn't playing nicely?

On Mon, Oct 22, 2012 at 10:41 PM, Michael Flester fles...@gmail.com wrote:

 Wilhelm --

 For the release of 1.3.6 a while back if I switched out the mvn
 rat:check line in
 assemble/build.sh with this:

   mvn org.apache.rat:apache-rat-plugin:check

 I could get the build to pass on OS X, otherwise the odp
 files would break it like you say. I just tested this on
 trunk and it did not make the build pass but I don't think
 the issue on trunk is related to odp files.


 On Mon, Oct 22, 2012 at 10:34 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 I replied to the thread David made, since I think Billie has run into the
 same issue. I'm on OSX 10.7.5 and I believe
 it's docs/src/developer_manual/component_docs.odp
 and test/system/continuous/ScaleTest.odp, since she mentioned that some
 versions seem not care if it's binary or not.


 On Mon, Oct 22, 2012 at 10:09 PM, Eric Newton eric.new...@gmail.comwrote:

 Can you identify a file that is missing a license or has an incorrect
 license?

 I have run the build on RHEL 6, and Ubuntu 12.04.  In what environment
 does the build fail?

 -Eric


 On Mon, Oct 22, 2012 at 10:06 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 -1, since I'm running into the rat issue reported by Dave Medinets when
 running build.sh.


 On Mon, Oct 22, 2012 at 12:20 PM, Keith Turner ke...@deenlo.comwrote:

 On Mon, Oct 22, 2012 at 9:52 AM, Josh Elser josh.el...@gmail.com
 wrote:
  I agree. If it's not a quick fix, we should just revert the change
 and fix
  it properly in the next release.

 Since this is a bug introduced in 1.4.1, Christopher suggested rolling
 back the changes made in 1.4.1 in the ticket.  I like this idea and
 will take a stab at it today.

 
 
  On 10/19/12 5:32 PM, Christopher Tubbs wrote:
 
  I don't know that ACCUMULO-826 should be fixed before release, as
 I'm
  not sure there's a good fix without changing the API, and these
 issues
  may occur in several places in the MapReduce API.
 
  --
  Christopher L Tubbs II
  http://gravatar.com/ctubbsii
 
 
  On Fri, Oct 19, 2012 at 3:41 PM, Eric Newton eric.new...@gmail.com
 
  wrote:
 
  I agree.  And thanks for taking some time to test the candidate.
 
  It would be great if we could get some feedback from all the
 committers,
  and
  soon. I assume many of them will be busy in NY next week.
 
  If you look at the CHANGES for 1.4.2, there are some significant
 bug
  fixes.
  We want to make sure the final release doesn't contain any
 unexpected
  surprises like this.
 
  -Eric
 
  On Fri, Oct 19, 2012 at 3:33 PM, Keith Turner ke...@deenlo.com
 wrote:
 
  While testing 1.4.2rc2, I ran into ACCUMULO-826.   I think this
 is a
  pretty severe issue that occurs under a fairly common use case.
 It
  sucks to have your M/R job die after a few hours because you
 killed
  the processes that started the job.   I am thinking its worth
 holding
  1.4.2 up in order to fix this issue, thoughts?
 
  On Thu, Oct 18, 2012 at 12:46 PM, Eric Newton 
 eric.new...@gmail.com
  wrote:
 
  Please vote on releasing the following candidate as Apache
 Accumulo
  version 1.4.2.
 
  The src tar ball was generated by exporting:
 
  https://svn.apache.org/repos/asf/accumulo/tags/1.4.2rc2
 
  To build the dist tar ball from the source run the following
 command:
  src/assemble/build.sh
 
  Tarballs, checksums, signatures:
 http://people.apache.org/~ecn/1.4.2rc2
 
  Maven Staged Repository:
 
 
 
 https://repository.apache.org/content/repositories/orgapacheaccumulo-135
 
  Keys:
 http://www.apache.org/dist/accumulo/KEYS
 
  Changes:
 
 https://svn.apache.org/repos/asf/accumulo/tags/1.4.2rc2/CHANGES
 
  The vote will be held open for the next 72 hours.
 
  The only change from RC1 was ACCUMULO-823.

Re: compressing values returned to scanner

2012-10-01 Thread William Slacum

If you aren't often looking at the data in the value on the tablet server
(like in an iterator), you can also pre-compress your values on ingest.

On Mon, Oct 1, 2012 at 12:19 PM, Marc Parisi m...@accumulo.net wrote:

 You could compress the data in the value, and decompress the data upon
 receipt by the scanner.


 On Mon, Oct 1, 2012 at 3:03 PM, ameet kini ameetk...@gmail.com wrote:


 My understanding of compression in Accumulo 1.4.1 is that it is on by
 default and that data is decompressed by the tablet server, so data on the
 wire between server/client is decompressed. Is there a way to shift the
 decompression from happening on the server to the client? I have a use case
 where each Value in my table is relatively large (~ 8MB) and I can benefit
 from compression over the wire. I don't have any server side iterators, so
 the values don't need to be decompressed by the tablet server. Also, each
 scan returns a few rows, so client-side decompression can be fast.

 The only way I can think of now is to disable compression on that table,
 and handle compression/decompression in the application. But if there is a
 way to do this in Accumulo, I'd prefer that.

 Thanks,
 Ameet

Re: compressing values returned to scanner

2012-10-01 Thread William Slacum

Someone can correct me if I'm wrong, but I believe the file compression
option you quoted is for the RFiles in HDFS. You can enable compression
there and will still see some benefit even if you compress the values on
ingest.

On Mon, Oct 1, 2012 at 12:40 PM, ameet kini ameetk...@gmail.com wrote:

 That is exactly my use case (ingest once, serve often, no server-side
 iterators).

 And I'm doing pre-compression on ingest. I was just looking to do away
 with app-level compression code. Not a biggie.

 Ameet


 On Mon, Oct 1, 2012 at 3:32 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 If you aren't often looking at the data in the value on the tablet server
 (like in an iterator), you can also pre-compress your values on ingest.


 On Mon, Oct 1, 2012 at 12:19 PM, Marc Parisi m...@accumulo.net wrote:

 You could compress the data in the value, and decompress the data upon
 receipt by the scanner.


 On Mon, Oct 1, 2012 at 3:03 PM, ameet kini ameetk...@gmail.com wrote:


 My understanding of compression in Accumulo 1.4.1 is that it is on by
 default and that data is decompressed by the tablet server, so data on the
 wire between server/client is decompressed. Is there a way to shift the
 decompression from happening on the server to the client? I have a use case
 where each Value in my table is relatively large (~ 8MB) and I can benefit
 from compression over the wire. I don't have any server side iterators, so
 the values don't need to be decompressed by the tablet server. Also, each
 scan returns a few rows, so client-side decompression can be fast.

 The only way I can think of now is to disable compression on that
 table, and handle compression/decompression in the application. But if
 there is a way to do this in Accumulo, I'd prefer that.

 Thanks,
 Ameet

Re: sanity checking application WALogs make sense

2012-09-15 Thread William Slacum

I'm a bit confused as to what you mean if an iterator goes down
mid-processing. If it goes down at all, then whatever scope it's running
in- minor compaction, major compaction and scan- will most likely go down
as well (unless your iterator eats an exception and ignores errors). A
WALog shouldn't be deleted if whatever you were trying to do failed.

On Sat, Sep 15, 2012 at 1:44 AM, Sukant Hajra qn2b6c2...@snkmail.comwrote:

 Hi guys,

 We've been slowing inching towards using iterators more effectively.  The
 typical use case of indexed docs fit one of our needs and we wrote a
 prototype
 for it.

 We've recently realized that iterators are not just read-only, and that we
 can
 get more data-local functionality by taking advantage of their ability to
 mutate data as well.  We've only begun to think more of how this may
 assist us.
 A /lot/ of our critical data-accesses are slightly complex, but local to
 one
 row.  We have billions of entities in our system, so a simple bijection of
 entities to rows works our really well for us with respect to iterators.

 Up to this point, we've had an planned architecture that uses Kestrel for
 WALog
 and a messaging system like Akka pipelining work.  Akka would help us
 manage
 flowing work from the user to the log and from the log to orchestrations of
 Accumulo intra-row reads and writes.  The log just helps us get some faster
 response time without sacrificing too much reliability.

 Recently someone asked why use our own WALog when Accumulo has one
 natively in
 HDFS.  My response has been that Accumulo's WALog is at a lower level of
 granularity of mutations.  We want reliable orchestrations of mutations.
  Our
 orchestrations are idempotent, but we want something long the lines of
 at-least-once delivery for the entire orchestration.  If an iterator goes
 down
 mid-processing, I fear Accumulo's native WALog is insufficient to claim we
 have
 a reliable enough system.

 I could definitely go through source code to validate this opinion, but I
 thought I'd bounce this reasoning off the list first.

 Also, I'm sure we're not the only people using Accumulo in this way.
  Please
 feel to advise us if anyone's got other ideas for an architecture or feels
 we're thinking about the problem backwards.

 Thanks for your input,
 Sukant

Re: Running Accumulo straight from Memory

2012-09-11 Thread William Slacum

Woops- slow innurnet and didn't notice Eric's response.

On Tue, Sep 11, 2012 at 9:30 AM, William Slacum 
wilhelm.von.cl...@accumulo.net wrote:

 You could mount a RAM disk and point HDFS to it.


 On Tue, Sep 11, 2012 at 9:02 AM, Moore, Matthew J. 
 matthew.j.mo...@saic.com wrote:

 Has anyone run Accumulo on a single server straight from memory?
 Probably using something like a Fusion  IO drive.  We are trying to use it
 without using an SSD or any spinning discs.

 ** **

 *Matthew Moore*

 Systems Engineer

 SAIC, ISBU

 Columbia, MD

 410-312-2542

 ** **

Re: Custom Iterators

2012-08-22 Thread William Slacum

An or clause should be able to handle an enumeration of values, as that's
supported in a JEXL expression. It would not, however, surprise me if those
iterators could not handle multiple rows in a tablet. If you can reproduce
that, please file a ticket. There will be a large update occurring to the
Wiki example in the near future.

Do you have any specific questions about how you should structure your
iterator or the contract? Making a tutorial has been on my to do list, but
we all know how to do lists end up...

The big things to remember are:

1) The call order: Your iterator will be created via the default
constructor, init() will be called, then seek(). After seek() is called,
your iterator should have a top if there is data available. A client then
can call hasTop(), getTopKey() and getTopValue() to check and retrieve data
(similar to hasNext() and next()) and then next to advance the pointer.

2) Your iterator can be destroyed during a scan and then reconstructed,
being passed in the last key returned to the client as the start of the
range.

3) You can have multiple sources feed into a single iterator in a tree like
fashion by clone()'ing the source passed in to init.

On Wed, Aug 22, 2012 at 1:41 PM, Cardon, Tejay E tejay.e.car...@lmco.comwrote:

  All,

 I’m interested in writing a custom iterator, and I’ve been looking for
 documentation on how to do so.  Thus far, I’ve not been able to find
 anything beyond the java docs in SortedKeyValueIterator and a few other
 sub-classes.  A few of the examples use Iterators, but provide no real info
 on how to properly implement one.  Is there anywhere to find general
 guidance on the iterator stack?

 ** **

 (If you’re interested)

 Specifically, for those that are curious, I’m trying to implement
 something similar to the wikisearch example, but with some key
 differences.  In my case, I’ve got a file with various attributes that
 being indexed.  So for each file there are 5 attributes, and each attribute
 has a fixed number of possible values.  For example (totally made up):

 personID, gender, hair color, country, race, personRecord

 ** **

 Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank

 AND
 Row:binID; ColFam:”D”; ColQ:personID; value:personRecord

 ** **

 A typical query would be:

 Give me the personRecord for all people with:

 Gender: male 

 Hair color: blond or brown 

 Country: USA or England or china or korea 

 Race: white or oriental

 ** **

 The existing Iterators used in the wikisearch example are unable to handle
 the “or” clauses in each attribute.

 The OrIterator doesn’t appear to handle the possibility more than one row
 per tablet

 ** **

 Thanks,

 Tejay Cardon

Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection

2012-08-16 Thread William Slacum

What does your TServer debug log say? Also, are you writing back out to
Accumulo?

To follow up what Jim said, you can check the zookeeper log to see if max
connections is being hit. You may also want to check and see what your max
xceivers is set to for HDFS and check your Accumulo and HDFS logs to see if
it is mentioned.

On Thu, Aug 16, 2012 at 3:59 AM, Arjumand Bonhomme jum...@gmail.com wrote:

 Hello,

 I'm fairly new to both Accumulo and Hadoop, so I think my problem may be
 due to poor configuration on my part, but I'm running out of ideas.

 I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from cdh3u4)
 in pseudo-distributed mode.
 zookeeper version zookeeper-3.3.5 from cdh3u4
 I'm using the 1.4.1 release of accumulo with a configuration copied from
 conf/examples/512MB/standalone

 I've got a Map task that is using an accumulo table as the input.
 I'm fetching all rows, but just a single column family, that has hundreds
 or even thousands of different column qualifiers.
 The table has a SummingCombiner installed for the given the column family.

 The task runs fine at first, but after ~9-15K records (I print the record
 count to the console every 1K records), it hangs and the following messages
 are printed to the console where I'm running the job:
 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional
 data from server sessionid 0x1392cc35b460d1c, likely server has closed
 socket, closing socket connection and attempting reconnect
 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection to
 server localhost/fe80:0:0:0:0:0:0:1%1:2181
 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection established
 to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to
 ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket
 connection
 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down
 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client connection,
 connectString=localhost sessionTimeout=3
 watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c
 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection to
 server localhost/0:0:0:0:0:0:0:1:2181
 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection established
 to localhost/0:0:0:0:0:0:0:1:2181, initiating session
 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment
 complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid =
 0x1392cc35b460d25, negotiated timeout = 3
 12/08/16 02:57:11 INFO mapred.LocalJobRunner:
 12/08/16 02:57:14 INFO mapred.LocalJobRunner:
 12/08/16 02:57:17 INFO mapred.LocalJobRunner:

 Sometimes the messages contain a stacktrace like this below:
 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40 for
 server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing
 socket connection and attempting reconnect
 java.io.IOException: Connection reset by peer
  at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
 at sun.nio.ch.IOUtil.read(IOUtil.java:166)
  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
 at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856)
  at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154)
 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection to
 server localhost/127.0.0.1:2181
 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection established
 to localhost/127.0.0.1:2181, initiating session
 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to
 ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket
 connection
 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down
 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client connection,
 connectString=localhost sessionTimeout=3
 watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8
 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket connection to
 server localhost/fe80:0:0:0:0:0:0:1%1:2181
 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection established
 to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment
 complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid =
 0x1392cc35b460b46, negotiated timeout = 3


 I've poked through the logs in accumulo, and I've noticed that when it
 hangs, the following is written to the logger_HOSTNAME.debug.log file:
 16 03:29:46,332 [logger.LogService] DEBUG: event null None Disconnected
 16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state of
 current session : Expired
 16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired
 16 03:29:47,249 [logger.LogService] WARN

Re: [External] Re: Problem importing directory to Accumulo table

2012-07-17 Thread William Slacum

Did you configure hadoop to store your HDFS instance/data somewhere
other than /tmp? Look up the single node set up in the Hadoop docs.

On Tue, Jul 17, 2012 at 12:07 PM, Shrestha, Tejen [USA]
shrestha_te...@bah.com wrote:
 This is the error that was produced.

 java.io.FileNotFoundException: File /tmp/files does not exist.
 at
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.ja
 va:361)
 at
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:2
 45)
 at
 org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.
 java:509)
 at
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.ja
 va:644)
 at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
 at com.bah.applefox.plugins.loader.NGramLoader.run(NGramLoader.java:302)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at com.bah.applefox.ingest.Ingest.main(Ingest.java:133)



 On 7/17/12 12:50 PM, Eric Newton eric.new...@gmail.com wrote:

You will need to look in the master/tserver logs for the reason.

-Eric

On Tue, Jul 17, 2012 at 11:03 AM, Shrestha, Tejen [USA]
shrestha_te...@bah.com wrote:
 Below is the line I am using to do the Bulk Import:


 conn.tableOperations().importDirectory(table, dir, failureDir, false);


 Where conn is the connector to the ZooKeeper instance.  The problem is
the
 error: Internal error processing waitForTableOperation.

Re: [External] Re: Problem importing directory to Accumulo table

2012-07-17 Thread William Slacum

Also it looks like your app is storing something in /tmp/files, so you
may want to make sure that you mean to be looking on your local FS or
in HDFS.

On Tue, Jul 17, 2012 at 12:27 PM, William Slacum wsla...@gmail.com wrote:
 Did you configure hadoop to store your HDFS instance/data somewhere
 other than /tmp? Look up the single node set up in the Hadoop docs.

 On Tue, Jul 17, 2012 at 12:07 PM, Shrestha, Tejen [USA]
 shrestha_te...@bah.com wrote:
 This is the error that was produced.

 java.io.FileNotFoundException: File /tmp/files does not exist.
 at
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.ja
 va:361)
 at
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:2
 45)
 at
 org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.
 java:509)
 at
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.ja
 va:644)
 at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
 at com.bah.applefox.plugins.loader.NGramLoader.run(NGramLoader.java:302)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at com.bah.applefox.ingest.Ingest.main(Ingest.java:133)



 On 7/17/12 12:50 PM, Eric Newton eric.new...@gmail.com wrote:

You will need to look in the master/tserver logs for the reason.

-Eric

On Tue, Jul 17, 2012 at 11:03 AM, Shrestha, Tejen [USA]
shrestha_te...@bah.com wrote:
 Below is the line I am using to do the Bulk Import:


 conn.tableOperations().importDirectory(table, dir, failureDir, false);


 Where conn is the connector to the ZooKeeper instance.  The problem is
the
 error: Internal error processing waitForTableOperation.

Re: more questions about IndexedDocIterators

2012-07-16 Thread William Slacum

1) The class hierarchy is a little convoluted, but there doesn't seem to be
anything necessarily broken about the
FamilyIntersectingIterator/IndexedDocIterator that would prevent it from
being backported from trunk to a 1.3.x branch. AFAIK the
SortedKeyValueIterator interface has remained unchanged between the initial
1.3 release up through our current trunk.

2) I'm a little confused as to what you mean by sharding by document ID.
Does this mean that for any given key, the row portion is a document ID? As
far as reversing the timestamp, it seems reasonable if your queries are
primarily of the form give me documents within the past X time units.

3) What's your timestamp? If it's just a milliseconds-since-epoch
timestamp, it's not unheard of to encode numeric values into an ordering
that sorts lexicographically that isn't just padding with zeroes. The
Wikipedia example has a NumberNormalizer that uses commons-lang to do this.
As for hard numbers on performance with time and space, I don't have them.
I would imagine you will see a difference in space and possibly time if the
deserializing of the String is faster than what your'e using now.

4) I'd like to see your source. Have you looked at the
IndexedDocIteratorTest to verify that it behaves properly? I'm surprised
that it's returning you an index column family. Was your sample client
running with the dummy negation you mentioned in #5?

On Sun, Jul 15, 2012 at 7:05 PM, Sukant Hajra qn2b6c2...@snkmail.comwrote:

 Hi all,

 I have a mixed bag of questions to follow up on an earlier post inquiring
 about
 intersecting iterators now that I've done some prototyping:


 1. Do FamilyIntersectingIterators work in 1.3.4?
 

 Does anyone know if FamilyIntersectingIterators were useable as far back as
 1.3.4?  Or am I wasting my time on them at this old version (and need to
 upgrade)?

 I got a prototype of IndexedDocIterators working with Accumulo 1.4.1, but
 currently have a hung thread in my attempt to use a
 FamilyIntersectingIterator
 with Cloudbase 1.3.4.  Also, I noticed the API changed somewhat to remove
 some
 oddly designed static configuration.

 If FamilyIntersectingIterators were buggy, were there sufficient
 work-arounds
 to get some use out of them in 1.3.4?

 Unfortunately, I need to jump through some political/social hoops to
 upgrade,
 but if it's got to be done, then I'll do what I have to.


 2. Is this approach reasonable?
 ---

 We're trying to be clever with our use of indexed docs.  We're less
 interested
 in searching over a large corpus of data in parallel, and more interested
 in
 doing some server-side joins in a data-local way (to reduce client burden
 and
 network traffic).  So we're heavily sharding our documents (billions of
 shards) and using range constraints on the iterator to hone in on exactly
 one
 shard (new Range(shardId, shardId)).

 Let me give you a sense for what we're doing.  In one use case, we're using
 document-indexed iterators to accomodate both per-author and by-time
 accesses
 of a per-document commit log.  So we're sharding by document ID (and we
 have
 billions of documents).  Then we use the author ID as terms for each commit
 (one term per commit entry).  We use a reverse timestamp for the doc type,
 so
 we get back these entries in reverse time order.  In this way, we can scan
 the
 log for the entire document by time with plan iterators, and for a specific
 author with a document-indexed iterator (with a server-side join to the
 commit
 log entry).  Later on, we may index the log by other features with this
 approach.

 Is this strategy sane?  Is there precedent for doing it?  Is there a better
 alternative?


 3. Compressed reverse-timestamp using Unicode tricks?
 --

 I see code in Accumulo like

 // We're past the index column family, so return a term that will sort
 // lexicographically last.  The last unicode character should suffice
 return new Text(\uFFFD);

 which gets me thinking that i can probably pull off a impressively
 compressed,
 but still lexically orderd, reverse timestamp using Unicode trickery to
 get a
 gigantic radix.  Is there any precedence for this?  I'm a little worried
 about
 running into corner cases with Unicode encoding.  Otherwise, I think it
 feels
 like a simple algorithm that may not eat up much CPU in translation and
 might
 save disk space at scale.

 Or is this optimizing into the noise given compression Accumulo already
 does
 under the covers?


 4. Response from IndexedDocIterator not reflecting documentation
 

 I got back results in my prototype that don't line up with the
 documentation
 for a IndexedDocIterator.  For example, here's some data I put into a test
 table:

 r:shardId, cf:e\0docType, cq:docId, value:content
 r:shardId, cf:i,

Re: Chain Jobs and Accumulo.

2012-07-16 Thread William Slacum

http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201203.mbox/%3ccaocnvr0osrawytau7lt+agf0bmmcwfhrgpj8_ga4u6mac2y...@mail.gmail.com%3E

It looks like the old API was given a second chance at life and is now
being billed as the stable API.

On Mon, Jul 16, 2012 at 2:39 PM, Billie J Rinaldi billie.j.rina...@ugov.gov
wrote:

On Monday, July 16, 2012 5:27:09 PM, Ed Kohlwey ekohl...@gmail.com
wrote:
I would suggest spending the effort porting chainmapper to the new API
(mapreduce) since the old API will eventually be removed.

I assumed that would be true since the old API was deprecated, and that is
why we no longer support it. However, the old API has been undeprecated
since 0.20.205.0 and 1.0.0, which seems to indicate it's not going away.
Does anyone know what the plan for it is?

Billie

Sent from my smartphone. Please excuse any typos or shorthand.
On Jul 16, 2012 5:22 PM, Billie J Rinaldi
billie.j.rina...@ugov.gov wrote:

On Monday, July 16, 2012 5:02:52 PM, Juan Moreno
jwellington.mor...@gmail.com wrote:
Hi there, I have a use case where I need to use a Chain Mapper
and/or
Reducer. The problem is that
the AccumuloInputFormat extends hadoop.mapreduce.InputFormat rather
than implementing hadoop.mapred.InputFormat

Trying to make use of org.apache.hadoop.mapred.lib.ChainMapper
does not work because it requires the use of the mapred package. Is
there a version of the AccumuloInputFormat which uses
the hadoop.mapred package instead? Can InputFormatBase be rewritten
with the newer API?

Thanks!
Juan

I opened ACCUMULO-695 to add support for the old mapred API.

Billie

Re: Chain Jobs and Accumulo.

2012-07-16 Thread William Slacum

mapred was deprecated as of 0.20.0 (
http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/InputFormat.html)
:)

On Mon, Jul 16, 2012 at 2:49 PM, Juan Moreno
jwellington.mor...@gmail.comwrote:

The hadoop API is very confusing in that regard. Currently Accumulo runs
atop 0.20 and in that version , mapred is the new one and mapreduce is the
old one. InputFormat currently makes use of mapreduce.
On Jul 16, 2012 5:40 PM, Billie J Rinaldi billie.j.rina...@ugov.gov
wrote:

I assumed that would be true since the old API was deprecated, and that
is why we no longer support it. However, the old API has been undeprecated
since 0.20.205.0 and 1.0.0, which seems to indicate it's not going away.
Does anyone know what the plan for it is?

Billie

Sent from my smartphone. Please excuse any typos or shorthand.
On Jul 16, 2012 5:22 PM, Billie J Rinaldi
billie.j.rina...@ugov.gov wrote:

Thanks!
Juan

I opened ACCUMULO-695 to add support for the old mapred API.

Billie

Re: Chain Jobs and Accumulo.

2012-07-16 Thread William Slacum

You'd basically be doing a copy of the getInputSplits and getRecordReader
methods, except they'd be returning the mapred version of those classes.

On Mon, Jul 16, 2012 at 3:13 PM, Juan Moreno
jwellington.mor...@gmail.comwrote:

How hard would it be to implement own version using the mapred API ?

Would I have to do something as complex as InputFormatBase ? (It's a
mammoth class)
On Jul 16, 2012 5:53 PM, William Slacum wilhelm.von.cl...@accumulo.net
wrote:

mapred was deprecated as of 0.20.0 (
http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/InputFormat.html)
:)

On Mon, Jul 16, 2012 at 2:49 PM, Juan Moreno
jwellington.mor...@gmail.com wrote:

I assumed that would be true since the old API was deprecated, and that
is why we no longer support it. However, the old API has been undeprecated
since 0.20.205.0 and 1.0.0, which seems to indicate it's not going away.
Does anyone know what the plan for it is?

Billie

Sent from my smartphone. Please excuse any typos or shorthand.
On Jul 16, 2012 5:22 PM, Billie J Rinaldi
billie.j.rina...@ugov.gov wrote:

Thanks!
Juan

I opened ACCUMULO-695 to add support for the old mapred API.

Billie

Re: more questions about IndexedDocIterators

2012-07-15 Thread William Slacum

I'm on a phone, so excuse the lack of info/answers, but #5 is because the
IntersectingIterator is essentially a proof of concept piece of code.
There's no reason you shouldn't be able to do one term. The Wikipedia
example is able to handle single term queries. The code is a bit rough to
read, but should be a starting point.
On Jul 15, 2012 7:06 PM, Sukant Hajra qn2b6c2...@snkmail.com wrote:

Re: java.lang.VerifyError: Cannot inherit from final class

2012-07-14 Thread William Slacum

Looks like the stack trace is finishing up in the Thrift stuff-- I
wonder if you have a newer version of Thrift on the client?

On Sat, Jul 14, 2012 at 10:33 PM, Josh Elser josh.el...@gmail.com wrote:
 Can you post some more information about how you're running your program on
 your Windows client (specifically, the classpath)? Are you using something
 like Ant/Maven to manage dependencies? Also, what version of Java are you
 running on your Linode instance?



 On 07/14/2012 08:27 PM, David Medinets wrote:

 This was unexpected. My code is simple but ran into a problem. I am
 connecting from my Windows computer to the Linode Ubuntu-based server
 running Accumulo. The same code worked when run directly on the
 server. I copied all jar files from the Accumulo lib directory after I
 compiled Accumulo over to the Windows computer.

 Here is the code:

  String instanceName = development;
 String zooKeepers = zookeeper.affy.com;
  String user = root;
  byte[] pass = X.getBytes();
  String tableName = rope;

  ZooKeeperInstance instance = new
 ZooKeeperInstance(instanceName, zooKeepers);
  Connector connector = instance.getConnector(user, pass);

  if (!connector.tableOperations().exists(tableName)) {
  connector.tableOperations().create(tableName);
  }

 And here is the output:

 START: com.codebits.accumulo.CreateTable
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Client
 environment:zookeeper.version=3.3.1-942149, built on 05/07/2010 17:14
 GMT
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Client
 environment:host.name=aashi
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Client
 environment:java.version=1.7.0_04
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Client
 environment:java.vendor=Oracle Corporation
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Client
 environment:java.home=C:\Program Files (x86)\Java\jdk1.7.0_04\jre
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Client
 environment:java.class.path=...
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Client
 environment:java.io.tmpdir=C:\Users\medined\AppData\Local\Temp\
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Client
 environment:java.compiler=NA
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Client
 environment:os.name=Windows 7
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Client environment:os.arch=x86
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Client
 environment:os.version=6.1
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Client
 environment:user.name=medined
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Client
 environment:user.home=C:\Users\medined
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Client
 environment:user.dir=C:\eclipse_projects\accumulo_playground
 12/07/14 19:55:02 INFO zookeeper.ZooKeeper: Initiating client
 connection, connectString=zookeeper.affy.com sessionTimeout=3
 watcher=org.apache.accumulo.fate.zookeeper.ZooSession$ZooWatcher@11adeb7
 12/07/14 19:55:02 INFO zookeeper.ClientCnxn: Opening socket connection
 to server zookeeper.affy.com/66.175.213.65:2181
 12/07/14 19:55:02 INFO zookeeper.ClientCnxn: Socket connection
 established to zookeeper.affy.com/66.175.213.65:2181, initiating
 session
 12/07/14 19:55:02 INFO zookeeper.ClientCnxn: Session establishment
 complete on server zookeeper.affy.com/66.175.213.65:2181, sessionid =
 0x138823b14b64848, negotiated timeout = 3
 12/07/14 19:55:03 WARN impl.ServerClient: Failed to find an available
 server in the list of servers: [66.175.213.65:9997:9997 (12)]
 Exception in thread main java.lang.VerifyError: Cannot inherit from
 final class
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
 at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
 at
 org.apache.accumulo.core.util.ThriftUtil.clinit(ThriftUtil.java:79)
 at
 org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:478)
 at
 org.apache.accumulo.core.client.impl.ThriftTransportPool.getAnyTransport(ThriftTransportPool.java:429)
 at
 org.apache.accumulo.core.client.impl.ServerClient.getConnection(ServerClient.java:144)
 at
 org.apache.accumulo.core.client.impl.ServerClient.getConnection(ServerClient.java:122)
 at

Re: Cleaning tablet server entries from zookeeper - would it be possible to add percent complete?

2012-07-10 Thread William Slacum

It can take a long time if your tablet server isn't responsive, you're
major compacting, or there's some other issue going on in your
ecosystem (ie, the NameNode/DataNode has barfed or even ZooKeeper
itself has locked up). Check your monitor to see what it's trying to
do and also check that HDFS is doing ok.

On Tue, Jul 10, 2012 at 10:32 PM, David Medinets
david.medin...@gmail.com wrote:
 I am trying to stop Accumulo with bin/stop-all.sh. The process seems
 to be taking a long time after displaying Cleaning tablet server
 entries from zookeeper. Would it be possible to count the number of
 entries in Zookeeper before clearing the entries. Then a percent
 complete message can be displayed. Is it normal for this step to take
 a long time?

Re: querying the tablet server for given row (to get locality)?

2012-07-01 Thread William Slacum

A tablet will contain at minimum one row. So, if you shard/partition,
eventually your data will grow to the point that each tablet will
essentially be one row.
On Jul 1, 2012 2:17 PM, Sukant Hajra qn2b6c2...@snkmail.com wrote:

 I've been considering using distributed messaging service (Akka in my
 case).
 To get some throughput on ingesting data, I was going to shard computation
 across multiple servers, but the backend is still Accumulo.

 What bothers me is that I don't know the mapping from row IDs to tablet
 servers, so every one of my nodes is talking ostensibly to every tablet
 server,
 which is a lot of needless network traffic.

 What I'd really like to do is collocate my computation on the relevant
 tablet
 server to get the same benefits of locality Accumulo gets with HDFS.

 I feel Accumulo has to have this information internally, but I haven't dug
 deeply into the source to see if it's exposed to Accumulo clients.  Is it
 there?  If it is exposed, is it supported?

 Thanks for the help,
 Sukant

Re: strategies beyond intersecting iterators?

2012-07-01 Thread William Slacum

By iterator stack I am referring to the Accumulo iterators. Resource
sharing among scan sessions is implemented by destroying a user scan
session and eventually recreating the iterator stack. The new stack is then
seek'd to the last key returned by the entire stack. If you were holding
some state, such as a set of keys, it would be rebuilt every time the stack
is created.
On Jul 1, 2012 5:55 PM, Sukant Hajra qn2b6c2...@snkmail.com wrote:

 Excerpts from William Slacum's message of Thu Jun 28 16:04:32 -0500 2012:
 
  You're pretty much on the spot regarding two aspects about the current
  IntersectingIterator:
 
  1- It's not really extensible (there are hooks for building doc IDs,
  but you still need the same `partition term: docId` key structure)
  2- Its main strength is that it can do the merges of sorted lists of
  doc IDs based on equality expressions (ie, `author==bob and
  day==20120627`)
 
  Fortunately, the logic isn't very complicated for re-creating the
  merging stuff. Personally, I think it's easy enough to separate the
  logic of joining N streams of iterator results from the actual
  scanning. Unfortunately, this would be left up to you to do at the
  moment :)
 
  You could do range searches by consuming sets of values and sorting
  all of the docIds in that range by throwing them into a TreeSet. That
  would let you emit doc IDs in a globally sorted order for the given
  range of terms.

 I understand everything above, I think.  Thanks for the prompt reply.

  This can get problematic if the range ends up being very large because
 your
  iterator stack may periodically be destroyed and rebuilt.

 This particular statement confused me.  When you said TreeSet, you're
 talking
 about a straight-forward in-memory collection from java.util or similar,
 right?

 Because I'm confused about which iterator stack may periodically be
 destroyed
 and rebuilt.  It sounds like we're talking about some garbage collection
 specific to Accumulo.  Am I missing something here?

 -Sukant

Re: strategies beyond intersecting iterators?

2012-07-01 Thread William Slacum

The you can think of the Intersecting (and Or) iterator as a tree of
merging keys.

So, let's assume we have the following index in a given partition. The
partition will have the row partitionN.

partitionN Bill: 1
partitionN Bill: 2
partitionN Bill: 3
partitionN Josh: 3
partitionN Josh: 4
partitionN Josh: 5
partitionN Sukant: 0
partitionN Sukant: 3
partitionN Sukant: 6

If I wanted to query for all documents that contained Bill, Josh and
Sukant, I'd set up an IntersectingIterator that three term sources. The
term sources would be created to look at one of {Bill, Josh, Sukant}
for their column family values. The column qualifiers contained document
IDs for documents that contain the given term. This yields a set up where,
for a given term, we have a sorted list of document IDs.

To give a bit of a visualization, you can think of this structure in tree
form:


  Intersection
 /   |\
   Sukant   BillJosh
  [0,  [1,[3,
   3,   2, 4,
   6]   3] 5]


On our first pass, the IntersectingIterator will note that its children
point to the document IDs 0, 1 and 3. Since each list of doc IDs is sorted,
we can deduce that the earliest doc ID that could be a potential match is
3. So, it will seek the term sources for Sukant and Bill to at least
the key {row: partitionN, colf: term, colq: 3}. On the next pass,
we'll note that each term source is pointing to doc ID 3. This means we've
found an intersection, so the top level IntersectingIterator will return
docID 3.

When the session requests the next matching docID, the iterator will
advance each iterator by calling next(). The IntersectingIterator now sees
its children are all positioned at docIDs [6, null, 5] (the `null` value
arises because the Bill term source doesn't have a key beyond {row:
partitionN, colf: Bill, colq: 3}. This state means that the
intersection is done, because one of the term sources has exhausted its
possible values, so there's no doc ID that will occur in all three lists.

On Sun, Jul 1, 2012 at 11:57 PM, Sukant Hajra qn2b6c2...@snkmail.comwrote:

 Excerpts from Sukant Hajra's message of Thu Jun 28 15:49:11 -0500 2012:
 
  The Accumulo documentation alludes to the problem a little:
 
  If the results are unordered this is quite effective as the first
 results
  to arrive are as good as any others to the user.
 
  In our case, order matters because we want the last results without
 pulling in
  everything.

 Actually, I was just thinking about this a little.  I don't know if this is
 specified in the documentation, but is there /any/ reliable (deterministic)
 ordering for the values returned by intersecting iterators?

 If there is, would it be horribly ill-advised to rely on this ordering for
 application logic if we got clever with our schema?

 Also, if someone could reply with the exact algorithm for this ordering, it
 would help put less burden on us to reverse engineer and/or read the source
 code correctly.

 Thanks for your help,
 Sukant

Re: querying for relevant rows

2012-06-29 Thread William Slacum

You can use a BatchScanner and give it two ranges. It would look something
like:

ArrayListRange ranges = new ArrayListRange();
ranges.add(new Range(new Key(timestamp1)));
ranges.add(new Range(new Key(timestamp2)));

BatchScanner bs = con.createBatchScanner(...);

//set  your iterators and filters

bs.setRanges(ranges);

for(EntryKey, Value e : bs) {
  //your stuff
}

On Fri, Jun 29, 2012 at 11:19 AM, Lam dnae...@gmail.com wrote:

 I'm using a timestamp as a key and the value is all the relevant data
 starting at that timestamp up to the timestamp represented by the key
 of the next row.

 When querying, I'm given a time span, consisting of a start and stop
 time.  I want to return all the relevant data within the time span, so
 I was to retrieve the appropriate rows (then filter the data for the
 given timespan).

 Example:
 In Accumulo:  (the format of the value is  letter.timestamp)
 key=1  value= {a.1 b.1 c.2 d.2}
 key=3  value= {m.3 n.4 o.5}
 key=6  value={x.6 y.6 z.7}

 Query:  timespan=[2 4]  (get all data from timestamp 2 to 4 inclusively)

 Desire result: retrieve key=1 and key=3, then filter out a.1, b.1, and
 o.5, and return the rest

 Problem: How do I know to retrieve key=1 and key=3 without scanning
 all the keys?

 Can I create a scanner that looks for the given start key=2 and go to
 the prior row (i.e. key=1)?

 --
 D. Lam

Re: querying for relevant rows

2012-06-29 Thread William Slacum

Oh, did I interpret this wrong? I originally thought all of the timestamps
would be enumerated as rows, but after re-reading, I kind of get the idea
that the rows are being used as markers in a skip list like fashion.

On Fri, Jun 29, 2012 at 11:52 AM, Adam Fuchs afu...@apache.org wrote:

 You can't scan backwards in Accumulo, but you probably don't need to. What
 you can do instead is use the last timestamp in the range as the key like
 this:

 key=2  value= {a.1 b.1 c.2 d.2}
 key=5  value= {m.3 n.4 o.5}
 key=7  value={x.6 y.6 z.7}

 As long as your ranges are non-overlapping, you can just stop when you get
 to the first key/value pair that starts after your given time range. If
 your ranges are overlapping then you will have to do a more complicated
 intersection between forward and reverse orderings to efficiently select
 ranges, or maybe use some type of hierarchical range intersection index
 akin to a binary space partitioning tree.

 Cheers,
 Adam



 On Fri, Jun 29, 2012 at 2:19 PM, Lam dnae...@gmail.com wrote:

 I'm using a timestamp as a key and the value is all the relevant data
 starting at that timestamp up to the timestamp represented by the key
 of the next row.

 When querying, I'm given a time span, consisting of a start and stop
 time.  I want to return all the relevant data within the time span, so
 I was to retrieve the appropriate rows (then filter the data for the
 given timespan).

 Example:
 In Accumulo:  (the format of the value is  letter.timestamp)
 key=1  value= {a.1 b.1 c.2 d.2}
 key=3  value= {m.3 n.4 o.5}
 key=6  value={x.6 y.6 z.7}

 Query:  timespan=[2 4]  (get all data from timestamp 2 to 4 inclusively)

 Desire result: retrieve key=1 and key=3, then filter out a.1, b.1, and
 o.5, and return the rest

 Problem: How do I know to retrieve key=1 and key=3 without scanning
 all the keys?

 Can I create a scanner that looks for the given start key=2 and go to
 the prior row (i.e. key=1)?

 --
 D. Lam

Re: strategies beyond intersecting iterators?

2012-06-28 Thread William Slacum

You're pretty much on the spot regarding two aspects about the current
IntersectingIterator:

1- It's not really extensible (there are hooks for building doc IDs,
but you still need the same `partition term: docId` key structure)
2- Its main strength is that it can do the merges of sorted lists of
doc IDs based on equality expressions (ie, `author==bob and
day==20120627`)

Fortunately, the logic isn't very complicated for re-creating the
merging stuff. Personally, I think it's easy enough to separate the
logic of joining N streams of iterator results from the actual
scanning. Unfortunately, this would be left up to you to do at the
moment :)

You could do range searches by consuming sets of values and sorting
all of the docIds in that range by throwing them into a TreeSet. That
would let you emit doc IDs in a globally sorted order for the given
range of terms. This can get problematic if the range ends up being
very large because your iterator stack may periodically be destroyed
and rebuilt.

On Thu, Jun 28, 2012 at 1:49 PM, Sukant Hajra qn2b6c2...@snkmail.com wrote:
 We're in a position right now, where we have a change list (like a transaction
 log) and we'd like to index the changes by author, but a typical query is:

    Show the last n changes for author Foo Bar

 or

    Show changes after Jan. 1st, 2012 for author Foo Bar

 Certainly, we can denormalize our data to facilitate this lookup.  But the 
 idea
 of using intersecting iterators seems intriguing (to get a modicum of
 data-local server-side joining), but our ideas for shoe-horning the query into
 intersecting iterators seems really wonky or half-baked.  Largely, we're
 running into the restriction that intersecting iterators are based upon the
 product of a boolean conjunctive statements about term equality.  What we'd
 really like is a little more range-based.  The Accumulo documentation alludes
 to the problem a little:

    If the results are unordered this is quite effective as the first results
    to arrive are as good as any others to the user.

 In our case, order matters because we want the last results without pulling in
 everything.

 We looked at the code for intersecting iterators a little, and noticed that
 there's an inheritance design, but we're not convinced that it's really
 designed for extension and if it is, we're not sure if it can be extended to
 meet our needs gracefully.  If it can, we're really interested in any
 suggestions or prior work.

 Otherwise, we're open to the idea that there's Accumulo features we're just 
 not
 aware of beyond intersecting iterators that are a better fit.

 It would be wonderful to have a technique to hedge against over-denormalizing
 our data for every variant of query we have to support.

 Thanks for your help,
 Sukant

Re: [External] Re: accumulo init not working

2012-06-18 Thread William Slacum

Did your NameNode start up correctly?

If on a local instance, you can verify this by running `jps -lm`. If
jps isn't on your path, it should be located in $JAVA_HOME/bin.

If the NameNode is not running, check your Hadoop logs. The log you
want should have namenode in the file name-- it should tell you want
went wrong. I commonly see this when setting up a new instance if I
forget to run `hadoop namenode -format`.

On Mon, Jun 18, 2012 at 11:29 PM, Shrestha, Tejen [USA]
shrestha_te...@bah.com wrote:
 Thank you for the quick reply.  You were right I had downloaded the source
 instead of the dist.
 I ran: mvn package  mvn assembly:single –N as per the Accumulo README.
  I'm not getting the exception anymore but now I can't get it to connect for
 some reason.  Again, Hadoop and Zookeeper are running fine and this is the
 error that I get after $ACCUMULO/bin/accumulo init

 18 23:04:55,614 [ipc.Client] INFO : Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 0 time(s).
 18 23:04:56,618 [ipc.Client] INFO : Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 1 time(s).
 18 23:04:57,620 [ipc.Client] INFO : Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 2 time(s).
 18 23:04:58,621 [ipc.Client] INFO : Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 3 time(s).
 18 23:04:59,623 [ipc.Client] INFO : Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 4 time(s).
 18 23:05:00,625 [ipc.Client] INFO : Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 5 time(s).
 18 23:05:01,625 [ipc.Client] INFO : Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 6 time(s).
 18 23:05:02,627 [ipc.Client] INFO : Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 7 time(s).
 18 23:05:03,629 [ipc.Client] INFO : Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 8 time(s).
 18 23:05:04,631 [ipc.Client] INFO : Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 9 time(s).
 18 23:05:04,634 [util.Initialize] FATAL: java.net.ConnectException: Call to
 localhost/127.0.0.1:9000 failed on connection exception:
 java.net.ConnectException: Connection refused
 java.net.ConnectException: Call to localhost/127.0.0.1:9000 failed on
 connection exception: java.net.ConnectException: Connection refused
 at org.apache.hadoop.ipc.Client.wrapException(Client.java:767)
 at org.apache.hadoop.ipc.Client.call(Client.java:743)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
 at $Proxy0.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
 at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
 at org.apache.accumulo.core.file.FileUtil.getFileSystem(FileUtil.java:554)
 at org.apache.accumulo.server.util.Initialize.main(Initialize.java:426)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.accumulo.start.Main$1.run(Main.java:89)
 at java.lang.Thread.run(Thread.java:680)
 Caused by: java.net.ConnectException: Connection refused
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
 at
 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
 at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304)
 at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
 at org.apache.hadoop.ipc.Client.getConnection(Client.java:860)
 at org.apache.hadoop.ipc.Client.call(Client.java:720)
 ... 20 more
 Thread init died null
 java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.accumulo.start.Main$1.run(Main.java:89)
 at java.lang.Thread.run(Thread.java:680)
 Caused by: java.lang.RuntimeException: java.net.ConnectException: Call to
 localhost/127.0.0.1:9000 failed

Re: Is it possible to use an iterator to aggregate results of a BatchScanner?

2012-06-11 Thread William Slacum

So, is a global sorting order required of your iterator? That's really
the key behavioral difference in terms of output when you're dealing
with a Scanner versus a BatchScanner.

Please correct me if I'm wrong about assuming you're trying to get a
distribution for the column families that appear in a given set of
ranges.

You can count the column qualifiers on a per tablet/row basis server
side using an Accumulo iterator, and as you iterate over your scanner,
you can merge those counts using a map.

{{{
BatchScanner scan = connector.createBatchScanner(...);
// set up a column family counting/skipping iterator

HashMapText, AtomicLong cqCounts = new HashMapText, AtomicLong();

for(EntryKey, Value e : scan) {
  AtomicLong cqCount = cqCounts.get(e.getKey().getColumnFamily());
  if(cqCount == null) {
 cqCount = new AtomicLong();
 cqCounts.put(e.getKey().getColumnFamily(), cqCount);
  }
  cqCount.addAndGet(Long.parseLong(new String(e.getValue().get()));
}
}}}

(please excuse any old/deprecated API's used)

On Mon, Jun 11, 2012 at 2:21 PM, Hunter Provyn hun...@ccri.com wrote:
 I have a SkippingIterator that skips entries with cq that it has seen
 before.
 It works on a Scanner, but on a BatchScanner, the iterators from different
 threads don't communicate, so the result is that results within a single
 range are unique, but across the whole set of ranges, are not unique.
 I'd prefer to perform the aggregation within the iterators if possible, but
 I don't know how.

 Also, thanks for your previous help, William, Keith, Bob and David.

Re: how to use CountingIterator to count records?

2012-06-06 Thread William Slacum

You're kind of there. Essentially, you can think of your Scanner's
interactions with the TServers as a tree with a height of two. Your
Scanner is the root and its children are all of the TServers it
needs to interact with. Essentially, the operation you'd want to is
sum the number of records each of the children have.

In Accumulo terms, you can use something like a CountingIterator to
count the number of results on each TServer. You can then sum all of
those intermediate results to get a total count of results.

On Wed, Jun 6, 2012 at 10:39 AM, Hunter Provyn hun...@ccri.com wrote:
 I want to know the number of records a scanner has without actually getting
 the records from cloudbase.
 I've been looking at CountingIterator (1.3.4), which has a getCount()
 method.  However, I don't know how
 to access the instance to call getCount() on it because Cloudbase server
 just passes back the entries and doesn't expose the instance of the
 iterator.

 It is possible to use an AggregatingIterator to aggregate all entries into a
 single entry whose value is the number of entries.  But I was wondering if
 there was a better way that possibly makes use of the CountingIterator
 class.

1 2 >

1 - 100 of 104 matches

Mail list logo