Re: Place a fat Jar file in ACCUMULO_HOME/lib/ext

2017-05-19 Thread Sven Hodapp
My accumulo-site.xml looks like this:

  
general.classpaths


  
  $ACCUMULO_HOME/lib/accumulo-server.jar,
  $ACCUMULO_HOME/lib/accumulo-core.jar,
  $ACCUMULO_HOME/lib/accumulo-start.jar,
  $ACCUMULO_HOME/lib/accumulo-fate.jar,
  $ACCUMULO_HOME/lib/accumulo-proxy.jar,
  $ACCUMULO_HOME/lib/[^.].*.jar,
  
  $ZOOKEEPER_HOME/zookeeper[^.].*.jar,
  
  $HADOOP_CONF_DIR,
  
  $HADOOP_PREFIX/share/hadoop/common/[^.].*.jar,
  $HADOOP_PREFIX/share/hadoop/common/lib/(?!slf4j)[^.].*.jar,
  $HADOOP_PREFIX/share/hadoop/hdfs/[^.].*.jar,
  $HADOOP_PREFIX/share/hadoop/mapreduce/[^.].*.jar,
  $HADOOP_PREFIX/share/hadoop/yarn/[^.].*.jar,
  $HADOOP_PREFIX/share/hadoop/yarn/lib/jersey.*.jar,

Classpaths that accumulo checks for updates and class 
files.
  

I think it's a standard general.classpaths configuration.

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de

- Ursprüngliche Mail -
> Von: "Dave Marion" <dlmar...@comcast.net>
> An: "Sven Hodapp" <sven.hod...@scai.fraunhofer.de>
> CC: "user" <user@accumulo.apache.org>
> Gesendet: Donnerstag, 18. Mai 2017 16:38:34
> Betreff: Re: Place a fat Jar file in ACCUMULO_HOME/lib/ext

> What does your accumulo-site.xml file look like with respect to the 
> classloader
> setup?
> 
>> On May 18, 2017 at 9:35 AM Sven Hodapp <sven.hod...@scai.fraunhofer.de> 
>> wrote:
>>
>>
>> Hi Dave,
>>
>> thanks for your reply! No the Jar file is perfectly fine.
>>
>> I've tried some other options and found for SBT (http://www.scala-sbt.org/) a
>> nice plugin (https://github.com/xerial/sbt-pack) which collects all depended
>> "thin" Jar files including scala-library.jar into `target/pack/lib`.
>>
>> Now I can just copy the Jars from the `target/pack/lib` folder into
>> `ACCUMULO_HOME/lib/ext` and Accumulo deploys this code without errors.
>>
>> Regards,
>> Sven
>>
>> --
>> Sven Hodapp, M.Sc.,
>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>> Department of Bioinformatics
>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>> sven.hod...@scai.fraunhofer.de
>> www.scai.fraunhofer.de
>>
>> - Ursprüngliche Mail -
>> > Von: "Dave Marion" <dlmar...@comcast.net>
>> > An: "Sven Hodapp" <sven.hod...@scai.fraunhofer.de>, "user"
>> > <user@accumulo.apache.org>
>> > Gesendet: Donnerstag, 18. Mai 2017 15:13:27
>> > Betreff: Re: Place a fat Jar file in ACCUMULO_HOME/lib/ext
>>
>> > The cause makes me think that the jar is corrupt. Can you unzip the fat 
>> > jar on
>> > the command line?
>> >
>> > Caused by: java.util.zip.ZipException: error in opening zip file
>> > at java.util.zip.ZipFile.open(Native Method)
>> > at java.util.zip.ZipFile.(ZipFile.java:219)
>> > at java.util.zip.ZipFile.(ZipFile.java:149)
>> > at java.util.jar.JarFile.(JarFile.java:166)
>> > at java.util.jar.JarFile.(JarFile.java:130)
>> >
>> >> On May 18, 2017 at 8:50 AM Sven Hodapp <sven.hod...@scai.fraunhofer.de> 
>> >> wrote:
>> >>
>> >>
>> >> Hi everyone,
>> >>
>> >> I've tried to deploy my Iterator suite together with their dependencies 
>> >> in one
>> >> single (fat) Jar file.
>> >> But then I'll get errors like these:
>> >>
>> >> [vfs.AccumuloReloadingVFSClassLoader] ERROR: Could not open Jar file
>> >> "/export/accumulo/install/accumulo-1.8.0/lib/ext/my.jar".
>> >> org.apache.commons.vfs2.FileSystemException: Could not open Jar file
>> >> "/export/accumulo/install/accumulo-1.8.0/lib/ext/my.jar".
>> >> at
>> >> org.apache.commons.vfs2.provider.jar.JarFileSystem.createZipFile(JarFileSystem.java:66)
>> >> at
>> >> org.apache.commons.vfs2.provider.zip.ZipFileSystem.getZipFile(ZipFileSystem.java:141)
>> >> at
>> >> org.apache.commons.vfs2.provider.jar.JarFileSystem.getZipFile(JarFileSystem.java:219)
>> >> at
>> >> org.apache.commons.vfs2.provider.zip.ZipFileSystem.init(ZipFileSystem.java:87)
>> >> at
>> >> org.apache.commons.vfs2.provider.AbstractVfsContainer.addComponent(AbstractVfsContainer.java:56)
>> >> at
>> 

Re: Place a fat Jar file in ACCUMULO_HOME/lib/ext

2017-05-18 Thread Sven Hodapp
Hi Dave,

thanks for your reply! No the Jar file is perfectly fine.

I've tried some other options and found for SBT (http://www.scala-sbt.org/) a 
nice plugin (https://github.com/xerial/sbt-pack) which collects all depended 
"thin" Jar files including scala-library.jar into `target/pack/lib`.

Now I can just copy the Jars from the `target/pack/lib` folder into 
`ACCUMULO_HOME/lib/ext` and Accumulo deploys this code without errors.

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de

- Ursprüngliche Mail -
> Von: "Dave Marion" <dlmar...@comcast.net>
> An: "Sven Hodapp" <sven.hod...@scai.fraunhofer.de>, "user" 
> <user@accumulo.apache.org>
> Gesendet: Donnerstag, 18. Mai 2017 15:13:27
> Betreff: Re: Place a fat Jar file in ACCUMULO_HOME/lib/ext

> The cause makes me think that the jar is corrupt. Can you unzip the fat jar on
> the command line?
> 
> Caused by: java.util.zip.ZipException: error in opening zip file
> at java.util.zip.ZipFile.open(Native Method)
> at java.util.zip.ZipFile.(ZipFile.java:219)
> at java.util.zip.ZipFile.(ZipFile.java:149)
> at java.util.jar.JarFile.(JarFile.java:166)
> at java.util.jar.JarFile.(JarFile.java:130)
> 
>> On May 18, 2017 at 8:50 AM Sven Hodapp <sven.hod...@scai.fraunhofer.de> 
>> wrote:
>>
>>
>> Hi everyone,
>>
>> I've tried to deploy my Iterator suite together with their dependencies in 
>> one
>> single (fat) Jar file.
>> But then I'll get errors like these:
>>
>> [vfs.AccumuloReloadingVFSClassLoader] ERROR: Could not open Jar file
>> "/export/accumulo/install/accumulo-1.8.0/lib/ext/my.jar".
>> org.apache.commons.vfs2.FileSystemException: Could not open Jar file
>> "/export/accumulo/install/accumulo-1.8.0/lib/ext/my.jar".
>> at
>> org.apache.commons.vfs2.provider.jar.JarFileSystem.createZipFile(JarFileSystem.java:66)
>> at
>> org.apache.commons.vfs2.provider.zip.ZipFileSystem.getZipFile(ZipFileSystem.java:141)
>> at
>> org.apache.commons.vfs2.provider.jar.JarFileSystem.getZipFile(JarFileSystem.java:219)
>> at
>> org.apache.commons.vfs2.provider.zip.ZipFileSystem.init(ZipFileSystem.java:87)
>> at
>> org.apache.commons.vfs2.provider.AbstractVfsContainer.addComponent(AbstractVfsContainer.java:56)
>> at
>> org.apache.commons.vfs2.provider.AbstractFileProvider.addFileSystem(AbstractFileProvider.java:108)
>> at
>> org.apache.commons.vfs2.provider.AbstractLayeredFileProvider.createFileSystem(AbstractLayeredFileProvider.java:88)
>> at
>> org.apache.commons.vfs2.impl.DefaultFileSystemManager.createFileSystem(DefaultFileSystemManager.java:1022)
>> at
>> org.apache.commons.vfs2.impl.DefaultFileSystemManager.createFileSystem(DefaultFileSystemManager.java:1042)
>> at
>> org.apache.commons.vfs2.impl.VFSClassLoader.addFileObjects(VFSClassLoader.java:156)
>> at 
>> org.apache.commons.vfs2.impl.VFSClassLoader.(VFSClassLoader.java:119)
>> at
>> org.apache.accumulo.start.classloader.vfs.AccumuloReloadingVFSClassLoader$2.run(AccumuloReloadingVFSClassLoader.java:85)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.util.zip.ZipException: error in opening zip file
>> at java.util.zip.ZipFile.open(Native Method)
>> at java.util.zip.ZipFile.(ZipFile.java:219)
>> at java.util.zip.ZipFile.(ZipFile.java:149)
>> at java.util.jar.JarFile.(JarFile.java:166)
>> at java.util.jar.JarFile.(JarFile.java:130)
>> at
>> org.apache.commons.vfs2.provider.jar.JarFileSystem.createZipFile(JarFileSystem.java:62)
>>
>> If I place "thin" Jars into the lib/ext folder there will be no problems.
>> But I think it is cumbersome to manually disassemble the dependency tree into
>> (many) "thin" Jar files...
>>
>> Has anybody an idea how to fix that?
>>
>> Thanks and kind regards,
>> Sven
>>
>> --
>> Sven Hodapp, M.Sc.,
>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>> Department of Bioinformatics
>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>> sven.hod...@scai.fraunhofer.de
> > www.scai.fraunhofer.de


Place a fat Jar file in ACCUMULO_HOME/lib/ext

2017-05-18 Thread Sven Hodapp
Hi everyone,

I've tried to deploy my Iterator suite together with their dependencies in one 
single (fat) Jar file.
But then I'll get errors like these:

[vfs.AccumuloReloadingVFSClassLoader] ERROR: Could not open Jar file 
"/export/accumulo/install/accumulo-1.8.0/lib/ext/my.jar".
org.apache.commons.vfs2.FileSystemException: Could not open Jar file 
"/export/accumulo/install/accumulo-1.8.0/lib/ext/my.jar".
at 
org.apache.commons.vfs2.provider.jar.JarFileSystem.createZipFile(JarFileSystem.java:66)
at 
org.apache.commons.vfs2.provider.zip.ZipFileSystem.getZipFile(ZipFileSystem.java:141)
at 
org.apache.commons.vfs2.provider.jar.JarFileSystem.getZipFile(JarFileSystem.java:219)
at 
org.apache.commons.vfs2.provider.zip.ZipFileSystem.init(ZipFileSystem.java:87)
at 
org.apache.commons.vfs2.provider.AbstractVfsContainer.addComponent(AbstractVfsContainer.java:56)
at 
org.apache.commons.vfs2.provider.AbstractFileProvider.addFileSystem(AbstractFileProvider.java:108)
at 
org.apache.commons.vfs2.provider.AbstractLayeredFileProvider.createFileSystem(AbstractLayeredFileProvider.java:88)
at 
org.apache.commons.vfs2.impl.DefaultFileSystemManager.createFileSystem(DefaultFileSystemManager.java:1022)
at 
org.apache.commons.vfs2.impl.DefaultFileSystemManager.createFileSystem(DefaultFileSystemManager.java:1042)
at 
org.apache.commons.vfs2.impl.VFSClassLoader.addFileObjects(VFSClassLoader.java:156)
at 
org.apache.commons.vfs2.impl.VFSClassLoader.(VFSClassLoader.java:119)
at 
org.apache.accumulo.start.classloader.vfs.AccumuloReloadingVFSClassLoader$2.run(AccumuloReloadingVFSClassLoader.java:85)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.(ZipFile.java:219)
at java.util.zip.ZipFile.(ZipFile.java:149)
at java.util.jar.JarFile.(JarFile.java:166)
at java.util.jar.JarFile.(JarFile.java:130)
at 
org.apache.commons.vfs2.provider.jar.JarFileSystem.createZipFile(JarFileSystem.java:62)

If I place "thin" Jars into the lib/ext folder there will be no problems.
But I think it is cumbersome to manually disassemble the dependency tree into 
(many) "thin" Jar files...

Has anybody an idea how to fix that?

Thanks and kind regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de


Re: Accumulo Seek performance

2016-08-31 Thread Sven Hodapp
Hi Keith,

I've tried it with 1, 2 or 10 threads. Unfortunately there where no amazing 
differences.
Maybe it's a problem with the table structure? For example it may happen that 
one row id (e.g. a sentence) has several thousand column families. Can this 
affect the seek performance?

So for my initial example it has about 3000 row ids to seek, which will return 
about 500k entries. If I filter for specific column families (e.g. a document 
without annotations) it will return about 5k entries, but the seek time will 
only be halved.
Are there to much column families to seek it fast?

Thanks!

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de

- Ursprüngliche Mail -
> Von: "Keith Turner" <ke...@deenlo.com>
> An: "user" <user@accumulo.apache.org>
> Gesendet: Montag, 29. August 2016 22:37:32
> Betreff: Re: Accumulo Seek performance

> On Wed, Aug 24, 2016 at 9:22 AM, Sven Hodapp
> <sven.hod...@scai.fraunhofer.de> wrote:
>> Hi there,
>>
>> currently we're experimenting with a two node Accumulo cluster (two tablet
>> servers) setup for document storage.
>> This documents are decomposed up to the sentence level.
>>
>> Now I'm using a BatchScanner to assemble the full document like this:
>>
>> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) // 
>> ARTIFACTS table
>> currently hosts ~30GB data, ~200M entries on ~45 tablets
>> bscan.setRanges(ranges)  // there are like 3000 Range.exact's in the 
>> ranges-list
>>   for (entry <- bscan.asScala) yield {
>> val key = entry.getKey()
>> val value = entry.getValue()
>> // etc.
>>   }
>>
>> For larger full documents (e.g. 3000 exact ranges), this operation will take
>> about 12 seconds.
>> But shorter documents are assembled blazing fast...
>>
>> Is that to much for a BatchScanner / I'm misusing the BatchScaner?
>> Is that a normal time for such a (seek) operation?
>> Can I do something to get a better seek performance?
> 
> How many threads did you configure the batch scanner with and did you
> try varying this?
> 
>>
>> Note: I have already enabled bloom filtering on that table.
>>
>> Thank you for any advice!
>>
>> Regards,
>> Sven
>>
>> --
>> Sven Hodapp, M.Sc.,
>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>> Department of Bioinformatics
>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>> sven.hod...@scai.fraunhofer.de
> > www.scai.fraunhofer.de


Re: Accumulo Seek performance

2016-08-25 Thread Sven Hodapp
Hi Dave,

toList will exhaust the iterator. But all 6 iterators will be concurrently 
exhausted within the Future object 
(http://docs.scala-lang.org/overviews/core/futures.html).

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de

- Ursprüngliche Mail -
> Von: dlmar...@comcast.net
> An: "user" <user@accumulo.apache.org>
> Gesendet: Donnerstag, 25. August 2016 16:22:35
> Betreff: Re: Accumulo Seek performance

> But does toList exhaust the first iterator() before going to the next?
> 
> - Dave
> 
> 
> - Original Message -
> 
> From: "Sven Hodapp" <sven.hod...@scai.fraunhofer.de>
> To: "user" <user@accumulo.apache.org>
> Sent: Thursday, August 25, 2016 9:42:00 AM
> Subject: Re: Accumulo Seek performance
> 
> Hi dlmarion,
> 
> toList should also call iterator(), and that is done in independently for each
> batch scanner iterator in the context of the Future.
> 
> Regards,
> Sven
> 
> --
> Sven Hodapp, M.Sc.,
> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
> Department of Bioinformatics
> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
> sven.hod...@scai.fraunhofer.de
> www.scai.fraunhofer.de
> 
> - Ursprüngliche Mail -
>> Von: dlmar...@comcast.net
>> An: "user" <user@accumulo.apache.org>
>> Gesendet: Donnerstag, 25. August 2016 14:34:39
>> Betreff: Re: Accumulo Seek performance
> 
>> Calling BatchScanner.iterator() is what starts the work on the server side. 
>> You
>> should do this first for all 6 batch scanners, then iterate over all of them 
>> in
>> parallel.
>> 
>> - Original Message -
>> 
>> From: "Sven Hodapp" <sven.hod...@scai.fraunhofer.de>
>> To: "user" <user@accumulo.apache.org>
>> Sent: Thursday, August 25, 2016 4:53:41 AM
>> Subject: Re: Accumulo Seek performance
>> 
>> Hi,
>> 
>> I've changed the code a little bit, so that it uses a thread pool (via the
>> Future):
>> 
>> val ranges500 = ranges.asScala.grouped(500) // this means 6 BatchScanners 
>> will
>> be created
>> 
>> for (ranges <- ranges500) {
>> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 2)
>> bscan.setRanges(ranges.asJava)
>> Future {
>> time("mult-scanner") {
>> bscan.asScala.toList // toList forces the iteration of the iterator
>> }
>> }
>> }
>> 
>> Here are the results:
>> 
>> background log: info: mult-scanner time: 4807.289358 ms
>> background log: info: mult-scanner time: 4930.996522 ms
>> background log: info: mult-scanner time: 9510.010808 ms
>> background log: info: mult-scanner time: 11394.152391 ms
>> background log: info: mult-scanner time: 13297.247295 ms
>> background log: info: mult-scanner time: 14032.704837 ms
>> 
>> background log: info: single-scanner time: 15322.624393 ms
>> 
>> Every Future completes independent, but in return every batch scanner 
>> iterator
>> needs more time to complete. :(
>> This means the batch scanners aren't really processed in parallel on the 
>> server
>> side?
>> Should I reconfigure something? Maybe the tablet servers haven't/can't 
>> allocate
>> enough threads or memory? (Every of the two nodes has 8 cores and 64GB memory
>> and a storage with ~300MB/s...)
>> 
>> Regards,
>> Sven
>> 
>> --
>> Sven Hodapp, M.Sc.,
>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>> Department of Bioinformatics
>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>> sven.hod...@scai.fraunhofer.de
>> www.scai.fraunhofer.de
>> 
>> - Ursprüngliche Mail -
>>> Von: "Josh Elser" <josh.el...@gmail.com>
>>> An: "user" <user@accumulo.apache.org>
>>> Gesendet: Mittwoch, 24. August 2016 18:36:42
>>> Betreff: Re: Accumulo Seek performance
>> 
>>> Ahh duh. Bad advice from me in the first place :)
>>> 
>>> Throw 'em in a threadpool locally.
>>> 
>>> dlmar...@comcast.net wrote:
>>>> Doesn't this use the 6 batch scanners serially?
>>>> 
>>>> 
>>>> *From: *"Sven Hodapp" <sven.hod...@scai.fraunhofer.de>
>>>> *To: *"user&q

Re: Accumulo Seek performance

2016-08-25 Thread Sven Hodapp
Hi dlmarion,

toList should also call iterator(), and that is done in independently for each 
batch scanner iterator in the context of the Future.

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de

- Ursprüngliche Mail -
> Von: dlmar...@comcast.net
> An: "user" <user@accumulo.apache.org>
> Gesendet: Donnerstag, 25. August 2016 14:34:39
> Betreff: Re: Accumulo Seek performance

> Calling BatchScanner.iterator() is what starts the work on the server side. 
> You
> should do this first for all 6 batch scanners, then iterate over all of them 
> in
> parallel.
> 
> ----- Original Message -
> 
> From: "Sven Hodapp" <sven.hod...@scai.fraunhofer.de>
> To: "user" <user@accumulo.apache.org>
> Sent: Thursday, August 25, 2016 4:53:41 AM
> Subject: Re: Accumulo Seek performance
> 
> Hi,
> 
> I've changed the code a little bit, so that it uses a thread pool (via the
> Future):
> 
> val ranges500 = ranges.asScala.grouped(500) // this means 6 BatchScanners will
> be created
> 
> for (ranges <- ranges500) {
> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 2)
> bscan.setRanges(ranges.asJava)
> Future {
> time("mult-scanner") {
> bscan.asScala.toList // toList forces the iteration of the iterator
> }
> }
> }
> 
> Here are the results:
> 
> background log: info: mult-scanner time: 4807.289358 ms
> background log: info: mult-scanner time: 4930.996522 ms
> background log: info: mult-scanner time: 9510.010808 ms
> background log: info: mult-scanner time: 11394.152391 ms
> background log: info: mult-scanner time: 13297.247295 ms
> background log: info: mult-scanner time: 14032.704837 ms
> 
> background log: info: single-scanner time: 15322.624393 ms
> 
> Every Future completes independent, but in return every batch scanner iterator
> needs more time to complete. :(
> This means the batch scanners aren't really processed in parallel on the 
> server
> side?
> Should I reconfigure something? Maybe the tablet servers haven't/can't 
> allocate
> enough threads or memory? (Every of the two nodes has 8 cores and 64GB memory
> and a storage with ~300MB/s...)
> 
> Regards,
> Sven
> 
> --
> Sven Hodapp, M.Sc.,
> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
> Department of Bioinformatics
> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
> sven.hod...@scai.fraunhofer.de
> www.scai.fraunhofer.de
> 
> - Ursprüngliche Mail -
>> Von: "Josh Elser" <josh.el...@gmail.com>
>> An: "user" <user@accumulo.apache.org>
>> Gesendet: Mittwoch, 24. August 2016 18:36:42
>> Betreff: Re: Accumulo Seek performance
> 
>> Ahh duh. Bad advice from me in the first place :)
>> 
>> Throw 'em in a threadpool locally.
>> 
>> dlmar...@comcast.net wrote:
>>> Doesn't this use the 6 batch scanners serially?
>>> 
>>> 
>>> *From: *"Sven Hodapp" <sven.hod...@scai.fraunhofer.de>
>>> *To: *"user" <user@accumulo.apache.org>
>>> *Sent: *Wednesday, August 24, 2016 11:56:14 AM
>>> *Subject: *Re: Accumulo Seek performance
>>> 
>>> Hi Josh,
>>> 
>>> thanks for your reply!
>>> 
>>> I've tested your suggestion with a implementation like that:
>>> 
>>> val ranges500 = ranges.asScala.grouped(500) // this means 6
>>> BatchScanners will be created
>>> 
>>> time("mult-scanner") {
>>> for (ranges <- ranges500) {
>>> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 1)
>>> bscan.setRanges(ranges.asJava)
>>> for (entry <- bscan.asScala) yield {
>>> entry.getKey()
>>> }
>>> }
>>> }
>>> 
>>> And the result is a bit disappointing:
>>> 
>>> background log: info: mult-scanner time: 18064.969281 ms
>>> background log: info: single-scanner time: 6527.482383 ms
>>> 
>>> I'm doing something wrong here?
>>> 
>>> 
>>> Regards,
>>> Sven
>>> 
>>> --
>>> Sven Hodapp, M.Sc.,
>>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>>> Department of Bioinformatics
>>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>>> sven.hod...@scai.fraunhofer.de
>>> www.scai.fraunhof

Re: Accumulo Seek performance

2016-08-25 Thread Sven Hodapp
Hi,

I've changed the code a little bit, so that it uses a thread pool (via the 
Future):

val ranges500 = ranges.asScala.grouped(500)  // this means 6 BatchScanners 
will be created

for (ranges <- ranges500) {
  val bscan = instance.createBatchScanner(ARTIFACTS, auths, 2)
  bscan.setRanges(ranges.asJava)
  Future {
time("mult-scanner") {
  bscan.asScala.toList  // toList forces the iteration of the iterator
}
  }
}

Here are the results:

background log: info: mult-scanner time: 4807.289358 ms
background log: info: mult-scanner time: 4930.996522 ms
background log: info: mult-scanner time: 9510.010808 ms
background log: info: mult-scanner time: 11394.152391 ms
background log: info: mult-scanner time: 13297.247295 ms
background log: info: mult-scanner time: 14032.704837 ms

background log: info: single-scanner time: 15322.624393 ms

Every Future completes independent, but in return every batch scanner iterator 
needs more time to complete. :(
This means the batch scanners aren't really processed in parallel on the server 
side?
Should I reconfigure something? Maybe the tablet servers haven't/can't allocate 
enough threads or memory? (Every of the two nodes has 8 cores and 64GB memory 
and a storage with ~300MB/s...)

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de

- Ursprüngliche Mail -
> Von: "Josh Elser" <josh.el...@gmail.com>
> An: "user" <user@accumulo.apache.org>
> Gesendet: Mittwoch, 24. August 2016 18:36:42
> Betreff: Re: Accumulo Seek performance

> Ahh duh. Bad advice from me in the first place :)
> 
> Throw 'em in a threadpool locally.
> 
> dlmar...@comcast.net wrote:
>> Doesn't this use the 6 batch scanners serially?
>>
>> ----
>> *From: *"Sven Hodapp" <sven.hod...@scai.fraunhofer.de>
>> *To: *"user" <user@accumulo.apache.org>
>> *Sent: *Wednesday, August 24, 2016 11:56:14 AM
>> *Subject: *Re: Accumulo Seek performance
>>
>> Hi Josh,
>>
>> thanks for your reply!
>>
>> I've tested your suggestion with a implementation like that:
>>
>> val ranges500 = ranges.asScala.grouped(500) // this means 6
>> BatchScanners will be created
>>
>> time("mult-scanner") {
>> for (ranges <- ranges500) {
>> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 1)
>> bscan.setRanges(ranges.asJava)
>> for (entry <- bscan.asScala) yield {
>> entry.getKey()
>> }
>> }
>> }
>>
>> And the result is a bit disappointing:
>>
>> background log: info: mult-scanner time: 18064.969281 ms
>> background log: info: single-scanner time: 6527.482383 ms
>>
>> I'm doing something wrong here?
>>
>>
>> Regards,
>> Sven
>>
>> --
>> Sven Hodapp, M.Sc.,
>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>> Department of Bioinformatics
>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>> sven.hod...@scai.fraunhofer.de
>> www.scai.fraunhofer.de
>>
>> - Ursprüngliche Mail -
>>  > Von: "Josh Elser" <josh.el...@gmail.com>
>>  > An: "user" <user@accumulo.apache.org>
>>  > Gesendet: Mittwoch, 24. August 2016 16:33:37
>>  > Betreff: Re: Accumulo Seek performance
>>
>>  > This reminded me of https://issues.apache.org/jira/browse/ACCUMULO-3710
>>  >
>>  > I don't feel like 3000 ranges is too many, but this isn't quantitative.
>>  >
>>  > IIRC, the BatchScanner will take each Range you provide, bin each Range
>>  > to the TabletServer(s) currently hosting the corresponding data, clip
>>  > (truncate) each Range to match the Tablet boundaries, and then does an
>>  > RPC to each TabletServer with just the Ranges hosted there.
>>  >
>>  > Inside the TabletServer, it will then have many Ranges, binned by Tablet
>>  > (KeyExtent, to be precise). This will spawn a
>>  > org.apache.accumulo.tserver.scan.LookupTask will will start collecting
>>  > results to send back to the client.
>>  >
>>  > The caveat here is that those ranges are processed serially on a
>>  > TabletServer. Maybe, you're swamping one TabletServer with lots of
>>  > Ranges that it could be processing in parallel.
>>  >
>>  > Could you experiment with using multiple BatchSc

Re: Accumulo Seek performance

2016-08-24 Thread Sven Hodapp
Hi Josh,

thanks for your reply!

I've tested your suggestion with a implementation like that:

val ranges500 = ranges.asScala.grouped(500)  // this means 6 BatchScanners 
will be created

time("mult-scanner") {
  for (ranges <- ranges500) {
val bscan = instance.createBatchScanner(ARTIFACTS, auths, 1)
bscan.setRanges(ranges.asJava)
for (entry <- bscan.asScala) yield {
  entry.getKey()
}
  }
}

And the result is a bit disappointing:

background log: info: mult-scanner time: 18064.969281 ms
background log: info: single-scanner time: 6527.482383 ms

I'm doing something wrong here?


Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de

- Ursprüngliche Mail -
> Von: "Josh Elser" <josh.el...@gmail.com>
> An: "user" <user@accumulo.apache.org>
> Gesendet: Mittwoch, 24. August 2016 16:33:37
> Betreff: Re: Accumulo Seek performance

> This reminded me of https://issues.apache.org/jira/browse/ACCUMULO-3710
> 
> I don't feel like 3000 ranges is too many, but this isn't quantitative.
> 
> IIRC, the BatchScanner will take each Range you provide, bin each Range
> to the TabletServer(s) currently hosting the corresponding data, clip
> (truncate) each Range to match the Tablet boundaries, and then does an
> RPC to each TabletServer with just the Ranges hosted there.
> 
> Inside the TabletServer, it will then have many Ranges, binned by Tablet
> (KeyExtent, to be precise). This will spawn a
> org.apache.accumulo.tserver.scan.LookupTask will will start collecting
> results to send back to the client.
> 
> The caveat here is that those ranges are processed serially on a
> TabletServer. Maybe, you're swamping one TabletServer with lots of
> Ranges that it could be processing in parallel.
> 
> Could you experiment with using multiple BatchScanners and something
> like Guava's Iterables.concat to make it appear like one Iterator?
> 
> I'm curious if we should put an optimization into the BatchScanner
> itself to limit the number of ranges we send in one RPC to a
> TabletServer (e.g. one BatchScanner might open multiple
> MultiScanSessions to a TabletServer).
> 
> Sven Hodapp wrote:
>> Hi there,
>>
>> currently we're experimenting with a two node Accumulo cluster (two tablet
>> servers) setup for document storage.
>> This documents are decomposed up to the sentence level.
>>
>> Now I'm using a BatchScanner to assemble the full document like this:
>>
>>  val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) // 
>> ARTIFACTS table
>>  currently hosts ~30GB data, ~200M entries on ~45 tablets
>>  bscan.setRanges(ranges)  // there are like 3000 Range.exact's in the 
>> ranges-list
>>for (entry<- bscan.asScala) yield {
>>  val key = entry.getKey()
>>  val value = entry.getValue()
>>  // etc.
>>}
>>
>> For larger full documents (e.g. 3000 exact ranges), this operation will take
>> about 12 seconds.
>> But shorter documents are assembled blazing fast...
>>
>> Is that to much for a BatchScanner / I'm misusing the BatchScaner?
>> Is that a normal time for such a (seek) operation?
>> Can I do something to get a better seek performance?
>>
>> Note: I have already enabled bloom filtering on that table.
>>
>> Thank you for any advice!
>>
>> Regards,
>> Sven


Accumulo Seek performance

2016-08-24 Thread Sven Hodapp
Hi there,

currently we're experimenting with a two node Accumulo cluster (two tablet 
servers) setup for document storage.
This documents are decomposed up to the sentence level.

Now I'm using a BatchScanner to assemble the full document like this:

val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) // ARTIFACTS 
table currently hosts ~30GB data, ~200M entries on ~45 tablets 
bscan.setRanges(ranges)  // there are like 3000 Range.exact's in the 
ranges-list
  for (entry <- bscan.asScala) yield {
val key = entry.getKey()
val value = entry.getValue()
// etc.
  }

For larger full documents (e.g. 3000 exact ranges), this operation will take 
about 12 seconds.
But shorter documents are assembled blazing fast...

Is that to much for a BatchScanner / I'm misusing the BatchScaner?
Is that a normal time for such a (seek) operation?
Can I do something to get a better seek performance?

Note: I have already enabled bloom filtering on that table.

Thank you for any advice!

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de


Re: tableOperations().create hangs

2016-05-19 Thread Sven Hodapp
Hi David,

I had the same issue. I've found out to modify table configuration in client 
code the master server must be available.
You have said, that you already altered the configuration, but maybe the master 
is nevertheless inaccessible for the client?

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de

- Ursprüngliche Mail -
> Von: "David Boyd" <db...@incadencecorp.com>
> An: "user" <user@accumulo.apache.org>
> CC: "Jing Yang" <jy...@incadencecorp.com>
> Gesendet: Donnerstag, 19. Mai 2016 15:22:26
> Betreff: tableOperations().create hangs

> Hello:
> 
> I have the following code in one of my methods:
> 
> if(!dbConnector.tableOperations().exists(coalesceTable)) {
> System.err.println("creating table " + coalesceTable);
> dbConnector.tableOperations().create(coalesceTable);
> System.err.println("created table " + coalesceTable);
> }
> 
> The exists call works just fine.  If the table exists (e.g. I created it
> from the accumulo shell)
> the code moves on and writes to the table just fine.
> 
> If the table does not exist the create call is called and the entire
> process hangs.
> 
> This same code works just fine using MiniAccumuloCluster in my junit test.
> 
> My accumlo is a single node 1.6.1 instance running in a separate VM on
> my laptop.
> 
> I saw a similar thread
> (http://mail-archives.apache.org/mod_mbox/accumulo-dev/201310.mbox/%3c1382562693449-5858.p...@n5.nabble.com%3E)
> where the user set all the accumulo conf entries to the IP versus
> localhost, but I had already done that.  I reverified that nothing in my
> accumulo configuration
> uses localhost.
> 
> Any help would be appreciated.
> 
> 
> 
> --
> = mailto:db...@incadencecorp.com 
> David W. Boyd
> VP,  Data Solutions
> 10432 Balls Ford, Suite 240
> Manassas, VA 20109
> office:   +1-703-552-2862
> cell: +1-703-402-7908
> == http://www.incadencecorp.com/ 
> ISO/IEC JTC1 WG9, editor ISO/IEC 20547 Big Data Reference Architecture
> Chair ANSI/INCITS TC Big Data
> Co-chair NIST Big Data Public Working Group Reference Architecture
> First Robotic Mentor - FRC, FTC - www.iliterobotics.org
> Board Member- USSTEM Foundation - www.usstem.org
> 
> The information contained in this message may be privileged
> and/or confidential and protected from disclosure.
> If the reader of this message is not the intended recipient
> or an employee or agent responsible for delivering this message
> to the intended recipient, you are hereby notified that any
> dissemination, distribution or copying of this communication
> is strictly prohibited.  If you have received this communication
> in error, please notify the sender immediately by replying to
> this message and deleting the material from any computer.


Re: Accumulo 1.7.1 on Docker

2016-03-14 Thread Sven Hodapp
Hi Josh,

thanks for your answer.
Currently I haven't uploaded the Dockerfile... If you want I'll upload it!
(But currently it only is an debian with ssh, rsync, jdk7, the unzipped 
distributions of hadoop, zookeeper and accumulo, and the set of environment 
variables)

Currently, and for testing, I start all things manually.
The steps I do is documented in /init.sh within the docker container.
I've enshured that dfs and zookeeper are up before starting accumulo.

Regards,
Sven

- Ursprüngliche Mail -
> Von: "Josh Elser" <josh.el...@gmail.com>
> An: "user" <user@accumulo.apache.org>
> Gesendet: Montag, 14. März 2016 15:41:16
> Betreff: Re: Accumulo 1.7.1 on Docker

> Hi Sven,
> 
> I can't seem to find the source for your Docker image (nor am smart
> enough to figure out how to extract it). Any chance you can point us to
> that?
> 
> That said, I did run your image, and it looks like you didn't actually
> start Hadoop, ZooKeeper, or Accumulo.
> 
> Our documentation treats Hadoop and ZooKeeper primarily as
> prerequisites, so check their docs for good instructions.
> 
> For Accumulo -
> 
> Configuration:
> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_installation
> Initialization:
> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_initialization
> Starting:
> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_starting_accumulo
> 
> Sven Hodapp wrote:
>> Dear reader,
>>
>> I'd like to create a docker image with accumulo 1.7.1.
>> I've install it from scratch with hadoop-2.7.2, zookeeper-3.4.8 and
>> accumulo-1.7.1.
>> I'm going though the installation and every time I'll end up like that:
>>
>> root@deebd8e29683:/# accumulo shell -u root
>> Password: **
>> 2016-03-13 13:18:31,627 [trace.DistributedTrace] INFO : SpanReceiver
>> org.apache.accumulo.tracer.ZooTraceClient was loaded successfully.
>> 2016-03-13 13:20:31,955 [impl.ServerClient] WARN : There are no tablet 
>> servers:
>> check that zookeeper and accumulo are running.
>>
>> Anybody got an idea whats wrong? I have no idea anymore...
>>
>> I'll currently share this docker image on docker hub.
>> If you want to try it you can simply start:
>>
>> docker run -it themerius/accumulo /bin/bash
>>
>> All binaries and configs are in there. In ./init.sh are my setup steps.
>>
>> Regards,
> > Sven


Accumulo 1.7.1 on Docker

2016-03-13 Thread Sven Hodapp
Dear reader,

I'd like to create a docker image with accumulo 1.7.1.
I've install it from scratch with hadoop-2.7.2, zookeeper-3.4.8 and 
accumulo-1.7.1.
I'm going though the installation and every time I'll end up like that:

root@deebd8e29683:/# accumulo shell -u root
Password: **
2016-03-13 13:18:31,627 [trace.DistributedTrace] INFO : SpanReceiver 
org.apache.accumulo.tracer.ZooTraceClient was loaded successfully.
2016-03-13 13:20:31,955 [impl.ServerClient] WARN : There are no tablet servers: 
check that zookeeper and accumulo are running.

Anybody got an idea whats wrong? I have no idea anymore...

I'll currently share this docker image on docker hub.
If you want to try it you can simply start:

   docker run -it themerius/accumulo /bin/bash

All binaries and configs are in there. In ./init.sh are my setup steps.

Regards,
Sven


Re: IntersectingIterator and Ranges

2015-12-18 Thread Sven Hodapp
Hi Billie,

I've read in the source code documentation the following:

This iterator is commonly used with BatchScanner or AccumuloInputFormat, to 
parallelize the search over all shardIDs.

This means key1 and key2 (the shradIDs) should be searched? Or is this a 
misunderstanding?
The IndexedDocIterator should have also search in all shradIDs?

Thanks!

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de

- Ursprüngliche Mail -
> Von: "Billie Rinaldi" <billie.rina...@gmail.com>
> An: "user" <user@accumulo.apache.org>
> Gesendet: Mittwoch, 18. November 2015 15:57:15
> Betreff: Re: IntersectingIterator and Ranges

> Yes, that is the correct behavior. The IntersectingIterator intersects
> columns within a row, on a single tablet server. To get the results you
> want, you should make sure all the terms for a document are inserted with
> the same key / row. In this case, all the doc1 entries should have key1 as
> their row.
> 
> Billie
> On Nov 18, 2015 7:08 AM, "Sven Hodapp" <sven.hod...@scai.fraunhofer.de>
> wrote:
> 
>> Hello together,
>>
>> Currently I'm using Accumulo 1.7 (currently single a node) with the
>> IntersectingIterator.
>> The current index schema for the IntersectingIterator looks like this, for
>> example:
>>
>> key1 : term1 : doc1
>> key1 : term2 : doc1
>> key2 : term3 : doc1
>>
>> I've noticed that I can't intersect terms which are in distinct key-ranges.
>> Is that a correct behavior, or I'm doing something wrong?
>>
>> Extract of my code (Scala) as example:
>>
>> val bs = conn.createBatchScanner(tableName, authorizations,
>> numQueryThreads)
>> val terms = List(new Text("term1"), new Text("term2")).toArray
>>
>> val ii = new IteratorSetting(priority, name, iteratorClass)
>> IntersectingIterator.setColumnFamilies(ii, terms)
>> bs.addScanIterator(ii)
>>
>> bs.setRanges(Collections.singleton(new Range()))  // all ranges
>>
>> for (entry <- bs.asScala.take(100)) yield {
>>   entry.getKey.getColumnQualifier.toString
>> }
>>
>> This will yield "doc1" as expected.
>>
>> But if I'll choose the terms like this:
>>
>> // ...
>>     val terms = List(new Text("term1"), new Text("term3")).toArray
>> // ...
>>
>> It will yield "null" but I would expect here also "doc1".
>> I've also tried this with setting a list of Range.exact,
>> but I'll get also "null".
>>
>> I'm doing something wrong?
>>
>> Thank you in advance!
>>
>> Regards,
>> Sven
>>
>> --
>> Sven Hodapp, M.Sc.,
>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>> Department of Bioinformatics
>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>> sven.hod...@scai.fraunhofer.de
>> www.scai.fraunhofer.de


Re: Multiple talbet servers on the same host

2015-12-11 Thread Sven Hodapp
Hi Josh,
HI Billie,

now it works, thank you both!

So in short:

1. Copy ACCUMULO_HOME to another distinct directory
2. export ACCUMULO_HOME=/path/to/new/directory
3. Edit conf/accumulo-site.xml in the new directory

  
tserver.port.client
29997 
  

  
replication.receipt.service.port
0 
  

4. bin/tup.sh to start the tablet servers

Maybe you can recycle this for the user manual.

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de

- Ursprüngliche Mail -
> Von: "Josh Elser" <josh.el...@gmail.com>
> An: "user" <user@accumulo.apache.org>
> Gesendet: Donnerstag, 10. Dezember 2015 17:56:39
> Betreff: Re: Multiple talbet servers on the same host

> Sven Hodapp wrote:
>> I've read in the Accumulo Book (p. 496) that it should be possible to start 
>> on a
>> (fat) machine multiple tablet servers to scale (also) vertically. Sadly it's
>> not described how to do it. Also I didn't find anything about this issue in 
>> the
>> official documentation.
> 
> I've just created https://issues.apache.org/jira/browse/ACCUMULO-4072
> for us to get some documentation into the Accumulo User Manual on the
> matter. Thanks for letting us know we were missing this.


Re: Multiple talbet servers on the same host

2015-12-10 Thread Sven Hodapp
Hi Billie,

thanks for your reply! I've created a conf2 dir with tserver.port.client set to 
another port. I've also made shure, that in the environment ACCUMULO_CONF_DIR 
is set with conf2.
But it's said "tablet server already running".

I'm using Accumulo 1.7.0, currently on a single node. But because the machine 
is a big one (48 cores, 256 GB RAM) I'd like to test if multiple tservers on 
the same machine bring some benefit.
Currently the single tablet server yields about 100k writes per second. Or may 
we have a disk-bottleneck?

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de

- Ursprüngliche Mail -
> Von: "Billie Rinaldi" <billie.rina...@gmail.com>
> An: "user" <user@accumulo.apache.org>
> Gesendet: Mittwoch, 9. Dezember 2015 18:40:25
> Betreff: Re: Multiple talbet servers on the same host

> Hi Sven,
> 
> The tablet server port is configured in the accumulo-site.xml file.  Ports
> should not appear in the slaves file.  Thus, you would need a separate
> configuration directory for Accumulo to be able to start up a second set of
> tablet servers on a different port.  Let's say you copied your existing
> conf directory to a directory named conf2, with the tserver.port.client
> property set to a new value in the conf2/accumulo-site.xml file, and pushed
> this conf2 directory out to all your nodes.  Then you should be able to
> start the second set of tablet servers by running
> ACCUMULO_CONF_DIR=/path/to/conf2 $ACCUMULO_HOME/bin/tup.sh from one of the
> nodes.
> 
> That's how you could do this, but I'd encourage you to evaluate whether you
> actually need to do this based on your workload, hardware, version of
> Accumulo, etc.
> 
> Billie
> 
> On Wed, Dec 9, 2015 at 8:30 AM, Sven Hodapp <sven.hod...@scai.fraunhofer.de>
> wrote:
> 
>> Hi there,
>>
>> I've read in the Accumulo Book (p. 496) that it should be possible to
>> start on a (fat) machine multiple tablet servers to scale (also)
>> vertically. Sadly it's not described how to do it. Also I didn't find
>> anything about this issue in the official documentation.
>>
>> I thought it must be configured in ACCUMULO_HOME/conf/slaves.
>> So I've added here the same IP with distinct ports multiple times ... in
>> my naive thinking.
>> But *no* additional tablet servers are started.
>>
>> Or I am completely wrong here?
>>
>> Regards,
>> Sven
>>
>> --
>> Sven Hodapp, M.Sc.,
>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>> Department of Bioinformatics
>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>> sven.hod...@scai.fraunhofer.de
>> www.scai.fraunhofer.de


Re: Multiple talbet servers on the same host

2015-12-10 Thread Sven Hodapp
Hi Billie,

it seems that in Accumulo 1.7 there is no ACCUMULO_PID_DIR anymore, it is 
generated in the start-server.sh script with ps.

So I've tried this:

$ export ACCUMULO_CONF_DIR="${ACCUMULO_CONF_DIR:$ACCUMULO_HOME/conf2}"
$ bin/accumulo tserver --address serverIP

This will start the tserver on port of conf2 (currently 29997), but tries to 
open another port which is in use:

Unable to start TServer
org.apache.thrift.transport.TTransportException: Could not create 
ServerSocket on address serverIP:10002.

I think it is in use by the other tablet server...

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de

- Ursprüngliche Mail -
> Von: "Billie Rinaldi" <billie.rina...@gmail.com>
> An: "user" <user@accumulo.apache.org>
> Gesendet: Donnerstag, 10. Dezember 2015 13:33:47
> Betreff: Re: Multiple talbet servers on the same host

> Ah, that's my fault. I forgot another change you should make. In
> conf2/accumulo-env.sh, change ACCUMULO_LOG_DIR and ACCUMULO_PID_DIR and see
> if that helps.
> On Dec 10, 2015 3:29 AM, "Sven Hodapp" <sven.hod...@scai.fraunhofer.de>
> wrote:
> 
> Hi Billie,
> 
> thanks for your reply! I've created a conf2 dir with tserver.port.client
> set to another port. I've also made shure, that in the environment
> ACCUMULO_CONF_DIR is set with conf2.
> But it's said "tablet server already running".
> 
> I'm using Accumulo 1.7.0, currently on a single node. But because the
> machine is a big one (48 cores, 256 GB RAM) I'd like to test if multiple
> tservers on the same machine bring some benefit.
> Currently the single tablet server yields about 100k writes per second. Or
> may we have a disk-bottleneck?
> 
> Regards,
> Sven
> 
> --
> Sven Hodapp, M.Sc.,
> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
> Department of Bioinformatics
> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
> sven.hod...@scai.fraunhofer.de
> www.scai.fraunhofer.de
> 
> - Ursprüngliche Mail -
>> Von: "Billie Rinaldi" <billie.rina...@gmail.com>
>> An: "user" <user@accumulo.apache.org>
>> Gesendet: Mittwoch, 9. Dezember 2015 18:40:25
>> Betreff: Re: Multiple talbet servers on the same host
> 
>> Hi Sven,
>>
>> The tablet server port is configured in the accumulo-site.xml file.  Ports
>> should not appear in the slaves file.  Thus, you would need a separate
>> configuration directory for Accumulo to be able to start up a second set
> of
>> tablet servers on a different port.  Let's say you copied your existing
>> conf directory to a directory named conf2, with the tserver.port.client
>> property set to a new value in the conf2/accumulo-site.xml file, and
> pushed
>> this conf2 directory out to all your nodes.  Then you should be able to
>> start the second set of tablet servers by running
>> ACCUMULO_CONF_DIR=/path/to/conf2 $ACCUMULO_HOME/bin/tup.sh from one of the
>> nodes.
>>
>> That's how you could do this, but I'd encourage you to evaluate whether
> you
>> actually need to do this based on your workload, hardware, version of
>> Accumulo, etc.
>>
>> Billie
>>
>> On Wed, Dec 9, 2015 at 8:30 AM, Sven Hodapp <
> sven.hod...@scai.fraunhofer.de>
>> wrote:
>>
>>> Hi there,
>>>
>>> I've read in the Accumulo Book (p. 496) that it should be possible to
>>> start on a (fat) machine multiple tablet servers to scale (also)
>>> vertically. Sadly it's not described how to do it. Also I didn't find
>>> anything about this issue in the official documentation.
>>>
>>> I thought it must be configured in ACCUMULO_HOME/conf/slaves.
>>> So I've added here the same IP with distinct ports multiple times ... in
>>> my naive thinking.
>>> But *no* additional tablet servers are started.
>>>
>>> Or I am completely wrong here?
>>>
>>> Regards,
>>> Sven
>>>
>>> --
>>> Sven Hodapp, M.Sc.,
>>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>>> Department of Bioinformatics
>>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>>> sven.hod...@scai.fraunhofer.de
> >> www.scai.fraunhofer.de


Multiple talbet servers on the same host

2015-12-09 Thread Sven Hodapp
Hi there,

I've read in the Accumulo Book (p. 496) that it should be possible to start on 
a (fat) machine multiple tablet servers to scale (also) vertically. Sadly it's 
not described how to do it. Also I didn't find anything about this issue in the 
official documentation.

I thought it must be configured in ACCUMULO_HOME/conf/slaves.
So I've added here the same IP with distinct ports multiple times ... in my 
naive thinking.
But *no* additional tablet servers are started.

Or I am completely wrong here?

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de


Re: IntersectingIterator and Ranges

2015-11-19 Thread Sven Hodapp
Hi Billie,

thank you for that information!
So I must search a other solution.

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de

- Ursprüngliche Mail -
> Von: "Billie Rinaldi" <billie.rina...@gmail.com>
> An: "user" <user@accumulo.apache.org>
> Gesendet: Mittwoch, 18. November 2015 15:57:15
> Betreff: Re: IntersectingIterator and Ranges

> Yes, that is the correct behavior. The IntersectingIterator intersects
> columns within a row, on a single tablet server. To get the results you
> want, you should make sure all the terms for a document are inserted with
> the same key / row. In this case, all the doc1 entries should have key1 as
> their row.
> 
> Billie
> On Nov 18, 2015 7:08 AM, "Sven Hodapp" <sven.hod...@scai.fraunhofer.de>
> wrote:
> 
>> Hello together,
>>
>> Currently I'm using Accumulo 1.7 (currently single a node) with the
>> IntersectingIterator.
>> The current index schema for the IntersectingIterator looks like this, for
>> example:
>>
>> key1 : term1 : doc1
>> key1 : term2 : doc1
>> key2 : term3 : doc1
>>
>> I've noticed that I can't intersect terms which are in distinct key-ranges.
>> Is that a correct behavior, or I'm doing something wrong?
>>
>> Extract of my code (Scala) as example:
>>
>> val bs = conn.createBatchScanner(tableName, authorizations,
>> numQueryThreads)
>> val terms = List(new Text("term1"), new Text("term2")).toArray
>>
>> val ii = new IteratorSetting(priority, name, iteratorClass)
>> IntersectingIterator.setColumnFamilies(ii, terms)
>> bs.addScanIterator(ii)
>>
>> bs.setRanges(Collections.singleton(new Range()))  // all ranges
>>
>> for (entry <- bs.asScala.take(100)) yield {
>>   entry.getKey.getColumnQualifier.toString
>> }
>>
>> This will yield "doc1" as expected.
>>
>> But if I'll choose the terms like this:
>>
>> // ...
>>     val terms = List(new Text("term1"), new Text("term3")).toArray
>> // ...
>>
>> It will yield "null" but I would expect here also "doc1".
>> I've also tried this with setting a list of Range.exact,
>> but I'll get also "null".
>>
>> I'm doing something wrong?
>>
>> Thank you in advance!
>>
>> Regards,
>> Sven
>>
>> --
>> Sven Hodapp, M.Sc.,
>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>> Department of Bioinformatics
>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>> sven.hod...@scai.fraunhofer.de
>> www.scai.fraunhofer.de


IntersectingIterator and Ranges

2015-11-18 Thread Sven Hodapp
Hello together,

Currently I'm using Accumulo 1.7 (currently single a node) with the 
IntersectingIterator.
The current index schema for the IntersectingIterator looks like this, for 
example:

key1 : term1 : doc1
key1 : term2 : doc1
key2 : term3 : doc1

I've noticed that I can't intersect terms which are in distinct key-ranges.
Is that a correct behavior, or I'm doing something wrong?

Extract of my code (Scala) as example:

val bs = conn.createBatchScanner(tableName, authorizations, numQueryThreads)
val terms = List(new Text("term1"), new Text("term2")).toArray

val ii = new IteratorSetting(priority, name, iteratorClass)
IntersectingIterator.setColumnFamilies(ii, terms)
bs.addScanIterator(ii)

bs.setRanges(Collections.singleton(new Range()))  // all ranges

for (entry <- bs.asScala.take(100)) yield {
  entry.getKey.getColumnQualifier.toString
}

This will yield "doc1" as expected.

But if I'll choose the terms like this:

// ...
val terms = List(new Text("term1"), new Text("term3")).toArray
// ...

It will yield "null" but I would expect here also "doc1".
I've also tried this with setting a list of Range.exact,
but I'll get also "null".

I'm doing something wrong?

Thank you in advance!

Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de


Re: Mini Accumulo Cluster reusing the directory

2015-09-22 Thread Sven Hodapp
Hi Keith,

for me the use cases are:

 * easy and portable development
 * in general testing
 * embedding Accumulo

Especially the last point is for me very insteresting.
First I can deliver a relativly light weight application, which has Accumulo 
embedded (like a library).
And then, if the application runs very well and gets many users, it's very easy 
to scale to a full Accumulo installation!

Regards,
Sven

-- 
Sven Hodapp M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
sven.hod...@scai.fraunhofer.de
www.scai.fraunhofer.de

- Ursprüngliche Mail -
> Von: "Keith Turner" <ke...@deenlo.com>
> An: "user" <user@accumulo.apache.org>
> Gesendet: Mittwoch, 16. September 2015 21:41:00
> Betreff: Re: Mini Accumulo Cluster reusing the directory

> Would you be able to provide more informaiton about your use case?  Was
> wondering if other solutions could be of use, like configuring regular
> Accumulo to use the local filesystem.  This can be done, but care needs to
> be taken to make walogs work correctly.   If interested I could provide
> more info about this configuration.
> 
> On Wed, Sep 16, 2015 at 9:20 AM, Sven Hodapp <sven.hod...@scai.fraunhofer.de
>> wrote:
> 
>> Hi there,
>>
>> is it possible for MiniAccumuloCluster to reuse a given directory?
>> Sadly, I haven't found anything in the docs?
>>
>> I’ll fire up my instance like this:
>>
>>val dict = new File("/tmp/accumulo-mini-cluster")
>>val accumulo = new MiniAccumuloCluster(dict, "test“)
>>
>> If I’ll restart my JVM it will raise a error like this:
>>
>>Exception in thread "main" java.lang.IllegalArgumentException:
>> Directory /tmp/accumulo-mini-cluster is not empty
>>
>> It would be nice if the data can survive a JVM restart and the folder
>> structure must not be constructed every time.
>>
>> Thanks a lot!
>>
>> Regards,
>> Sven
>>
>> --
>> Sven Hodapp M.Sc.,
>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>> Department of Bioinformatics
>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>> sven.hod...@scai.fraunhofer.de
>> www.scai.fraunhofer.de


Re: Mini Accumulo Cluster reusing the directory

2015-09-16 Thread Sven Hodapp
Hi Corey,

thanks for your reply and the link. Sounds good, if that will be available in 
the future!
Is the code from Christopher somewhere deployed?

Currently I'm using version 1.7

Regards,
Sven

- Ursprüngliche Mail -
> Von: "Corey Nolet" <cjno...@gmail.com>
> An: "user" <user@accumulo.apache.org>
> Gesendet: Mittwoch, 16. September 2015 16:31:02
> Betreff: Re: Mini Accumulo Cluster reusing the directory

> Sven,
> 
> What version of Accumulo are you running? We have a ticket for this [1]
> which has had a lot of discussion on it. Christopher Tubbs mentioned that
> he had gotten this to work.
> 
> [1] https://issues.apache.org/jira/browse/ACCUMULO-1378
> 
> On Wed, Sep 16, 2015 at 9:20 AM, Sven Hodapp <sven.hod...@scai.fraunhofer.de
>> wrote:
> 
>> Hi there,
>>
>> is it possible for MiniAccumuloCluster to reuse a given directory?
>> Sadly, I haven't found anything in the docs?
>>
>> I’ll fire up my instance like this:
>>
>>val dict = new File("/tmp/accumulo-mini-cluster")
>>val accumulo = new MiniAccumuloCluster(dict, "test“)
>>
>> If I’ll restart my JVM it will raise a error like this:
>>
>>Exception in thread "main" java.lang.IllegalArgumentException:
>> Directory /tmp/accumulo-mini-cluster is not empty
>>
>> It would be nice if the data can survive a JVM restart and the folder
>> structure must not be constructed every time.
>>
>> Thanks a lot!
>>
>> Regards,
>> Sven
>>
>> --
>> Sven Hodapp M.Sc.,
>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>> Department of Bioinformatics
>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>> sven.hod...@scai.fraunhofer.de
>> www.scai.fraunhofer.de