RE: Modify Keys within iterator

2016-09-30 Thread Dan Blum
What happens if you just use an empty range?

 

What I suspect is happening is that the scanner code sees a key beyond the 
range’s end key and stops. In general you can’t transform rows in keys for this 
reason, and you might have issues even if you don’t transform the rows if the 
keys end up out of order – see the comments in TransformingIterator.

 

From: Yamini Joshi [mailto:yamini.1...@gmail.com] 
Sent: Friday, September 30, 2016 1:37 PM
To: user@accumulo.apache.org
Subject: Re: Modify Keys within iterator

 

I am using pyaccumulo. Here's the code snippet:

rowIds=['r2','r10']

hashFilter = KeyModifyIterator(priority=10)
iterator.append(hashFilter)

for entry in self.dbconn.batch_scan(table , scanranges=(Range(srow=row, 
erow=row) for row in rowIDs),iterators=[hashFilter]):
print entry

 




Best regards,
Yamini Joshi

 

On Fri, Sep 30, 2016 at 12:31 PM, Dan Blum <db...@bbn.com> wrote:

What code are you using to test the iterator, where you see no output?

 

From: Yamini Joshi [mailto:yamini.1...@gmail.com] 
Sent: Friday, September 30, 2016 1:26 PM
To: user@accumulo.apache.org
Subject: Modify Keys within iterator

 

Hello Everyone!

I am trying to write an iterator to modify keys within a table (at scan). My 
use case is to select a few records that match a certain criterion and then 
modify them within the iterator(using the following class) for some other 
succeeding iterator/combiner. The problem is that this iterator does return any 
records/keys. I added some primitive prints and found that the keys (this.key) 
is changed but the output of iterator is nothing. I'd appreciate if someone 
could give me any insight. I'm sure I'm making a teeny tiny mistake somewhere.

Schema:

 row   colF   colQ 
tsVal

I/P:r_1   f f_1 
 v1

 r_1   fxf_1
  v1

O/P:  f_1   r r_1   
   v1

 f_1   fxfx 
   v1

 



public class KeyModifyIterator implements SortedKeyValueIterator<Key,Value> {

  private SortedKeyValueIterator<Key,Value> source;
  private Key key;
  private Value value;

  @Override
  public void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> 
options, IteratorEnvironment env) throws IOException {
this.source = source;
  }

  @Override
  public boolean hasTop() {
return key != null;
  }

  @Override
  public void next() throws IOException {
if (source.hasTop()) {
ByteSequence currentRow = source.getTopKey().getRowData();
ByteSequence currentColf = source.getTopKey().getColumnFamilyData();
ByteSequence currentColq = 
source.getTopKey().getColumnQualifierData();
long ts = source.getTopKey().getTimestamp();
String v = source.getTopValue().toString();
System.out.println("Key = " + currentRow.toString() + " Cf = " + 
currentColf.toString() + " Cq = " + currentColq.toString()  + " val = " + 
v.toString());

if (currentColf.toString().equals("fx")){
System.out.println("Updating fx" );
this.key = new Key(currentColq.toArray(), 
currentColf.toArray(), currentColf.toArray(), new byte[0], ts);
this.value = new Value (v.getBytes(UTF_8));
}
else{
System.out.println("Updating other" );
this.key = new Key(currentColq.toArray(), "r".getBytes(UTF_8), 
currentRow.toArray(), new byte[0], ts);
this.value = new Value (v.getBytes(UTF_8));
System.out.println(this.key.toString());
}



source.next();


  } else {
this.key = null;
this.value = null;
  }
  }

  @Override
  public void seek(Range range, Collection columnFamilies, 
boolean inclusive) throws IOException { 
source.seek(range, columnFamilies, inclusive);
next();
  }

  @Override
  public Key getTopKey() {
return key;
  }

  @Override
  public Value getTopValue() {
return value;
  }

  @Override
  public SortedKeyValueIterator<Key,Value> deepCopy(IteratorEnvironment env) {
return null;
  }

  
}




Best regards,
Yamini Joshi

 



RE: Modify Keys within iterator

2016-09-30 Thread Dan Blum
What code are you using to test the iterator, where you see no output?

 

From: Yamini Joshi [mailto:yamini.1...@gmail.com] 
Sent: Friday, September 30, 2016 1:26 PM
To: user@accumulo.apache.org
Subject: Modify Keys within iterator

 

Hello Everyone!

I am trying to write an iterator to modify keys within a table (at scan). My 
use case is to select a few records that match a certain criterion and then 
modify them within the iterator(using the following class) for some other 
succeeding iterator/combiner. The problem is that this iterator does return any 
records/keys. I added some primitive prints and found that the keys (this.key) 
is changed but the output of iterator is nothing. I'd appreciate if someone 
could give me any insight. I'm sure I'm making a teeny tiny mistake somewhere.

Schema:

 row   colF   colQ 
tsVal

I/P:r_1   f f_1 
 v1

 r_1   fxf_1
  v1

O/P:  f_1   r r_1   
   v1

 f_1   fxfx 
   v1

 



public class KeyModifyIterator implements SortedKeyValueIterator {

  private SortedKeyValueIterator source;
  private Key key;
  private Value value;

  @Override
  public void init(SortedKeyValueIterator source, Map 
options, IteratorEnvironment env) throws IOException {
this.source = source;
  }

  @Override
  public boolean hasTop() {
return key != null;
  }

  @Override
  public void next() throws IOException {
if (source.hasTop()) {
ByteSequence currentRow = source.getTopKey().getRowData();
ByteSequence currentColf = source.getTopKey().getColumnFamilyData();
ByteSequence currentColq = 
source.getTopKey().getColumnQualifierData();
long ts = source.getTopKey().getTimestamp();
String v = source.getTopValue().toString();
System.out.println("Key = " + currentRow.toString() + " Cf = " + 
currentColf.toString() + " Cq = " + currentColq.toString()  + " val = " + 
v.toString());

if (currentColf.toString().equals("fx")){
System.out.println("Updating fx" );
this.key = new Key(currentColq.toArray(), 
currentColf.toArray(), currentColf.toArray(), new byte[0], ts);
this.value = new Value (v.getBytes(UTF_8));
}
else{
System.out.println("Updating other" );
this.key = new Key(currentColq.toArray(), "r".getBytes(UTF_8), 
currentRow.toArray(), new byte[0], ts);
this.value = new Value (v.getBytes(UTF_8));
System.out.println(this.key.toString());
}



source.next();


  } else {
this.key = null;
this.value = null;
  }
  }

  @Override
  public void seek(Range range, Collection columnFamilies, 
boolean inclusive) throws IOException { 
source.seek(range, columnFamilies, inclusive);
next();
  }

  @Override
  public Key getTopKey() {
return key;
  }

  @Override
  public Value getTopValue() {
return value;
  }

  @Override
  public SortedKeyValueIterator deepCopy(IteratorEnvironment env) {
return null;
  }

  
}




Best regards,
Yamini Joshi



RE: Accumulo Seek performance

2016-09-12 Thread Dan Blum
I think the 450 ranges returned a total of about 7.5M entries, but the ranges 
were in fact quite small relative to the size of the table.

-Original Message-
From: Josh Elser [mailto:josh.el...@gmail.com] 
Sent: Monday, September 12, 2016 2:43 PM
To: user@accumulo.apache.org
Subject: Re: Accumulo Seek performance

What does a "large scan" mean here, Dan?

Sven's original problem statement was running many small/pointed Ranges 
(e.g. point lookups). My observation was that BatchScanners were slower 
than running each in a Scanner when using multiple BS's concurrently.

Dan Blum wrote:
> I tested a large scan on a 1.6.2 cluster with 11 tablet servers - using 
> Scanners was much slower than using a BatchScanner with 11 threads, by about 
> a 5:1 ratio. There were 450 ranges.
>
> -Original Message-
> From: Josh Elser [mailto:josh.el...@gmail.com]
> Sent: Monday, September 12, 2016 1:42 PM
> To: user@accumulo.apache.org
> Subject: Re: Accumulo Seek performance
>
> I had increased the readahead threed pool to 32 (from 16). I had also
> increased the minimum thread pool size from 20 to 40. I had 10 tablets
> with the data block cache turned on (probably only 256M tho).
>
> Each tablet had a single file (manually compacted). Did not observe
> cache rates.
>
> I've been working through this with Keith on IRC this morning too. Found
> that a single batchscanner (one partition) is faster than the Scanner.
> Two partitions and things started to slow down.
>
> Two interesting points to still pursue, IMO:
>
> 1. I saw that the tserver-side logging for MultiScanSess was near
> identical to the BatchScanner timings
> 2. The minimum server threads did not seem to be taking effect. Despite
> having the value set to 64, I only saw a few ClientPool threads in a
> jstack after running the test.
>
> Adam Fuchs wrote:
>> Sorry, Monday morning poor reading skills, I guess. :)
>>
>> So, 3000 ranges in 40 seconds with the BatchScanner. In my past
>> experience HDFS seeks tend to take something like 10-100ms, and I would
>> expect that time to dominate here. With 60 client threads your
>> bottleneck should be the readahead pool, which I believe defaults to 16
>> threads. If you get perfect index caching then you should be seeing
>> something like 3000/16*50ms = 9,375ms. That's in the right ballpark, but
>> it assumes no data cache hits. Do you have any idea of how many files
>> you had per tablet after the ingest? Do you know what your cache hit
>> rate was?
>>
>> Adam
>>
>>
>> On Mon, Sep 12, 2016 at 9:14 AM, Josh Elser<josh.el...@gmail.com
>> <mailto:josh.el...@gmail.com>>  wrote:
>>
>>  5 iterations, figured that would be apparent from the log messages :)
>>
>>  The code is already posted in my original message.
>>
>>  Adam Fuchs wrote:
>>
>>  Josh,
>>
>>  Two questions:
>>
>>  1. How many iterations did you do? I would like to see an absolute
>>  number of lookups per second to compare against other observations.
>>
>>  2. Can you post your code somewhere so I can run it?
>>
>>  Thanks,
>>  Adam
>>
>>
>>  On Sat, Sep 10, 2016 at 3:01 PM, Josh Elser
>>  <josh.el...@gmail.com<mailto:josh.el...@gmail.com>
>>  <mailto:josh.el...@gmail.com<mailto:josh.el...@gmail.com>>>  wrote:
>>
>>   Sven, et al:
>>
>>   So, it would appear that I have been able to reproduce this one
>>   (better late than never, I guess...). tl;dr Serially using
>>  Scanners
>>   to do point lookups instead of a BatchScanner is ~20x
>>  faster. This
>>   sounds like a pretty serious performance issue to me.
>>
>>   Here's a general outline for what I did.
>>
>>   * Accumulo 1.8.0
>>   * Created a table with 1M rows, each row with 10 columns
>>  using YCSB
>>   (workloada)
>>   * Split the table into 9 tablets
>>   * Computed the set of all rows in the table
>>
>>   For a number of iterations:
>>   * Shuffle this set of rows
>>   * Choose the first N rows
>>   * Construct an equivalent set of Ranges from the set of Rows,
>>   choosing a random column (0-9)
>>   * Partition the N rows into X collections
>>   * Submit X tasks to query one partition of the N rows (to a
>>  

RE: Accumulo Seek performance

2016-09-12 Thread Dan Blum
I tested a large scan on a 1.6.2 cluster with 11 tablet servers - using 
Scanners was much slower than using a BatchScanner with 11 threads, by about a 
5:1 ratio. There were 450 ranges.

-Original Message-
From: Josh Elser [mailto:josh.el...@gmail.com] 
Sent: Monday, September 12, 2016 1:42 PM
To: user@accumulo.apache.org
Subject: Re: Accumulo Seek performance

I had increased the readahead threed pool to 32 (from 16). I had also 
increased the minimum thread pool size from 20 to 40. I had 10 tablets 
with the data block cache turned on (probably only 256M tho).

Each tablet had a single file (manually compacted). Did not observe 
cache rates.

I've been working through this with Keith on IRC this morning too. Found 
that a single batchscanner (one partition) is faster than the Scanner. 
Two partitions and things started to slow down.

Two interesting points to still pursue, IMO:

1. I saw that the tserver-side logging for MultiScanSess was near 
identical to the BatchScanner timings
2. The minimum server threads did not seem to be taking effect. Despite 
having the value set to 64, I only saw a few ClientPool threads in a 
jstack after running the test.

Adam Fuchs wrote:
> Sorry, Monday morning poor reading skills, I guess. :)
>
> So, 3000 ranges in 40 seconds with the BatchScanner. In my past
> experience HDFS seeks tend to take something like 10-100ms, and I would
> expect that time to dominate here. With 60 client threads your
> bottleneck should be the readahead pool, which I believe defaults to 16
> threads. If you get perfect index caching then you should be seeing
> something like 3000/16*50ms = 9,375ms. That's in the right ballpark, but
> it assumes no data cache hits. Do you have any idea of how many files
> you had per tablet after the ingest? Do you know what your cache hit
> rate was?
>
> Adam
>
>
> On Mon, Sep 12, 2016 at 9:14 AM, Josh Elser  > wrote:
>
> 5 iterations, figured that would be apparent from the log messages :)
>
> The code is already posted in my original message.
>
> Adam Fuchs wrote:
>
> Josh,
>
> Two questions:
>
> 1. How many iterations did you do? I would like to see an absolute
> number of lookups per second to compare against other observations.
>
> 2. Can you post your code somewhere so I can run it?
>
> Thanks,
> Adam
>
>
> On Sat, Sep 10, 2016 at 3:01 PM, Josh Elser
> 
> >> wrote:
>
>  Sven, et al:
>
>  So, it would appear that I have been able to reproduce this one
>  (better late than never, I guess...). tl;dr Serially using
> Scanners
>  to do point lookups instead of a BatchScanner is ~20x
> faster. This
>  sounds like a pretty serious performance issue to me.
>
>  Here's a general outline for what I did.
>
>  * Accumulo 1.8.0
>  * Created a table with 1M rows, each row with 10 columns
> using YCSB
>  (workloada)
>  * Split the table into 9 tablets
>  * Computed the set of all rows in the table
>
>  For a number of iterations:
>  * Shuffle this set of rows
>  * Choose the first N rows
>  * Construct an equivalent set of Ranges from the set of Rows,
>  choosing a random column (0-9)
>  * Partition the N rows into X collections
>  * Submit X tasks to query one partition of the N rows (to a
> thread
>  pool with X fixed threads)
>
>  I have two implementations of these tasks. One, where all
> ranges in
>  a partition are executed via one BatchWriter. A second
> where each
>  range is executed in serial using a Scanner. The numbers
> speak for
>  themselves.
>
>  ** BatchScanners **
>  2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO :
> Shuffled
>  all rows
>  2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All
>  ranges calculated: 3000 ranges found
>  2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO :
>  Executing 6 range partitions using a pool of 6 threads
>  2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO :
> Queries
>  executed in 40178 ms
>  2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO :
>  Executing 6 range partitions using a pool of 6 threads
>  2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO :
> Queries
>  executed in 42296 ms
>  2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO :
>  Executing 6 range 

RE: Accumulo Seek performance

2016-09-12 Thread Dan Blum
I am not sure - my recollection is that the 1.6.x code capped the number of 
threads requested at 1 per tablet (covered by the requested ranges), not 1 per 
tablet server.

-Original Message-
From: Josh Elser [mailto:josh.el...@gmail.com] 
Sent: Monday, September 12, 2016 10:58 AM
To: user@accumulo.apache.org
Subject: Re: Accumulo Seek performance

Good call. I kind of forgot about BatchScanner threads and trying to 
factor those in :). I guess doing one thread in the BatchScanners would 
be more accurate.

Although, I only had one TServer, so I don't *think* there would be any 
difference. I don't believe we have concurrent requests from one 
BatchScanner to one TServer.

Dylan Hutchison wrote:
> Nice setup Josh.  Thank you for putting together the tests.  A few
> questions:
>
> The serial scanner implementation uses 6 threads: one for each thread in
> the thread pool.
> The batch scanner implementation uses 60 threads: 10 for each thread in
> the thread pool, since the BatchScanner was configured with 10 threads
> and there are 10 (9?) tablets.
>
> Isn't 60 threads of communication naturally inefficient?  I wonder if we
> would see the same performance if we set each BatchScanner to use 1 or 2
> threads.
>
> Maybe this would motivate a /MultiTableBatchScanner/, which maintains a
> fixed number of threads across any number of concurrent scans, possibly
> to the same table.
>
>
> On Sat, Sep 10, 2016 at 3:01 PM, Josh Elser  > wrote:
>
> Sven, et al:
>
> So, it would appear that I have been able to reproduce this one
> (better late than never, I guess...). tl;dr Serially using Scanners
> to do point lookups instead of a BatchScanner is ~20x faster. This
> sounds like a pretty serious performance issue to me.
>
> Here's a general outline for what I did.
>
> * Accumulo 1.8.0
> * Created a table with 1M rows, each row with 10 columns using YCSB
> (workloada)
> * Split the table into 9 tablets
> * Computed the set of all rows in the table
>
> For a number of iterations:
> * Shuffle this set of rows
> * Choose the first N rows
> * Construct an equivalent set of Ranges from the set of Rows,
> choosing a random column (0-9)
> * Partition the N rows into X collections
> * Submit X tasks to query one partition of the N rows (to a thread
> pool with X fixed threads)
>
> I have two implementations of these tasks. One, where all ranges in
> a partition are executed via one BatchWriter. A second where each
> range is executed in serial using a Scanner. The numbers speak for
> themselves.
>
> ** BatchScanners **
> 2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled
> all rows
> 2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All
> ranges calculated: 3000 ranges found
> 2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO :
> Executing 6 range partitions using a pool of 6 threads
> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 40178 ms
> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO :
> Executing 6 range partitions using a pool of 6 threads
> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 42296 ms
> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO :
> Executing 6 range partitions using a pool of 6 threads
> 2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 46094 ms
> 2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO :
> Executing 6 range partitions using a pool of 6 threads
> 2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 47704 ms
> 2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO :
> Executing 6 range partitions using a pool of 6 threads
> 2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 49221 ms
>
> ** Scanners **
> 2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled
> all rows
> 2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All
> ranges calculated: 3000 ranges found
> 2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO :
> Executing 6 range partitions using a pool of 6 threads
> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 2833 ms
> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO :
> Executing 6 range partitions using a pool of 6 threads
> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 2536 ms
> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO :
> Executing 6 range partitions using a pool of 6 threads
> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 2150 ms
> 2016-09-10 17:57:31,425 

RE: Accumulo Seek performance

2016-09-12 Thread Dan Blum
Is this a problem specific to 1.8.0, or is it likely to affect earlier versions?

-Original Message-
From: Josh Elser [mailto:josh.el...@gmail.com] 
Sent: Saturday, September 10, 2016 6:01 PM
To: user@accumulo.apache.org
Subject: Re: Accumulo Seek performance

Sven, et al:

So, it would appear that I have been able to reproduce this one (better 
late than never, I guess...). tl;dr Serially using Scanners to do point 
lookups instead of a BatchScanner is ~20x faster. This sounds like a 
pretty serious performance issue to me.

Here's a general outline for what I did.

* Accumulo 1.8.0
* Created a table with 1M rows, each row with 10 columns using YCSB 
(workloada)
* Split the table into 9 tablets
* Computed the set of all rows in the table

For a number of iterations:
* Shuffle this set of rows
* Choose the first N rows
* Construct an equivalent set of Ranges from the set of Rows, choosing a 
random column (0-9)
* Partition the N rows into X collections
* Submit X tasks to query one partition of the N rows (to a thread pool 
with X fixed threads)

I have two implementations of these tasks. One, where all ranges in a 
partition are executed via one BatchWriter. A second where each range is 
executed in serial using a Scanner. The numbers speak for themselves.

** BatchScanners **
2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all 
rows
2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges 
calculated: 3000 ranges found
2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 40178 ms
2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 42296 ms
2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 46094 ms
2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 47704 ms
2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 49221 ms

** Scanners **
2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all 
rows
2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges 
calculated: 3000 ranges found
2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 2833 ms
2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 2536 ms
2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 2150 ms
2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 2061 ms
2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 2140 ms

Query code is available https://github.com/joshelser/accumulo-range-binning

Sven Hodapp wrote:
> Hi Keith,
>
> I've tried it with 1, 2 or 10 threads. Unfortunately there where no amazing 
> differences.
> Maybe it's a problem with the table structure? For example it may happen that 
> one row id (e.g. a sentence) has several thousand column families. Can this 
> affect the seek performance?
>
> So for my initial example it has about 3000 row ids to seek, which will 
> return about 500k entries. If I filter for specific column families (e.g. a 
> document without annotations) it will return about 5k entries, but the seek 
> time will only be halved.
> Are there to much column families to seek it fast?
>
> Thanks!
>
> Regards,
> Sven
>



Bug in TabletServerBatchWriter.logStats

2016-08-22 Thread Dan Blum
If no batches have been written logStats will throw a divide-by-zero
exception. Obviously, it would be good practice for the caller to not open a
BatchWriter in the first place until it knows something needs to be written,
but it should be simple enough to avoid this.



RE: Querying Accumulo Data Using Java

2016-04-28 Thread Dan Blum
You can’t pass objects because the iterator stack will be run in a different 
JVM; anything that isn’t a string or coercible to a string would need to be 
serialized. So if you have something complex you will need to serialize it into 
a string somehow and then deserialize it in your filter class, which is quite 
doable.

 

From: Ben Craig [mailto:ben.cra...@gmail.com] 
Sent: Thursday, April 28, 2016 12:54 PM
To: user@accumulo.apache.org
Subject: Re: Querying Accumulo Data Using Java

 

Hey Josh,


Thanks for the response.  I was getting pretty lost trying to use the Internal 
Iterator.  I will try using the fetchColumn on the scanner.  I guess I have one 
last question is there any possible way to pass a java object to a custom 
filter or are we limited to the PropertyMap of  ?  I think its 
the String,String will work in almost all cases I need to do but am more 
curious than anything.

 

Thanks,


Ben 

 

On Thu, Apr 28, 2016 at 1:34 PM, Josh Elser  wrote:

Hi Ben,

Looks like you're on the right track. Iterator priorities are a little obtuse 
at first glance; you probably want to change the 1 to 15 (we can touch on the 
"why" later).

As far as Iterators/Filters that you should feel comfortable using, check out: 
https://github.com/apache/accumulo/tree/master/core/src/main/java/org/apache/accumulo/core/iterators/user

The "system" package are not meant for public consumption (and often are used 
to implement internal functionality). This is probably why you're having a hard 
time figuring out how to use it.

Don't miss the methods on Scanner: fetchColumnFamily(Text), fetchColumn(Text, 
Text), and fetchColumn(Column). These are how you can easily do column family 
or column family + qualifier filtering.

For example, if you wanted to filter on the column family "foo" and the column 
qualifier "bar":

```
Scanner scan = connector.createScanner("table", auths);
scan.fetchColumn(new Text("foo"), new Text("bar"));
RowIterator rowIterator = new RowIterator(scan);
while (...) { ... }
```

The fetch*() methods are also accumulative. If you want to fetch multiple cf's 
(or cf+cq pairs), you can invoke the method multiple times.

Ben Craig wrote:

Hey Guys I'm new to Accumulo and trying to learn how to query data.  I
think I've got the basics down like:

//create a scanner
Scanner scan = connector.createScanner( "table", auths );

//create a filter
IteratorSetting itr1 = new IteratorSetting( 1, "TimeFilter",
AgeOffFilter.class );
itr1.addOption( TTL, Long.toString( DEFAULT_QUERY_TIME ) );
scan.addScanIterator( itr1 );

//iterate over the resulting rows
RowIterator rowIterator = new RowIterator( scan );
while ( rowIterator.hasNext() )
{
}

I've been playing around with some of the built in filters and have been
able to apply multiple filters on top of each other.  Some of the
filters I'm having issues with where they take a complex java object and
not just option

For example ColumnQualifierFilter.java


When we use Iterator Settings the class is implicitly created but if I
want to use the ColumnQualifierFilter I need to create one and pass it a
set of columns.  I've been playing around with it for a while and havn't
been able to learn how to use it properly.

The constructor takes a sorted key value iterator.  How do I get this
sorted key value iterator?  Do I start with a scanner or do you start
with another type of scanner?  Do I just make one?
new ArrayList>(); ? And the data goes
into it?



I've read through this Accumulo
 book but it just
shows how you can use the Scanner/Iterator Settings to query.

If anyone has any suggestions / documentation / examples it be much
appreciated.

Thanks,

Ben

 



RE: Unable to get Mini to use native maps - 1.6.2

2016-02-24 Thread Dan Blum
Got it, thanks. The main problem we had turned out to be the way that MAC was 
being run, which was too clever by half (it was implemented by a former 
employee who shall remain nameless). Somehow it managed to filter out a small 
subset of logging messages including the ones produced by NativeMap – I am 
still not sure how.

 

From: Michael Wall [mailto:mjw...@gmail.com] 
Sent: Wednesday, February 24, 2016 4:11 PM
To: user@accumulo.apache.org
Subject: Re: Unable to get Mini to use native maps - 1.6.2

 

Dan,

 

I was unable to get Christopher's example to work as posted without steps 
mentioned earlier in the thread.  In the MiniAccumuloCluster code for version 
1.6.2, if I am reading it correctly, you must do a couple of things.

 

1 - build the native libs

2 - call config.setNativeLibPaths(somedirectory)

 

The ProcessBuilder that is used to run the MiniAccumuloCluster does not appear 
to reset the environment.  But it does overwrite whatever is in 
DYLIB_LIBRARY_PATH and LD_LIBRARY_PATH with the value set in the config.  So 
you must call config.setNativeLibPaths and then use that config when you create 
the MiniAccumuloCluster, as far as I can tell

 

I created the following repo to show how I did it, 
https://github.com/mjwall/miniaccumulocluster-nativemaps.

 

Hope that helps.

 

Mike

 

 

On Tue, Feb 23, 2016 at 5:14 PM, Christopher <ctubb...@apache.org> wrote:

Looking at the NativeMap, it looks like it will always log some message at the 
INFO level if it successfully loaded the native maps, or at the ERROR level if 
it failed to do so (with some extra DEBUG messages while it searches the path).

I thought maybe there was a class loading race condition where 
NativeMap.isLoaded() returns false while it's still trying to load... that 
might still be a possibility (I'm not sure if this can happen with static 
initializer blocks?), but if it were, you'd still see the log messages about 
loading or not.

I can't see your code, so I don't know what's wrong, but something like the 
following should work fine:

 

1. MiniAccumuloConfig config = new MiniAccumuloConfig(new 
File("/path/to/miniDir"), "rootPassword");

2. HashMap<String,String> map = new HashMap<String,String>();

3. map.put(Property.TSERV_NATIVEMAP_ENABLED.getKey(), "true");

4. config.setSiteConfig(map);

5. MiniAccumuloCluster mini = new MiniAccumuloCluster(config);

 

On Tue, Feb 23, 2016 at 2:21 PM Dan Blum <db...@bbn.com> wrote:

In fact, we are calling that (in Groovy which is why I missed it before, not 
being that familiar with Groovy). I verified that the path is correct – doesn’t 
help.

 

From: Christopher [mailto:ctubb...@apache.org] 
Sent: Tuesday, February 23, 2016 2:06 PM


To: user@accumulo.apache.org
Subject: Re: Unable to get Mini to use native maps - 1.6.2

 

MiniAccumuloConfig has a method, called "setNativeLibPaths(String... 
nativePathItems)".

You should call that method with the absolute path for your compiled native map 
shared library file (.so), before you start Mini.

 

On Tue, Feb 23, 2016 at 2:03 PM Josh Elser <josh.el...@gmail.com> wrote:

MiniAccumuloCluster spawns its own processes, though. Calling
NativeMap.isLoaded() in your test JVM isn't proving anything.

That's why you need to call these methods on MAC, you would need to
check the TabletServer*.log file(s), and make sure that its
configuration is set up properly to find the .so.

Does that make sense? Did I misinterpret you?

Dan Blum wrote:
> I'll see what I can do, but there's no simple way to pull out something
> small we can share (and it would have to be a gradle project).
>
> I confirmed that the path is not the immediate issue by adding an explicit
> call to NativeMap.isLoaded() at the start of my test - that produces logging
> from NativeMap saying it can't find the library, which is what I expect.
> Without this call NativeMap still logs nothing so the setting that should
> cause it to be referenced is getting overridden somewhere. Calling
> InstanceOperations.getSiteConfiguration and getSystemConfiguration shows
> that the native maps are enabled, however.
>
> -Original Message-
> From: Josh Elser [mailto:josh.el...@gmail.com]
> Sent: Tuesday, February 23, 2016 12:56 PM
> To: user@accumulo.apache.org
> Subject: Re: Unable to get Mini to use native maps - 1.6.2
>
> Well, I'm near positive that 1.6.2 had native maps working, so there
> must be something unexpected happening :). MAC should be very close to
> what a real standalone instance is doing -- if you have the ability to
> share some end-to-end project with where you are seeing this, that'd be
> extremely helpful (e.g. a Maven project that we can just run would be
> superb).
>
> Dan Blum wrote:
>> I'll take a look but I don't think the path is the problem - NativeMap
>> should try to load the library regardles

RE: Unable to get Mini to use native maps - 1.6.2

2016-02-23 Thread Dan Blum
In fact, we are calling that (in Groovy which is why I missed it before, not 
being that familiar with Groovy). I verified that the path is correct – doesn’t 
help.

 

From: Christopher [mailto:ctubb...@apache.org] 
Sent: Tuesday, February 23, 2016 2:06 PM
To: user@accumulo.apache.org
Subject: Re: Unable to get Mini to use native maps - 1.6.2

 

MiniAccumuloConfig has a method, called "setNativeLibPaths(String... 
nativePathItems)".

You should call that method with the absolute path for your compiled native map 
shared library file (.so), before you start Mini.

 

On Tue, Feb 23, 2016 at 2:03 PM Josh Elser <josh.el...@gmail.com> wrote:

MiniAccumuloCluster spawns its own processes, though. Calling
NativeMap.isLoaded() in your test JVM isn't proving anything.

That's why you need to call these methods on MAC, you would need to
check the TabletServer*.log file(s), and make sure that its
configuration is set up properly to find the .so.

Does that make sense? Did I misinterpret you?

Dan Blum wrote:
> I'll see what I can do, but there's no simple way to pull out something
> small we can share (and it would have to be a gradle project).
>
> I confirmed that the path is not the immediate issue by adding an explicit
> call to NativeMap.isLoaded() at the start of my test - that produces logging
> from NativeMap saying it can't find the library, which is what I expect.
> Without this call NativeMap still logs nothing so the setting that should
> cause it to be referenced is getting overridden somewhere. Calling
> InstanceOperations.getSiteConfiguration and getSystemConfiguration shows
> that the native maps are enabled, however.
>
> -Original Message-
> From: Josh Elser [mailto:josh.el...@gmail.com]
> Sent: Tuesday, February 23, 2016 12:56 PM
> To: user@accumulo.apache.org
> Subject: Re: Unable to get Mini to use native maps - 1.6.2
>
> Well, I'm near positive that 1.6.2 had native maps working, so there
> must be something unexpected happening :). MAC should be very close to
> what a real standalone instance is doing -- if you have the ability to
> share some end-to-end project with where you are seeing this, that'd be
> extremely helpful (e.g. a Maven project that we can just run would be
> superb).
>
> Dan Blum wrote:
>> I'll take a look but I don't think the path is the problem - NativeMap
>> should try to load the library regardless of whether this path is set and
>> will log if it can't find it. This isn't happening.
>>
>> -Original Message-
>> From: Josh Elser [mailto:josh.el...@gmail.com]
>> Sent: Tuesday, February 23, 2016 12:27 PM
>> To: user@accumulo.apache.org
>> Subject: Re: Unable to get Mini to use native maps - 1.6.2
>>
>> Hi Dan,
>>
>> I'm seeing in our internal integration tests that we have some
>> configuration happening which (at least, intends to) configure the
>> native maps for the minicluster.
>>
>> If you're not familiar, the MiniAccumuloConfig and MiniAccumuloCluster
>> classes are thin wrappers around MiniAccumuloConfigImpl and
>> MiniAccumuloClusterImpl. There is a setNativeLibPaths method on
>> MiniAccumuloConfigImpl which you can use to provide the path to the
>> native library shared object (.so). You will probably have to switch
>> from MiniAccumuloConfig/MiniAccumuloCluster to
>> MiniAccumuloConfigImpl/MiniAccumuloClusterImpl to use the "hidden"
> methods.
>> You could also look at MiniClusterHarness.java in>=1.7 if you want a
>> concrete example of how we initialize things for our tests.
>>
>> - Josh
>>
>> Dan Blum wrote:
>>> In order to test to make sure we don't have more code that needs a
>>> workaround for https://issues.apache.org/jira/browse/ACCUMULO-4148 I am
>>> trying again to enable the native maps for Mini, which we use for
> testing.
>>> I set tserver.memory.maps.native.enabled to true in the site XML, and
> this
>>> is getting picked up since I see this in the Mini logs:
>>>
>>> [server.Accumulo] INFO : tserver.memory.maps.native.enabled = true
>>>
>>> However, NativeMap should log something when it tries to load the
> library,
>>> whether it succeeds or fails, but it logs nothing. The obvious conclusion
>> is
>>> that something about how MiniAccumuloCluster starts means that this
>> setting
>>> is ignored or overridden, but I am not finding it. (I see the mergeProp
>> call
>>> in MiniAccumuloConfigImpl.initialize which will set
>> TSERV_NATIVEMAP_ENABLED
>>> to false, but that should only set it if it's not already in the
>> properties,
>>> which it should be, and as far as I can tell the log message above is
>> issued
>>> after this.)
>>>
>



RE: Unable to get Mini to use native maps - 1.6.2

2016-02-23 Thread Dan Blum
I'll see what I can do, but there's no simple way to pull out something
small we can share (and it would have to be a gradle project).

I confirmed that the path is not the immediate issue by adding an explicit
call to NativeMap.isLoaded() at the start of my test - that produces logging
from NativeMap saying it can't find the library, which is what I expect.
Without this call NativeMap still logs nothing so the setting that should
cause it to be referenced is getting overridden somewhere. Calling
InstanceOperations.getSiteConfiguration and getSystemConfiguration shows
that the native maps are enabled, however.

-Original Message-
From: Josh Elser [mailto:josh.el...@gmail.com] 
Sent: Tuesday, February 23, 2016 12:56 PM
To: user@accumulo.apache.org
Subject: Re: Unable to get Mini to use native maps - 1.6.2

Well, I'm near positive that 1.6.2 had native maps working, so there 
must be something unexpected happening :). MAC should be very close to 
what a real standalone instance is doing -- if you have the ability to 
share some end-to-end project with where you are seeing this, that'd be 
extremely helpful (e.g. a Maven project that we can just run would be 
superb).

Dan Blum wrote:
> I'll take a look but I don't think the path is the problem - NativeMap
> should try to load the library regardless of whether this path is set and
> will log if it can't find it. This isn't happening.
>
> -Original Message-
> From: Josh Elser [mailto:josh.el...@gmail.com]
> Sent: Tuesday, February 23, 2016 12:27 PM
> To: user@accumulo.apache.org
> Subject: Re: Unable to get Mini to use native maps - 1.6.2
>
> Hi Dan,
>
> I'm seeing in our internal integration tests that we have some
> configuration happening which (at least, intends to) configure the
> native maps for the minicluster.
>
> If you're not familiar, the MiniAccumuloConfig and MiniAccumuloCluster
> classes are thin wrappers around MiniAccumuloConfigImpl and
> MiniAccumuloClusterImpl. There is a setNativeLibPaths method on
> MiniAccumuloConfigImpl which you can use to provide the path to the
> native library shared object (.so). You will probably have to switch
> from MiniAccumuloConfig/MiniAccumuloCluster to
> MiniAccumuloConfigImpl/MiniAccumuloClusterImpl to use the "hidden"
methods.
>
> You could also look at MiniClusterHarness.java in>=1.7 if you want a
> concrete example of how we initialize things for our tests.
>
> - Josh
>
> Dan Blum wrote:
>> In order to test to make sure we don't have more code that needs a
>> workaround for https://issues.apache.org/jira/browse/ACCUMULO-4148 I am
>> trying again to enable the native maps for Mini, which we use for
testing.
>>
>> I set tserver.memory.maps.native.enabled to true in the site XML, and
this
>> is getting picked up since I see this in the Mini logs:
>>
>> [server.Accumulo] INFO : tserver.memory.maps.native.enabled = true
>>
>> However, NativeMap should log something when it tries to load the
library,
>> whether it succeeds or fails, but it logs nothing. The obvious conclusion
> is
>> that something about how MiniAccumuloCluster starts means that this
> setting
>> is ignored or overridden, but I am not finding it. (I see the mergeProp
> call
>> in MiniAccumuloConfigImpl.initialize which will set
> TSERV_NATIVEMAP_ENABLED
>> to false, but that should only set it if it's not already in the
> properties,
>> which it should be, and as far as I can tell the log message above is
> issued
>> after this.)
>>
>



RE: Unable to get Mini to use native maps - 1.6.2

2016-02-23 Thread Dan Blum
I'll take a look but I don't think the path is the problem - NativeMap
should try to load the library regardless of whether this path is set and
will log if it can't find it. This isn't happening.

-Original Message-
From: Josh Elser [mailto:josh.el...@gmail.com] 
Sent: Tuesday, February 23, 2016 12:27 PM
To: user@accumulo.apache.org
Subject: Re: Unable to get Mini to use native maps - 1.6.2

Hi Dan,

I'm seeing in our internal integration tests that we have some 
configuration happening which (at least, intends to) configure the 
native maps for the minicluster.

If you're not familiar, the MiniAccumuloConfig and MiniAccumuloCluster 
classes are thin wrappers around MiniAccumuloConfigImpl and 
MiniAccumuloClusterImpl. There is a setNativeLibPaths method on 
MiniAccumuloConfigImpl which you can use to provide the path to the 
native library shared object (.so). You will probably have to switch 
from MiniAccumuloConfig/MiniAccumuloCluster to 
MiniAccumuloConfigImpl/MiniAccumuloClusterImpl to use the "hidden" methods.

You could also look at MiniClusterHarness.java in >=1.7 if you want a 
concrete example of how we initialize things for our tests.

- Josh

Dan Blum wrote:
> In order to test to make sure we don't have more code that needs a
> workaround for https://issues.apache.org/jira/browse/ACCUMULO-4148 I am
> trying again to enable the native maps for Mini, which we use for testing.
>
> I set tserver.memory.maps.native.enabled to true in the site XML, and this
> is getting picked up since I see this in the Mini logs:
>
> [server.Accumulo] INFO : tserver.memory.maps.native.enabled = true
>
> However, NativeMap should log something when it tries to load the library,
> whether it succeeds or fails, but it logs nothing. The obvious conclusion
is
> that something about how MiniAccumuloCluster starts means that this
setting
> is ignored or overridden, but I am not finding it. (I see the mergeProp
call
> in MiniAccumuloConfigImpl.initialize which will set
TSERV_NATIVEMAP_ENABLED
> to false, but that should only set it if it's not already in the
properties,
> which it should be, and as far as I can tell the log message above is
issued
> after this.)
>



Unable to get Mini to use native maps - 1.6.2

2016-02-23 Thread Dan Blum
In order to test to make sure we don't have more code that needs a
workaround for https://issues.apache.org/jira/browse/ACCUMULO-4148 I am
trying again to enable the native maps for Mini, which we use for testing.

I set tserver.memory.maps.native.enabled to true in the site XML, and this
is getting picked up since I see this in the Mini logs:

[server.Accumulo] INFO : tserver.memory.maps.native.enabled = true

However, NativeMap should log something when it tries to load the library,
whether it succeeds or fails, but it logs nothing. The obvious conclusion is
that something about how MiniAccumuloCluster starts means that this setting
is ignored or overridden, but I am not finding it. (I see the mergeProp call
in MiniAccumuloConfigImpl.initialize which will set TSERV_NATIVEMAP_ENABLED
to false, but that should only set it if it's not already in the properties,
which it should be, and as far as I can tell the log message above is issued
after this.)