Filtering on column qualifier

2013-08-21 Thread Marc Reichman
I have some data stored in Accumulo with some scores stored as column
qualifiers (there was an older thread about this). I would like to find a
way to do thresholding when retrieving the data without retrieving it all
and then manually filtering out items below my threshold.

I know I can fetch column qualifiers which are exact.

I've seen the ColumnQualifierFilter, which I assume is what's in play when
I fetch qualifiers. Is there a reasonable pattern to extend this and try to
use it as a scan iterator so I can do things like greater than a value
which will be interpreted as a Double vs. the string equality going on now?

Thanks,
Marc


Re: Filtering on column qualifier

2013-08-21 Thread John Vines
There's no way to extend the ColumnQualietyFilter via configuration, but it
sounds like you are on top of it. You just need to extend the class,
possibly copy a bit of code, and change the equality check to a compareTo
after converting the Strings to Doubles.


On Wed, Aug 21, 2013 at 10:00 AM, Marc Reichman 
mreich...@pixelforensics.com wrote:

 I have some data stored in Accumulo with some scores stored as column
 qualifiers (there was an older thread about this). I would like to find a
 way to do thresholding when retrieving the data without retrieving it all
 and then manually filtering out items below my threshold.

 I know I can fetch column qualifiers which are exact.

 I've seen the ColumnQualifierFilter, which I assume is what's in play when
 I fetch qualifiers. Is there a reasonable pattern to extend this and try to
 use it as a scan iterator so I can do things like greater than a value
 which will be interpreted as a Double vs. the string equality going on now?

 Thanks,
 Marc



Straggler problem in Accumulo BatchScans

2013-08-21 Thread Slater, David M.
Hey, I have a 7 node network running accumulo 1.4.1 and hadoop 1.0.4.

When I run large BatchScanner operations, the number of tablets scanned per 
node is not uniform, leading to the overloaded nodes taking much longer to 
finish than the others. For queries that require all of the scans to finish 
before returning, this is a major latency issue. What are some practical means 
of load-balancing this to reduce delay?

Is it possible for tablets to be hosted on multiple tablet servers, up to the 
replication factor of the underlying hdfs? Are there reasons this might be an 
undesirable design?

Thanks in advance,
David


Re: Straggler problem in Accumulo BatchScans

2013-08-21 Thread James Hughes
David,

Each tablet is hosted by one tablet server, and there's no way around
that.  (This is actually quite reasonably; otherwise, we would receive
duplicate results from multiple tablet servers.)

One strategy to deal with imbalanced data is to add a random partition
prefix to your row Ids.  This does complicate building queries, but in
general, you'll be able to leverage all of your nodes.  I did some testing
with the nodes of such random shard ids, and it seems like having 1-2x as
many shards as tablet servers worked pretty well.  (I'd suggest 2x in case
you ever grow your cloud.)

In particular, if you can reingest your data, prepend a random 01-14~ to
the beginning of each row Id, and see if that helps.  After that, you can
help Accumulo decide where it should split tablets with addSplits 01 02
etc 14 from the Accumulo shell (or programmatically with the addSplits).
After that, you can make sure that your 14+ splits are distributed across
the 7 nodes in a reasonable way.

I hope that helps,

Jim

http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#addSplits%28java.lang.String,%20java.util.SortedSet%29



On Wed, Aug 21, 2013 at 7:09 PM, Slater, David M.
david.sla...@jhuapl.eduwrote:

 Hey, I have a 7 node network running accumulo 1.4.1 and hadoop 1.0.4.

 ** **

 When I run large BatchScanner operations, the number of tablets scanned
 per node is not uniform, leading to the overloaded nodes taking much longer
 to finish than the others. For queries that require all of the scans to
 finish before returning, this is a major latency issue. What are some
 practical means of load-balancing this to reduce delay?

 ** **

 Is it possible for tablets to be hosted on multiple tablet servers, up to
 the replication factor of the underlying hdfs? Are there reasons this might
 be an undesirable design?

 ** **

 Thanks in advance,
 David 



Re: Straggler problem in Accumulo BatchScans

2013-08-21 Thread Eric Newton
You can write your own balancer, and use it just for your table.

-Eric


On Wed, Aug 21, 2013 at 7:47 PM, Slater, David M.
david.sla...@jhuapl.eduwrote:

 Hi James,

 ** **

 I already had the data sharded into 7 partitions, and that works well to
 distribute the data into 7 tablets. (I have 2 GB tablet sizes, with about
 1.2 TB of data, so there are numerous tablets per server). The difficulty
 is that Accumulo seems to decide for itself where each tablet goes. When I
 only had 10 GB of data, it nicely divided into 7 tablets, one on each node.
 However, with dozens of tablets per tablet server, it assigns tablets to
 tablet servers orthogonally to my presplits. 

 ** **

 Is there a way to force Accumulo to keep specific ranges on specific
 nodes? If not, I suppose that I could have 4x or more shards per tablet
 server to ensure that it was more uniformly placed.

 ** **

 D

 ** **

 *From:* James Hughes [mailto:jn...@virginia.edu]
 *Sent:* Wednesday, August 21, 2013 7:29 PM
 *To:* user@accumulo.apache.org
 *Subject:* Re: Straggler problem in Accumulo BatchScans

 ** **

 David,

 Each tablet is hosted by one tablet server, and there's no way around
 that.  (This is actually quite reasonably; otherwise, we would receive
 duplicate results from multiple tablet servers.)  

 One strategy to deal with imbalanced data is to add a random partition
 prefix to your row Ids.  This does complicate building queries, but in
 general, you'll be able to leverage all of your nodes.  I did some testing
 with the nodes of such random shard ids, and it seems like having 1-2x as
 many shards as tablet servers worked pretty well.  (I'd suggest 2x in case
 you ever grow your cloud.)

 In particular, if you can reingest your data, prepend a random 01-14~ to
 the beginning of each row Id, and see if that helps.  After that, you can
 help Accumulo decide where it should split tablets with addSplits 01 02
 etc 14 from the Accumulo shell (or programmatically with the addSplits).
 After that, you can make sure that your 14+ splits are distributed across
 the 7 nodes in a reasonable way.  

 I hope that helps,

 Jim


 http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#addSplits%28java.lang.String,%20java.util.SortedSet%29
 

 ** **

 On Wed, Aug 21, 2013 at 7:09 PM, Slater, David M. david.sla...@jhuapl.edu
 wrote:

 Hey, I have a 7 node network running accumulo 1.4.1 and hadoop 1.0.4.

  

 When I run large BatchScanner operations, the number of tablets scanned
 per node is not uniform, leading to the overloaded nodes taking much longer
 to finish than the others. For queries that require all of the scans to
 finish before returning, this is a major latency issue. What are some
 practical means of load-balancing this to reduce delay?

  

 Is it possible for tablets to be hosted on multiple tablet servers, up to
 the replication factor of the underlying hdfs? Are there reasons this might
 be an undesirable design?

  

 Thanks in advance,
 David 

 ** **



RE: Filtering on column qualifier

2013-08-21 Thread Slater, David M.
Would this require extending the BatchScanner as well to make use of the 
extended ColumnQualifierFilter, or could this be done with an unmodified 
BatchScanner?

From: John Vines [mailto:vi...@apache.org]
Sent: Wednesday, August 21, 2013 10:49 AM
To: user@accumulo.apache.org
Subject: Re: Filtering on column qualifier

There's no way to extend the ColumnQualietyFilter via configuration, but it 
sounds like you are on top of it. You just need to extend the class, possibly 
copy a bit of code, and change the equality check to a compareTo after 
converting the Strings to Doubles.

On Wed, Aug 21, 2013 at 10:00 AM, Marc Reichman 
mreich...@pixelforensics.commailto:mreich...@pixelforensics.com wrote:
I have some data stored in Accumulo with some scores stored as column 
qualifiers (there was an older thread about this). I would like to find a way 
to do thresholding when retrieving the data without retrieving it all and then 
manually filtering out items below my threshold.

I know I can fetch column qualifiers which are exact.

I've seen the ColumnQualifierFilter, which I assume is what's in play when I 
fetch qualifiers. Is there a reasonable pattern to extend this and try to use 
it as a scan iterator so I can do things like greater than a value which will 
be interpreted as a Double vs. the string equality going on now?

Thanks,
Marc



Re: Straggler problem in Accumulo BatchScans

2013-08-21 Thread Eric Newton
A new balancer is a plug-in class that instructs the Master process where
to place tablets.

If you know you need your tablets spread out over servers based on time
(row id), you can do that.  It's pretty common, in fact.

-Eric


On Wed, Aug 21, 2013 at 7:54 PM, Slater, David M.
david.sla...@jhuapl.eduwrote:

 Hi Dave,

 ** **

 The table is currently organizing netflow data with its rowID of
 timestamp_netflowRecordID, some columns corresponding to various netflow
 quantites, and one column representing the entire netflow in binary form.*
 ***

 ** **

 The table is about 1.2 TB, and I am scanning 5-40 GB per scan, which scans
 about 7-28 tablets.

 ** **

 What do you mean by a custom load balancer? Do you mean balancing the data
 on ingest, or balancing the query load? What would you recommend for
 balancing the query load if I can only retrieve the data from a particular
 tablet?

 ** **

 I’ve played with index/data caches, though I haven’t used readahead
 threads or max open files. Is that referring to rfiles? 

 ** **

 I’m noticing that most of the queries are CPU bound, and that read i/o is
 not being hit very hard. Is that a typical behavior for scans?

 ** **

 Thanks,
 David 

 ** **

 *From:* Dave Marion [mailto:dlmar...@comcast.net]
 *Sent:* Wednesday, August 21, 2013 7:29 PM
 *To:* user@accumulo.apache.org
 *Subject:* RE: Straggler problem in Accumulo BatchScans

 ** **

 How is the table organized?

 What percent of the table are you scanning in these large operations?

 Have you considered writing a custom load balancer?

 ** **

 I don’t think that a tablet can be hosted on multiple servers. But you
 might be able to play around with the index/data caches, readahead threads
 (concurrent queries), and max open files to achieve better performance.***
 *

 ** **

 *From:* Slater, David M. 
 [mailto:david.sla...@jhuapl.edudavid.sla...@jhuapl.edu]

 *Sent:* Wednesday, August 21, 2013 7:09 PM
 *To:* user@accumulo.apache.org
 *Subject:* Straggler problem in Accumulo BatchScans

 ** **

 Hey, I have a 7 node network running accumulo 1.4.1 and hadoop 1.0.4.

 ** **

 When I run large BatchScanner operations, the number of tablets scanned
 per node is not uniform, leading to the overloaded nodes taking much longer
 to finish than the others. For queries that require all of the scans to
 finish before returning, this is a major latency issue. What are some
 practical means of load-balancing this to reduce delay?

 ** **

 Is it possible for tablets to be hosted on multiple tablet servers, up to
 the replication factor of the underlying hdfs? Are there reasons this might
 be an undesirable design?

 ** **

 Thanks in advance,
 David 



RE: Straggler problem in Accumulo BatchScans

2013-08-21 Thread Slater, David M.
Thanks Eric,

Just to make sure I'm going in the right direction, this would involve 
extending the TabletBalancer class, correct? How do I add it to the table after 
that (and remove the old one)? I don't see it under the Connector's 
TableOperations().

Is using a load-balancer what you would recommend if I wanted to make sure that 
two different tables stored related information (e.g. data and indexes) on the 
same tablets?

Thanks,
David

From: Eric Newton [mailto:eric.new...@gmail.com]
Sent: Wednesday, August 21, 2013 8:03 PM
To: user@accumulo.apache.org
Subject: Re: Straggler problem in Accumulo BatchScans

A new balancer is a plug-in class that instructs the Master process where to 
place tablets.

If you know you need your tablets spread out over servers based on time (row 
id), you can do that.  It's pretty common, in fact.

-Eric

On Wed, Aug 21, 2013 at 7:54 PM, Slater, David M. 
david.sla...@jhuapl.edumailto:david.sla...@jhuapl.edu wrote:
Hi Dave,

The table is currently organizing netflow data with its rowID of 
timestamp_netflowRecordID, some columns corresponding to various netflow 
quantites, and one column representing the entire netflow in binary form.

The table is about 1.2 TB, and I am scanning 5-40 GB per scan, which scans 
about 7-28 tablets.

What do you mean by a custom load balancer? Do you mean balancing the data on 
ingest, or balancing the query load? What would you recommend for balancing the 
query load if I can only retrieve the data from a particular tablet?

I've played with index/data caches, though I haven't used readahead threads or 
max open files. Is that referring to rfiles?

I'm noticing that most of the queries are CPU bound, and that read i/o is not 
being hit very hard. Is that a typical behavior for scans?

Thanks,
David

From: Dave Marion [mailto:dlmar...@comcast.netmailto:dlmar...@comcast.net]
Sent: Wednesday, August 21, 2013 7:29 PM
To: user@accumulo.apache.orgmailto:user@accumulo.apache.org
Subject: RE: Straggler problem in Accumulo BatchScans

How is the table organized?
What percent of the table are you scanning in these large operations?
Have you considered writing a custom load balancer?

I don't think that a tablet can be hosted on multiple servers. But you might be 
able to play around with the index/data caches, readahead threads (concurrent 
queries), and max open files to achieve better performance.

From: Slater, David M. [mailto:david.sla...@jhuapl.edu]
Sent: Wednesday, August 21, 2013 7:09 PM
To: user@accumulo.apache.orgmailto:user@accumulo.apache.org
Subject: Straggler problem in Accumulo BatchScans

Hey, I have a 7 node network running accumulo 1.4.1 and hadoop 1.0.4.

When I run large BatchScanner operations, the number of tablets scanned per 
node is not uniform, leading to the overloaded nodes taking much longer to 
finish than the others. For queries that require all of the scans to finish 
before returning, this is a major latency issue. What are some practical means 
of load-balancing this to reduce delay?

Is it possible for tablets to be hosted on multiple tablet servers, up to the 
replication factor of the underlying hdfs? Are there reasons this might be an 
undesirable design?

Thanks in advance,
David



Re: Straggler problem in Accumulo BatchScans

2013-08-21 Thread dlmarion
You can set it in the shell on the table. Just override the default tablet 
balancer for the table. I think the master has to use the Table load balancer 
also if it is not set by default. 

- Original Message -

From: David M. Slater david.sla...@jhuapl.edu 
To: user@accumulo.apache.org 
Sent: Wednesday, August 21, 2013 8:12:46 PM 
Subject: RE: Straggler problem in Accumulo BatchScans 



Thanks Eric, 



Just to make sure I’m going in the right direction, this would involve 
extending the TabletBalancer class, correct? How do I add it to the table after 
that (and remove the old one)? I don’t see it under the Connector’s 
TableOperations(). 



Is using a load-balancer what you would recommend if I wanted to make sure that 
two different tables stored related information (e.g. data and indexes) on the 
same tablets? 



Thanks, 
David 



From: Eric Newton [mailto:eric.new...@gmail.com] 
Sent: Wednesday, August 21, 2013 8:03 PM 
To: user@accumulo.apache.org 
Subject: Re: Straggler problem in Accumulo BatchScans 




A new balancer is a plug-in class that instructs the Master process where to 
place tablets. 





If you know you need your tablets spread out over servers based on time (row 
id), you can do that. It's pretty common, in fact. 





-Eric 





On Wed, Aug 21, 2013 at 7:54 PM, Slater, David M.  david.sla...@jhuapl.edu  
wrote: 


Hi Dave, 



The table is currently organizing netflow data with its rowID of 
timestamp_netflowRecordID, some columns corresponding to various netflow 
quantites, and one column representing the entire netflow in binary form. 



The table is about 1.2 TB, and I am scanning 5-40 GB per scan, which scans 
about 7-28 tablets. 



What do you mean by a custom load balancer? Do you mean balancing the data on 
ingest, or balancing the query load? What would you recommend for balancing the 
query load if I can only retrieve the data from a particular tablet? 



I’ve played with index/data caches, though I haven’t used readahead threads or 
max open files. Is that referring to rfiles? 



I’m noticing that most of the queries are CPU bound, and that read i/o is not 
being hit very hard. Is that a typical behavior for scans? 



Thanks, 
David 




From: Dave Marion [mailto: dlmar...@comcast.net ] 


Sent: Wednesday, August 21, 2013 7:29 PM 
To: user@accumulo.apache.org 


Subject: RE: Straggler problem in Accumulo BatchScans 




How is the table organized? 

What percent of the table are you scanning in these large operations? 

Have you considered writing a custom load balancer? 



I don’t think that a tablet can be hosted on multiple servers. But you might be 
able to play around with the index/data caches, readahead threads (concurrent 
queries), and max open files to achieve better performance. 




From: Slater, David M. [ mailto:david.sla...@jhuapl.edu ] 
Sent: Wednesday, August 21, 2013 7:09 PM 
To: user@accumulo.apache.org 
Subject: Straggler problem in Accumulo BatchScans 




Hey, I have a 7 node network running accumulo 1.4.1 and hadoop 1.0.4. 



When I run large BatchScanner operations, the number of tablets scanned per 
node is not uniform, leading to the overloaded nodes taking much longer to 
finish than the others. For queries that require all of the scans to finish 
before returning, this is a major latency issue. What are some practical means 
of load-balancing this to reduce delay? 



Is it possible for tablets to be hosted on multiple tablet servers, up to the 
replication factor of the underlying hdfs? Are there reasons this might be an 
undesirable design? 



Thanks in advance, 
David 






Re: Filtering on column qualifier

2013-08-21 Thread John Vines
Nope, it's just a custom Filter, which is a type of iterator. You can
attach iterators to run on a scanner, you just need to make sure you deploy
the jar with the custom iterator to all of the tservers.


On Wed, Aug 21, 2013 at 7:58 PM, Slater, David M.
david.sla...@jhuapl.eduwrote:

 Would this require extending the BatchScanner as well to make use of the
 extended ColumnQualifierFilter, or could this be done with an unmodified
 BatchScanner?

 ** **

 *From:* John Vines [mailto:vi...@apache.org]
 *Sent:* Wednesday, August 21, 2013 10:49 AM
 *To:* user@accumulo.apache.org
 *Subject:* Re: Filtering on column qualifier

 ** **

 There's no way to extend the ColumnQualietyFilter via configuration, but
 it sounds like you are on top of it. You just need to extend the class,
 possibly copy a bit of code, and change the equality check to a compareTo
 after converting the Strings to Doubles.

 ** **

 On Wed, Aug 21, 2013 at 10:00 AM, Marc Reichman 
 mreich...@pixelforensics.com wrote:

 I have some data stored in Accumulo with some scores stored as column
 qualifiers (there was an older thread about this). I would like to find a
 way to do thresholding when retrieving the data without retrieving it all
 and then manually filtering out items below my threshold.

 ** **

 I know I can fetch column qualifiers which are exact.

 ** **

 I've seen the ColumnQualifierFilter, which I assume is what's in play when
 I fetch qualifiers. Is there a reasonable pattern to extend this and try to
 use it as a scan iterator so I can do things like greater than a value
 which will be interpreted as a Double vs. the string equality going on now?
 

 ** **

 Thanks,

 Marc

 ** **



du command giving null pointer exception [SEC=UNOFFICIAL]

2013-08-21 Thread Dickson, Matt MR
UNOFFICIAL

I'm trying to get the total disk space used by a table and am using the du 
tablename command but getting a NullPointerException, specifically

22 05:41:13,223 [shell:shell] ERROR: java.lang.RuntimeException: 
java.lang.NullPointerException

When run I'm in a table context on Accumulo 1.5.1.

Is this a bug or am I doing something wrong?

Thanks in advance,
Matt