[jira] [Commented] (HBASE-4435) Add Group By functionality using Coprocessors

2015-08-07 Thread nicu marasoiu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662049#comment-14662049
 ] 

nicu marasoiu commented on HBASE-4435:
--

Hi,

Is this still ongoing? I looked on github and seemed that only one metric like 
sum(column) is done, not multiple ones. The general case is of course group by 
(d1,..,dn) sum(c1) hyperlogUniq(c2) i.e. multiple metrics.

Thank you,
Nicu

 Add Group By functionality using Coprocessors
 -

 Key: HBASE-4435
 URL: https://issues.apache.org/jira/browse/HBASE-4435
 Project: HBase
  Issue Type: Improvement
  Components: Coprocessors
Reporter: Nichole Treadway
Priority: Minor
  Labels: by, coprocessors, group, hbase
 Attachments: HBASE-4435-v2.patch, HBase-4435.patch


 Adds in a Group By -like functionality to HBase, using the Coprocessor 
 framework. 
 It provides the ability to group the result set on one or more columns 
 (groupBy families). It computes statistics (max, min, sum, count, sum of 
 squares, number missing) for a second column, called the stats column. 
 To use, I've provided two implementations.
 1. In the first, you specify a single group-by column and a stats field:
   statsMap = gbc.getStats(tableName, scan, groupByFamily, 
 groupByQualifier, statsFamily, statsQualifier, statsFieldColumnInterpreter);
 The result is a map with the Group By column value (as a String) to a 
 GroupByStatsValues object. The GroupByStatsValues object has max,min,sum etc. 
 of the stats column for that group.
 2. The second implementation allows you to specify a list of group-by columns 
 and a stats field. The List of group-by columns is expected to contain lists 
 of {column family, qualifier} pairs. 
   statsMap = gbc.getStats(tableName, scan, listOfGroupByColumns, 
 statsFamily, statsQualifier, statsFieldColumnInterpreter);
 The GroupByStatsValues code is adapted from the Solr Stats component.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-1512) Coprocessors: Support aggregate functions

2015-08-07 Thread nicu marasoiu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662054#comment-14662054
 ] 

nicu marasoiu commented on HBASE-1512:
--

Hi,

Do you know if, related to this issue, or generally, is there a solution with 
HBase coprocessors for:
1. multiple metric columns e.g. group by (d1,..,dn) sum(c1) sum(c2)
2. custom metric columns e.g. group by (d1,..,dn) sum(c1) hyperlogUniq(c2)
3. sharing the components with map-reduce to run the same query for larger 
inputs

Please advise,
Nicu

 Coprocessors: Support aggregate functions
 -

 Key: HBASE-1512
 URL: https://issues.apache.org/jira/browse/HBASE-1512
 Project: HBase
  Issue Type: Sub-task
  Components: Coprocessors
Reporter: stack
Assignee: Himanshu Vashishtha
 Fix For: 0.92.0

 Attachments: 1512.zip, AggregateCpProtocol.java, 
 AggregateProtocolImpl.java, AggregationClient.java, ColumnInterpreter.java, 
 addendum_1512.txt, patch-1512-2.txt, patch-1512-3.txt, patch-1512-4.txt, 
 patch-1512-5.txt, patch-1512-6.txt, patch-1512-7.txt, patch-1512-8.txt, 
 patch-1512-9.txt, patch-1512.txt


 Chatting with jgray and holstad at the kitchen table about counts, sums, and 
 other aggregating facility, facility generally where you want to calculate 
 some meta info on your table, it seems like it wouldn't be too hard making a 
 filter type that could run a function server-side and return the result ONLY 
 of the aggregation or whatever.
 For example, say you just want to count rows, currently you scan, server 
 returns all data to client and count is done by client counting up row keys.  
 A bunch of time and resources have been wasted returning data that we're not 
 interested in.  With this new filter type, the counting would be done 
 server-side and then it would make up a new result that was the count only 
 (kinda like mysql when you ask it to count, it returns a 'table' with a count 
 column whose value is count of rows).   We could have it so the count was 
 just done per region and return that.  Or we could maybe make a small change 
 in scanner too so that it aggregated the per-region counts.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-2939) Allow Client-Side Connection Pooling

2012-08-17 Thread nicu marasoiu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13436951#comment-13436951
 ] 

nicu marasoiu commented on HBASE-2939:
--

Hi,

I am not sure why multiplexing does not work so well with multiple tcp 
connections. 
When using one tcp connection, the multiplexing pays off peaking at 16 threads 
using same connection in my particular tests (remote client, 4kb batching of 2k 
puts).
While when using 16 tcp connections, the multiplexing benefit peaks at 2 
threads sharing one tcp connection.
The relative benefit of sharing/multiplexing seems about the same when 2 
htables share one socket. However when using hbase.client.ipc.pool.size=16, the 
benefit degrades rapidly when the multiplexing factor is above 2, and goes 
below the line when above 4.
I would like to understand the underlying reason. Perhaps there are locking and 
contention mechanisms preventing us loading the multiple connections in the 
same way we can multiplex a single one.

Here are my times with batched puts, n htable instances (threads), m 
connections in the robin pool:

1 HTable:  1400 records in a fixed timeframe
2 HTables sharing the tcp socket:  1566 records in a fixed timeframe
8 Htables sharing the tcp socket: 6500 records
16 HTables sharing the tcp socket: 9200 records
32 HTables sharing the tcp socket: 6500 records

256 Htables sharing the tcp socket: 1340 recs
16 HTables on 16 connections: 16753 recs
32 Htables on 16 connections: 18661 recs
64 Htables on 16 connections: 16800 recs
128 Htables on 16 connections: 4300 recs
256 Htables on 16 connections: 2434 recs

when saying 16 htables i mean 16 threads using a HTablePool

You can see that it seems that the multiplexing performance degrades when using 
multiple connections much faster than when using just one (without pooling, 
default).

 Allow Client-Side Connection Pooling
 

 Key: HBASE-2939
 URL: https://issues.apache.org/jira/browse/HBASE-2939
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.89.20100621
Reporter: Karthick Sankarachary
Assignee: Karthick Sankarachary
Priority: Critical
 Fix For: 0.92.0

 Attachments: HBASE-2939-0.20.6.patch, HBASE-2939-LATEST.patch, 
 HBASE-2939.patch, HBASE-2939.patch, HBASE-2939.patch, HBASE-2939-V6.patch, 
 HBaseClient.java


 By design, the HBase RPC client multiplexes calls to a given region server 
 (or the master for that matter) over a single socket, access to which is 
 managed by a connection thread defined in the HBaseClient class. While this 
 approach may suffice for most cases, it tends to break down in the context of 
 a real-time, multi-threaded server, where latencies need to be lower and 
 throughputs higher. 
 In brief, the problem is that we dedicate one thread to handle all 
 client-side reads and writes for a given server, which in turn forces them to 
 share the same socket. As load increases, this is bound to serialize calls on 
 the client-side. In particular, when the rate at which calls are submitted to 
 the connection thread is greater than that at which the server responds, then 
 some of those calls will inevitably end up sitting idle, just waiting their 
 turn to go over the wire.
 In general, sharing sockets across multiple client threads is a good idea, 
 but limiting the number of such sockets to one may be overly restrictive for 
 certain cases. Here, we propose a way of defining multiple sockets per server 
 endpoint, access to which may be managed through either a load-balancing or 
 thread-local pool. To that end, we define the notion of a SharedMap, which 
 maps a key to a resource pool, and supports both of those pool types. 
 Specifically, we will apply that map in the HBaseClient, to associate 
 multiple connection threads with each server endpoint (denoted by a 
 connection id). 
  Currently, the SharedMap supports the following types of pools:
 * A ThreadLocalPool, which represents a pool that builds on the 
 ThreadLocal class. It essentially binds the resource to the thread from which 
 it is accessed.
 * A ReusablePool, which represents a pool that builds on the LinkedList 
 class. It essentially allows resources to be checked out, at which point it 
 is (temporarily) removed from the pool. When the resource is no longer 
 required, it should be returned to the pool in order to be reused.
 * A RoundRobinPool, which represents a pool that stores its resources in 
 an ArrayList. It load-balances access to its resources by returning a 
 different resource every time a given key is looked up.
 To control the type and size of the connection pools, we give the user a 
 couple of parameters (viz. hbase.client.ipc.pool.type and 
 hbase.client.ipc.pool.size). In