Re: Speeding up the row count

James Taylor Fri, 19 Apr 2013 09:00:18 -0700

Phoenix will parallelize within a region:

SELECT count(1) FROM orders


I agree with Ted, though, even serially, 100,000 rows shouldn't take any where 
near 6 mins. You say > 100,000 rows. Can you tell us what it's < ?

Thanks,
James

On Apr 19, 2013, at 2:37 AM, "Ted Yu" <[email protected]> wrote:

> Since there is only one region in your table, using aggregation coprocessor 
> has no advantage. 
> I think there may be some issue with your cluster - row count should finish 
> within 6 minutes.
> 
> Have you checked server logs ?
> 
> Thanks
> 
> On Apr 19, 2013, at 12:33 AM, Omkar Joshi <[email protected]> wrote:
> 
>> Hi,
>> 
>> I'm having a 2-node(VMs) Hadoop cluster atop which HBase is running in the 
>> distributed mode.
>> 
>> I'm having a table named ORDERS with >100000 rows.
>> 
>> NOTE : Since my cluster is ultra-small, I didn't pre-split the table.
>> 
>> ORDERS
>> rowkey :                ORDER_ID
>> 
>> column family : ORDER_DETAILS
>>       columns : CUSTOMER_ID
>>                       PRODUCT_ID
>>                       REQUEST_DATE
>>                       PRODUCT_QUANTITY
>>                       PRICE
>>                       PAYMENT_MODE
>> 
>> The java client code to simply check the count of the records is :
>> 
>> public long getTableCount(String tableName, String columnFamilyName) {
>> 
>>               AggregationClient aggregationClient = new 
>> AggregationClient(config);
>>               Scan scan = new Scan();
>>               scan.addFamily(Bytes.toBytes(columnFamilyName));
>>               scan.setFilter(new FirstKeyOnlyFilter());
>> 
>>               long rowCount = 0;
>> 
>>               try {
>>                       rowCount = 
>> aggregationClient.rowCount(Bytes.toBytes(tableName),
>>                                       null, scan);
>>                       System.out.println("No. of rows in " + tableName + " 
>> is "
>>                                       + rowCount);
>>               } catch (Throwable e) {
>>                       // TODO Auto-generated catch block
>>                       e.printStackTrace();
>>               }
>> 
>>               return rowCount;
>>       }
>> 
>> It is running for more than 6 minutes now :(
>> 
>> What shall I do to speed up the execution to milliseconds(at least a couple 
>> of seconds)?
>> 
>> Regards,
>> Omkar Joshi
>> 
>> 
>> -----Original Message-----
>> From: Vedad Kirlic [mailto:[email protected]]
>> Sent: Thursday, April 18, 2013 12:22 AM
>> To: [email protected]
>> Subject: Re: Speeding up the row count
>> 
>> Hi Omkar,
>> 
>> If you are not interested in occurrences of specific column (e.g. name,
>> email ... ), and just want to get total number of rows (regardless of their
>> content - i.e. columns), you should avoid adding any columns to the Scan, in
>> which case coprocessor implementation for AggregateClient, will add
>> FirstKeyOnlyFilter to the Scan, so to avoid loading unnecessary columns, so
>> this should result in some speed up.
>> 
>> This is similar approach to what hbase shell 'count' implementation does,
>> although reduction in overhead in that case is bigger, since data transfer
>> from region server to client (shell) is minimized, whereas in case of
>> coprocessor, data does not leave region server, so most of the improvement
>> in that case should come from avoiding loading of unnecessary files. Not
>> sure how this will apply to your particular case, given that data set per
>> row seems to be rather small. Also, in case of AggregateClient you will
>> benefit if/when your tables span multiple regions. Essentially, performance
>> of this approach will 'degrade' as your table gets bigger, but only to the
>> point when it splits, from which point it should be pretty constant. Having
>> this in mind, and your type of data, you might consider pre-splitting your
>> tables.
>> 
>> DISCLAIMER: this is mostly theoretical, since I'm not an expert in hbase
>> internals :), so your best bet is to try it - I'm too lazy to verify impact
>> my self ;)
>> 
>> Finally, if your case can tolerate eventual consistency of counters with
>> actual number of rows, you can, as already suggested, have RowCounter map
>> reduce run every once in a while, write the counter(s) back to hbase, and
>> read those when you need to obtain the number of rows.
>> 
>> Regards,
>> Vedad
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-hbase.679495.n3.nabble.com/Speeding-up-the-row-count-tp4042378p4042415.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>> 
>> The contents of this e-mail and any attachment(s) may contain confidential 
>> or privileged information for the intended recipient(s). Unintended 
>> recipients are prohibited from taking action on the basis of information in 
>> this e-mail and  using or disseminating the information,  and must notify 
>> the sender and delete it from their system. L&T Infotech will not accept 
>> responsibility or liability for the accuracy or completeness of, or the 
>> presence of any virus or disabling code in this e-mail"

Re: Speeding up the row count

Reply via email to