Re: Speeding up the row count

Ted Yu Fri, 19 Apr 2013 02:37:20 -0700

Since there is only one region in your table, using aggregation coprocessor has 
no advantage. 
I think there may be some issue with your cluster - row count should finish 
within 6 minutes.


Have you checked server logs ?

Thanks

On Apr 19, 2013, at 12:33 AM, Omkar Joshi <[email protected]> wrote:

> Hi,
> 
> I'm having a 2-node(VMs) Hadoop cluster atop which HBase is running in the 
> distributed mode.
> 
> I'm having a table named ORDERS with >100000 rows.
> 
> NOTE : Since my cluster is ultra-small, I didn't pre-split the table.
> 
> ORDERS
> rowkey :                ORDER_ID
> 
> column family : ORDER_DETAILS
>        columns : CUSTOMER_ID
>                        PRODUCT_ID
>                        REQUEST_DATE
>                        PRODUCT_QUANTITY
>                        PRICE
>                        PAYMENT_MODE
> 
> The java client code to simply check the count of the records is :
> 
> public long getTableCount(String tableName, String columnFamilyName) {
> 
>                AggregationClient aggregationClient = new 
> AggregationClient(config);
>                Scan scan = new Scan();
>                scan.addFamily(Bytes.toBytes(columnFamilyName));
>                scan.setFilter(new FirstKeyOnlyFilter());
> 
>                long rowCount = 0;
> 
>                try {
>                        rowCount = 
> aggregationClient.rowCount(Bytes.toBytes(tableName),
>                                        null, scan);
>                        System.out.println("No. of rows in " + tableName + " 
> is "
>                                        + rowCount);
>                } catch (Throwable e) {
>                        // TODO Auto-generated catch block
>                        e.printStackTrace();
>                }
> 
>                return rowCount;
>        }
> 
> It is running for more than 6 minutes now :(
> 
> What shall I do to speed up the execution to milliseconds(at least a couple 
> of seconds)?
> 
> Regards,
> Omkar Joshi
> 
> 
> -----Original Message-----
> From: Vedad Kirlic [mailto:[email protected]]
> Sent: Thursday, April 18, 2013 12:22 AM
> To: [email protected]
> Subject: Re: Speeding up the row count
> 
> Hi Omkar,
> 
> If you are not interested in occurrences of specific column (e.g. name,
> email ... ), and just want to get total number of rows (regardless of their
> content - i.e. columns), you should avoid adding any columns to the Scan, in
> which case coprocessor implementation for AggregateClient, will add
> FirstKeyOnlyFilter to the Scan, so to avoid loading unnecessary columns, so
> this should result in some speed up.
> 
> This is similar approach to what hbase shell 'count' implementation does,
> although reduction in overhead in that case is bigger, since data transfer
> from region server to client (shell) is minimized, whereas in case of
> coprocessor, data does not leave region server, so most of the improvement
> in that case should come from avoiding loading of unnecessary files. Not
> sure how this will apply to your particular case, given that data set per
> row seems to be rather small. Also, in case of AggregateClient you will
> benefit if/when your tables span multiple regions. Essentially, performance
> of this approach will 'degrade' as your table gets bigger, but only to the
> point when it splits, from which point it should be pretty constant. Having
> this in mind, and your type of data, you might consider pre-splitting your
> tables.
> 
> DISCLAIMER: this is mostly theoretical, since I'm not an expert in hbase
> internals :), so your best bet is to try it - I'm too lazy to verify impact
> my self ;)
> 
> Finally, if your case can tolerate eventual consistency of counters with
> actual number of rows, you can, as already suggested, have RowCounter map
> reduce run every once in a while, write the counter(s) back to hbase, and
> read those when you need to obtain the number of rows.
> 
> Regards,
> Vedad
> 
> 
> 
> --
> View this message in context: 
> http://apache-hbase.679495.n3.nabble.com/Speeding-up-the-row-count-tp4042378p4042415.html
> Sent from the HBase User mailing list archive at Nabble.com.
> 
> The contents of this e-mail and any attachment(s) may contain confidential or 
> privileged information for the intended recipient(s). Unintended recipients 
> are prohibited from taking action on the basis of information in this e-mail 
> and  using or disseminating the information,  and must notify the sender and 
> delete it from their system. L&T Infotech will not accept responsibility or 
> liability for the accuracy or completeness of, or the presence of any virus 
> or disabling code in this e-mail"

Re: Speeding up the row count

Reply via email to