Phoenix will parallelize within a region: SELECT count(1) FROM orders
I agree with Ted, though, even serially, 100,000 rows shouldn't take any where near 6 mins. You say > 100,000 rows. Can you tell us what it's < ? Thanks, James On Apr 19, 2013, at 2:37 AM, "Ted Yu" <[email protected]> wrote: > Since there is only one region in your table, using aggregation coprocessor > has no advantage. > I think there may be some issue with your cluster - row count should finish > within 6 minutes. > > Have you checked server logs ? > > Thanks > > On Apr 19, 2013, at 12:33 AM, Omkar Joshi <[email protected]> wrote: > >> Hi, >> >> I'm having a 2-node(VMs) Hadoop cluster atop which HBase is running in the >> distributed mode. >> >> I'm having a table named ORDERS with >100000 rows. >> >> NOTE : Since my cluster is ultra-small, I didn't pre-split the table. >> >> ORDERS >> rowkey : ORDER_ID >> >> column family : ORDER_DETAILS >> columns : CUSTOMER_ID >> PRODUCT_ID >> REQUEST_DATE >> PRODUCT_QUANTITY >> PRICE >> PAYMENT_MODE >> >> The java client code to simply check the count of the records is : >> >> public long getTableCount(String tableName, String columnFamilyName) { >> >> AggregationClient aggregationClient = new >> AggregationClient(config); >> Scan scan = new Scan(); >> scan.addFamily(Bytes.toBytes(columnFamilyName)); >> scan.setFilter(new FirstKeyOnlyFilter()); >> >> long rowCount = 0; >> >> try { >> rowCount = >> aggregationClient.rowCount(Bytes.toBytes(tableName), >> null, scan); >> System.out.println("No. of rows in " + tableName + " >> is " >> + rowCount); >> } catch (Throwable e) { >> // TODO Auto-generated catch block >> e.printStackTrace(); >> } >> >> return rowCount; >> } >> >> It is running for more than 6 minutes now :( >> >> What shall I do to speed up the execution to milliseconds(at least a couple >> of seconds)? >> >> Regards, >> Omkar Joshi >> >> >> -----Original Message----- >> From: Vedad Kirlic [mailto:[email protected]] >> Sent: Thursday, April 18, 2013 12:22 AM >> To: [email protected] >> Subject: Re: Speeding up the row count >> >> Hi Omkar, >> >> If you are not interested in occurrences of specific column (e.g. name, >> email ... ), and just want to get total number of rows (regardless of their >> content - i.e. columns), you should avoid adding any columns to the Scan, in >> which case coprocessor implementation for AggregateClient, will add >> FirstKeyOnlyFilter to the Scan, so to avoid loading unnecessary columns, so >> this should result in some speed up. >> >> This is similar approach to what hbase shell 'count' implementation does, >> although reduction in overhead in that case is bigger, since data transfer >> from region server to client (shell) is minimized, whereas in case of >> coprocessor, data does not leave region server, so most of the improvement >> in that case should come from avoiding loading of unnecessary files. Not >> sure how this will apply to your particular case, given that data set per >> row seems to be rather small. Also, in case of AggregateClient you will >> benefit if/when your tables span multiple regions. Essentially, performance >> of this approach will 'degrade' as your table gets bigger, but only to the >> point when it splits, from which point it should be pretty constant. Having >> this in mind, and your type of data, you might consider pre-splitting your >> tables. >> >> DISCLAIMER: this is mostly theoretical, since I'm not an expert in hbase >> internals :), so your best bet is to try it - I'm too lazy to verify impact >> my self ;) >> >> Finally, if your case can tolerate eventual consistency of counters with >> actual number of rows, you can, as already suggested, have RowCounter map >> reduce run every once in a while, write the counter(s) back to hbase, and >> read those when you need to obtain the number of rows. >> >> Regards, >> Vedad >> >> >> >> -- >> View this message in context: >> http://apache-hbase.679495.n3.nabble.com/Speeding-up-the-row-count-tp4042378p4042415.html >> Sent from the HBase User mailing list archive at Nabble.com. >> >> The contents of this e-mail and any attachment(s) may contain confidential >> or privileged information for the intended recipient(s). Unintended >> recipients are prohibited from taking action on the basis of information in >> this e-mail and using or disseminating the information, and must notify >> the sender and delete it from their system. L&T Infotech will not accept >> responsibility or liability for the accuracy or completeness of, or the >> presence of any virus or disabling code in this e-mail"
