Thanks James, I will investigate Apache Phoenix and get back to group with updates.
Regards, Riyaz On Sat, May 24, 2014 at 4:15 AM, James Taylor <[email protected]>wrote: > Hi Riyaz, > You can do this with a single SQL command using Apache Phoenix, a SQL > engine on top of HBase, and you'll get better performance than if you hand > coded it using the HBase client APIs. Depending on your current schema, you > may be able to run this command with no change to your data. Let's assume > you have an MD5 hash of the website and the date/time in the row key with > your website and counts in key values. That schema could be modeled like > this in Phoenix: > > CREATE VIEW WEBSITE_STATS ( > WEBSITE_MD5 BINARY(16) NOT NULL, > DATE_COLLECTED UNSIGNED_DATE NOT NULL, > WEBSITE_URL VARCHAR, > HIT_COUNT UNSIGNED_LONG, > CONSTRAINT PK PRIMARY KEY (WEBSITE_MD5, DATE_COLLECTED)); > > You could issue this create view statement and map directly to your HBase > table. I used the UNSIGNED types above as they match the serialization you > get when you use the HBase Bytes utility methods. Phoenix normalizes column > names by upper casing them, so if your column qualifiers are lower case, > you'd want to put the column names above in double quotes. > > Next, you'd create a table to hold the top10 information: > > CREATE TABLE WEBSITE_TOP10 ( > WEBSITE_URL VARCHAR PRIMARY KEY, > TOTAL_HIT_COUNT BIGINT); > > Then you'd run an UPSERT SELECT command like this: > > UPSERT INTO WEBSITE_TOP10 > SELECT WEBSITE_URL, SUM(HIT_COUNT) FROM WEBSITE_STATS > GROUP BY WEBSITE_URL > ORDER BY SUM(HIT_COUNT) > LIMIT 10; > > Phoenix will run the SELECT part of this in parallel on the client and use > a coprocessor on the server side to aggregate over the WEBSITE_URLs > returning the distinct set of urls per region with a final merge sort > happening o the client to compute the total sum. Then the client will hold > on to the top 10 rows it sees and upsert these into the WEBSITE_TOP10 > table. > > HTH, > James > > > > > > > On Fri, May 23, 2014 at 5:14 PM, Ted Yu <[email protected]> wrote: > > > Would the new HBase table reside in the same cluster as the original > table > > ? > > > > See this recent thread: http://search-hadoop.com/m/DHED4uBNqJ1 > > > > Cheers > > > > > > On Fri, May 23, 2014 at 2:49 PM, Shaikh Ahmed <[email protected]> > > wrote: > > > > > Hi, > > > > > > We have one huge HBase table with billions of rows. This table holds > the > > > information about websites and number of hits on that site in every 15 > > > minutes. > > > > > > Every website will have multiple records in data with different number > of > > > hit count and last updated timestamp. > > > > > > Now, we want to create another Hbase table which will contain > information > > > about only those TOP 10 websites which are having more number of hits. > > > > > > We are seeking help from experts to achieve this requirement. > > > How we can filter top 10 websites based on hits count from billions of > > > records and copy it into our new HBase table? > > > > > > I will greatly appreciate kind support from group members. > > > > > > Thanks in advance. > > > Regards, > > > Riyaz > > > > > >
