Hi all, Could somebody please throw some light on this?
If it is a limitation in Hbase that I can access only that column value having the latest timestamp in a row of Hbase table, then will have to think about a different schema where each user event entry will need to go into a different row. Also could somebody let me know if getTimeRange method of GET or SCAN can be used to access all the column values falling under all timestamps of a particular row. Thanks, Narayanan On Mon, Jul 11, 2011 at 5:15 PM, Narayanan K <[email protected]> wrote: > Hi, > > I'll make my doubt a little more illustrative. > > The flat feed file ( just a sample scenario to make my point clear) would > have the user events at various times of a day. > > Eg: *UserID Time Url Views Timespent* > 1 05:27 a.com 1 20 > 2 05:34 b.com 2 12 > 1 06:00 a.com 1 18 > 3 06:02 c.com 3 56 > 1 07:03 a.com 2 10 > > So these data would be dumped into Hbase Table with rowkey as the date * > 2011-07-01* and columns as *User:UserID, Http:Url*, *Metrics:Views*, * > Metrics:Timespent.* > So the next day, the rowkey will be incremented and all feeds of this day > will be loaded into the table with new rowkey *2011-07-02* and so on. > > Now I need to sum up the Timespent or the Views for a user "1" for url " > a.com" for the day say *2011-07-01* (Just a sample scenario I am thinking > of) - which means I need to sum up all the Timespent for a particular userID > for a particular Url present in the row *2011-07-01. > > *A GET on this table for this rowkey will just give me the latest entry > into the row. But I need to be able to scan through all values in a row and > sum up. > The output should be like below: > > Output to different table: > > *Date -> User url TotalViews TotalTS* > 2011-07-01 -> 1 a.com 4 48 > 2 b.com 2 12 > 3 c.com 3 56 > > 2011-07-02 -> so on..... > > I hope this would make my doubt a little more clearer. > > Thanks, > Narayanan > > > On Mon, Jul 11, 2011 at 2:59 PM, Srikanth P. Shreenivas < > [email protected]> wrote: > >> Columns in a table are identified by column-family:column-name. >> >> A column-name is a byte array, and you can assign a dynamic value. >> So, in this case, you can have table with variable columns where each >> column can represent on web site, and the cell value can be the count of >> views user has done for that page. >> >> Rowkey - <========================== Columns >> ====================================> >> >> User1 pageviews:www.yahoo.com pageviews:www.google.com >> 10 20 >> >> >> So, if you do >> >> hbase> get "useractivity", "userid1", {COLUMN=>'pageviews:www.yahoo.com'} >> >> then, you should get desired value. >> >> >> This solution too has some gotchas though. Keep in mind that you can >> query either particular columns or all columns in a Get request. If number >> of columns is too large, then, you can risk out-of-memory error when doing >> all columns get. If you are going to query by column name (specific web >> site in this case), then, you should be okay with this design. >> >> Alternatively, you can define your row key to contain the web site name. >> For example, you can have one row per user per website. >> So, your row key will look like "userID1-com.yahoo.www" (It is typically >> suggested to use reverse domain names >> http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable). >> Row key is too just a byte array and it is up to you to figure out >> what you want it to be consist of. >> >> >> Regards, >> Srikanth >> >> >> -----Original Message----- >> From: Narayanan K [mailto:[email protected]] >> Sent: Monday, July 11, 2011 2:46 PM >> To: [email protected] >> Subject: Re: Fetching and iterating through all column values belonging to >> all Timestamps of a Row >> >> Hi Srikanth, >> >> Yes. Versions will help me if I have fixed number of Versions. >> >> But in my case, I will not know the number of versions beforehand. The >> table >> will be populated from feedfiles using a mapreduce program. >> Once loaded, all these will go into the same column family:column. Then I >> would want to count the number of times a particular URI was accessed by >> userid1. >> For this, I need to be able to scan through all the versions loaded in >> that >> rowkey and do a counter increment. >> >> How is this possible,if I donot know the number of versions that is >> getting >> loaded into a table rowkey as it is a dynamic property (each feedfile may >> have different number of records) ? >> >> Is the setTimeRange method of GET and SCAN meant to do this? If so, why am >> I >> not getting all the column values for a particular rowkey? >> >> Regards, >> Narayanan >> >> >> On Mon, Jul 11, 2011 at 12:28 PM, Srikanth P. Shreenivas < >> [email protected]> wrote: >> >> > Hi Narayanan, >> > >> > I think you need to create the table with versions enabled. >> > >> > For example, if you need to store 5 versions, you can use create like >> this: >> > >> > Hbase> create 'useractivity', {NAME => 'pageviews', VERSIONS => 5} >> > >> > HBase> put 'useractivity', 'userid1', 'pageviews:uri', ' >> > http://www.allaboutdata.net' >> > HBase> put 'useractivity', 'userid1', 'pageviews:uri', ' >> > http://www.yahoo.co.in' >> > >> > HBase> get "useractivity", "userid1", {COLUMN=>'pageviews',VERSIONS=>2} >> > COLUMN CELL >> > pageviews:uri timestamp=1310367267049, >> > value=http://www.yahoo.co.in >> > pageviews:uri timestamp=1310367221129, >> > value=http://www.allaboutdata.net >> > 2 row(s) in 0.0440 seconds >> > >> > >> > One thing you need to watch out for is the VERSIONS is defined on column >> > family, and hence, you cannot change it once you have defined your >> column >> > family. This will work if your applications wishes to store only fixed >> > number of versions you want to store. If that is not the case, you need >> to >> > relook at your table design and realize that using some other way. >> > >> > Regards, >> > Srikanth >> > >> > -----Original Message----- >> > From: Narayanan K [mailto:[email protected]] >> > Sent: Monday, July 11, 2011 11:07 AM >> > To: [email protected] >> > Subject: Fetching and iterating through all column values belonging to >> all >> > Timestamps of a Row >> > >> > Hi all, >> > >> > I am using Hadoop - 0.20.1 and HBASE - 0.20. >> > >> > Currently, I am trying to retrieve and iterate through all the column >> > values >> > of a particular rowkey in an Hbase Table. >> > But I am able to retrieve *only* the cell+value having the *latest >> > Timestamp >> > *. >> > >> > Eg: >> > >> > *hbase>create 'useractivity', 'pageviews' >> > hbase>put 'useractivity', 'userid1', 'pageviews:uri', >> > 'http://www.allaboutdata.net' >> > hbase>put 'useractivity', 'userid1', 'pageviews:uri', ' >> > http://www.yahoo.co.in'* >> > >> > *hbase>get 'useractivity', 'userid1' * >> > is fetching only the "http://www.yahoo.co.in" column value as it has >> the >> > latest timestamp. >> > >> > I wanted to view both the values in the column *uri*. >> > >> > I tried the same with the java API - Get as well as Scan. But still both >> of >> > them gave me the same result with the column having value that was >> > inserted the latest. >> > I also read through some old archives and found I could setTimeRange on >> > Get/Scan which is also not solving my problem. >> > >> > *get.setTimeRange(0,Long.MAXVALUE);* as in : >> > >> > *HTable table = new HTable(new HBaseConfiguration(), "useractivity"); >> > Get get = new Get(Bytes.toBytes("userid1")); >> > get.addFamily(Bytes.toBytes("pageviews")); >> > get.setTimeRange(0,Long.MAXVALUE); >> > Result result = table.get(get); >> > byte[] value = result.getValue(Bytes.toBytes("pageviews"), >> > Bytes.toBytes("uri")); >> > >> > System.out.println(Bytes.toString(value));* >> > >> > This is fetching me only the column value with the latest timestamp. >> > >> > I tried the same with Scan API. But I get the same result. >> > >> > *Could you please let me know how I can retrieve all column values of >> all >> > timestamps of a particular rowkey??* >> > >> > Many Thanks, >> > Narayanan >> > >> > ________________________________ >> > >> > http://www.mindtree.com/email/disclaimer.html >> > >> > >
