Hi Narayanan,

If you think of HBase as 2D table, then, each cell is identified by a row key 
and column name (column family + column qualifier).
A third dimension is added by versions which allows you to maintain multiple 
(fixed number) copies of a cell.

Each cell has associated time stamp.  You can specify time range in Get or Scan 
queries to filter cells that are created in certain time duration.  By default 
you will get only latest version.  If you want more versions, you need to call 
setMaxVersions on Scan or Get.

Given your requirement, you need to look at alternate table designs.  May be 
have row key "DATE:USERID" so that you can fetch data for a given date and give 
user.  If you know list of all users, then, you can easily query the rows for 
each of the user for a given date by making multiple GET calls, one for each 
user.

Details of client API: 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html 
Some advanced topics here: 
http://ofps.oreilly.com/titles/9781449396107/advanced.html 

Regards,
Srikanth

-----Original Message-----
From: Narayanan K [mailto:[email protected]] 
Sent: Monday, July 11, 2011 10:09 PM
To: [email protected]
Subject: Re: Fetching and iterating through all column values belonging to all 
Timestamps of a Row

Hi all,

Could somebody please throw some light on this?

If it is a limitation in Hbase that I can access only that column value
having the latest timestamp in a row of Hbase table, then
will have to think about a different schema where each user event entry will
need to go into a different row.

Also could somebody let me know if getTimeRange method of GET or SCAN can be
used to access all the column values falling under all timestamps of a
particular row.

Thanks,
Narayanan

On Mon, Jul 11, 2011 at 5:15 PM, Narayanan K <[email protected]> wrote:

> Hi,
>
> I'll make my doubt a little more illustrative.
>
> The flat feed file ( just a sample scenario to make my point clear) would
> have the user events at various times of a day.
>
> Eg: *UserID     Time             Url           Views  Timespent*
>        1            05:27            a.com      1         20
>        2            05:34            b.com      2         12
>        1            06:00            a.com      1         18
>        3            06:02            c.com      3         56
>        1            07:03            a.com      2         10
>
> So these data would be dumped into Hbase Table with rowkey as the date *
> 2011-07-01* and columns as *User:UserID,  Http:Url*,  *Metrics:Views*, *
> Metrics:Timespent.*
> So the next day, the rowkey will be incremented and all feeds of this day
> will be loaded into the table with new rowkey *2011-07-02* and so on.
>
> Now I need to sum up the Timespent or the Views for a user "1" for url "
> a.com" for the day say *2011-07-01*  (Just a sample scenario I am thinking
> of) - which means I need to sum up all the Timespent for a particular userID
> for a particular Url present in the row *2011-07-01.
>
> *A GET on this table for this rowkey will just give me the latest entry
> into the row. But I need to be able to scan through all values in a row and
> sum up.
> The output should be like below:
>
> Output to different table:
>
> *Date            ->   User   url          TotalViews    TotalTS*
> 2011-07-01   ->   1        a.com    4                  48
>                          2        b.com    2                  12
>                          3        c.com    3                  56
>
> 2011-07-02   ->  so on.....
>
> I hope this would make my doubt a little more clearer.
>
> Thanks,
> Narayanan
>
>
> On Mon, Jul 11, 2011 at 2:59 PM, Srikanth P. Shreenivas <
> [email protected]> wrote:
>
>> Columns in a table are identified by column-family:column-name.
>>
>> A column-name is a byte array, and you can assign a dynamic value.
>> So, in this case, you can have table with variable columns where each
>> column can represent on web site, and the cell value can be the count of
>> views user has done for that page.
>>
>> Rowkey  - <==========================  Columns
>> ====================================>
>>
>> User1    pageviews:www.yahoo.com     pageviews:www.google.com
>>         10                          20
>>
>>
>> So, if you do
>>
>> hbase> get "useractivity", "userid1", {COLUMN=>'pageviews:www.yahoo.com'}
>>
>> then, you should get desired value.
>>
>>
>> This solution too has some gotchas though.  Keep in mind that you can
>> query either particular columns or all columns in a Get request.  If number
>> of columns is too large, then, you can risk out-of-memory error when doing
>> all columns get.   If you are going to query by column name (specific web
>> site in this case), then, you should be okay with this design.
>>
>> Alternatively, you can define your row key to contain the web site name.
>>  For example, you can have one row per user per website.
>> So, your row key will look like "userID1-com.yahoo.www" (It is typically
>> suggested to use reverse domain names
>> http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable).   
>>  Row key is too just a byte array and it is up to you to figure out
>> what you want it to be consist of.
>>
>>
>> Regards,
>> Srikanth
>>
>>
>> -----Original Message-----
>> From: Narayanan K [mailto:[email protected]]
>> Sent: Monday, July 11, 2011 2:46 PM
>> To: [email protected]
>> Subject: Re: Fetching and iterating through all column values belonging to
>> all Timestamps of a Row
>>
>> Hi Srikanth,
>>
>> Yes. Versions will help me if I have fixed number of Versions.
>>
>> But in my case, I will not know the number of versions beforehand. The
>> table
>> will be populated from feedfiles using a mapreduce program.
>> Once loaded, all these will go into the same column family:column. Then I
>> would want to count the number of times a particular URI was accessed by
>> userid1.
>> For this, I need to be able to scan through all the versions loaded in
>> that
>> rowkey and do a counter increment.
>>
>> How is this possible,if I donot know the number of versions that is
>> getting
>> loaded into a table rowkey as it is a dynamic property (each feedfile may
>> have different number of records) ?
>>
>> Is the setTimeRange method of GET and SCAN meant to do this? If so, why am
>> I
>> not getting all the column values for a particular rowkey?
>>
>> Regards,
>> Narayanan
>>
>>
>> On Mon, Jul 11, 2011 at 12:28 PM, Srikanth P. Shreenivas <
>> [email protected]> wrote:
>>
>> > Hi Narayanan,
>> >
>> > I think you need to create the table with versions enabled.
>> >
>> > For example, if you need to store 5 versions, you can use create like
>> this:
>> >
>> > Hbase> create 'useractivity', {NAME => 'pageviews', VERSIONS => 5}
>> >
>> > HBase> put 'useractivity', 'userid1', 'pageviews:uri', '
>> > http://www.allaboutdata.net'
>> > HBase> put 'useractivity', 'userid1', 'pageviews:uri', '
>> > http://www.yahoo.co.in'
>> >
>> > HBase> get "useractivity", "userid1", {COLUMN=>'pageviews',VERSIONS=>2}
>> > COLUMN                                        CELL
>> >  pageviews:uri                                timestamp=1310367267049,
>> > value=http://www.yahoo.co.in
>> >  pageviews:uri                                timestamp=1310367221129,
>> > value=http://www.allaboutdata.net
>> > 2 row(s) in 0.0440 seconds
>> >
>> >
>> > One thing you need to watch out for is the VERSIONS is defined on column
>> > family, and hence, you cannot change it once you have defined your
>> column
>> > family.  This will work if your applications wishes to store only fixed
>> > number of versions you want to store.  If that is not the case, you need
>> to
>> > relook at your table design and realize that using some other way.
>> >
>> > Regards,
>> > Srikanth
>> >
>> > -----Original Message-----
>> > From: Narayanan K [mailto:[email protected]]
>> > Sent: Monday, July 11, 2011 11:07 AM
>> > To: [email protected]
>> > Subject: Fetching and iterating through all column values belonging to
>> all
>> > Timestamps of a Row
>> >
>> > Hi all,
>> >
>> > I am using Hadoop - 0.20.1 and HBASE - 0.20.
>> >
>> > Currently, I am trying to retrieve and iterate through all the column
>> > values
>> > of a particular rowkey in an Hbase Table.
>> > But I am able to retrieve *only* the cell+value having the *latest
>> > Timestamp
>> > *.
>> >
>> > Eg:
>> >
>> > *hbase>create 'useractivity', 'pageviews'
>> > hbase>put 'useractivity', 'userid1', 'pageviews:uri',
>> > 'http://www.allaboutdata.net'
>> > hbase>put 'useractivity', 'userid1', 'pageviews:uri', '
>> > http://www.yahoo.co.in'*
>> >
>> > *hbase>get 'useractivity', 'userid1' *
>> > is fetching only the  "http://www.yahoo.co.in"; column value as it has
>> the
>> > latest timestamp.
>> >
>> > I wanted to view both the values in the column *uri*.
>> >
>> > I tried the same with the java API - Get as well as Scan. But still both
>> of
>> > them gave me the same result with the column having value that was
>> > inserted the latest.
>> > I also read through some old archives and found I could setTimeRange on
>> > Get/Scan which is also not solving my problem.
>> >
>> > *get.setTimeRange(0,Long.MAXVALUE);* as in :
>> >
>> >  *HTable table = new HTable(new HBaseConfiguration(), "useractivity");
>> >  Get get = new Get(Bytes.toBytes("userid1"));
>> >        get.addFamily(Bytes.toBytes("pageviews"));
>> >        get.setTimeRange(0,Long.MAXVALUE);
>> >        Result result = table.get(get);
>> >        byte[] value = result.getValue(Bytes.toBytes("pageviews"),
>> > Bytes.toBytes("uri"));
>> >
>> >        System.out.println(Bytes.toString(value));*
>> >
>> >  This is fetching me only the column value with the latest timestamp.
>> >
>> > I tried the same with Scan API. But I get the same result.
>> >
>> > *Could you please let me know how I can retrieve all column values of
>> all
>> > timestamps of a particular rowkey??*
>> >
>> > Many Thanks,
>> > Narayanan
>> >
>> > ________________________________
>> >
>> > http://www.mindtree.com/email/disclaimer.html
>> >
>>
>
>

Reply via email to