Hi,
I'll make my doubt a little more illustrative.
The flat feed file ( just a sample scenario to make my point clear) would
have the user events at various times of a day.
Eg: *UserID Time Url Views Timespent*
1 05:27 a.com 1 20
2 05:34 b.com 2 12
1 06:00 a.com 1 18
3 06:02 c.com 3 56
1 07:03 a.com 2 10
So these data would be dumped into Hbase Table with rowkey as the date *
2011-07-01* and columns as *User:UserID, Http:Url*, *Metrics:Views*, *
Metrics:Timespent.*
So the next day, the rowkey will be incremented and all feeds of this day
will be loaded into the table with new rowkey *2011-07-02* and so on.
Now I need to sum up the Timespent or the Views for a user "1" for url "
a.com" for the day say *2011-07-01* (Just a sample scenario I am thinking
of) - which means I need to sum up all the Timespent for a particular userID
for a particular Url present in the row *2011-07-01.
*A GET on this table for this rowkey will just give me the latest entry into
the row. But I need to be able to scan through all values in a row and sum
up.
The output should be like below:
Output to different table:
*Date -> User url TotalViews TotalTS*
2011-07-01 -> 1 a.com 4 48
2 b.com 2 12
3 c.com 3 56
2011-07-02 -> so on.....
I hope this would make my doubt a little more clearer.
Thanks,
Narayanan
On Mon, Jul 11, 2011 at 2:59 PM, Srikanth P. Shreenivas <
[email protected]> wrote:
> Columns in a table are identified by column-family:column-name.
>
> A column-name is a byte array, and you can assign a dynamic value.
> So, in this case, you can have table with variable columns where each
> column can represent on web site, and the cell value can be the count of
> views user has done for that page.
>
> Rowkey - <========================== Columns
> ====================================>
>
> User1 pageviews:www.yahoo.com pageviews:www.google.com
> 10 20
>
>
> So, if you do
>
> hbase> get "useractivity", "userid1", {COLUMN=>'pageviews:www.yahoo.com'}
>
> then, you should get desired value.
>
>
> This solution too has some gotchas though. Keep in mind that you can query
> either particular columns or all columns in a Get request. If number of
> columns is too large, then, you can risk out-of-memory error when doing all
> columns get. If you are going to query by column name (specific web site
> in this case), then, you should be okay with this design.
>
> Alternatively, you can define your row key to contain the web site name.
> For example, you can have one row per user per website.
> So, your row key will look like "userID1-com.yahoo.www" (It is typically
> suggested to use reverse domain names
> http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable).
> Row key is too just a byte array and it is up to you to figure out
> what you want it to be consist of.
>
>
> Regards,
> Srikanth
>
>
> -----Original Message-----
> From: Narayanan K [mailto:[email protected]]
> Sent: Monday, July 11, 2011 2:46 PM
> To: [email protected]
> Subject: Re: Fetching and iterating through all column values belonging to
> all Timestamps of a Row
>
> Hi Srikanth,
>
> Yes. Versions will help me if I have fixed number of Versions.
>
> But in my case, I will not know the number of versions beforehand. The
> table
> will be populated from feedfiles using a mapreduce program.
> Once loaded, all these will go into the same column family:column. Then I
> would want to count the number of times a particular URI was accessed by
> userid1.
> For this, I need to be able to scan through all the versions loaded in that
> rowkey and do a counter increment.
>
> How is this possible,if I donot know the number of versions that is getting
> loaded into a table rowkey as it is a dynamic property (each feedfile may
> have different number of records) ?
>
> Is the setTimeRange method of GET and SCAN meant to do this? If so, why am
> I
> not getting all the column values for a particular rowkey?
>
> Regards,
> Narayanan
>
>
> On Mon, Jul 11, 2011 at 12:28 PM, Srikanth P. Shreenivas <
> [email protected]> wrote:
>
> > Hi Narayanan,
> >
> > I think you need to create the table with versions enabled.
> >
> > For example, if you need to store 5 versions, you can use create like
> this:
> >
> > Hbase> create 'useractivity', {NAME => 'pageviews', VERSIONS => 5}
> >
> > HBase> put 'useractivity', 'userid1', 'pageviews:uri', '
> > http://www.allaboutdata.net'
> > HBase> put 'useractivity', 'userid1', 'pageviews:uri', '
> > http://www.yahoo.co.in'
> >
> > HBase> get "useractivity", "userid1", {COLUMN=>'pageviews',VERSIONS=>2}
> > COLUMN CELL
> > pageviews:uri timestamp=1310367267049,
> > value=http://www.yahoo.co.in
> > pageviews:uri timestamp=1310367221129,
> > value=http://www.allaboutdata.net
> > 2 row(s) in 0.0440 seconds
> >
> >
> > One thing you need to watch out for is the VERSIONS is defined on column
> > family, and hence, you cannot change it once you have defined your column
> > family. This will work if your applications wishes to store only fixed
> > number of versions you want to store. If that is not the case, you need
> to
> > relook at your table design and realize that using some other way.
> >
> > Regards,
> > Srikanth
> >
> > -----Original Message-----
> > From: Narayanan K [mailto:[email protected]]
> > Sent: Monday, July 11, 2011 11:07 AM
> > To: [email protected]
> > Subject: Fetching and iterating through all column values belonging to
> all
> > Timestamps of a Row
> >
> > Hi all,
> >
> > I am using Hadoop - 0.20.1 and HBASE - 0.20.
> >
> > Currently, I am trying to retrieve and iterate through all the column
> > values
> > of a particular rowkey in an Hbase Table.
> > But I am able to retrieve *only* the cell+value having the *latest
> > Timestamp
> > *.
> >
> > Eg:
> >
> > *hbase>create 'useractivity', 'pageviews'
> > hbase>put 'useractivity', 'userid1', 'pageviews:uri',
> > 'http://www.allaboutdata.net'
> > hbase>put 'useractivity', 'userid1', 'pageviews:uri', '
> > http://www.yahoo.co.in'*
> >
> > *hbase>get 'useractivity', 'userid1' *
> > is fetching only the "http://www.yahoo.co.in" column value as it has
> the
> > latest timestamp.
> >
> > I wanted to view both the values in the column *uri*.
> >
> > I tried the same with the java API - Get as well as Scan. But still both
> of
> > them gave me the same result with the column having value that was
> > inserted the latest.
> > I also read through some old archives and found I could setTimeRange on
> > Get/Scan which is also not solving my problem.
> >
> > *get.setTimeRange(0,Long.MAXVALUE);* as in :
> >
> > *HTable table = new HTable(new HBaseConfiguration(), "useractivity");
> > Get get = new Get(Bytes.toBytes("userid1"));
> > get.addFamily(Bytes.toBytes("pageviews"));
> > get.setTimeRange(0,Long.MAXVALUE);
> > Result result = table.get(get);
> > byte[] value = result.getValue(Bytes.toBytes("pageviews"),
> > Bytes.toBytes("uri"));
> >
> > System.out.println(Bytes.toString(value));*
> >
> > This is fetching me only the column value with the latest timestamp.
> >
> > I tried the same with Scan API. But I get the same result.
> >
> > *Could you please let me know how I can retrieve all column values of all
> > timestamps of a particular rowkey??*
> >
> > Many Thanks,
> > Narayanan
> >
> > ________________________________
> >
> > http://www.mindtree.com/email/disclaimer.html
> >
>