Hi,

I'll make my doubt a little more illustrative.

The flat feed file ( just a sample scenario to make my point clear) would
have the user events at various times of a day.

Eg: *UserID     Time             Url           Views  Timespent*
       1            05:27            a.com      1         20
       2            05:34            b.com      2         12
       1            06:00            a.com      1         18
       3            06:02            c.com      3         56
       1            07:03            a.com      2         10

So these data would be dumped into Hbase Table with rowkey as the date *
2011-07-01* and columns as *User:UserID,  Http:Url*,  *Metrics:Views*, *
Metrics:Timespent.*
So the next day, the rowkey will be incremented and all feeds of this day
will be loaded into the table with new rowkey *2011-07-02* and so on.

Now I need to sum up the Timespent or the Views for a user "1" for url "
a.com" for the day say *2011-07-01*  (Just a sample scenario I am thinking
of) - which means I need to sum up all the Timespent for a particular userID
for a particular Url present in the row *2011-07-01.

*A GET on this table for this rowkey will just give me the latest entry into
the row. But I need to be able to scan through all values in a row and sum
up.
The output should be like below:

Output to different table:

*Date            ->   User   url          TotalViews    TotalTS*
2011-07-01   ->   1        a.com    4                  48
                         2        b.com    2                  12
                         3        c.com    3                  56

2011-07-02   ->  so on.....

I hope this would make my doubt a little more clearer.

Thanks,
Narayanan

On Mon, Jul 11, 2011 at 2:59 PM, Srikanth P. Shreenivas <
[email protected]> wrote:

> Columns in a table are identified by column-family:column-name.
>
> A column-name is a byte array, and you can assign a dynamic value.
> So, in this case, you can have table with variable columns where each
> column can represent on web site, and the cell value can be the count of
> views user has done for that page.
>
> Rowkey  - <==========================  Columns
> ====================================>
>
> User1    pageviews:www.yahoo.com     pageviews:www.google.com
>         10                          20
>
>
> So, if you do
>
> hbase> get "useractivity", "userid1", {COLUMN=>'pageviews:www.yahoo.com'}
>
> then, you should get desired value.
>
>
> This solution too has some gotchas though.  Keep in mind that you can query
> either particular columns or all columns in a Get request.  If number of
> columns is too large, then, you can risk out-of-memory error when doing all
> columns get.   If you are going to query by column name (specific web site
> in this case), then, you should be okay with this design.
>
> Alternatively, you can define your row key to contain the web site name.
>  For example, you can have one row per user per website.
> So, your row key will look like "userID1-com.yahoo.www" (It is typically
> suggested to use reverse domain names
> http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable).    
> Row key is too just a byte array and it is up to you to figure out
> what you want it to be consist of.
>
>
> Regards,
> Srikanth
>
>
> -----Original Message-----
> From: Narayanan K [mailto:[email protected]]
> Sent: Monday, July 11, 2011 2:46 PM
> To: [email protected]
> Subject: Re: Fetching and iterating through all column values belonging to
> all Timestamps of a Row
>
> Hi Srikanth,
>
> Yes. Versions will help me if I have fixed number of Versions.
>
> But in my case, I will not know the number of versions beforehand. The
> table
> will be populated from feedfiles using a mapreduce program.
> Once loaded, all these will go into the same column family:column. Then I
> would want to count the number of times a particular URI was accessed by
> userid1.
> For this, I need to be able to scan through all the versions loaded in that
> rowkey and do a counter increment.
>
> How is this possible,if I donot know the number of versions that is getting
> loaded into a table rowkey as it is a dynamic property (each feedfile may
> have different number of records) ?
>
> Is the setTimeRange method of GET and SCAN meant to do this? If so, why am
> I
> not getting all the column values for a particular rowkey?
>
> Regards,
> Narayanan
>
>
> On Mon, Jul 11, 2011 at 12:28 PM, Srikanth P. Shreenivas <
> [email protected]> wrote:
>
> > Hi Narayanan,
> >
> > I think you need to create the table with versions enabled.
> >
> > For example, if you need to store 5 versions, you can use create like
> this:
> >
> > Hbase> create 'useractivity', {NAME => 'pageviews', VERSIONS => 5}
> >
> > HBase> put 'useractivity', 'userid1', 'pageviews:uri', '
> > http://www.allaboutdata.net'
> > HBase> put 'useractivity', 'userid1', 'pageviews:uri', '
> > http://www.yahoo.co.in'
> >
> > HBase> get "useractivity", "userid1", {COLUMN=>'pageviews',VERSIONS=>2}
> > COLUMN                                        CELL
> >  pageviews:uri                                timestamp=1310367267049,
> > value=http://www.yahoo.co.in
> >  pageviews:uri                                timestamp=1310367221129,
> > value=http://www.allaboutdata.net
> > 2 row(s) in 0.0440 seconds
> >
> >
> > One thing you need to watch out for is the VERSIONS is defined on column
> > family, and hence, you cannot change it once you have defined your column
> > family.  This will work if your applications wishes to store only fixed
> > number of versions you want to store.  If that is not the case, you need
> to
> > relook at your table design and realize that using some other way.
> >
> > Regards,
> > Srikanth
> >
> > -----Original Message-----
> > From: Narayanan K [mailto:[email protected]]
> > Sent: Monday, July 11, 2011 11:07 AM
> > To: [email protected]
> > Subject: Fetching and iterating through all column values belonging to
> all
> > Timestamps of a Row
> >
> > Hi all,
> >
> > I am using Hadoop - 0.20.1 and HBASE - 0.20.
> >
> > Currently, I am trying to retrieve and iterate through all the column
> > values
> > of a particular rowkey in an Hbase Table.
> > But I am able to retrieve *only* the cell+value having the *latest
> > Timestamp
> > *.
> >
> > Eg:
> >
> > *hbase>create 'useractivity', 'pageviews'
> > hbase>put 'useractivity', 'userid1', 'pageviews:uri',
> > 'http://www.allaboutdata.net'
> > hbase>put 'useractivity', 'userid1', 'pageviews:uri', '
> > http://www.yahoo.co.in'*
> >
> > *hbase>get 'useractivity', 'userid1' *
> > is fetching only the  "http://www.yahoo.co.in"; column value as it has
> the
> > latest timestamp.
> >
> > I wanted to view both the values in the column *uri*.
> >
> > I tried the same with the java API - Get as well as Scan. But still both
> of
> > them gave me the same result with the column having value that was
> > inserted the latest.
> > I also read through some old archives and found I could setTimeRange on
> > Get/Scan which is also not solving my problem.
> >
> > *get.setTimeRange(0,Long.MAXVALUE);* as in :
> >
> >  *HTable table = new HTable(new HBaseConfiguration(), "useractivity");
> >  Get get = new Get(Bytes.toBytes("userid1"));
> >        get.addFamily(Bytes.toBytes("pageviews"));
> >        get.setTimeRange(0,Long.MAXVALUE);
> >        Result result = table.get(get);
> >        byte[] value = result.getValue(Bytes.toBytes("pageviews"),
> > Bytes.toBytes("uri"));
> >
> >        System.out.println(Bytes.toString(value));*
> >
> >  This is fetching me only the column value with the latest timestamp.
> >
> > I tried the same with Scan API. But I get the same result.
> >
> > *Could you please let me know how I can retrieve all column values of all
> > timestamps of a particular rowkey??*
> >
> > Many Thanks,
> > Narayanan
> >
> > ________________________________
> >
> > http://www.mindtree.com/email/disclaimer.html
> >
>

Reply via email to