Thanks Srikanth for your replies..
I wrote a java code that picks up all the version cell values using the
setMaxVersions function of GET API.
For the benefit of others who were looking for an implementation of
setMaxVersions, please find the code below:
*import java.io.IOException;
import java.util.NavigableMap;
import java.util.Map.Entry;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;
public class HbaseReadTest {
public static void main(String[] args) {
try {
HTable table = new HTable(new HBaseConfiguration(),
"useractivity");
Get get = new Get(Bytes.toBytes("userid1"));
get.addFamily(Bytes.toBytes("pageviews"));
get.setMaxVersions(Integer.MAX_VALUE);
Result result = table.get(get);
NavigableMap<byte[],NavigableMap<byte[],NavigableMap<Long,byte[]>>> map =
result.getMap();
for (Entry<byte[], NavigableMap<byte[], NavigableMap<Long,
byte[]>>> columnFamilyEntry : map.entrySet())
{
NavigableMap<byte[],NavigableMap<Long,byte[]>> columnMap =
columnFamilyEntry.getValue();
for( Entry<byte[], NavigableMap<Long, byte[]>> columnEntry :
columnMap.entrySet())
{
NavigableMap<Long,byte[]> cellMap =
columnEntry.getValue();
for ( Entry<Long, byte[]> cellEntry : cellMap.entrySet())
{
System.out.println(String.format("Key : %s, Value :
%s", Bytes.toString(columnEntry.getKey()),
Bytes.toString(cellEntry.getValue())));
}
}
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
*With this done, we can manipulate the values obtained from all versions of
a single row and perform any statistics on that.*
*Thanks,
Narayanan K
([email protected])
On Mon, Jul 11, 2011 at 10:44 PM, Srikanth P. Shreenivas <
[email protected]> wrote:
> Hi Narayanan,
>
> If you think of HBase as 2D table, then, each cell is identified by a row
> key and column name (column family + column qualifier).
> A third dimension is added by versions which allows you to maintain
> multiple (fixed number) copies of a cell.
>
> Each cell has associated time stamp. You can specify time range in Get or
> Scan queries to filter cells that are created in certain time duration. By
> default you will get only latest version. If you want more versions, you
> need to call setMaxVersions on Scan or Get.
>
> Given your requirement, you need to look at alternate table designs. May
> be have row key "DATE:USERID" so that you can fetch data for a given date
> and give user. If you know list of all users, then, you can easily query
> the rows for each of the user for a given date by making multiple GET calls,
> one for each user.
>
> Details of client API:
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
> Some advanced topics here:
> http://ofps.oreilly.com/titles/9781449396107/advanced.html
>
> Regards,
> Srikanth
>
> -----Original Message-----
> From: Narayanan K [mailto:[email protected]]
> Sent: Monday, July 11, 2011 10:09 PM
> To: [email protected]
> Subject: Re: Fetching and iterating through all column values belonging to
> all Timestamps of a Row
>
> Hi all,
>
> Could somebody please throw some light on this?
>
> If it is a limitation in Hbase that I can access only that column value
> having the latest timestamp in a row of Hbase table, then
> will have to think about a different schema where each user event entry
> will
> need to go into a different row.
>
> Also could somebody let me know if getTimeRange method of GET or SCAN can
> be
> used to access all the column values falling under all timestamps of a
> particular row.
>
> Thanks,
> Narayanan
>
> On Mon, Jul 11, 2011 at 5:15 PM, Narayanan K <[email protected]>
> wrote:
>
> > Hi,
> >
> > I'll make my doubt a little more illustrative.
> >
> > The flat feed file ( just a sample scenario to make my point clear) would
> > have the user events at various times of a day.
> >
> > Eg: *UserID Time Url Views Timespent*
> > 1 05:27 a.com 1 20
> > 2 05:34 b.com 2 12
> > 1 06:00 a.com 1 18
> > 3 06:02 c.com 3 56
> > 1 07:03 a.com 2 10
> >
> > So these data would be dumped into Hbase Table with rowkey as the date *
> > 2011-07-01* and columns as *User:UserID, Http:Url*, *Metrics:Views*, *
> > Metrics:Timespent.*
> > So the next day, the rowkey will be incremented and all feeds of this day
> > will be loaded into the table with new rowkey *2011-07-02* and so on.
> >
> > Now I need to sum up the Timespent or the Views for a user "1" for url "
> > a.com" for the day say *2011-07-01* (Just a sample scenario I am
> thinking
> > of) - which means I need to sum up all the Timespent for a particular
> userID
> > for a particular Url present in the row *2011-07-01.
> >
> > *A GET on this table for this rowkey will just give me the latest entry
> > into the row. But I need to be able to scan through all values in a row
> and
> > sum up.
> > The output should be like below:
> >
> > Output to different table:
> >
> > *Date -> User url TotalViews TotalTS*
> > 2011-07-01 -> 1 a.com 4 48
> > 2 b.com 2 12
> > 3 c.com 3 56
> >
> > 2011-07-02 -> so on.....
> >
> > I hope this would make my doubt a little more clearer.
> >
> > Thanks,
> > Narayanan
> >
> >
> > On Mon, Jul 11, 2011 at 2:59 PM, Srikanth P. Shreenivas <
> > [email protected]> wrote:
> >
> >> Columns in a table are identified by column-family:column-name.
> >>
> >> A column-name is a byte array, and you can assign a dynamic value.
> >> So, in this case, you can have table with variable columns where each
> >> column can represent on web site, and the cell value can be the count of
> >> views user has done for that page.
> >>
> >> Rowkey - <========================== Columns
> >> ====================================>
> >>
> >> User1 pageviews:www.yahoo.com pageviews:www.google.com
> >> 10 20
> >>
> >>
> >> So, if you do
> >>
> >> hbase> get "useractivity", "userid1", {COLUMN=>'pageviews:www.yahoo.com
> '}
> >>
> >> then, you should get desired value.
> >>
> >>
> >> This solution too has some gotchas though. Keep in mind that you can
> >> query either particular columns or all columns in a Get request. If
> number
> >> of columns is too large, then, you can risk out-of-memory error when
> doing
> >> all columns get. If you are going to query by column name (specific
> web
> >> site in this case), then, you should be okay with this design.
> >>
> >> Alternatively, you can define your row key to contain the web site name.
> >> For example, you can have one row per user per website.
> >> So, your row key will look like "userID1-com.yahoo.www" (It is typically
> >> suggested to use reverse domain names
> >>
> http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable).
> Row key is too just a byte array and it is up to you to figure out
> >> what you want it to be consist of.
> >>
> >>
> >> Regards,
> >> Srikanth
> >>
> >>
> >> -----Original Message-----
> >> From: Narayanan K [mailto:[email protected]]
> >> Sent: Monday, July 11, 2011 2:46 PM
> >> To: [email protected]
> >> Subject: Re: Fetching and iterating through all column values belonging
> to
> >> all Timestamps of a Row
> >>
> >> Hi Srikanth,
> >>
> >> Yes. Versions will help me if I have fixed number of Versions.
> >>
> >> But in my case, I will not know the number of versions beforehand. The
> >> table
> >> will be populated from feedfiles using a mapreduce program.
> >> Once loaded, all these will go into the same column family:column. Then
> I
> >> would want to count the number of times a particular URI was accessed by
> >> userid1.
> >> For this, I need to be able to scan through all the versions loaded in
> >> that
> >> rowkey and do a counter increment.
> >>
> >> How is this possible,if I donot know the number of versions that is
> >> getting
> >> loaded into a table rowkey as it is a dynamic property (each feedfile
> may
> >> have different number of records) ?
> >>
> >> Is the setTimeRange method of GET and SCAN meant to do this? If so, why
> am
> >> I
> >> not getting all the column values for a particular rowkey?
> >>
> >> Regards,
> >> Narayanan
> >>
> >>
> >> On Mon, Jul 11, 2011 at 12:28 PM, Srikanth P. Shreenivas <
> >> [email protected]> wrote:
> >>
> >> > Hi Narayanan,
> >> >
> >> > I think you need to create the table with versions enabled.
> >> >
> >> > For example, if you need to store 5 versions, you can use create like
> >> this:
> >> >
> >> > Hbase> create 'useractivity', {NAME => 'pageviews', VERSIONS => 5}
> >> >
> >> > HBase> put 'useractivity', 'userid1', 'pageviews:uri', '
> >> > http://www.allaboutdata.net'
> >> > HBase> put 'useractivity', 'userid1', 'pageviews:uri', '
> >> > http://www.yahoo.co.in'
> >> >
> >> > HBase> get "useractivity", "userid1",
> {COLUMN=>'pageviews',VERSIONS=>2}
> >> > COLUMN CELL
> >> > pageviews:uri timestamp=1310367267049,
> >> > value=http://www.yahoo.co.in
> >> > pageviews:uri timestamp=1310367221129,
> >> > value=http://www.allaboutdata.net
> >> > 2 row(s) in 0.0440 seconds
> >> >
> >> >
> >> > One thing you need to watch out for is the VERSIONS is defined on
> column
> >> > family, and hence, you cannot change it once you have defined your
> >> column
> >> > family. This will work if your applications wishes to store only
> fixed
> >> > number of versions you want to store. If that is not the case, you
> need
> >> to
> >> > relook at your table design and realize that using some other way.
> >> >
> >> > Regards,
> >> > Srikanth
> >> >
> >> > -----Original Message-----
> >> > From: Narayanan K [mailto:[email protected]]
> >> > Sent: Monday, July 11, 2011 11:07 AM
> >> > To: [email protected]
> >> > Subject: Fetching and iterating through all column values belonging to
> >> all
> >> > Timestamps of a Row
> >> >
> >> > Hi all,
> >> >
> >> > I am using Hadoop - 0.20.1 and HBASE - 0.20.
> >> >
> >> > Currently, I am trying to retrieve and iterate through all the column
> >> > values
> >> > of a particular rowkey in an Hbase Table.
> >> > But I am able to retrieve *only* the cell+value having the *latest
> >> > Timestamp
> >> > *.
> >> >
> >> > Eg:
> >> >
> >> > *hbase>create 'useractivity', 'pageviews'
> >> > hbase>put 'useractivity', 'userid1', 'pageviews:uri',
> >> > 'http://www.allaboutdata.net'
> >> > hbase>put 'useractivity', 'userid1', 'pageviews:uri', '
> >> > http://www.yahoo.co.in'*
> >> >
> >> > *hbase>get 'useractivity', 'userid1' *
> >> > is fetching only the "http://www.yahoo.co.in" column value as it has
> >> the
> >> > latest timestamp.
> >> >
> >> > I wanted to view both the values in the column *uri*.
> >> >
> >> > I tried the same with the java API - Get as well as Scan. But still
> both
> >> of
> >> > them gave me the same result with the column having value that was
> >> > inserted the latest.
> >> > I also read through some old archives and found I could setTimeRange
> on
> >> > Get/Scan which is also not solving my problem.
> >> >
> >> > *get.setTimeRange(0,Long.MAXVALUE);* as in :
> >> >
> >> > *HTable table = new HTable(new HBaseConfiguration(), "useractivity");
> >> > Get get = new Get(Bytes.toBytes("userid1"));
> >> > get.addFamily(Bytes.toBytes("pageviews"));
> >> > get.setTimeRange(0,Long.MAXVALUE);
> >> > Result result = table.get(get);
> >> > byte[] value = result.getValue(Bytes.toBytes("pageviews"),
> >> > Bytes.toBytes("uri"));
> >> >
> >> > System.out.println(Bytes.toString(value));*
> >> >
> >> > This is fetching me only the column value with the latest timestamp.
> >> >
> >> > I tried the same with Scan API. But I get the same result.
> >> >
> >> > *Could you please let me know how I can retrieve all column values of
> >> all
> >> > timestamps of a particular rowkey??*
> >> >
> >> > Many Thanks,
> >> > Narayanan
> >> >
> >> > ________________________________
> >> >
> >> > http://www.mindtree.com/email/disclaimer.html
> >> >
> >>
> >
> >
>