Hi Jean-Marc, I reply in your text.
Le 12/06/2012 23:42, Jean-Marc Spaggiari a écrit : > Hi, > > I have read all the documentation here > http://hbase.apache.org/book/book.html and I now have few questions. > > I currently have a mysql table with millions of lines (4 for now, but > it's growing by 4 millions a month). It's running on a fast computer, > but it's still way to slow when it's time to insert new data. The > bigger the table is, the slower the inserts are, and soon, the > application will not be working anymore. > > Here is what the table looks like: > +--------------+--------------+------+-----+---------+-------+ > | Field | Type | Null | Key | Default | Extra | > +--------------+--------------+------+-----+---------+-------+ > | IDLow | bigint(20) | NO | PRI | NULL | | > | IDHigh | bigint(20) | NO | PRI | NULL | | > | IDAux | bigint(20) | NO | PRI | NULL | | > | Value | varchar(512) | NO | | NULL | | > | lastUpdate | bigint(20) | NO | | 0 | | > | crc | bigint(20) | YES | | NULL | | > | language | int(11) | YES | | NULL | | > | status | int(11) | YES | | NULL | | > | size | int(11) | YES | | NULL | | > +--------------+--------------+------+-----+---------+-------+ > > Now, I'm trying to design an HBase table to do the same thing. > IDLow, IDHigh and IDAux are 64 bits numbers. So I will just convert > them to a 24 bytes array. That's fine. They are uniq, and they are the > main key for the inserts. > > lastUpdate, I don't think I will need it anymore since it seems to be > already handled by HBase (timestamp) > > Value, crc, language, status and size are parameters I will have to be > able to retrieve. > > Based on what I read on the documentation, I need to keep as few > Column Familly as possible. And the Columns names as short as > possible. So let's try to have just one, named "a". > > That will give me something like: > create 'mytable', 'a' > > Then for the cells, I can insert that way: > put 'mytable', 'row1', 'a:a', 'ID' > Where value1 is my 24 bytes ID. > put 'mytable', 'row1', 'a:b', 'VALUE' > Which is my "value" parameter from my mysql table. > put 'mytable', 'row1', 'a:c', 'CRC' > Which will be my CRC > > And so on. > > Now, I have few questions. > > 1) I was not able to find any reference of the primary key or primary > index or similar in the HBase documentation. Is it automatically done > of the first cell ("a" in the example)? Or on the row name ("row1" in > the example)? The unique key available is the rowId. You can use any Bytes[] as rowId often it is a concat of multiple fields. But, it will be ordered in alphanum order (as a row of Bytes) so it will have some consequences on the way you scan your data to retrieve values. STARTROW an STOPROW is the most efficient filter for scan (directly based on the way the data are sharded and so will on be executed on the concerned regions of the cluster). So think how you mainly search ? (what ranges) Then other filters are sent to all the regions. Column based filter and Timestamp based filter are then efficient using the 2 others dimentions of an Hbase table. Then come the less efficient filters based on the values where you have to scan for the whole data set. (or regexp or contains like operator on the rowid, maybe bloom filter can help here) > 2) I will need to be able to parse the rows filtering on "status" > field. I searched for the way to add a secondary index but I was not > able to find it. Any place I can look for that? Secondary index in Hbase is usually an other Htable with rowid espacially designed to retrive easily by range rowids from the primary table and consistency have to be maintein on your on. Htable have 3 dimentions : rowid , column name and timestamp you can try to use it. maybe you should create a column for each status values and add/remove binary data when status change ... so you only have to filter on column name. > 3) Since all my entries will have unique id, can I simply use that as > the row title instead of "row1"? What's pros what's cons? That sound the good way to do > 4) If I don't have an index on a field will I still be able to filter > on that field? Like select * from mytable where status = 0... With 50 > millions lines, it might take some time... But is that still doable > without reaching memory limitation or something like that? The work will be split on each regions and streamed (you have to iterate on the client side) > 5) I would like to try my architecture on small computers before I go > for bigger. What's the minimum memory I should have on each > RegionServer (Or DataNode) to start the application and load few > hundred thousands lines? My advise is to use default values first and change them when you reach problems. > I just order "HBase: The Definitive Guide" from amazon. I hope it will > help me, but in the meantime, if you have some responses for me, it > will be welcome. > > Thanks for your help. > > JM
