On 6/12/2012 2:42 PM, Jean-Marc Spaggiari wrote:
Here is what the table looks like:
+--------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+-------+
| IDLow | bigint(20) | NO | PRI | NULL | |
| IDHigh | bigint(20) | NO | PRI | NULL | |
| IDAux | bigint(20) | NO | PRI | NULL | |
| Value | varchar(512) | NO | | NULL | |
| lastUpdate | bigint(20) | NO | | 0 | |
| crc | bigint(20) | YES | | NULL | |
| language | int(11) | YES | | NULL | |
| status | int(11) | YES | | NULL | |
| size | int(11) | YES | | NULL | |
+--------------+--------------+------+-----+---------+-------+
Now, I'm trying to design an HBase table to do the same thing.
IDLow, IDHigh and IDAux are 64 bits numbers. So I will just convert
them to a 24 bytes array. That's fine. They are uniq, and they are the
main key for the inserts.
Based on what I read on the documentation, I need to keep as few
Column Familly as possible. And the Columns names as short as
possible. So let's try to have just one, named "a".
That will give me something like:
create 'mytable', 'a'
Then for the cells, I can insert that way:
put 'mytable', 'row1', 'a:a', 'ID'
Where value1 is my 24 bytes ID.
put 'mytable', 'row1', 'a:b', 'VALUE'
Which is my "value" parameter from my mysql table.
put 'mytable', 'row1', 'a:c', 'CRC'
Which will be my CRC
And so on.
Why not use ID as the row key? and put all the values at once as
separate columns in the same row.
Put put = new put(byte [] row)
.add(Bytes.toBytes("a"), Bytes.toBytes("CRC"), Bytes.toBytes(<crc>));
.add(Bytes.toBytes("a"), Bytes.toBytes("VALUE"),
Bytes.toBytes(<value>));
// where 'row' = your 24 bytes ID as above
etc.
then send the complete row as one atomic operation.
where <crc> is the value of the CRC column and so on. I'm showing the
Java bindings, but the principle applies to whichever client api you use.
Now, I have few questions.
1) I was not able to find any reference of the primary key or primary
index or similar in the HBase documentation. Is it automatically done
of the first cell ("a" in the example)? Or on the row name ("row1" in
the example)?
2) I will need to be able to parse the rows filtering on "status"
field. I searched for the way to add a secondary index but I was not
able to find it. Any place I can look for that?
3) Since all my entries will have unique id, can I simply use that as
the row title instead of "row1"? What's pros what's cons?
4) If I don't have an index on a field will I still be able to filter
on that field? Like select * from mytable where status = 0... With 50
millions lines, it might take some time... But is that still doable
without reaching memory limitation or something like that?
5) I would like to try my architecture on small computers before I go
for bigger. What's the minimum memory I should have on each
RegionServer (Or DataNode) to start the application and load few
hundred thousands lines?
I just order "HBase: The Definitive Guide" from amazon. I hope it will
help me, but in the meantime, if you have some responses for me, it
will be welcome.
Thanks for your help.
JM