Hi Stack, Inline.
>> According to the HBase book , pre splitting tables and doing manual >> splits is a better long term strategy than letting HBase handle it. >> > > Its good for getting a table off the ground, yes. > > >> Since I do not know what the keys from the prod system are going to >> look like , I am adding a machine number prefix to the the row keys >> and pre splitting the tables based on the prefix (prefix 0 goes to >> machine A, prefix 1 goes to machine b etc). >> > > You don't need to do inorder scan of the data? Whats the rest of your > row key look like? I need to do be able to do this on 5-6 types of keys/dimensions. I have a map reduce job that runs periodically and creates the indexes on separate tables for querying the data. > >> Once I decide to add more machines, I can always do a rolling split >> and add more prefixes. >> > > Yes. > >> Is this a good strategy for pre splitting the tables ? >> > > So, you'll start out with one region per server? > > What do you think the rate of splitting will be like? Are you using > default region size or have you bumped this up? This prefix strategy should I think create one region per region server. I have configured a single region size to 2 G right now. This is just the number I picked. This is a small cluster as a proof of concept running in parallel with some of the other monolithic reporting infrastructures we have, and will only be serving a fraction of the prod traffic to start off with. The machines on the cluster look like - 120 GB of disk space ; 8 GB of memory ; Quad core 2.66 Ghz . I am going to allocate around 80 GB of memory for HBase use. On a side note, I don't think I understand how to really decide how many regions / region server do I need. If I was to create one region / region server and set hbase.hregion.max.filesize to Long.MAX, why is that a bad thing ? What kind of problems can I run into ? If I was to err on the side of too many regions , what are the advantages/disadvantages there ? > St.Ack >
