Plus,
Since most of the time a client will display the area that does not fit in 
500x500, Scan operations are required. (Get is not enough)
So, I'm worried that on scanning, many irrelevant column data (those have the 
same rowkey, which is the position on the grid) would be read into the block 
cache, unless the columns are separated by individual column family.


-----Original Message-----
From: innowireless TaeYun Kim [mailto:[email protected]] 
Sent: Tuesday, August 05, 2014 8:36 PM
To: [email protected]
Subject: RE: Question on the number of column families

Thank you for your reply.

I can decrease the size of column value if it's not good for HBase.
BTW, The values are for a point on a grid cell on a map.
250000 is 500x500, and 500x500 is somewhat related to the size of the client 
screen that displays the values on a map.
Normally a client requests the values for the area that is displayed on the 
screen.


-----Original Message-----
From: Alok Kumar [mailto:[email protected]]
Sent: Tuesday, August 05, 2014 8:24 PM
To: [email protected]
Subject: Re: Question on the number of column families

Hi,

Hbase creates HFile per column-family. Having 130 column-family is really not 
recommended.
It will increase number of file pointer ( open file count) underneath.

If you are sure which columns are "frequently" accessed by users, you could 
consider putting them in one column family. And "Non frequently" ones in 
another.
Btw, ~5MB size of column value is something to consider. We should wait for 
some expert advise here!!


Thanks
Alok


On Tue, Aug 5, 2014 at 4:50 PM, innowireless TaeYun Kim < 
[email protected]> wrote:

> Plus,
> the size of the value of each field can be ~5MB, since max 250000 
> lines of the source data will be merged into one record, to match the 
> request pattern.
>
>
> -----Original Message-----
> From: innowireless TaeYun Kim [mailto:[email protected]]
> Sent: Tuesday, August 05, 2014 8:11 PM
> To: [email protected]
> Subject: Question on the number of column families
>
> Hi,
>
>
>
> According to http://hbase.apache.org/book/number.of.cfs.html, having 
> more than 2~3 column families are strongly discouraged.
>
>
>
> BTW, in my case, records on a table have the following characteristics:
>
>
>
> - The table is read-only. It is bulk-loaded once. When a new data is 
> ready, A new table is created and the old table is deleted.
>
> - The size of the source data can be hundreds of gigabytes.
>
> - A record has about 130 fields.
>
> - The number of fields in a record is fixed.
>
> - The names of the fields are also fixed. (it's like a table in RDBMS)
>
> - About 40(it varies) fields mostly have value, while other fields are 
> mostly empty(null in RDBMS).
>
> - It is unknown which field will be dense. It depends on the source data.
>
> - Fields are accessed independently. Normally a user requests just one 
> field. A user can request several fields.
>
> - The range on the range query is the same for all fields. (No wider, 
> no narrower, regardless the data density)
>
> For me, it seems that it would be more efficient if there is one 
> column family for each field, since it would cost less disk I/O, for 
> only the needed column data will be read.
>
>
>
> Can the table have 130 column families for this case?
>
> Or the whole columns must be in one column family?
>
>
>
> Thanks.
>
>
>
>
>


--
Alok Kumar
Email : [email protected]
http://sharepointorange.blogspot.in/
http://www.linkedin.com/in/alokawi

Reply via email to