I built a distributed FS on top of cassandra a while ago, pretty sure
the same approach will work on hbase too.

1. Create a two tables: file_info, file_data
2. Break each file into chunks of size N (I used 256K).
3. In the file_info table, store the metadata, SHA1 hash of the data,
number of chunks.
3. Store the chunks in file_data with key "sha1_hash"."chunk_number"

When reading the file out, you, would first need to query the
file_info table and then do a get for each of the chunks and join then
together to rebuild the file.

Pros
* No file size limit
* You can retrieve individual chunks in parallel
* Auto de-duping of files, ie, files with different names but same
data will only be stored once.
Cons
* (file_size/block_size) + 1 reads for each file that you want to get.

Alok
On Mon, Mar 5, 2012 at 8:49 AM, Jacques <[email protected]> wrote:
> Namenode is limited on the number of blocks.  Whether you changed the block
> size or not would not have much impact on the problem.  I think that the
> limit is something like 150 million blocks. (Someone else can feel free to
> correct this.)  (It isn't exactly that simple because it also has to do
> with time for cluster recovery, hardware, etc)  HDFS Federation (I believe
> in the 0.23 branch) would increase this number through added complexity.
>
> If you're talking about a smaller scale (which most people are really
> focused on to start), just go with HDFS and don't worry about the
> limitation.
>
> If your product takes off, you'll have the engineering staff to start doing
> a three way solution e.g:
>
> 1) Less than 15mb, store in hbase
> 2) more than 15mb new, store direct in hdfs with pointer in hbase
> 3) Daily, combine the days files into a single archive, update hbase
> pointers, delete original individual files for that day
>
> Good luck,
> Jacques
>
>
> On Sun, Mar 4, 2012 at 9:12 PM, Rohit Kelkar <[email protected]> wrote:
>
>> Jacques, I agree that storing files (lets say greater than 15mb) would
>> make the namenode run out of space. But what if I make my
>> blocksize=15mb ?
>> Even I am having the same issue that Konrad is mentioning and I have
>> used exactly the approach number 2 that he mentioned.
>>
>> - Rohit Kelkar
>>
>> On Mon, Mar 5, 2012 at 6:59 AM, Jacques <[email protected]> wrote:
>> >>>2) files bigger than 15MB are stored in HDFS and HBase keeps only some
>> > information where file is placed
>> >
>> > You're likely to run out of space due to the name node's file count limit
>> > if you went with this solution straight.  If you follow this path, you
>> > probably would need to either regularily combine files into a Hadoop
>> > Archive har file or something similar, or go with another backing
>> > filesystem (e.g. mapr) that can support hundreds of millions of files
>> > natively...
>> >
>> > On Sun, Mar 4, 2012 at 4:12 PM, Konrad Tendera <[email protected]>
>> wrote:
>> >
>> >> So, what should I use instead of HBase? I'm wondering about following
>> >> solution:
>> >> 1) let's say our limit is 15MB - files up to this limit worth to keep in
>> >> hbase
>> >>
>> >>
>> >> is it appropriate way to solve the problem? Or maybe I should use
>> >> http://www.lilyproject.org/ ?
>> >>
>> >>
>>

Reply via email to