I built a distributed FS on top of cassandra a while ago, pretty sure the same approach will work on hbase too.
1. Create a two tables: file_info, file_data 2. Break each file into chunks of size N (I used 256K). 3. In the file_info table, store the metadata, SHA1 hash of the data, number of chunks. 3. Store the chunks in file_data with key "sha1_hash"."chunk_number" When reading the file out, you, would first need to query the file_info table and then do a get for each of the chunks and join then together to rebuild the file. Pros * No file size limit * You can retrieve individual chunks in parallel * Auto de-duping of files, ie, files with different names but same data will only be stored once. Cons * (file_size/block_size) + 1 reads for each file that you want to get. Alok On Mon, Mar 5, 2012 at 8:49 AM, Jacques <[email protected]> wrote: > Namenode is limited on the number of blocks. Whether you changed the block > size or not would not have much impact on the problem. I think that the > limit is something like 150 million blocks. (Someone else can feel free to > correct this.) (It isn't exactly that simple because it also has to do > with time for cluster recovery, hardware, etc) HDFS Federation (I believe > in the 0.23 branch) would increase this number through added complexity. > > If you're talking about a smaller scale (which most people are really > focused on to start), just go with HDFS and don't worry about the > limitation. > > If your product takes off, you'll have the engineering staff to start doing > a three way solution e.g: > > 1) Less than 15mb, store in hbase > 2) more than 15mb new, store direct in hdfs with pointer in hbase > 3) Daily, combine the days files into a single archive, update hbase > pointers, delete original individual files for that day > > Good luck, > Jacques > > > On Sun, Mar 4, 2012 at 9:12 PM, Rohit Kelkar <[email protected]> wrote: > >> Jacques, I agree that storing files (lets say greater than 15mb) would >> make the namenode run out of space. But what if I make my >> blocksize=15mb ? >> Even I am having the same issue that Konrad is mentioning and I have >> used exactly the approach number 2 that he mentioned. >> >> - Rohit Kelkar >> >> On Mon, Mar 5, 2012 at 6:59 AM, Jacques <[email protected]> wrote: >> >>>2) files bigger than 15MB are stored in HDFS and HBase keeps only some >> > information where file is placed >> > >> > You're likely to run out of space due to the name node's file count limit >> > if you went with this solution straight. If you follow this path, you >> > probably would need to either regularily combine files into a Hadoop >> > Archive har file or something similar, or go with another backing >> > filesystem (e.g. mapr) that can support hundreds of millions of files >> > natively... >> > >> > On Sun, Mar 4, 2012 at 4:12 PM, Konrad Tendera <[email protected]> >> wrote: >> > >> >> So, what should I use instead of HBase? I'm wondering about following >> >> solution: >> >> 1) let's say our limit is 15MB - files up to this limit worth to keep in >> >> hbase >> >> >> >> >> >> is it appropriate way to solve the problem? Or maybe I should use >> >> http://www.lilyproject.org/ ? >> >> >> >> >>
