RE: Shared block storage via ZooKepper

Simon Felix Wed, 13 Jul 2011 11:14:13 -0700

Thanks for the suggestion but I gues I cannot use MapR for my purpose. I’m 
working on a non-commercial hobby project that one day I might make commercial. 
I believe what I want to use/build is simpler than a distributed file system 
because I don’t have to care about:



-          Metadata

-          Locking

-          Hierarchies

-          Access rights

-          Lookups

So if anyone knows of free, appropriately licensed alternative I’d be happy to 
use that.

From: Ted Dunning [mailto:[email protected]]
Sent: Mittwoch, 13. Juli 2011 18:52
To: [email protected]
Subject: Re: Shared block storage via ZooKepper

Simon,

What you are describing is (roughly) a general read-write distributed and 
replicated file system.  This is a hard problem if you want high performance, 
absolute consistency and significant amounts of failure tolerance.  Building 
such a system from scratch is a difficult proposition.

Frankly, it also sounds just like the filesystem component of MapR (conflict 
alert, I work for MapR Technologies).  You may have additional constraints on 
what you are looking for, but to meet the requirements that you have already 
stated, you should take a look at our offering.  I can imagine scenarios where 
this wouldn't be satisfactory, particularly if this is a homework assignment, 
but if you are simply trying to solve a real engineering problem, it should do 
very well.  I don't want to hijack this list with non-Zookeeper discussion so 
feel free to contact me directly for more pointers.

Ohh... I should mention MapR uses Zookeeper prominently and is glad to do so.  
The strictness and durability of ZK are ideal as the last resort determinant of 
coordination.  In many areas of our system, the ZK trade-offs are not 
appropriate, especially where speed is critical, but that isn't what ZK was 
designed to do.  Using ZK appropriately gives extremely good results.
On Wed, Jul 13, 2011 at 5:15 AM, Simon Felix <[email protected]<mailto:[email protected]>> 
wrote:
Thanks for the reply. I’ll try to clarify my question a bit. I want to simulate 
a single, fault-tolerant shared block storage device. This means everything 
should be replicated and consistent. All that system manages is (for example) 
one billion blocks, each containing exactly 4096 bytes. I do not need any 
metadata per block or locking. There will be multiple nodes, all reading and 
writing the data concurrently. If two nodes A and B write to the same block 
concurrently I expect that all nodes have either version A or version B of the 
block afterwards.

I’m not sure which of the option is the easiest to implement and which will 
give me the highest performance.

#2: Cassandra: Would you store the data in multiple rows? Columns? How much 
data per column? I should probably ask the Cassandra people about this...
#3: BookKeeper: Every node is writing to the data. I’d use BookKeeper as 
write-ahead log. Was BookKeeper built for that kind of workload?

Has anyone else done something similar? I couldn’t find anything in the 
archives...


Simon


From: Flavio Junqueira [mailto:[email protected]<mailto:[email protected]>]
Sent: Mittwoch, 13. Juli 2011 14:01
To: [email protected]<mailto:[email protected]>
Subject: Re: Shared block storage via ZooKepper

Hi Simon, It is not entirely clear to me what you need zookeeper for in this 
case. Are blocks replicated and you need to guarantee that the updates are 
consistent across replicas?

On your observations, I'm quite sure people will have an opinion, so here are 
my thoughts, which might not be representative of the whole community :
1- You're right, we do not recommended to use ZooKeeper directly as the data 
store. ZooKeeper servers keep their state in memory.
2- Cassandra already provides replication. Are you trying to strengthen the 
guarantees of Cassandra? I don't get it...
3- Sound right that you could use BK as a journal, but it is not clear which 
element is writing to the journal. Are you assuming a metadata manager such as 
the namenode of HDFS?
4- I'm not sure what this option means. Are you proposing ZooKeeper to manage 
the metadata of the file system? If so, I don't find it entirely unrealistic, 
since metadata updates are supposed to be small and the performance of 
ZooKeeper should be good enough for your case, but it might be awkward to have 
your block storage clients talking directly to ZooKeeper. Changes to metadata 
management would imply in this case rolling out a new version of the client 
application instead of just having the changes implemented on the service side.

-Flavio

On Jul 13, 2011, at 12:02 PM, Simon Felix wrote:

Hello everyone

What is the best way to build a distributed, shared storage system on top of
ZooKeeper? I'm talking about block storage in the terabyte-range (i.e. store
billions of 4k blocks). Consistency and Availability are important, as is
throughput (both read & write). I need at least 50 MB/s with 3 nodes with
two regular SATA drives each for my application.

Some options I came up with:
1. Use ZooKeeper directly as a data store (Not recommended according to the
docs - and it really leads to abysmally bad performance, I tested that)
2. Use Cassandra as data store
3. Use BookKeeper as write-ahead log and implement my own underlying store
4. Use ZooKeeper to create my own (probably buggy...) data store

What would you recommend? Are there other options?

Cheers,
Simon

flavio
junqueira

research scientist

[email protected]<mailto:[email protected]>
direct +34 93-183-8828<tel:%2B34%2093-183-8828>

avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300<tel:%28408%29%20349%203300>    fax (408) 349 
3301<tel:%28408%29%20349%203301>

[cid:[email protected]]

RE: Shared block storage via ZooKepper

Reply via email to