Re: [Zope-dev] NEO High Performance Distributed Fault Tolerant ZODB Storage
Le mardi 6 avril 2010 18:54:24, Jim Fulton a écrit : Is there a document you can point to that provide a description of the approach used? A document is in the works, in the while I'll try to describe the architecture briefly (should be readable to people knowing a bit about TPC) and give some typical use cases. NEO is distributed, with different types of nodes involved. The 3 main types are: - Client nodes: what gets embedded in any application using NEO, so typically Zope + ZODB - Master nodes: one of them gets elected (to become the primary master node, other becoming secondary master nodes). The primary handles all centralised tasks, such as: - oid/tid generation - cluster consistency checking (high level: is there enough nodes to cover all data) - broadcasting cluster changes to all nodes (new storage node, storage node getting disconnected, etc) When a primary master falls, secondaries take over by restarting an election. There is no persistent data on a master node, asides from its configuration (the name of the NEO cluster it belongs to, the addresses of other master nodes and its own listening address) - Storage nodes: they contain object data, stored accessed through them and ultimately stored in a local storage backend (MySQL currently, but all NEO really needs is something which can guarantee some atomicity and limited index lookups) Other nodes (existing or planned) are: - Admin node Cluster monitoring (human-level: book-keeping of cluster health) - Control command Administrator CLI tool to trigger cluster actions. Technical note: for now, it uses the admin node as a stepping stone in all cases, though it could be avoided in some cases. - Backup node Very similar technically to a storage node, but only dumping data out of storage nodes, and being a CLI tool, probably not a daemon (not implemented yet). Remarks: - there is currently no security in NEO (cryptography/authentication) - expected cluster size is of around 1000 (client + storage) nodes, probably more (depends on the node type ratio and usage pattern) Some sample use cases: Client wants data Client connects to the primary master node (happens once upon startup, and again if primary master dies). During the initial handshake, the client receives the partition table, which tells him which storage nodes are part of the cluster and which one contains which object. Then, client connects to one of the storage nodes which is expected to contain desired data, and asks it to be served. Client wants to commit Client is already known to cluster and enters TPC in ZODB: it sends object data to all candidate storage (known by looking up in its partition table) for each object. Those storage locally handle locking at object level, a write lock being taken upon such data transmission. Once the store phase is over, vote phase starts and the vote results depends on storage refusing data (base version of object not being the latest one at commit time, etc), resulting in conflict resolution, and ultimately in ConflictError exception. Then the client notified the primary master of ZODB decision (finish or abort), which in turn asks all involved storage nodes to make changes persistent or trash them (releasing object-level write locks). If it chose to finish, involved storage take a read lock on objects (barrier kind of use), answer master that they are done acquiring this lock. In turn, master asks them to release all locks (barrier effect is achieved, write lock get released to allow further transactions) and send invalidation requests to all clients. New storage enters cluster When a storage node enters an existing (and working) cluster, it will be assigned to some partitions, in order to achieve load balancing. It will by itself connect to storage nodes having data for those partitions, and start replicating them. Those partitions might or might not be ultimately be dropped from their original container nodes, depending on the data replication constraints (how many copy of those partitions exist in the whole cluster). Storage dies When a storage dies (disconnected, or request timeout), it is set as temporarily down in primary master partition table (change which is then broadcast to all other nodes). it means that the storage node might still have its data, but is currently unavailable. If the lost storage contained the last copy of any partition, the cluster lost a part of its content: it asks all nodes to interrupt service (they stay running, but refuse to serve further requests until the cluster gets back to a running state). Note that currently, loosing a storage doesn't trigger automatically the symmetric action to adding a storage node: partitions are not added to existing nodes. It can only be triggered by manual action, via the admin node and its command line tool. This was chosen to avoid seeing the cluster waste time balancing data over and over
Re: [Zope-dev] NEO High Performance Distributed Fault Tolerant ZODB Storage
On Wed, Apr 7, 2010 at 8:50 AM, Vincent Pelletier vinc...@nexedi.com wrote: Le mardi 6 avril 2010 18:54:24, Jim Fulton a écrit : Is there a document you can point to that provide a description of the approach used? A document is in the works, in the while I'll try to describe the architecture briefly (should be readable to people knowing a bit about TPC) and give some typical use cases. ... Thanks for sharing some details. The hints of the architecture are intriguing. I look forward to studying this in more detail when the document is available. ... You mean, a presentation of NEO Yes. (maybe what I wrote above would fit) ? More details would be good. Of course, I could look at the source, but given the license, I'm not highly motivated. Jim -- Jim Fulton ___ Zope-Dev maillist - Zope-Dev@zope.org https://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - https://mail.zope.org/mailman/listinfo/zope-announce https://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] NEO High Performance Distributed Fault Tolerant ZODB Storage
On Wed, Mar 31, 2010 at 4:32 AM, Vincent Pelletier vinc...@nexedi.com wrote: Hi, I would like to present you the NEOPPOD project, aiming at improving ZODB Storage scalability. The implementation is in a rather good shape, although it fails at a few ZODB tests at the moment (they are currently being worked on). Scalability is achieved by distributing data over multiple servers (replication and load balancing) with the ability to extend/reduce cluster on-line. Is there a document you can point to that provide a description of the approach used? Its code is available under the GPL, That's unfortunate. Why not a less restrictive license? [1] http://www.neoppod.org/ ... [3] http://www.myerp5.com/kb/enterprise-High.Performance.Zope/view These seem to be very high level. You provide a link to the source, which is rather low level. Anything in between? Jim -- Jim Fulton ___ Zope-Dev maillist - Zope-Dev@zope.org https://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - https://mail.zope.org/mailman/listinfo/zope-announce https://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] NEO High Performance Distributed Fault Tolerant ZODB Storage
Le mercredi 31 mars 2010 18:32:31, vous avez écrit : A few questions that you may want to add in a FAQ. We started that page and will publish it very soon, based on most points you raised. Other pages are also being worked on, such as an overview of a simple NEO cluster. - Why not include ZODB 3.10/Python 2.6 as a goal of the project? - I understand *today* the technologies use python 2.4 but ZODB 3.10/Plone 4/Zope 2.12 use python 2.6 We indeed aim at supporting more recent versions of python and Zope. Actually, your remark made us realise that our functional tests are currently (accidentally) running in a mixed 2.4/2.5 python environment: test process is started explicitly with 2.4, and forked processes (masters, storage and admin nodes) are running on default python, which is 2.5 (as of Debian stable). The standard Zope version in Nexedi is 2.8, which explains why we want to support it. We will switch to 2.12, as we have ERP5 unit tests running on 2.12 for some weeks[1] now. NEO will move to 2.12 at the same time or earlier. - Maybe explain the goal of the project clearer: NEO provides distributed, redundant and transactional storage designed for petabytes of persistent (python?) objects. Thanks, updated. - A buildout for NEO would lower bar for evaluation This is on our roadmap (...to be published along with the FAQ), but the priority currently goes to 2 developments which might/will break compatibility: pack support (required an undo rework which was recently integrated, pack itself needs more unit testing prior to integration) and multi-export support (aka ZODB mountpoints, also in need for more testing before integration). [1] http://mail.nexedi.com/pipermail/erp5-report/ (_z212 in subject) -- Vincent Pelletier ___ Zope-Dev maillist - Zope-Dev@zope.org https://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - https://mail.zope.org/mailman/listinfo/zope-announce https://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] NEO High Performance Distributed Fault Tolerant ZODB Storage
On Wed, Mar 31, 2010 at 3:32 AM, Vincent Pelletier vinc...@nexedi.comwrote: Hi, I would like to present you the NEOPPOD project, aiming at improving ZODB Storage scalability. The implementation is in a rather good shape, although it fails at a few ZODB tests at the moment (they are currently being worked on). Scalability is achieved by distributing data over multiple servers (replication and load balancing) with the ability to extend/reduce cluster on-line. Congrats! A few questions that you may want to add in a FAQ. - NEO replaces FileStorage? Maybe typo? - What is the gain of using NEO over relstorage + mysql replication? - Why not include ZODB 3.10/Python 2.6 as a goal of the project? - I understand *today* the technologies use python 2.4 but ZODB 3.10/Plone 4/Zope 2.12 use python 2.6 - NEO is a different protocol than ZEO? - What is the Blob story with NEO? - Any issues with 32bit vs 64bit - Backup/restore strategy of NEO Other notes: - Maybe explain the goal of the project clearer: NEO provides distributed, redundant and transactional storage designed for petabytes of persistent (python?) objects. - A buildout for NEO would lower bar for evaluation - How do you plan on storing petabytes in a single MySQL server? since that is the data structure backend for NEO? Looking forward to reading the petrinet article - please send update when it comes out. cheers alan ___ Zope-Dev maillist - Zope-Dev@zope.org https://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - https://mail.zope.org/mailman/listinfo/zope-announce https://mail.zope.org/mailman/listinfo/zope )