Dear Tsz-Wo, I understand these concerns.
As for the write-amplification: this is probably similar to the amplification in systems such as HBase and Cassandra where data is first written to a WAL before being written to SSTables. But of course, a bulk loading mechanism that prevents this, as both HBase and Cassandra have, would be preferred. Ideally there is a mechanism to coordinate this through Raft. Perhaps something more fine-grained than a full snapshot installation but without the write-amplification. > For (1), Ratis has a feature to separate metadata and state machine data so > only metadata is written to Raft log and the state machine will manage the > state machine data. I will try and figure out how this works. I’ve seen such separation in the Ozone codebase, but didn’t fully wrap my head around it yet. > For (2), Ratis Steaming is designed to solve that problem. However, as > mentioned previously, we do have a missing feature to replicate stream data > outside the original stream. > > You may consider using Steaming only when the data size is in terabytes. > When there is a failure or a Raft group membership change, use snapshot as a > workaround. The terabytes-sized data still has to be broken down to > tens/hundreds gigabytes chunks. This was my thinking as well. Hence my initial inquiry. Thanks a lot for thinking along! Best regards, Frens Jan > On 31 Jan 2025, at 18:28, Tsz-Wo Nicholas Sze <[email protected]> wrote: > > Dear Frens Jan, > > > ... in principle terabytes for bulk loading RDF statements. ... > > This case definitely needs more attention since the original Raft design is > not suitable for data intensive applications. The problems include > 1) Write amplification -- data will be written to both the Raft log and also > the state machine storage. > 2) A large number of transactions -- the terabytes have to be broken down > into many small megabytes-sized chunks. > > For (1), Ratis has a feature to separate metadata and state machine data so > only metadata is written to Raft log and the state machine will manage the > state machine data. > > For (2), Ratis Steaming is designed to solve that problem. However, as > mentioned previously, we do have a missing feature to replicate stream data > outside the original stream. > > You may consider using Steaming only when the data size is in terabytes. > When there is a failure or a Raft group membership change, use snapshot as a > workaround. The terabytes-sized data still has to be broken down to > tens/hundreds gigabytes chunks. > > Hope it helps. > > Tsz-Wo > > > > On Thu, Jan 30, 2025 at 11:51 PM Frens Jan Rumph <[email protected] > <mailto:[email protected]>> wrote: >> Dear Tsz-Wo, >> >> The “objects” could range from mere bytes for management commands to in >> principle terabytes for bulk loading RDF statements. So some type of >> chunking would be necessary. RDF statements themselves aren’t typically that >> large. So that should be doable. Or I could possibly treat them as opaque >> binary uploads and only parse them on commit. >> >> Frens Jan >> >> >>> On 30 Jan 2025, at 22:13, Tsz-Wo Nicholas Sze <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Hi Frens Jan, >>> >>> I agree that it is better to use the regular RAFT replication. How big is >>> the object size you are expecting? Ozone uses a 4-MB chunk size for >>> creating large objects in megabytes or gigabytes. >>> >>> Tsz-Wo >>> >>> >>> On Thu, Jan 30, 2025 at 10:53 AM Frens Jan Rumph <[email protected] >>> <mailto:[email protected]>> wrote: >>>> Dear Tsz-Wo, >>>> >>>> Thank you for getting back to me on this! >>>> >>>> I think that I understand. For a system like Ozone this makes sense; as >>>> the data written can be retrieved from another node by reading it from the >>>> object/file stored. I’m not sure how that would work for data that’s >>>> overwritten, but that’s for another day. For a database, something like >>>> this is probably not so easy as the contents of the stream aren’t >>>> reflected as-is in the state machine and there is no (easy) way to >>>> identify the side effects the mutation has caused. >>>> >>>> I’ll stick to using regular replication for now and devise some chunking >>>> strategy. Perhaps later for something like bulk-loading I’ll take another >>>> look at streaming. >>>> >>>> Best regards, >>>> Frens Jan >>>> >>>> >>>> >>>>> On 30 Jan 2025, at 18:38, Tsz-Wo Nicholas Sze <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> Hi Fens, >>>>> >>>>> Thanks a lot for trying Raits and the Streaming feature! >>>>> >>>>> > ... But what about when a node is added to a group? ... >>>>> >>>>> An existing stream does not support adding a new node dynamically. The >>>>> streams created afterward will be able to write to the new node. >>>>> >>>>> > ... I reckon that replication/recovery of such stream needs to be done >>>>> > ‘outside’ of Ratis; is that correct? ... >>>>> >>>>> You are right that Ratis does not replicate stream data outside the >>>>> original stream. It currently assumes that all data is already >>>>> replicated before the "link" transaction. >>>>> >>>>> In both cases, we need a missing feature for replicating stream data >>>>> outside the original stream. It should be done when the link transaction >>>>> happens -- if the data is missing, read it from another node. Without >>>>> such a feature, a workaround is to use snapshot -- when the stream data >>>>> is missing, trigger a snapshot. >>>>> >>>>> Please feel free to let us know if you have more questions. >>>>> >>>>> Tsz-Wo >>>>> >>>>> >>>>> On Tue, Jan 28, 2025 at 1:00 PM Frens Jan Rumph <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>>> Dear Ratis devs and users, >>>>>> >>>>>> I’m investigating use of Ratis for HA of RDF4J. In particular I’m >>>>>> wondering about implementation patterns/advice on the stream feature. >>>>>> I’ve read e.g., >>>>>> https://blog.cloudera.com/ozone-write-pipeline-v2-with-ratis-streaming/ >>>>>> and I’ve got a small prototype working. But as the javadoc of >>>>>> org.apache.ratis.statemachine.StateMachine.DataApi#link indicates, the >>>>>> stream _may_ not be available. >>>>>> >>>>>> I understand there might be error cases that need to be handled. But >>>>>> what about when a node is added to a group? In my investigation so far >>>>>> it seems that also in that case the stream is unavailable. I reckon that >>>>>> replication/recovery of such stream needs to be done ‘outside’ of Ratis; >>>>>> is that correct? I didn’t see such facilities in the file store example: >>>>>> https://github.com/apache/ratis/blob/master/ratis-examples/src/main/java/org/apache/ratis/examples/filestore/FileStoreStateMachine.java. >>>>>> I’ve also tried to figure out how Ozone implements this, but this >>>>>> codebase is a bit to big to easily wrap my head around how this would >>>>>> work. >>>>>> >>>>>> I would appreciate any pointers with regards to this matter. >>>>>> >>>>>> Thanks! >>>>>> Frens Jan >>>>>> >>>>>> >>>>>> Award-winning OSINT partner for Law Enforcement and >>>>>> Defence. >>>>>> >>>>>> Frens Jan Rumph >>>>>> Data platform engineering lead >>>>>> >>>>>> phone: >>>>>> site: >>>>>> >>>>>> pgp: +31 50 21 11 622 >>>>>> web-iq.com <https://web-iq.com/> >>>>>> >>>>>> CEE2 A4F1 972E 78C0 F816 >>>>>> 86BB D096 18E2 3AC0 16E0 >>>>>> The content of this email is confidential and intended for the >>>>>> recipient(s) specified in this message only. It is strictly forbidden to >>>>>> share any part of this message with any third party, without a written >>>>>> consent of the sender. If you received this message by mistake, please >>>>>> reply to this message and follow with its deletion, so that we can >>>>>> ensure such a mistake does not occur in the future. >>>>>> >>>>>> >>>> >>
signature.asc
Description: Message signed with OpenPGP
