Re: Implementation pattern for null stream in DataApi#link

Frens Jan Rumph Thu, 06 Feb 2025 02:20:55 -0800

Dear Tsz-Wo,

I understand these concerns.


As for the write-amplification: this is probably similar to the amplification 
in systems such as HBase and Cassandra where data is first written to a WAL 
before being written to SSTables. But of course, a bulk loading mechanism that 
prevents this, as both HBase and Cassandra have, would be preferred. Ideally 
there is a mechanism to coordinate this through Raft. Perhaps something more 
fine-grained than a full snapshot installation but without the 
write-amplification.


> For (1), Ratis has a feature to separate metadata and state machine data so 
> only metadata is written to Raft log and the state machine will manage the 
> state machine data.

I will try and figure out how this works. I’ve seen such separation in the 
Ozone codebase, but didn’t fully wrap my head around it yet.



> For (2), Ratis Steaming is designed to solve that problem.  However, as 
> mentioned previously, we do have a missing feature to replicate stream data 
> outside the original stream.
> 
> You may consider using Steaming only when the data size is in terabytes.  
> When there is a failure or a Raft group membership change, use snapshot as a 
> workaround.  The terabytes-sized data still has to be broken down to 
> tens/hundreds gigabytes chunks.

This was my thinking as well. Hence my initial inquiry.


Thanks a lot for thinking along!

Best regards,
Frens Jan




> On 31 Jan 2025, at 18:28, Tsz-Wo Nicholas Sze <[email protected]> wrote:
> 
> Dear Frens Jan,
> 
> > ... in principle terabytes for bulk loading RDF statements.  ...
> 
> This case definitely needs more attention since the original Raft design is 
> not suitable for data intensive applications.  The problems include
> 1) Write amplification -- data will be written to both the Raft log and also 
> the state machine storage.
> 2) A large number of transactions -- the terabytes have to be broken down 
> into many small megabytes-sized chunks.
> 
> For (1), Ratis has a feature to separate metadata and state machine data so 
> only metadata is written to Raft log and the state machine will manage the 
> state machine data.
> 
> For (2), Ratis Steaming is designed to solve that problem.  However, as 
> mentioned previously, we do have a missing feature to replicate stream data 
> outside the original stream.
> 
> You may consider using Steaming only when the data size is in terabytes.  
> When there is a failure or a Raft group membership change, use snapshot as a 
> workaround.  The terabytes-sized data still has to be broken down to 
> tens/hundreds gigabytes chunks.
> 
> Hope it helps.
> 
> Tsz-Wo
> 
> 
> 
> On Thu, Jan 30, 2025 at 11:51 PM Frens Jan Rumph <[email protected] 
> <mailto:[email protected]>> wrote:
>> Dear Tsz-Wo,
>> 
>> The “objects” could range from mere bytes for management commands to in 
>> principle terabytes for bulk loading RDF statements. So some type of 
>> chunking would be necessary. RDF statements themselves aren’t typically that 
>> large. So that should be doable. Or I could possibly treat them as opaque 
>> binary uploads and only parse them on commit.
>> 
>> Frens Jan
>> 
>> 
>>> On 30 Jan 2025, at 22:13, Tsz-Wo Nicholas Sze <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hi Frens Jan,
>>> 
>>> I agree that it is better to use the regular RAFT replication.  How big is 
>>> the object size you are expecting?  Ozone uses a 4-MB chunk size for 
>>> creating large objects in megabytes or gigabytes.
>>> 
>>> Tsz-Wo
>>> 
>>> 
>>> On Thu, Jan 30, 2025 at 10:53 AM Frens Jan Rumph <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>> Dear Tsz-Wo,
>>>> 
>>>> Thank you for getting back to me on this!
>>>> 
>>>> I think that I understand. For a system like Ozone this makes sense; as 
>>>> the data written can be retrieved from another node by reading it from the 
>>>> object/file stored. I’m not sure how that would work for data that’s 
>>>> overwritten, but that’s for another day. For a database, something like 
>>>> this is probably not so easy as the contents of the stream aren’t 
>>>> reflected as-is in the state machine and there is no (easy) way to 
>>>> identify the side effects the mutation has caused.
>>>> 
>>>> I’ll stick to using regular replication for now and devise some chunking 
>>>> strategy. Perhaps later for something like bulk-loading I’ll take another 
>>>> look at streaming.
>>>> 
>>>> Best regards,
>>>> Frens Jan
>>>> 
>>>> 
>>>> 
>>>>> On 30 Jan 2025, at 18:38, Tsz-Wo Nicholas Sze <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Hi Fens,
>>>>> 
>>>>> Thanks a lot for trying Raits and the Streaming feature!
>>>>> 
>>>>> > ... But what about when a node is added to a group? ...
>>>>> 
>>>>> An existing stream does not support adding a new node dynamically.  The 
>>>>> streams created afterward will be able to write to the new node.
>>>>> 
>>>>> > ...  I reckon that replication/recovery of such stream needs to be done 
>>>>> > ‘outside’ of Ratis; is that correct? ...
>>>>> 
>>>>> You are right that Ratis does not replicate stream data outside the 
>>>>> original stream.  It currently assumes that all data is already 
>>>>> replicated before the "link" transaction.
>>>>> 
>>>>> In both cases, we need a missing feature for replicating stream data 
>>>>> outside the original stream.  It should be done when the link transaction 
>>>>> happens -- if the data is missing, read it from another node.  Without 
>>>>> such a feature, a workaround is to use snapshot -- when the stream data 
>>>>> is missing, trigger a snapshot.
>>>>> 
>>>>> Please feel free to let us know if you have more questions.
>>>>> 
>>>>> Tsz-Wo
>>>>> 
>>>>> 
>>>>> On Tue, Jan 28, 2025 at 1:00 PM Frens Jan Rumph <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>>> Dear Ratis devs and users,
>>>>>> 
>>>>>> I’m investigating use of Ratis for HA of RDF4J. In particular I’m 
>>>>>> wondering about implementation patterns/advice on the stream feature. 
>>>>>> I’ve read e.g., 
>>>>>> https://blog.cloudera.com/ozone-write-pipeline-v2-with-ratis-streaming/ 
>>>>>> and I’ve got a small prototype working. But as the javadoc of 
>>>>>> org.apache.ratis.statemachine.StateMachine.DataApi#link indicates, the 
>>>>>> stream _may_ not be available.
>>>>>> 
>>>>>> I understand there might be error cases that need to be handled. But 
>>>>>> what about when a node is added to a group? In my investigation so far 
>>>>>> it seems that also in that case the stream is unavailable. I reckon that 
>>>>>> replication/recovery of such stream needs to be done ‘outside’ of Ratis; 
>>>>>> is that correct? I didn’t see such facilities in the file store example: 
>>>>>> https://github.com/apache/ratis/blob/master/ratis-examples/src/main/java/org/apache/ratis/examples/filestore/FileStoreStateMachine.java.
>>>>>>  I’ve also tried to figure out how Ozone implements this, but this 
>>>>>> codebase is a bit to big to easily wrap my head around how this would 
>>>>>> work.
>>>>>> 
>>>>>> I would appreciate any pointers with regards to this matter.
>>>>>> 
>>>>>> Thanks!
>>>>>> Frens Jan
>>>>>> 
>>>>>> 
>>>>>>                Award-winning OSINT partner for Law Enforcement and 
>>>>>> Defence.
>>>>>> 
>>>>>> Frens Jan Rumph
>>>>>> Data platform engineering lead
>>>>>> 
>>>>>> phone:
>>>>>> site: 
>>>>>> 
>>>>>> pgp:     +31 50 21 11 622
>>>>>> web-iq.com <https://web-iq.com/>
>>>>>> 
>>>>>> CEE2 A4F1 972E 78C0 F816
>>>>>> 86BB D096 18E2 3AC0 16E0
>>>>>> The content of this email is confidential and intended for the 
>>>>>> recipient(s) specified in this message only. It is strictly forbidden to 
>>>>>> share any part of this message with any third party, without a written 
>>>>>> consent of the sender. If you received this message by mistake, please 
>>>>>> reply to this message and follow with its deletion, so that we can 
>>>>>> ensure such a mistake does not occur in the future.
>>>>>> 
>>>>>> 
>>>> 
>>

signature.asc
Description: Message signed with OpenPGP

Re: Implementation pattern for null stream in DataApi#link

Reply via email to