Re: Implement ThreadIdGuessingAlgorithm for the distributed module

[email protected] Wed, 21 Jul 2021 04:08:28 -0700


On 21/07/2021 18:06, Quan tran hong wrote:
> Hi Benoit,
> Thanks for your great reviews.
>
>> Or we can just do several selects, one for each mimeMessageIds.I
> personaly likely prefer this option.
>
> This will make the upper logic layer more complicated. But I agree with
> your suggestion cause I would prefer balance and performance also.
>
>> Remember to do the delete of ThreadTable before threadtable_lookup,
> otherwise if you delete the pointer before the data you might end up in
> a case where the actual data is never deleted.
>
> Sure. Thanks for your reminder.
>
>> When/How do we call it?
> StoringThreadIdGuessingAlgorithm maybe?
Not really, this would only be called when appending messages.


Likely this would need to be hooked in DeleteMessageListener.
>
> Cheers,
> Quan
>
> Vào Th 4, 21 thg 7, 2021 vào lúc 17:03 [email protected] <
> [email protected]> đã viết:
>
>> Hello quan,
>>
>> Nice proposal! I think the Cassandra data model you propose is evolved
>> enough so that we start implementing it.
>>
>> Some comments inlined...
>>
>> On 21/07/2021 15:55, Quan tran hong wrote:
>>> Hi Benoit,
>>> I did have another try on this. Please have a look.
>>> CREATE threadtable
>>>
>>> CREATE TABLE ThreadTable (messageId timeuuid, threadId timeuuid, username
>>> text, mimeMessageId text, baseSubject text, PRIMARY KEY((username,
>>> mimemessageid), messageid));
>>>
>>> => Partition key: (username, mimemessageid), clustering key: messageid.
>>>
>>> [...]
>>>
>>> SELECT basesubject, threadId FROM threadtable WHERE username = 'quan'
>> AND
>>> mimeMessageId IN ('MimeMessageID2', 'MimeMessageID3');
>> This looks way better to me.
>>
>> IN usage is PRIMARY KEY is discouraged as it leads to coordination
>> across partitions.
>>
>> Read more for instance in
>>
>> https://stackoverflow.com/questions/55604857/cassandra-query-performance-using-in-clause-for-one-portion-of-the-composite-pa
>>
>> Either we should move "mimeMessageId" to the clustering key (but all the
>> messages of a user, including their subject would end up in a single
>> partition, which could be quite large... for instance 1 million messages
>> x size of the mime message ids and subject could be too much, as
>> partitions are recommended to be 100MBs at most).
>>
>> Or we can just do several selects, one for each mimeMessageIds.I
>> personaly likely prefer this option.
>>> [...]
>>>
>>> CREATE threadtable_lookup for deletion purpose
>>>
>>> Supposed when we delete a message, we would need to delete that message’s
>>> thread-related data in the threadtable also. I guess we just need
>> messageId
>>> to delete that message.
>>>
>>> We would need to have another table similar to the threadtable but looked
>>> up by messageId to get the needed params for deletion query.
>>>
>>> CREATE TABLE ThreadTable_lookup (messageId timeuuid, username text,
>>> mimeMessageIds set<text>, PRIMARY KEY(messageid));
>> The set should be frozen. We will never add or remove data in it, so we
>> do not need a CRDT (Commutative Replicated Data Type).
>>
>> Forbidding adding - removing individual elements avoids lots of issues,
>> and lead to a more compact storage.
>>
>>> To ease testing, I create a table with messageId’s type is int instead.
>>>
>>> We will insert the data as same as the original table.
>>>
>>>     insert into ThreadTable_lookup (messageId, username, mimeMessageIds)
>>> values (1, 'quan', {'MimeMessageID1', 'MimeMessageID2',
>> 'MimeMessageID3'});
>>> [...]
>>>
>>>
>>> SELECT * FROM threadtable_lookup;
>>>
>>>
>>> Now we will do a query selection by messageId to get the needed username,
>>> mimeMessageIds for original threadtable deletion.
>>>
>>> Supposed we want to delete message 4.
>>>
>>> SELECT username, mimemessageids FROM threadtable_lookup WHERE messageid
>> = 4;
>>>
>>> Then we will use these results to do a deletion query on threadtable.
>>>
>> Remember to do the delete of ThreadTable before threadtable_lookup,
>> otherwise if you delete the pointer before the data you might end up in
>> a case where the actual data is never deleted.
>>
>> The algorithm that you propose looks good.
>>
>> When/How do we call it?
>>
>>
>> Cheers,
>>
>> Benoit
>>> The data of messageId 4 deleted.
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Quan
>>>
>>>
>>> Vào Th 3, 20 thg 7, 2021 vào lúc 18:39 [email protected] <
>>> [email protected]> đã viết:
>>>
>>>> Hello Quan,
>>>>
>>>> On 20/07/2021 17:24, Quan tran hong wrote:
>>>>> [...]
>>>>>
>>>>> SELECT threadId FROM threadtable WHERE username = 'quan' AND
>> baseSubject
>>>> =
>>>>> 'baseSubject1' AND mimeMessageId IN ('MimeMessageID2',
>> 'MimeMessageID3')
>>>>> LIMIT 1 ALLOW FILTERING;
>>>> ALLOW FILTERING should not be used as it will result in a full scan and
>>>> is thus a performance disaster.
>>>>
>>>> If you need it, this means you do not have the right table structure and
>>>> likely should rework the CREATE TABLE statement.
>>>>> => This new message should have this threadId.
>>>>> New unrelated message
>>>>>
>>>>> Assume that we do a query for a new unrelated message.
>>>>>
>>>>> SELECT threadId FROM threadtable WHERE username = 'quan' AND
>> baseSubject
>>>> =
>>>>> 'unrelatedBaseSubject' AND mimeMessageId IN ('MimeMessageID2',
>>>>> 'MimeMessageID3') LIMIT 1 ALLOW FILTERING;
>>>>>
>>>>> => This new message should have a new threadId.
>>>>> Insert new message data
>>>>>
>>>>> After having a threadId, we need to insert new message data into the
>>>> thread
>>>>> table.
>>>>>
>>>>> insert into ThreadTable (messageId, threadId, username, mimeMessageId,
>>>>> baseSubject) values (now(), 02294fe1-e941-11eb-a8ee-77de5498f1fa,
>> 'quan',
>>>>> 'MimeMessageID2', 'baseSubject1');
>>>>>
>>>>> insert into ThreadTable (messageId, threadId, username, mimeMessageId,
>>>>> baseSubject) values (now(), 02294fe1-e941-11eb-a8ee-77de5498f1fa,
>> 'quan',
>>>>> 'MimeMessageID3', 'baseSubject1');
>>>>> Conclusion
>>>>>
>>>>> I think this data model complies with the needed request for the
>> guessing
>>>>> algorithm problem, but it looks like still maybe there is room for
>>>>> improvement.
>>>> What Cassandra request do we use to delete the data in there?
>>>>
>>>>> Best Regards,
>>>>>
>>>>> Quan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Vào Th 2, 19 thg 7, 2021 vào lúc 18:23 [email protected] <
>>>>> [email protected]> đã viết:
>>>>>
>>>>>> Hello Quan,
>>>>>>
>>>>>> On 19/07/2021 17:59, Quan tran hong wrote:
>>>>>>> Hi,
>>>>>>> I am starting to implement ThreadIdGuessingAlgorithm for the
>>>> distributed
>>>>>>> module. Because this is a breaking change and I am new to Cassandra
>>>> also,
>>>>>>> therefore I want to have some discussion with you about how to do
>> this.
>>>>>> As long as we introduce a new table there is no reason that it creates
>>>>>> breaking change, but getting the format right will ease our life down
>>>>>> the line.
>>>>>>> For the ones who did not catch up with this work, please have a look
>> at
>>>>>>> JMAP Threads specs [1] and my work related to this [2].
>>>>>>>
>>>>>>> So my ideas on how to do this:
>>>>>>> - Add a needed inputs Cassandra Table for guessing threadId
>> algorithm.
>>>>>>> Maybe a table likes:
>>>>>>>  CREATE TABLE ThreadRelatedTable (
>>>>>>> threadId       timeuuid,
>>>>>>> messageId      timeuuid,
>>>>>>> mimeMessageIds     SET<text>,
>>>>>>> subject       text,
>>>>>>> PRIMARY KEY (mimeMessageIds, subject)
>>>>>>> );
>>>>>>> - Whenever we guess threadId for a new message, we access this table
>>>> and
>>>>>> do
>>>>>>> the matching query to get related threadId(if there is) or decide new
>>>>>>> message should have a new threadId.
>>>>>>> - Whenever we save a new message, we save the thread-related data to
>>>> this
>>>>>>> table.
>>>>>>>
>>>>>>> This is my first come-up idea. Please express your thoughts about
>> this.
>>>>>> Collections are an advanced data modeling tool, that should be used
>> with
>>>>>> caution. I am not sure using it in a PRIMARY KEY is a good idea. I am
>>>>>> not sure that does what you want (the full primary key should be
>>>>>> specified to know which node hold the data.
>>>>>>
>>>>>> Also, once you found the message related to a thread you want to
>>>>>> validate that the subject matches. This can be done on application
>> side
>>>>>> (James), and avoids complicated data model.
>>>>>>
>>>>>> I encourage you to validate your data model using a Cassandra in
>> docker
>>>>>> and executing CQL commands locally with CQLSH tool to simulate the
>>>>>> queries you whish to do, and learn about your data model before even
>>>>>> starting to implement it. IMO sharing CQL commands for creating the
>>>>>> table, inserting data in it, and retrieving data from it would be a
>>>>>> great follow up to this email.
>>>>>>
>>>>>> How would you populate the data of that table?
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Benoit
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Quan
>>>>>>>
>>>>>>> [1] https://jmap.io/spec-mail.html#threads
>>>>>>> [2] https://issues.apache.org/jira/browse/JAMES-3516
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>>
>>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Implement ThreadIdGuessingAlgorithm for the distributed module

Reply via email to