Re: Implement ThreadIdGuessingAlgorithm for the distributed module

Quan tran hong Wed, 21 Jul 2021 04:06:30 -0700

Hi Benoit,
Thanks for your great reviews.

> Or we can just do several selects, one for each mimeMessageIds.I
personaly likely prefer this option.


This will make the upper logic layer more complicated. But I agree with
your suggestion cause I would prefer balance and performance also.

> Remember to do the delete of ThreadTable before threadtable_lookup,
otherwise if you delete the pointer before the data you might end up in
a case where the actual data is never deleted.

Sure. Thanks for your reminder.

> When/How do we call it?
StoringThreadIdGuessingAlgorithm maybe?

Cheers,
Quan

Vào Th 4, 21 thg 7, 2021 vào lúc 17:03 [email protected] <
[email protected]> đã viết:

> Hello quan,
>
> Nice proposal! I think the Cassandra data model you propose is evolved
> enough so that we start implementing it.
>
> Some comments inlined...
>
> On 21/07/2021 15:55, Quan tran hong wrote:
> > Hi Benoit,
> > I did have another try on this. Please have a look.
> > CREATE threadtable
> >
> > CREATE TABLE ThreadTable (messageId timeuuid, threadId timeuuid, username
> > text, mimeMessageId text, baseSubject text, PRIMARY KEY((username,
> > mimemessageid), messageid));
> >
> > => Partition key: (username, mimemessageid), clustering key: messageid.
> >
> > [...]
> >
> > SELECT basesubject, threadId FROM threadtable WHERE username = 'quan'
> AND
> > mimeMessageId IN ('MimeMessageID2', 'MimeMessageID3');
> This looks way better to me.
>
> IN usage is PRIMARY KEY is discouraged as it leads to coordination
> across partitions.
>
> Read more for instance in
>
> https://stackoverflow.com/questions/55604857/cassandra-query-performance-using-in-clause-for-one-portion-of-the-composite-pa
>
> Either we should move "mimeMessageId" to the clustering key (but all the
> messages of a user, including their subject would end up in a single
> partition, which could be quite large... for instance 1 million messages
> x size of the mime message ids and subject could be too much, as
> partitions are recommended to be 100MBs at most).
>
> Or we can just do several selects, one for each mimeMessageIds.I
> personaly likely prefer this option.
> > [...]
> >
> > CREATE threadtable_lookup for deletion purpose
> >
> > Supposed when we delete a message, we would need to delete that message’s
> > thread-related data in the threadtable also. I guess we just need
> messageId
> > to delete that message.
> >
> > We would need to have another table similar to the threadtable but looked
> > up by messageId to get the needed params for deletion query.
> >
> > CREATE TABLE ThreadTable_lookup (messageId timeuuid, username text,
> > mimeMessageIds set<text>, PRIMARY KEY(messageid));
> The set should be frozen. We will never add or remove data in it, so we
> do not need a CRDT (Commutative Replicated Data Type).
>
> Forbidding adding - removing individual elements avoids lots of issues,
> and lead to a more compact storage.
>
> > To ease testing, I create a table with messageId’s type is int instead.
> >
> > We will insert the data as same as the original table.
> >
> >     insert into ThreadTable_lookup (messageId, username, mimeMessageIds)
> > values (1, 'quan', {'MimeMessageID1', 'MimeMessageID2',
> 'MimeMessageID3'});
> >
> > [...]
> >
> >
> > SELECT * FROM threadtable_lookup;
> >
> >
> > Now we will do a query selection by messageId to get the needed username,
> > mimeMessageIds for original threadtable deletion.
> >
> > Supposed we want to delete message 4.
> >
> > SELECT username, mimemessageids FROM threadtable_lookup WHERE messageid
> = 4;
> >
> >
> > Then we will use these results to do a deletion query on threadtable.
> >
> Remember to do the delete of ThreadTable before threadtable_lookup,
> otherwise if you delete the pointer before the data you might end up in
> a case where the actual data is never deleted.
>
> The algorithm that you propose looks good.
>
> When/How do we call it?
>
>
> Cheers,
>
> Benoit
> > The data of messageId 4 deleted.
> >
> >
> >
> > Best regards,
> >
> > Quan
> >
> >
> > Vào Th 3, 20 thg 7, 2021 vào lúc 18:39 [email protected] <
> > [email protected]> đã viết:
> >
> >> Hello Quan,
> >>
> >> On 20/07/2021 17:24, Quan tran hong wrote:
> >>> [...]
> >>>
> >>> SELECT threadId FROM threadtable WHERE username = 'quan' AND
> baseSubject
> >> =
> >>> 'baseSubject1' AND mimeMessageId IN ('MimeMessageID2',
> 'MimeMessageID3')
> >>> LIMIT 1 ALLOW FILTERING;
> >> ALLOW FILTERING should not be used as it will result in a full scan and
> >> is thus a performance disaster.
> >>
> >> If you need it, this means you do not have the right table structure and
> >> likely should rework the CREATE TABLE statement.
> >>> => This new message should have this threadId.
> >>> New unrelated message
> >>>
> >>> Assume that we do a query for a new unrelated message.
> >>>
> >>> SELECT threadId FROM threadtable WHERE username = 'quan' AND
> baseSubject
> >> =
> >>> 'unrelatedBaseSubject' AND mimeMessageId IN ('MimeMessageID2',
> >>> 'MimeMessageID3') LIMIT 1 ALLOW FILTERING;
> >>>
> >>> => This new message should have a new threadId.
> >>> Insert new message data
> >>>
> >>> After having a threadId, we need to insert new message data into the
> >> thread
> >>> table.
> >>>
> >>> insert into ThreadTable (messageId, threadId, username, mimeMessageId,
> >>> baseSubject) values (now(), 02294fe1-e941-11eb-a8ee-77de5498f1fa,
> 'quan',
> >>> 'MimeMessageID2', 'baseSubject1');
> >>>
> >>> insert into ThreadTable (messageId, threadId, username, mimeMessageId,
> >>> baseSubject) values (now(), 02294fe1-e941-11eb-a8ee-77de5498f1fa,
> 'quan',
> >>> 'MimeMessageID3', 'baseSubject1');
> >>> Conclusion
> >>>
> >>> I think this data model complies with the needed request for the
> guessing
> >>> algorithm problem, but it looks like still maybe there is room for
> >>> improvement.
> >> What Cassandra request do we use to delete the data in there?
> >>
> >>>
> >>> Best Regards,
> >>>
> >>> Quan
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Vào Th 2, 19 thg 7, 2021 vào lúc 18:23 [email protected] <
> >>> [email protected]> đã viết:
> >>>
> >>>> Hello Quan,
> >>>>
> >>>> On 19/07/2021 17:59, Quan tran hong wrote:
> >>>>> Hi,
> >>>>> I am starting to implement ThreadIdGuessingAlgorithm for the
> >> distributed
> >>>>> module. Because this is a breaking change and I am new to Cassandra
> >> also,
> >>>>> therefore I want to have some discussion with you about how to do
> this.
> >>>> As long as we introduce a new table there is no reason that it creates
> >>>> breaking change, but getting the format right will ease our life down
> >>>> the line.
> >>>>> For the ones who did not catch up with this work, please have a look
> at
> >>>>> JMAP Threads specs [1] and my work related to this [2].
> >>>>>
> >>>>> So my ideas on how to do this:
> >>>>> - Add a needed inputs Cassandra Table for guessing threadId
> algorithm.
> >>>>> Maybe a table likes:
> >>>>>  CREATE TABLE ThreadRelatedTable (
> >>>>> threadId       timeuuid,
> >>>>> messageId      timeuuid,
> >>>>> mimeMessageIds     SET<text>,
> >>>>> subject       text,
> >>>>> PRIMARY KEY (mimeMessageIds, subject)
> >>>>> );
> >>>>> - Whenever we guess threadId for a new message, we access this table
> >> and
> >>>> do
> >>>>> the matching query to get related threadId(if there is) or decide new
> >>>>> message should have a new threadId.
> >>>>> - Whenever we save a new message, we save the thread-related data to
> >> this
> >>>>> table.
> >>>>>
> >>>>> This is my first come-up idea. Please express your thoughts about
> this.
> >>>> Collections are an advanced data modeling tool, that should be used
> with
> >>>> caution. I am not sure using it in a PRIMARY KEY is a good idea. I am
> >>>> not sure that does what you want (the full primary key should be
> >>>> specified to know which node hold the data.
> >>>>
> >>>> Also, once you found the message related to a thread you want to
> >>>> validate that the subject matches. This can be done on application
> side
> >>>> (James), and avoids complicated data model.
> >>>>
> >>>> I encourage you to validate your data model using a Cassandra in
> docker
> >>>> and executing CQL commands locally with CQLSH tool to simulate the
> >>>> queries you whish to do, and learn about your data model before even
> >>>> starting to implement it. IMO sharing CQL commands for creating the
> >>>> table, inserting data in it, and retrieving data from it would be a
> >>>> great follow up to this email.
> >>>>
> >>>> How would you populate the data of that table?
> >>>>
> >>>> Best regards,
> >>>>
> >>>> Benoit
> >>>>> Best regards,
> >>>>>
> >>>>> Quan
> >>>>>
> >>>>> [1] https://jmap.io/spec-mail.html#threads
> >>>>> [2] https://issues.apache.org/jira/browse/JAMES-3516
> >>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>>
> >>>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Implement ThreadIdGuessingAlgorithm for the distributed module

Reply via email to