Re: Implement ThreadIdGuessingAlgorithm for the distributed module

Quan tran hong Thu, 22 Jul 2021 00:31:50 -0700

Hi Benoit,

Following the approval of the table's design, I will start to implement
*threadtable *and start with define the table and its DAO layer. I will
come back for the deletion part feature finish the guessing threadid
feature.


Best regards,

Quan

Vào Th 4, 21 thg 7, 2021 vào lúc 18:08 btell...@linagora.com <
btell...@linagora.com> đã viết:

>
> On 21/07/2021 18:06, Quan tran hong wrote:
> > Hi Benoit,
> > Thanks for your great reviews.
> >
> >> Or we can just do several selects, one for each mimeMessageIds.I
> > personaly likely prefer this option.
> >
> > This will make the upper logic layer more complicated. But I agree with
> > your suggestion cause I would prefer balance and performance also.
> >
> >> Remember to do the delete of ThreadTable before threadtable_lookup,
> > otherwise if you delete the pointer before the data you might end up in
> > a case where the actual data is never deleted.
> >
> > Sure. Thanks for your reminder.
> >
> >> When/How do we call it?
> > StoringThreadIdGuessingAlgorithm maybe?
> Not really, this would only be called when appending messages.
>
> Likely this would need to be hooked in DeleteMessageListener.
> >
> > Cheers,
> > Quan
> >
> > Vào Th 4, 21 thg 7, 2021 vào lúc 17:03 btell...@apache.org <
> > btell...@apache.org> đã viết:
> >
> >> Hello quan,
> >>
> >> Nice proposal! I think the Cassandra data model you propose is evolved
> >> enough so that we start implementing it.
> >>
> >> Some comments inlined...
> >>
> >> On 21/07/2021 15:55, Quan tran hong wrote:
> >>> Hi Benoit,
> >>> I did have another try on this. Please have a look.
> >>> CREATE threadtable
> >>>
> >>> CREATE TABLE ThreadTable (messageId timeuuid, threadId timeuuid,
> username
> >>> text, mimeMessageId text, baseSubject text, PRIMARY KEY((username,
> >>> mimemessageid), messageid));
> >>>
> >>> => Partition key: (username, mimemessageid), clustering key: messageid.
> >>>
> >>> [...]
> >>>
> >>> SELECT basesubject, threadId FROM threadtable WHERE username = 'quan'
> >> AND
> >>> mimeMessageId IN ('MimeMessageID2', 'MimeMessageID3');
> >> This looks way better to me.
> >>
> >> IN usage is PRIMARY KEY is discouraged as it leads to coordination
> >> across partitions.
> >>
> >> Read more for instance in
> >>
> >>
> https://stackoverflow.com/questions/55604857/cassandra-query-performance-using-in-clause-for-one-portion-of-the-composite-pa
> >>
> >> Either we should move "mimeMessageId" to the clustering key (but all the
> >> messages of a user, including their subject would end up in a single
> >> partition, which could be quite large... for instance 1 million messages
> >> x size of the mime message ids and subject could be too much, as
> >> partitions are recommended to be 100MBs at most).
> >>
> >> Or we can just do several selects, one for each mimeMessageIds.I
> >> personaly likely prefer this option.
> >>> [...]
> >>>
> >>> CREATE threadtable_lookup for deletion purpose
> >>>
> >>> Supposed when we delete a message, we would need to delete that
> message’s
> >>> thread-related data in the threadtable also. I guess we just need
> >> messageId
> >>> to delete that message.
> >>>
> >>> We would need to have another table similar to the threadtable but
> looked
> >>> up by messageId to get the needed params for deletion query.
> >>>
> >>> CREATE TABLE ThreadTable_lookup (messageId timeuuid, username text,
> >>> mimeMessageIds set<text>, PRIMARY KEY(messageid));
> >> The set should be frozen. We will never add or remove data in it, so we
> >> do not need a CRDT (Commutative Replicated Data Type).
> >>
> >> Forbidding adding - removing individual elements avoids lots of issues,
> >> and lead to a more compact storage.
> >>
> >>> To ease testing, I create a table with messageId’s type is int instead.
> >>>
> >>> We will insert the data as same as the original table.
> >>>
> >>>     insert into ThreadTable_lookup (messageId, username,
> mimeMessageIds)
> >>> values (1, 'quan', {'MimeMessageID1', 'MimeMessageID2',
> >> 'MimeMessageID3'});
> >>> [...]
> >>>
> >>>
> >>> SELECT * FROM threadtable_lookup;
> >>>
> >>>
> >>> Now we will do a query selection by messageId to get the needed
> username,
> >>> mimeMessageIds for original threadtable deletion.
> >>>
> >>> Supposed we want to delete message 4.
> >>>
> >>> SELECT username, mimemessageids FROM threadtable_lookup WHERE messageid
> >> = 4;
> >>>
> >>> Then we will use these results to do a deletion query on threadtable.
> >>>
> >> Remember to do the delete of ThreadTable before threadtable_lookup,
> >> otherwise if you delete the pointer before the data you might end up in
> >> a case where the actual data is never deleted.
> >>
> >> The algorithm that you propose looks good.
> >>
> >> When/How do we call it?
> >>
> >>
> >> Cheers,
> >>
> >> Benoit
> >>> The data of messageId 4 deleted.
> >>>
> >>>
> >>>
> >>> Best regards,
> >>>
> >>> Quan
> >>>
> >>>
> >>> Vào Th 3, 20 thg 7, 2021 vào lúc 18:39 btell...@apache.org <
> >>> btell...@apache.org> đã viết:
> >>>
> >>>> Hello Quan,
> >>>>
> >>>> On 20/07/2021 17:24, Quan tran hong wrote:
> >>>>> [...]
> >>>>>
> >>>>> SELECT threadId FROM threadtable WHERE username = 'quan' AND
> >> baseSubject
> >>>> =
> >>>>> 'baseSubject1' AND mimeMessageId IN ('MimeMessageID2',
> >> 'MimeMessageID3')
> >>>>> LIMIT 1 ALLOW FILTERING;
> >>>> ALLOW FILTERING should not be used as it will result in a full scan
> and
> >>>> is thus a performance disaster.
> >>>>
> >>>> If you need it, this means you do not have the right table structure
> and
> >>>> likely should rework the CREATE TABLE statement.
> >>>>> => This new message should have this threadId.
> >>>>> New unrelated message
> >>>>>
> >>>>> Assume that we do a query for a new unrelated message.
> >>>>>
> >>>>> SELECT threadId FROM threadtable WHERE username = 'quan' AND
> >> baseSubject
> >>>> =
> >>>>> 'unrelatedBaseSubject' AND mimeMessageId IN ('MimeMessageID2',
> >>>>> 'MimeMessageID3') LIMIT 1 ALLOW FILTERING;
> >>>>>
> >>>>> => This new message should have a new threadId.
> >>>>> Insert new message data
> >>>>>
> >>>>> After having a threadId, we need to insert new message data into the
> >>>> thread
> >>>>> table.
> >>>>>
> >>>>> insert into ThreadTable (messageId, threadId, username,
> mimeMessageId,
> >>>>> baseSubject) values (now(), 02294fe1-e941-11eb-a8ee-77de5498f1fa,
> >> 'quan',
> >>>>> 'MimeMessageID2', 'baseSubject1');
> >>>>>
> >>>>> insert into ThreadTable (messageId, threadId, username,
> mimeMessageId,
> >>>>> baseSubject) values (now(), 02294fe1-e941-11eb-a8ee-77de5498f1fa,
> >> 'quan',
> >>>>> 'MimeMessageID3', 'baseSubject1');
> >>>>> Conclusion
> >>>>>
> >>>>> I think this data model complies with the needed request for the
> >> guessing
> >>>>> algorithm problem, but it looks like still maybe there is room for
> >>>>> improvement.
> >>>> What Cassandra request do we use to delete the data in there?
> >>>>
> >>>>> Best Regards,
> >>>>>
> >>>>> Quan
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> Vào Th 2, 19 thg 7, 2021 vào lúc 18:23 btell...@apache.org <
> >>>>> btell...@apache.org> đã viết:
> >>>>>
> >>>>>> Hello Quan,
> >>>>>>
> >>>>>> On 19/07/2021 17:59, Quan tran hong wrote:
> >>>>>>> Hi,
> >>>>>>> I am starting to implement ThreadIdGuessingAlgorithm for the
> >>>> distributed
> >>>>>>> module. Because this is a breaking change and I am new to Cassandra
> >>>> also,
> >>>>>>> therefore I want to have some discussion with you about how to do
> >> this.
> >>>>>> As long as we introduce a new table there is no reason that it
> creates
> >>>>>> breaking change, but getting the format right will ease our life
> down
> >>>>>> the line.
> >>>>>>> For the ones who did not catch up with this work, please have a
> look
> >> at
> >>>>>>> JMAP Threads specs [1] and my work related to this [2].
> >>>>>>>
> >>>>>>> So my ideas on how to do this:
> >>>>>>> - Add a needed inputs Cassandra Table for guessing threadId
> >> algorithm.
> >>>>>>> Maybe a table likes:
> >>>>>>>  CREATE TABLE ThreadRelatedTable (
> >>>>>>> threadId       timeuuid,
> >>>>>>> messageId      timeuuid,
> >>>>>>> mimeMessageIds     SET<text>,
> >>>>>>> subject       text,
> >>>>>>> PRIMARY KEY (mimeMessageIds, subject)
> >>>>>>> );
> >>>>>>> - Whenever we guess threadId for a new message, we access this
> table
> >>>> and
> >>>>>> do
> >>>>>>> the matching query to get related threadId(if there is) or decide
> new
> >>>>>>> message should have a new threadId.
> >>>>>>> - Whenever we save a new message, we save the thread-related data
> to
> >>>> this
> >>>>>>> table.
> >>>>>>>
> >>>>>>> This is my first come-up idea. Please express your thoughts about
> >> this.
> >>>>>> Collections are an advanced data modeling tool, that should be used
> >> with
> >>>>>> caution. I am not sure using it in a PRIMARY KEY is a good idea. I
> am
> >>>>>> not sure that does what you want (the full primary key should be
> >>>>>> specified to know which node hold the data.
> >>>>>>
> >>>>>> Also, once you found the message related to a thread you want to
> >>>>>> validate that the subject matches. This can be done on application
> >> side
> >>>>>> (James), and avoids complicated data model.
> >>>>>>
> >>>>>> I encourage you to validate your data model using a Cassandra in
> >> docker
> >>>>>> and executing CQL commands locally with CQLSH tool to simulate the
> >>>>>> queries you whish to do, and learn about your data model before even
> >>>>>> starting to implement it. IMO sharing CQL commands for creating the
> >>>>>> table, inserting data in it, and retrieving data from it would be a
> >>>>>> great follow up to this email.
> >>>>>>
> >>>>>> How would you populate the data of that table?
> >>>>>>
> >>>>>> Best regards,
> >>>>>>
> >>>>>> Benoit
> >>>>>>> Best regards,
> >>>>>>>
> >>>>>>> Quan
> >>>>>>>
> >>>>>>> [1] https://jmap.io/spec-mail.html#threads
> >>>>>>> [2] https://issues.apache.org/jira/browse/JAMES-3516
> >>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
> >>>>>> For additional commands, e-mail: server-dev-h...@james.apache.org
> >>>>>>
> >>>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
> >>>> For additional commands, e-mail: server-dev-h...@james.apache.org
> >>>>
> >>>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
> >> For additional commands, e-mail: server-dev-h...@james.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
> For additional commands, e-mail: server-dev-h...@james.apache.org
>
>

Re: Implement ThreadIdGuessingAlgorithm for the distributed module

Reply via email to