Re: Implement ThreadIdGuessingAlgorithm for the distributed module

Quan tran hong Tue, 20 Jul 2021 03:24:35 -0700

Hi Benoit,
Following your suggestion, I did have some experiments today. Please have a
look.
I represent this with a few outlines so you can read it easily.
Create table


CREATE TABLE ThreadTable (messageId timeuuid, threadId timeuuid, username
text, mimeMessageId text, baseSubject text, PRIMARY KEY(messageId,
mimeMessageId));

=> Partition key: messageId, clustering key: mimeMessageId.
Insert data

I will add:

   -

   2 related message and 1 unrelated message for user ‘quan’
   -

   1 message for user ‘benoit’ (which seem related to some messages of user
   ‘quan’)

// insert message1 data for ‘quan’

insert into ThreadTable (messageId, threadId, username, mimeMessageId,
baseSubject) values (now(), now(), 'quan', 'MimeMessageID1',
'baseSubject1');

insert into ThreadTable (messageId, threadId, username, mimeMessageId,
baseSubject) values (now(), now(), 'quan', 'MimeMessageID2',
'baseSubject1');

insert into ThreadTable (messageId, threadId, username, mimeMessageId,
baseSubject) values (now(), now(), 'quan', 'MimeMessageID3',
'baseSubject1');
// insert message2 data (related to message1) for ‘quan’

insert into ThreadTable (messageId, threadId, username, mimeMessageId,
baseSubject) values (now(), now(), 'quan', 'MimeMessageID1',
'baseSubject1');
// insert message3 data (not related to any message) for ‘quan’

insert into ThreadTable (messageId, threadId, username, mimeMessageId,
baseSubject) values (now(), now(), 'quan', 'MimeMessageID4',
'baseSubject2');
// insert message4 data (related to message1 but for ‘benoit’)

insert into ThreadTable (messageId, threadId, username, mimeMessageId,
baseSubject) values (now(), now(), 'benoit', 'MimeMessageID5',
'baseSubject1');

insert into ThreadTable (messageId, threadId, username, mimeMessageId,
baseSubject) values (now(), now(), 'benoit', 'MimeMessageID1',
'baseSubject1');
Select all data

The query for guessing new messages' threadIdNew related message

For example, there is a new message coming for user ‘quan’, with some
header fields:

   -

   SET MimeMessageIds (after combine values): {‘MimeMessageID2’,
   ‘MimeMessageID3’}
   -

   Base subject line (after stripping): “baseSubject1”

This message is supposed to be related to 2 other messages of ‘quan’.

We need to query one row related to this new message (if there is).

SELECT threadId FROM threadtable WHERE username = 'quan' AND baseSubject =
'baseSubject1' AND mimeMessageId IN ('MimeMessageID2', 'MimeMessageID3')
LIMIT 1 ALLOW FILTERING;

=> This new message should have this threadId.
New unrelated message

Assume that we do a query for a new unrelated message.

SELECT threadId FROM threadtable WHERE username = 'quan' AND baseSubject =
'unrelatedBaseSubject' AND mimeMessageId IN ('MimeMessageID2',
'MimeMessageID3') LIMIT 1 ALLOW FILTERING;

=> This new message should have a new threadId.
Insert new message data

After having a threadId, we need to insert new message data into the thread
table.

insert into ThreadTable (messageId, threadId, username, mimeMessageId,
baseSubject) values (now(), 02294fe1-e941-11eb-a8ee-77de5498f1fa, 'quan',
'MimeMessageID2', 'baseSubject1');

insert into ThreadTable (messageId, threadId, username, mimeMessageId,
baseSubject) values (now(), 02294fe1-e941-11eb-a8ee-77de5498f1fa, 'quan',
'MimeMessageID3', 'baseSubject1');
Conclusion

I think this data model complies with the needed request for the guessing
algorithm problem, but it looks like still maybe there is room for
improvement.


Best Regards,

Quan





Vào Th 2, 19 thg 7, 2021 vào lúc 18:23 [email protected] <
[email protected]> đã viết:

> Hello Quan,
>
> On 19/07/2021 17:59, Quan tran hong wrote:
> > Hi,
> > I am starting to implement ThreadIdGuessingAlgorithm for the distributed
> > module. Because this is a breaking change and I am new to Cassandra also,
> > therefore I want to have some discussion with you about how to do this.
> As long as we introduce a new table there is no reason that it creates
> breaking change, but getting the format right will ease our life down
> the line.
> >
> > For the ones who did not catch up with this work, please have a look at
> > JMAP Threads specs [1] and my work related to this [2].
> >
> > So my ideas on how to do this:
> > - Add a needed inputs Cassandra Table for guessing threadId algorithm.
> > Maybe a table likes:
> >  CREATE TABLE ThreadRelatedTable (
> > threadId       timeuuid,
> > messageId      timeuuid,
> > mimeMessageIds     SET<text>,
> > subject       text,
> > PRIMARY KEY (mimeMessageIds, subject)
> > );
> > - Whenever we guess threadId for a new message, we access this table and
> do
> > the matching query to get related threadId(if there is) or decide new
> > message should have a new threadId.
> > - Whenever we save a new message, we save the thread-related data to this
> > table.
> >
> > This is my first come-up idea. Please express your thoughts about this.
> Collections are an advanced data modeling tool, that should be used with
> caution. I am not sure using it in a PRIMARY KEY is a good idea. I am
> not sure that does what you want (the full primary key should be
> specified to know which node hold the data.
>
> Also, once you found the message related to a thread you want to
> validate that the subject matches. This can be done on application side
> (James), and avoids complicated data model.
>
> I encourage you to validate your data model using a Cassandra in docker
> and executing CQL commands locally with CQLSH tool to simulate the
> queries you whish to do, and learn about your data model before even
> starting to implement it. IMO sharing CQL commands for creating the
> table, inserting data in it, and retrieving data from it would be a
> great follow up to this email.
>
> How would you populate the data of that table?
>
> Best regards,
>
> Benoit
> >
> > Best regards,
> >
> > Quan
> >
> > [1] https://jmap.io/spec-mail.html#threads
> > [2] https://issues.apache.org/jira/browse/JAMES-3516
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Implement ThreadIdGuessingAlgorithm for the distributed module

Reply via email to