Hi Benoit, Thanks for your great reviews. > Or we can just do several selects, one for each mimeMessageIds.I personaly likely prefer this option.
This will make the upper logic layer more complicated. But I agree with your suggestion cause I would prefer balance and performance also. > Remember to do the delete of ThreadTable before threadtable_lookup, otherwise if you delete the pointer before the data you might end up in a case where the actual data is never deleted. Sure. Thanks for your reminder. > When/How do we call it? StoringThreadIdGuessingAlgorithm maybe? Cheers, Quan Vào Th 4, 21 thg 7, 2021 vào lúc 17:03 btell...@apache.org < btell...@apache.org> đã viết: > Hello quan, > > Nice proposal! I think the Cassandra data model you propose is evolved > enough so that we start implementing it. > > Some comments inlined... > > On 21/07/2021 15:55, Quan tran hong wrote: > > Hi Benoit, > > I did have another try on this. Please have a look. > > CREATE threadtable > > > > CREATE TABLE ThreadTable (messageId timeuuid, threadId timeuuid, username > > text, mimeMessageId text, baseSubject text, PRIMARY KEY((username, > > mimemessageid), messageid)); > > > > => Partition key: (username, mimemessageid), clustering key: messageid. > > > > [...] > > > > SELECT basesubject, threadId FROM threadtable WHERE username = 'quan' > AND > > mimeMessageId IN ('MimeMessageID2', 'MimeMessageID3'); > This looks way better to me. > > IN usage is PRIMARY KEY is discouraged as it leads to coordination > across partitions. > > Read more for instance in > > https://stackoverflow.com/questions/55604857/cassandra-query-performance-using-in-clause-for-one-portion-of-the-composite-pa > > Either we should move "mimeMessageId" to the clustering key (but all the > messages of a user, including their subject would end up in a single > partition, which could be quite large... for instance 1 million messages > x size of the mime message ids and subject could be too much, as > partitions are recommended to be 100MBs at most). > > Or we can just do several selects, one for each mimeMessageIds.I > personaly likely prefer this option. > > [...] > > > > CREATE threadtable_lookup for deletion purpose > > > > Supposed when we delete a message, we would need to delete that message’s > > thread-related data in the threadtable also. I guess we just need > messageId > > to delete that message. > > > > We would need to have another table similar to the threadtable but looked > > up by messageId to get the needed params for deletion query. > > > > CREATE TABLE ThreadTable_lookup (messageId timeuuid, username text, > > mimeMessageIds set<text>, PRIMARY KEY(messageid)); > The set should be frozen. We will never add or remove data in it, so we > do not need a CRDT (Commutative Replicated Data Type). > > Forbidding adding - removing individual elements avoids lots of issues, > and lead to a more compact storage. > > > To ease testing, I create a table with messageId’s type is int instead. > > > > We will insert the data as same as the original table. > > > > insert into ThreadTable_lookup (messageId, username, mimeMessageIds) > > values (1, 'quan', {'MimeMessageID1', 'MimeMessageID2', > 'MimeMessageID3'}); > > > > [...] > > > > > > SELECT * FROM threadtable_lookup; > > > > > > Now we will do a query selection by messageId to get the needed username, > > mimeMessageIds for original threadtable deletion. > > > > Supposed we want to delete message 4. > > > > SELECT username, mimemessageids FROM threadtable_lookup WHERE messageid > = 4; > > > > > > Then we will use these results to do a deletion query on threadtable. > > > Remember to do the delete of ThreadTable before threadtable_lookup, > otherwise if you delete the pointer before the data you might end up in > a case where the actual data is never deleted. > > The algorithm that you propose looks good. > > When/How do we call it? > > > Cheers, > > Benoit > > The data of messageId 4 deleted. > > > > > > > > Best regards, > > > > Quan > > > > > > Vào Th 3, 20 thg 7, 2021 vào lúc 18:39 btell...@apache.org < > > btell...@apache.org> đã viết: > > > >> Hello Quan, > >> > >> On 20/07/2021 17:24, Quan tran hong wrote: > >>> [...] > >>> > >>> SELECT threadId FROM threadtable WHERE username = 'quan' AND > baseSubject > >> = > >>> 'baseSubject1' AND mimeMessageId IN ('MimeMessageID2', > 'MimeMessageID3') > >>> LIMIT 1 ALLOW FILTERING; > >> ALLOW FILTERING should not be used as it will result in a full scan and > >> is thus a performance disaster. > >> > >> If you need it, this means you do not have the right table structure and > >> likely should rework the CREATE TABLE statement. > >>> => This new message should have this threadId. > >>> New unrelated message > >>> > >>> Assume that we do a query for a new unrelated message. > >>> > >>> SELECT threadId FROM threadtable WHERE username = 'quan' AND > baseSubject > >> = > >>> 'unrelatedBaseSubject' AND mimeMessageId IN ('MimeMessageID2', > >>> 'MimeMessageID3') LIMIT 1 ALLOW FILTERING; > >>> > >>> => This new message should have a new threadId. > >>> Insert new message data > >>> > >>> After having a threadId, we need to insert new message data into the > >> thread > >>> table. > >>> > >>> insert into ThreadTable (messageId, threadId, username, mimeMessageId, > >>> baseSubject) values (now(), 02294fe1-e941-11eb-a8ee-77de5498f1fa, > 'quan', > >>> 'MimeMessageID2', 'baseSubject1'); > >>> > >>> insert into ThreadTable (messageId, threadId, username, mimeMessageId, > >>> baseSubject) values (now(), 02294fe1-e941-11eb-a8ee-77de5498f1fa, > 'quan', > >>> 'MimeMessageID3', 'baseSubject1'); > >>> Conclusion > >>> > >>> I think this data model complies with the needed request for the > guessing > >>> algorithm problem, but it looks like still maybe there is room for > >>> improvement. > >> What Cassandra request do we use to delete the data in there? > >> > >>> > >>> Best Regards, > >>> > >>> Quan > >>> > >>> > >>> > >>> > >>> > >>> Vào Th 2, 19 thg 7, 2021 vào lúc 18:23 btell...@apache.org < > >>> btell...@apache.org> đã viết: > >>> > >>>> Hello Quan, > >>>> > >>>> On 19/07/2021 17:59, Quan tran hong wrote: > >>>>> Hi, > >>>>> I am starting to implement ThreadIdGuessingAlgorithm for the > >> distributed > >>>>> module. Because this is a breaking change and I am new to Cassandra > >> also, > >>>>> therefore I want to have some discussion with you about how to do > this. > >>>> As long as we introduce a new table there is no reason that it creates > >>>> breaking change, but getting the format right will ease our life down > >>>> the line. > >>>>> For the ones who did not catch up with this work, please have a look > at > >>>>> JMAP Threads specs [1] and my work related to this [2]. > >>>>> > >>>>> So my ideas on how to do this: > >>>>> - Add a needed inputs Cassandra Table for guessing threadId > algorithm. > >>>>> Maybe a table likes: > >>>>> CREATE TABLE ThreadRelatedTable ( > >>>>> threadId timeuuid, > >>>>> messageId timeuuid, > >>>>> mimeMessageIds SET<text>, > >>>>> subject text, > >>>>> PRIMARY KEY (mimeMessageIds, subject) > >>>>> ); > >>>>> - Whenever we guess threadId for a new message, we access this table > >> and > >>>> do > >>>>> the matching query to get related threadId(if there is) or decide new > >>>>> message should have a new threadId. > >>>>> - Whenever we save a new message, we save the thread-related data to > >> this > >>>>> table. > >>>>> > >>>>> This is my first come-up idea. Please express your thoughts about > this. > >>>> Collections are an advanced data modeling tool, that should be used > with > >>>> caution. I am not sure using it in a PRIMARY KEY is a good idea. I am > >>>> not sure that does what you want (the full primary key should be > >>>> specified to know which node hold the data. > >>>> > >>>> Also, once you found the message related to a thread you want to > >>>> validate that the subject matches. This can be done on application > side > >>>> (James), and avoids complicated data model. > >>>> > >>>> I encourage you to validate your data model using a Cassandra in > docker > >>>> and executing CQL commands locally with CQLSH tool to simulate the > >>>> queries you whish to do, and learn about your data model before even > >>>> starting to implement it. IMO sharing CQL commands for creating the > >>>> table, inserting data in it, and retrieving data from it would be a > >>>> great follow up to this email. > >>>> > >>>> How would you populate the data of that table? > >>>> > >>>> Best regards, > >>>> > >>>> Benoit > >>>>> Best regards, > >>>>> > >>>>> Quan > >>>>> > >>>>> [1] https://jmap.io/spec-mail.html#threads > >>>>> [2] https://issues.apache.org/jira/browse/JAMES-3516 > >>>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org > >>>> For additional commands, e-mail: server-dev-h...@james.apache.org > >>>> > >>>> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org > >> For additional commands, e-mail: server-dev-h...@james.apache.org > >> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org > For additional commands, e-mail: server-dev-h...@james.apache.org > >