On 21/07/2021 18:06, Quan tran hong wrote: > Hi Benoit, > Thanks for your great reviews. > >> Or we can just do several selects, one for each mimeMessageIds.I > personaly likely prefer this option. > > This will make the upper logic layer more complicated. But I agree with > your suggestion cause I would prefer balance and performance also. > >> Remember to do the delete of ThreadTable before threadtable_lookup, > otherwise if you delete the pointer before the data you might end up in > a case where the actual data is never deleted. > > Sure. Thanks for your reminder. > >> When/How do we call it? > StoringThreadIdGuessingAlgorithm maybe? Not really, this would only be called when appending messages.
Likely this would need to be hooked in DeleteMessageListener. > > Cheers, > Quan > > Vào Th 4, 21 thg 7, 2021 vào lúc 17:03 btell...@apache.org < > btell...@apache.org> đã viết: > >> Hello quan, >> >> Nice proposal! I think the Cassandra data model you propose is evolved >> enough so that we start implementing it. >> >> Some comments inlined... >> >> On 21/07/2021 15:55, Quan tran hong wrote: >>> Hi Benoit, >>> I did have another try on this. Please have a look. >>> CREATE threadtable >>> >>> CREATE TABLE ThreadTable (messageId timeuuid, threadId timeuuid, username >>> text, mimeMessageId text, baseSubject text, PRIMARY KEY((username, >>> mimemessageid), messageid)); >>> >>> => Partition key: (username, mimemessageid), clustering key: messageid. >>> >>> [...] >>> >>> SELECT basesubject, threadId FROM threadtable WHERE username = 'quan' >> AND >>> mimeMessageId IN ('MimeMessageID2', 'MimeMessageID3'); >> This looks way better to me. >> >> IN usage is PRIMARY KEY is discouraged as it leads to coordination >> across partitions. >> >> Read more for instance in >> >> https://stackoverflow.com/questions/55604857/cassandra-query-performance-using-in-clause-for-one-portion-of-the-composite-pa >> >> Either we should move "mimeMessageId" to the clustering key (but all the >> messages of a user, including their subject would end up in a single >> partition, which could be quite large... for instance 1 million messages >> x size of the mime message ids and subject could be too much, as >> partitions are recommended to be 100MBs at most). >> >> Or we can just do several selects, one for each mimeMessageIds.I >> personaly likely prefer this option. >>> [...] >>> >>> CREATE threadtable_lookup for deletion purpose >>> >>> Supposed when we delete a message, we would need to delete that message’s >>> thread-related data in the threadtable also. I guess we just need >> messageId >>> to delete that message. >>> >>> We would need to have another table similar to the threadtable but looked >>> up by messageId to get the needed params for deletion query. >>> >>> CREATE TABLE ThreadTable_lookup (messageId timeuuid, username text, >>> mimeMessageIds set<text>, PRIMARY KEY(messageid)); >> The set should be frozen. We will never add or remove data in it, so we >> do not need a CRDT (Commutative Replicated Data Type). >> >> Forbidding adding - removing individual elements avoids lots of issues, >> and lead to a more compact storage. >> >>> To ease testing, I create a table with messageId’s type is int instead. >>> >>> We will insert the data as same as the original table. >>> >>> insert into ThreadTable_lookup (messageId, username, mimeMessageIds) >>> values (1, 'quan', {'MimeMessageID1', 'MimeMessageID2', >> 'MimeMessageID3'}); >>> [...] >>> >>> >>> SELECT * FROM threadtable_lookup; >>> >>> >>> Now we will do a query selection by messageId to get the needed username, >>> mimeMessageIds for original threadtable deletion. >>> >>> Supposed we want to delete message 4. >>> >>> SELECT username, mimemessageids FROM threadtable_lookup WHERE messageid >> = 4; >>> >>> Then we will use these results to do a deletion query on threadtable. >>> >> Remember to do the delete of ThreadTable before threadtable_lookup, >> otherwise if you delete the pointer before the data you might end up in >> a case where the actual data is never deleted. >> >> The algorithm that you propose looks good. >> >> When/How do we call it? >> >> >> Cheers, >> >> Benoit >>> The data of messageId 4 deleted. >>> >>> >>> >>> Best regards, >>> >>> Quan >>> >>> >>> Vào Th 3, 20 thg 7, 2021 vào lúc 18:39 btell...@apache.org < >>> btell...@apache.org> đã viết: >>> >>>> Hello Quan, >>>> >>>> On 20/07/2021 17:24, Quan tran hong wrote: >>>>> [...] >>>>> >>>>> SELECT threadId FROM threadtable WHERE username = 'quan' AND >> baseSubject >>>> = >>>>> 'baseSubject1' AND mimeMessageId IN ('MimeMessageID2', >> 'MimeMessageID3') >>>>> LIMIT 1 ALLOW FILTERING; >>>> ALLOW FILTERING should not be used as it will result in a full scan and >>>> is thus a performance disaster. >>>> >>>> If you need it, this means you do not have the right table structure and >>>> likely should rework the CREATE TABLE statement. >>>>> => This new message should have this threadId. >>>>> New unrelated message >>>>> >>>>> Assume that we do a query for a new unrelated message. >>>>> >>>>> SELECT threadId FROM threadtable WHERE username = 'quan' AND >> baseSubject >>>> = >>>>> 'unrelatedBaseSubject' AND mimeMessageId IN ('MimeMessageID2', >>>>> 'MimeMessageID3') LIMIT 1 ALLOW FILTERING; >>>>> >>>>> => This new message should have a new threadId. >>>>> Insert new message data >>>>> >>>>> After having a threadId, we need to insert new message data into the >>>> thread >>>>> table. >>>>> >>>>> insert into ThreadTable (messageId, threadId, username, mimeMessageId, >>>>> baseSubject) values (now(), 02294fe1-e941-11eb-a8ee-77de5498f1fa, >> 'quan', >>>>> 'MimeMessageID2', 'baseSubject1'); >>>>> >>>>> insert into ThreadTable (messageId, threadId, username, mimeMessageId, >>>>> baseSubject) values (now(), 02294fe1-e941-11eb-a8ee-77de5498f1fa, >> 'quan', >>>>> 'MimeMessageID3', 'baseSubject1'); >>>>> Conclusion >>>>> >>>>> I think this data model complies with the needed request for the >> guessing >>>>> algorithm problem, but it looks like still maybe there is room for >>>>> improvement. >>>> What Cassandra request do we use to delete the data in there? >>>> >>>>> Best Regards, >>>>> >>>>> Quan >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Vào Th 2, 19 thg 7, 2021 vào lúc 18:23 btell...@apache.org < >>>>> btell...@apache.org> đã viết: >>>>> >>>>>> Hello Quan, >>>>>> >>>>>> On 19/07/2021 17:59, Quan tran hong wrote: >>>>>>> Hi, >>>>>>> I am starting to implement ThreadIdGuessingAlgorithm for the >>>> distributed >>>>>>> module. Because this is a breaking change and I am new to Cassandra >>>> also, >>>>>>> therefore I want to have some discussion with you about how to do >> this. >>>>>> As long as we introduce a new table there is no reason that it creates >>>>>> breaking change, but getting the format right will ease our life down >>>>>> the line. >>>>>>> For the ones who did not catch up with this work, please have a look >> at >>>>>>> JMAP Threads specs [1] and my work related to this [2]. >>>>>>> >>>>>>> So my ideas on how to do this: >>>>>>> - Add a needed inputs Cassandra Table for guessing threadId >> algorithm. >>>>>>> Maybe a table likes: >>>>>>> CREATE TABLE ThreadRelatedTable ( >>>>>>> threadId timeuuid, >>>>>>> messageId timeuuid, >>>>>>> mimeMessageIds SET<text>, >>>>>>> subject text, >>>>>>> PRIMARY KEY (mimeMessageIds, subject) >>>>>>> ); >>>>>>> - Whenever we guess threadId for a new message, we access this table >>>> and >>>>>> do >>>>>>> the matching query to get related threadId(if there is) or decide new >>>>>>> message should have a new threadId. >>>>>>> - Whenever we save a new message, we save the thread-related data to >>>> this >>>>>>> table. >>>>>>> >>>>>>> This is my first come-up idea. Please express your thoughts about >> this. >>>>>> Collections are an advanced data modeling tool, that should be used >> with >>>>>> caution. I am not sure using it in a PRIMARY KEY is a good idea. I am >>>>>> not sure that does what you want (the full primary key should be >>>>>> specified to know which node hold the data. >>>>>> >>>>>> Also, once you found the message related to a thread you want to >>>>>> validate that the subject matches. This can be done on application >> side >>>>>> (James), and avoids complicated data model. >>>>>> >>>>>> I encourage you to validate your data model using a Cassandra in >> docker >>>>>> and executing CQL commands locally with CQLSH tool to simulate the >>>>>> queries you whish to do, and learn about your data model before even >>>>>> starting to implement it. IMO sharing CQL commands for creating the >>>>>> table, inserting data in it, and retrieving data from it would be a >>>>>> great follow up to this email. >>>>>> >>>>>> How would you populate the data of that table? >>>>>> >>>>>> Best regards, >>>>>> >>>>>> Benoit >>>>>>> Best regards, >>>>>>> >>>>>>> Quan >>>>>>> >>>>>>> [1] https://jmap.io/spec-mail.html#threads >>>>>>> [2] https://issues.apache.org/jira/browse/JAMES-3516 >>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org >>>>>> For additional commands, e-mail: server-dev-h...@james.apache.org >>>>>> >>>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org >>>> For additional commands, e-mail: server-dev-h...@james.apache.org >>>> >>>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org >> For additional commands, e-mail: server-dev-h...@james.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org