[notmuch] Idea for storing tags

2010-01-13 Thread Scott Morrison

On 2010-01-12, at 8:24 PM, martin f krafft wrote:

> also sprach Scott Morrison  [2010.01.12.1711 +1300]:
>> 1.  synchronization of tag data with emails -- if they are in
>> a subfolder then it presents the issue of maintaining this
>> subfolder when managing emails (moving, deleting, duplicating etc)
>> and any .tag folder unaware clients are likely cause an breakage
>> in tagdata/message association.  One way of doing this is to have
>> a global .tag folder.
> 
> A global .tag folder indexed by e.g. message ID, as you state later,
> would probably allow for this. Or a file-per-tag design. We'd have
> to think carefully about pros and cons for each.
> 
> When thinking about this, I always have to remind myself that we are
> targetting this at a design that has indexed search. If that weren't
> the case, searches would be incredibly expensive.
> 
> Maybe a better approach would be content addressing (see below).


Content hashing -- good Idea (& not something that has hit me before) -- better 
than Message-Id as I believe there are still some MUA /MTAs that allow messages 
without message ids.  The only potential issue with this is that it is critical 
then to preserve the message source against encoding changes though that 
shouldn't be too hard to avoid.

> 
>> 2. what happens if that message is archived or moved to an
>> exclusively local cache -- eg. Mail.app on OS X can easily move
>> IMAP messages to a folder resident on the computers computers?
> 
> Well, if the target can store tags, then ideally the MUA should know
> how to transfer them along.
> 
> Maybe the right thing to do would be to use extended attributes
> (which are stored in the inode!), even if they may not be
> universally supported yet. If our solution scales, then this might
> lead to a significant increase in xattr adoption.
The problem with anything that is not universally supported is that for a 
package that is to appeal to a wide userbase, most don't know and don't care 
about the particulars of this IMAP server vs that IMAP server.  all they know 
it that for some reason it doesn't work with account X -- which leads to 
support head aches.

> 
>> 3. what happens with duplicates of emails -- I would assume that
>> the message id would be the key to match the tag data to the
>> message.  In this system a duplicate of a message could not have
>> a different set of tags from the original (not that this would
>> necessarily be desirable.)
> 
> Duplicates need folders, and tags and folders are somewhat at odds
> with each other. I mean, you can represent a folder hierarchy with
> tags (and more), and if you have tags and folders, you are
> potentially introducing a level of confusion/ambiguity that we don't
> want in the first place. Maybe the ideal solution doesn't need
> folders anymore (and IMAP-compatible (Maildir) subfolders have
> always been a hack anyway).
> 
> There are also two types of duplicates: copies and links. The former
> can diverge, the latter can't. I don't really see a reason for
> either. It's not like you need to copy a mail before you edit it,
> and I don't see a real reason for linking, assuming that the primary
> means of browsing will be tag-searches anyway.
> 
> Duplicates always make me think of content addressing, like Git's
> object cache. We could store the content hash of a message in its
> filename, and also use the hash to index into the tag database.
> I think that would be much cleaner than message IDs, and would make
> handling true duplicates (links) much easier, while copies (diverged
> ex-duplicates) would also be taken care of automatically.

I agree that conceptually duplicates should be buried but end users do have 
"peculiar" organization systems.

> 
> -snip-

>> The performance issue is very real -- because it means that
>> somehow messages have to rewritten to the IMAP server -- IMAP
>> doesn't have a mechanism AFAIK for updates.
> 
> Not even UIDPLUS?
> http://wiki.dovecot.org/FeatUIDPLUS


[notmuch] Idea for storing tags

2010-01-11 Thread Scott Morrison

Thought you would be interested in my experiences and thoughts from actually 
doing this kind of stuff.  

With my software MailTags (www.indev.ca/MailTags.html) and I have looked at all 
these options and decided to go with storing tags in headers (in  json 
formatted data for the X-MailTags header)

I have thought seriously about using pseudo emails stored in a specially named 
directory but feel there are a couple of issues with this.
1.  synchronization of tag data with emails -- if they are in a 
subfolder then it presents the issue of maintaining this subfolder when 
managing emails (moving, deleting, duplicating etc) and any .tag folder unaware 
clients are likely cause an breakage in tagdata/message association.  One way 
of doing this is to have a global .tag folder.

2. what happens if that message is archived or moved to an exclusively 
local cache -- eg. Mail.app on OS X can easily move IMAP messages to a folder 
resident on the computers computers? -- 
3. what happens with duplicates of emails -- I would assume that the 
message id would be the key to match the tag data to the message.  In this 
system a duplicate of a message could not have a different set of tags from the 
original (not that this would necessarily be desirable.)


As I mentioned, I went with tags in headers -- though this has its own 
drawbacks.
Your mention of potential leakage (aka inadvertent disclosure of tag 
data) is real -- but only if the client used to bounce/forward is not the one 
to tag the message (one would assume that if a client can tag, it can know to 
exclude the tags in a bounce.)   Mail.app -- which I am pluging into does not 
forward headers -- though it will include all headers in a bounce -- but chance 
are you aren't tagging messages you are bouncing.:)

The performance issue is very real -- because it means that somehow 
messages have to rewritten to the IMAP server -- IMAP doesn't have a mechanism 
AFAIK for updates.  Additionally, IMAP doesn't have a mechanism for simply 
replacing one message data with another -- a new message must be written and 
the old message must be deleted and the message IMAP UID will change, and the 
client will have to deal with this especially if it is cache the messages.

Also GMAIL IMAP is an issue-  gmail IMAP is not IMAP -- it simply 
doesn't work like a true imap server -- writes to folders in gmail IMAP are 
translated to database updates where it is attributing a single record of the 
message with the folder it was "written" to.   Changing headers on a gmail IMAP 
message simply will not work because it will will reject the message as update 
of the single record (and not actually write the new data).

Still tags in headers meant that I didn't have to worry about making sure that 
the .tags folder is maintained appropriate (throughout moves and deletions) and 
that the data is stored much closer to the message for data recovery if it is 
ever needed and for archiving tags. -- in anycase -- this is what I have 
working -- though I am open to considering new approaches.

Scott

ps.  
also see my post to the mailtags-list from a few years back
http://lists.madduck.net/pipermail/mailtags/2007-August/msg00017.html

On 2010-01-11, at 5:19 PM, martin f krafft wrote:

> Folks, over in #notmuch, we just floated an idea that I'd like to
> get out to you. We've been debating storing tags for messages.
> Therefore I am cross-posting. Please forgive me.
> 
> So far, there are two approaches:
> 
> 1. External database, which has the downside of not being
>  synchronisable with standard IMAP, like the rest of your mail
>  (assuming you use IMAP). Also, it's possible for mailstore and
>  database to get out of sync.
> 
> 2. In-headers, which has the downside of leaking (e.g. when
>  bouncing), and incurs the risks associated with message rewrites
>  (which I think is pretty much ignorable, but it's still there).
>  Also, there's a performance issue, but in the context of an
>  indexer like notmuch, this is negligible.
> 
>  The leakage is real, though and I think it makes in-headers
>  unusable. After all, I don't ever want anyone else to know that
>  I tag e-mails from my boss as "from-idiots", and I forward and
>  bounce mail on a regular basis. I could tell my MTA to remove
>  those headers, but I might forget to do that on a new system.
> 
> We also previously determined that IMAP keywords are pretty much
> useless as they are stored per mailbox, not per message, not
> standardised, and limited in their length anyway [0]. This also
> means that we don't really need to investigate sensibly storing tags
> in Maildir (e.g. with xattrs), because IMAP cannot transport them.
> 
> 0. http://lists.madduck.net/pipermail/mailtags/2007-August/msg00016.html
> 
> Seriously, who implemented IMAPv4rev1 and what sort of crack were
> they smoking??
> 
> I remember there was some KDE groupware contacts manager that used
> IMAP to synchronise