Re: [Evolution-hackers] A Camel API to get the filename of the cache, also a proposal to have one format to rule them all
Srinivasa Ragavan wrote: Hey Philip, [Im lagging in my mail-replies, still a lot to go, due to my 3 week vacation.] On Fri, 2009-01-02 at 13:25 +0100, Philip Van Hoof wrote: Hi there evos, For an EPlugin that I'm working on I will need a Camel API to get the filename of the cache. Sure and the patch seems fine to me, but the Exchange portion of the patch is missing. It should be similar/simple. I will attach a patch that adds this API. The EPlugin that I'm developing is available at Bug# 565091 and more information about it can be found at http://live.gnome.org/Evolution/Metadata. I added a bug for tracking this request: http://bugzilla.gnome.org/show_bug.cgi?id=566279 I know that for maildir (cur, tmp, new) and mbox (seek position) it's a little bit controversial to return a filename. For maildir I always use the cur-file one and for mbox I added /!seek_pos to the end of the returned filename. The reason why I need this is that for indexing already cached E-mails, Tracker will MIME parse what we can MIME parse. For example filenames and Exif data of attached images is stolen out of the cached items, to be made searchable. We don't want to require Evolution to eat all the code involved in indexing massive amounts of file formats. Best thing we can do right now is to simply pass the filenames over IPC. We STRONGLY recommend to the Evolution team to: a) migrate away the IMAP specific data cache (see c to store separate parts) I thought we already store parts separate. Is is just about the encoding or more than that? I seriously don't have an idea on this. May be Fejj, Sankar, Matt can add on it. migrating away from the IMAP specific data cache would be good. b) migrate away the mbox data cache (the all-in-one file crap) I'm all for it. Once I thought of doing this, but the options were like Maildir or a format of one mbox file per mail in a distributed folder [CamelDataCache sort of format, like imap4/GW/Exchange]. But IIRC Fejj, had some concern like, Local still might be good to be held in a 'standards' way. I know it hurts us on expunge/mailbox rewrite etc. what mbox data cache? CamelDataCache would probably be the best cache to use for IMAP. And to c) invent a better storage format that doesn't store the attachments in server's (usually) Base64 encoding. The one format to rule them all. Instead store the encoded attachments in decoded format (original file format). This will reduce diskspace (encoding increases diskspace usage) and will make it more easy to scan the original file for XMP and Exif information. Don't try to gzip or whatever anything. None of that makes any sense (original files are usually compressed ideally already). For example: devices that want to compress have filesystems that do this for you. Don't be silly trying to do this yourself. By storing the encoded version the only thing you currently gain is that the feature view E-mail source doesn't need to recode the attachments. This ain't a much-used feature. It doesn't have to be fast, at all. No it doesn't. Really it doesn't. Is thatz it? I need some other opinions, I don't have much thoughts here. Sankar, Matt, Fejj? this can cause problems if you need to verify signed parts because re-encoding them might not result in the same output. For Maildir I recommend wasting diskspace by storing both the original Maildir format and in parallel store the attachments separately. Maildir ain't accessible by current Evolution's UI, by the way. For MBox I recommend TO STOP USING THIS BROKEN FORMAT. It's insane with today's mailboxes that easily grow to 3 gigabytes in size per user. I second your thoughts for MBox stuff. Eh, I think mbox works fine but I can understand wanting to move to Maildir which is also fine :-) Once all start using the CamelDataCache API, implementing that new format and implementing converters wont be very hard. For existing CamelDataCache users it's just one format to convert. For IMAP, mbox, Maildir and mh it's indeed a few extra formats to handle using a conversion. Wont kill you to implement that, and, I'll help. Thatz so nice of you to help us :-) -Srini Jeff ___ Evolution-hackers mailing list Evolution-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/evolution-hackers
Re: [Evolution-hackers] A Camel API to get the filename of the cache, also a proposal to have one format to rule them all
Philip Van Hoof wrote: On Mon, 2009-01-05 at 08:25 -0500, Jeffrey Stedfast wrote: migrating away from the IMAP specific data cache would be good. Yes. I think IMAP and the local providers are the only ones that are still using a specialized datacache. The IMAP4 one, for example, ain't using a specialized one. b) migrate away the mbox data cache (the all-in-one file crap) I'm all for it. Once I thought of doing this, but the options were like Maildir or a format of one mbox file per mail in a distributed folder [CamelDataCache sort of format, like imap4/GW/Exchange]. But IIRC Fejj, had some concern like, Local still might be good to be held in a 'standards' way. I know it hurts us on expunge/mailbox rewrite etc. what mbox data cache? CamelDataCache would probably be the best cache to use for IMAP. Although I would change CamelDataCache to store individual MIME parts as separate files instead of files that look like a single-mail MBox file. it's really just the raw message/rfc822 format, not really mbox - there's no From line for example. that doesn't need to be part of the cache logic. that can be part of the key. I would also decode the separate MIME parts before storing if the original E-mail had them encoded (which is usually the case, and always for binary attachments). This to make it more easy for metadata engines to index the MIME parts, and to allow such to do this efficiently. Perhaps also to reduce disk-space, as encoded consumes more disk-space, but that is for me just a nice side-effect. So my format would create a directory foreach E-mail, or prefix each MIME part with the uid. Perhaps INBOX/subfolders/temp/1. // headers+multipart container INBOX/subfolders/temp/1.1 // multipart container INBOX/subfolders/temp/1.1.1 // text/plain INBOX/subfolders/temp/1.1.2 // text/html INBOX/subfolders/temp/1.2.1 // inline JPeg attachment INBOX/subfolders/temp/1.BODYSTRUCTURE // Bodystructure of the E-mail INBOX/subfolders/temp/1.ENVELOPE // Top envelope of the E-mail sure, this can be done with the key tho. instead of using the uid as the key, use uid.1 or uid.1.2 etc ps. Perhaps I would store 1.BODYSTRUCTURE in the database instead. I would probably store 1.ENVELOPE in the database (like how it is now). yea, I think it makes sense to store BODYSTURCTURE in the folder summary. I would probably on top of storing BODYSTRUCTURE and ENVELOPE in the database also store them in separate files. Even if most filesystems will consume 4k or more (sector or block size) for those mini files. To get the JPeg attachment: $ cp INBOX/subfolders/temp/1.2.1 ~/mommy.jpeg $ exif INBOX/subfolders/temp/1.2.1 EXIF tags in 'INBOX/subfolders/temp/1.2.1' ('Intel' byte order): +-- Tag |Value +-- Image Description |Mommy with cake at birthday Manufacturer|SONY Model |DSC-T33 ... $ tracker-search -s EMails birthday Results: email://u...@server/INBOX/temp/1 email://u...@server/INBOX/temp/1#2.1 ~/mommy.jpeg [CUT] this can cause problems if you need to verify signed parts because re-encoding them might not result in the same output. Ok, for signatures I guess we can make an exception and keep then encoded in their original format then. For Maildir I recommend wasting diskspace by storing both the original Maildir format and in parallel store the attachments separately. Maildir ain't accessible by current Evolution's UI, by the way. For MBox I recommend TO STOP USING THIS BROKEN FORMAT. It's insane with today's mailboxes that easily grow to 3 gigabytes in size per user. I second your thoughts for MBox stuff. Eh, I think mbox works fine but I can understand wanting to move to Maildir which is also fine :-) Maildir doesn't store individual MIME parts separately. So Mailbox is equally hard to handle for metadata engines as MBox is. Only difference with MBox is that we need to seek() to some location. So Maildir doesn't make it possible for us to let app developers implement indexing plugins easily, like a typical exif extractor. I guess, but they could just link with gmime or camel :p Jeff ___ Evolution-hackers mailing list Evolution-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/evolution-hackers
Re: [Evolution-hackers] camel-folder-summary.c - 64bit-ness ...
Michael Meeks wrote: Hi guys, I was just trying to reproduce some migration performance tests with my mbox and summary data rsync'd from a 32bit machine to a 64bit machine. Surprisingly this appears to crash immediately. that's not good :( Looking at the camel-file-utils.c code I was surprised to see simultaneously an apparent concern for network byte ordering: camel_file_util_encode_fixed_int32 (FILE *out, gint32 value) { guint32 save; save = g_htonl (value); if (fwrite (save, sizeof (save), 1, out) != 1) return -1; return 0; } and also things like: CFU_ENCODE_T(time_t) that appear to generate data based on the sizeof the platform's time_t - on my 64bit machine time_t is 8 bytes, on 32bit it is only 4. yea, unfortunately the old summary format wasn't designed with 32/64 bit compat. Now that a lot of people are moving from 32bit to 64bit as they upgrade to 64bit x86's, it would probably be good to look into. Although, to be fair, summary files can be re-generated pretty easily. Unfortunately, for IMAP, while it may be easy, it's not very fast :( Presumably this summary code is made obsolete by the new SQLite summary code - and modulo some data as to what architecture a file was written by it's perhaps less than obvious how to fix this. nod. the only idea I can come up with is having some logic in the loading code that tries to figure out why loading failed and to see if it might have something to do with 32bit vs 64bit int sizes in the summary file. Not sure how doable it is. Also - why we're not using fgetc_unlocked in these tight loops I don't know. isn't that a GNUism? To be honest, I didn't even know the function existed until a year or so ago when I was looking into Mono vs Java I/O performance based on the Debian Language Shootout tests. I happened to look at the C implementation and saw fgets_unlocked() and looked into it. IIRC, replacing it with fgets() didn't make any noticeable difference in performance. I just figured it was a soptimization ;-) I guess I need an old evo. version to re-build all my summaries for 64bit now; or am I barking up the wrong tree ? I would imagine so, yea. Jeff ___ Evolution-hackers mailing list Evolution-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/evolution-hackers
Re: [Evolution-hackers] A Camel API to get the filename of the cache, also a proposal to have one format to rule them all
On Mon, 2009-01-05 at 09:41 -0500, Jeffrey Stedfast wrote: Philip Van Hoof wrote: On Mon, 2009-01-05 at 08:25 -0500, Jeffrey Stedfast wrote: Maildir doesn't store individual MIME parts separately. So Mailbox is equally hard to handle for metadata engines as MBox is. Only difference with MBox is that we need to seek() to some location. So Maildir doesn't make it possible for us to let app developers implement indexing plugins easily, like a typical exif extractor. I guess, but they could just link with gmime or camel :p Which is what Tracker is doing at this moment. But for various reasons we still end up copying the E-mail's decoded attachments to /tmp, then scan them with the indexer's plugins, and then unlink() the files. Suffice to say that this ain't ideal when scanning 10.000 E-mails that way. Much more efficient for us would be to simply enter evo's caches and read the MIME parts as normal already decoded files. I also think such a format would improve some of Evolution's own features: o. For example a making a thumbnail of an image could use the platform's infrastructure, and see it being cached using the thumbnail-spec. Less code o. Another feature is the Save as feature for attachments. Instead of having to open a GFile and using CamelStream converted to a GOutputStream and decode-streaming it to that stream to save the attachment on the filesystem, you just copy the file. Less code o. Inline image viewers: Instead of having to plug the decoded memory of the attachment into a blob of memory, you just use any image viewer. Less code o. Inline attached images for text/html MIME part viewers: right now migrating GtkHTML to WebKit or GtkMozEmbed is hard because GtkHTML had implemented some special thing that allows it to get itself a blob of memory fed as pixmap buffer for images whom src attribute start with cid. Less code I'm not even sure if WebKit and GtkMozEmbed support rendering blobs of memory. Although I have been asking the developers of the respective components at nearly each conference I meet them about this. They all promised to at least offer some sort of infrastructure for this. Lot's of promises ;) After thinking about it very hard, and quite a lot, I didn't find any good reason to store attachments in Base64 encoding. I only found reasons why you would want to store it decoded: Less code, same features The only exception why storing in Base64 encoding could be the feature: View the source of this E-mail. You can perform the Base64 encode as the E-mail becomes visible in the E-mail source viewer, it's not a good reason (let's say this introduces 5 lines of camel_stream_* code). You could say: because we want to use a standard for our storage: - Mailbox can't work on Windows because the author of the spec refuses to change the character ':' into '!' for the filenames. Which renders his entire specification completely useless. Windows is not irrelevant, it's being used a lot. Ignoring it is like carving the word stupid on your head with a knife. But fine, let him. We are free to ignore his spec, right? Maildir also doesn't specify storing MIME parts as separate files. - MBox is just broken. You can't put 3Gigs of data in one file, require a rewrite of that file each time you want to remove 1kb of data out of it and have no index on it (this, at least, is something Maildir got right by letting the kernel's FS take care of that: atomic renames and DIR is quite good as an index). An MBox file is a ticking timebomb waiting to get corrupted. MBox also doesn't specify storing MIME parts as separate files. - What other formats do we have? Is there one so called standard format that stores MIME parts as individual decoded files? Because if not then just like the Maildir-guy I'll quickly make a website and give it a name. And then let's all start calling it a standard. Problem solved? It's not that Maildir is really that much more than that. A website that describes a broken way of storing E-mails. Well, ok, a few IMAP server guys decided to use that specification to shut up people who say that IMAP servers that store in a binary format are not compatible with their freedom religions. Of course that's an ill-educated point of view, but who cares. Freedom! *sigh* -- Philip Van Hoof, freelance software developer home: me at pvanhoof dot be gnome: pvanhoof at gnome dot org http://pvanhoof.be/blog http://codeminded.be ___ Evolution-hackers mailing list Evolution-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/evolution-hackers