Re: [Evolution-hackers] A Camel API to get the filename of the cache, also a proposal to have one format to rule them all

2009-01-06 Thread Jeffrey Stedfast
Srinivasa Ragavan wrote:
 Hey Philip,

 [Im lagging in my mail-replies, still a lot to go, due to my 3 week
 vacation.]

 On Fri, 2009-01-02 at 13:25 +0100, Philip Van Hoof wrote:
   
 Hi there evos,

 For an EPlugin that I'm working on I will need a Camel API to get the
 filename of the cache.
 

 Sure and the patch seems fine to me, but the Exchange portion of the
 patch is missing. It should be similar/simple.
   
 I will attach a patch that adds this API. The EPlugin that I'm developing is
 available at Bug# 565091 and more information about it can be found at

 http://live.gnome.org/Evolution/Metadata.


 I added a bug for tracking this request:

 http://bugzilla.gnome.org/show_bug.cgi?id=566279

 I know that for maildir (cur, tmp, new) and mbox (seek position) it's a
 little bit controversial to return a filename. For maildir I always use
 the cur-file one and for mbox I added /!seek_pos to the end of the
 returned filename. 

 The reason why I need this is that for indexing already cached E-mails,
 Tracker will MIME parse what we can MIME parse. For example filenames
 and Exif data of attached images is stolen out of the cached items, to
 be made searchable.

 We don't want to require Evolution to eat all the code involved in
 indexing massive amounts of file formats. Best thing we can do right now
 is to simply pass the filenames over IPC.

 We STRONGLY recommend to the Evolution team to:

 a) migrate away the IMAP specific data cache (see c to store separate parts)
 
 I thought we already store parts separate. Is is just about the encoding
 or more than that? I seriously don't have an idea on this. May be Fejj,
 Sankar, Matt can add on it.
   

migrating away from the IMAP specific data cache would be good.

   
 b) migrate away the mbox data cache (the all-in-one file crap)
 
 I'm all for it. Once I thought of doing this, but the options were like
 Maildir or a format of one mbox file per mail in a distributed folder
 [CamelDataCache sort of format, like imap4/GW/Exchange]. But IIRC Fejj,
 had some concern like, Local still might be good to be held in a
 'standards' way. I know it hurts us on expunge/mailbox rewrite etc.
   

what mbox data cache? CamelDataCache would probably be the best cache to
use for IMAP.

   
 And to

 c) invent a better storage format that doesn't store the attachments in
 server's (usually) Base64 encoding. The one format to rule them all.

 Instead store the encoded attachments in decoded format (original file
 format). This will reduce diskspace (encoding increases diskspace usage)
 and will make it more easy to scan the original file for XMP and Exif
 information. Don't try to gzip or whatever anything. None of that makes
 any sense (original files are usually compressed ideally already).

 For example: devices that want to compress have filesystems that do this
 for you. Don't be silly trying to do this yourself.

 By storing the encoded version the only thing you currently gain is that
 the feature view E-mail source doesn't need to recode the attachments.

 This ain't a much-used feature. It doesn't have to be fast, at all.

 No it doesn't. Really it doesn't.
 
 Is thatz it? I need some other opinions, I don't have much thoughts
 here. Sankar, Matt, Fejj?
   

this can cause problems if you need to verify signed parts because
re-encoding them might not result in the same output.

 For Maildir I recommend wasting diskspace by storing both the original
 Maildir format and in parallel store the attachments separately.

 Maildir ain't accessible by current Evolution's UI, by the way.

 For MBox I recommend TO STOP USING THIS BROKEN FORMAT. It's insane with
 today's mailboxes that easily grow to 3 gigabytes in size per user.
 
 I second your thoughts for MBox stuff. 
   

Eh, I think mbox works fine but I can understand wanting to move to
Maildir which is also fine :-)

   
 Once all start using the CamelDataCache API, implementing that new
 format and implementing converters wont be very hard. 

 For existing CamelDataCache users it's just one format to convert. For
 IMAP, mbox, Maildir and mh it's indeed a few extra formats to handle
 using a conversion. Wont kill you to implement that, and,  I'll help.
 

 Thatz so nice of you to help us :-)

 -Srini


   

Jeff
___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] A Camel API to get the filename of the cache, also a proposal to have one format to rule them all

2009-01-06 Thread Jeffrey Stedfast
Philip Van Hoof wrote:
 On Mon, 2009-01-05 at 08:25 -0500, Jeffrey Stedfast wrote:

   
 migrating away from the IMAP specific data cache would be good.
 

 Yes. I think IMAP and the local providers are the only ones that are
 still using a specialized datacache.

 The IMAP4 one, for example, ain't using a specialized one.

   
 b) migrate away the mbox data cache (the all-in-one file crap)
 
 
 I'm all for it. Once I thought of doing this, but the options were like
 Maildir or a format of one mbox file per mail in a distributed folder
 [CamelDataCache sort of format, like imap4/GW/Exchange]. But IIRC Fejj,
 had some concern like, Local still might be good to be held in a
 'standards' way. I know it hurts us on expunge/mailbox rewrite etc.
   
   
 what mbox data cache? CamelDataCache would probably be the best cache to
 use for IMAP.
 

 Although I would change CamelDataCache to store individual MIME parts as
 separate files instead of files that look like a single-mail MBox file.
   
it's really just the raw message/rfc822 format, not really mbox -
there's no From  line for example.

that doesn't need to be part of the cache logic. that can be part of the
key.

 I would also decode the separate MIME parts before storing if the
 original E-mail had them encoded (which is usually the case, and always
 for binary attachments). This to make it more easy for metadata engines
 to index the MIME parts, and to allow such to do this efficiently. 

 Perhaps also to reduce disk-space, as encoded consumes more disk-space,
 but that is for me just a nice side-effect.

 So my format would create a directory foreach E-mail, or prefix each
 MIME part with the uid. Perhaps

 INBOX/subfolders/temp/1.  // headers+multipart container
 INBOX/subfolders/temp/1.1 // multipart container
 INBOX/subfolders/temp/1.1.1   // text/plain
 INBOX/subfolders/temp/1.1.2   // text/html
 INBOX/subfolders/temp/1.2.1   // inline JPeg attachment
 INBOX/subfolders/temp/1.BODYSTRUCTURE // Bodystructure of the E-mail
 INBOX/subfolders/temp/1.ENVELOPE  // Top envelope of the E-mail
   

sure, this can be done with the key tho. instead of using the uid as the
key, use uid.1 or uid.1.2 etc

 ps. Perhaps I would store 1.BODYSTRUCTURE in the database instead. I
 would probably store 1.ENVELOPE in the database (like how it is now).
   
yea, I think it makes sense to store BODYSTURCTURE in the folder summary.

 I would probably on top of storing BODYSTRUCTURE and ENVELOPE in the
 database also store them in separate files. Even if most filesystems
 will consume 4k or more (sector or block size) for those mini files.

 To get the JPeg attachment:

 $ cp INBOX/subfolders/temp/1.2.1 ~/mommy.jpeg

 $ exif INBOX/subfolders/temp/1.2.1
 EXIF tags in 'INBOX/subfolders/temp/1.2.1' ('Intel' byte order):
 +--
 Tag |Value
  
 +--
 Image Description   |Mommy with cake at birthday 
 Manufacturer|SONY 
  
 Model   |DSC-T33  
  
 ...

 $ tracker-search -s EMails birthday
 Results:
   email://u...@server/INBOX/temp/1
   email://u...@server/INBOX/temp/1#2.1
   ~/mommy.jpeg


 [CUT]

   
 this can cause problems if you need to verify signed parts because
 re-encoding them might not result in the same output.
 

 Ok, for signatures I guess we can make an exception and keep then
 encoded in their original format then.

   
 For Maildir I recommend wasting diskspace by storing both the original
 Maildir format and in parallel store the attachments separately.

 Maildir ain't accessible by current Evolution's UI, by the way.

 For MBox I recommend TO STOP USING THIS BROKEN FORMAT. It's insane with
 today's mailboxes that easily grow to 3 gigabytes in size per user.
 
 
 I second your thoughts for MBox stuff. 
   
   
 Eh, I think mbox works fine but I can understand wanting to move to
 Maildir which is also fine :-)
 

 Maildir doesn't store individual MIME parts separately. So Mailbox is
 equally hard to handle for metadata engines as MBox is. Only difference
 with MBox is that we need to seek() to some location.

 So Maildir doesn't make it possible for us to let app developers
 implement indexing plugins easily, like a typical exif extractor.
   

I guess, but they could just link with gmime or camel :p

Jeff
___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] camel-folder-summary.c - 64bit-ness ...

2009-01-06 Thread Jeffrey Stedfast
Michael Meeks wrote:
 Hi guys,

   I was just trying to reproduce some migration performance tests with my
 mbox and summary data rsync'd from a 32bit machine to a 64bit machine.

   Surprisingly this appears to crash immediately.

that's not good :(

  Looking at the
 camel-file-utils.c code I was surprised to see simultaneously an
 apparent concern for network byte ordering:

 camel_file_util_encode_fixed_int32 (FILE *out, gint32 value)
 {
   guint32 save;

   save = g_htonl (value);
   if (fwrite (save, sizeof (save), 1, out) != 1)
   return -1;
   return 0;
 }

   and also things like:

 CFU_ENCODE_T(time_t)

   that appear to generate data based on the sizeof the platform's time_t
 - on my 64bit machine time_t is 8 bytes, on 32bit it is only 4.
   

yea, unfortunately the old summary format wasn't designed with 32/64 bit
compat. Now that a lot of people are moving from 32bit to 64bit as they
upgrade to 64bit x86's, it would probably be good to look into.
Although, to be fair, summary files can be re-generated pretty easily.
Unfortunately, for IMAP, while it may be easy, it's not very fast :(

   Presumably this summary code is made obsolete by the new SQLite summary
 code - and modulo some data as to what architecture a file was written
 by it's perhaps less than obvious how to fix this.

nod. the only idea I can come up with is having some logic in the
loading code that tries to figure out why loading failed and to see if
it might have something to do with 32bit vs 64bit int sizes in the
summary file.

Not sure how doable it is.

  Also - why we're not
 using fgetc_unlocked in these tight loops I don't know.
   

isn't that a GNUism? To be honest, I didn't even know the function
existed until a year or so ago when I was looking into Mono vs Java I/O
performance based on the Debian Language Shootout tests. I happened to
look at the C implementation and saw fgets_unlocked() and looked into
it. IIRC, replacing it with fgets() didn't make any noticeable
difference in performance. I just figured it was a soptimization ;-)

   I guess I need an old evo. version to re-build all my summaries for
 64bit now; or am I barking up the wrong tree ?
   

I would imagine so, yea.

Jeff

___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] A Camel API to get the filename of the cache, also a proposal to have one format to rule them all

2009-01-06 Thread Philip Van Hoof
On Mon, 2009-01-05 at 09:41 -0500, Jeffrey Stedfast wrote:
 Philip Van Hoof wrote:
  On Mon, 2009-01-05 at 08:25 -0500, Jeffrey Stedfast wrote:
 
 
  Maildir doesn't store individual MIME parts separately. So Mailbox is
  equally hard to handle for metadata engines as MBox is. Only difference
  with MBox is that we need to seek() to some location.
 
  So Maildir doesn't make it possible for us to let app developers
  implement indexing plugins easily, like a typical exif extractor.

 
 I guess, but they could just link with gmime or camel :p

Which is what Tracker is doing at this moment. But for various reasons
we still end up copying the E-mail's decoded attachments to /tmp, then
scan them with the indexer's plugins, and then unlink() the files.

Suffice to say that this ain't ideal when scanning 10.000 E-mails that
way. Much more efficient for us would be to simply enter evo's caches
and read the MIME parts as normal already decoded files.

I also think such a format would improve some of Evolution's own
features:

o. For example a making a thumbnail of an image could use the platform's
   infrastructure, and see it being cached using the thumbnail-spec.

   Less code

o. Another feature is the Save as feature for attachments. Instead of
   having to open a GFile and using CamelStream converted to a
   GOutputStream and decode-streaming it to that stream to save the
   attachment on the filesystem, you just copy the file.

   Less code

o. Inline image viewers: Instead of having to plug the decoded memory of
   the attachment into a blob of memory, you just use any image viewer. 

   Less code

o. Inline attached images for text/html MIME part viewers: right now
   migrating GtkHTML to WebKit or GtkMozEmbed is hard because GtkHTML
   had implemented some special thing that allows it to get itself a
   blob of memory fed as pixmap buffer for images whom src attribute
   start with cid.

   Less code


I'm not even sure if WebKit and GtkMozEmbed support rendering blobs of
memory. Although I have been asking the developers of the respective
components at nearly each conference I meet them about this. They all
promised to at least offer some sort of infrastructure for this.

Lot's of promises ;)

After thinking about it very hard, and quite a lot, I didn't find any
good reason to store attachments in Base64 encoding. I only found
reasons why you would want to store it decoded: Less code, same features

The only exception why storing in Base64 encoding could be the feature:
View the source of this E-mail. You can perform the Base64 encode as
the E-mail becomes visible in the E-mail source viewer, it's not a good
reason (let's say this introduces 5 lines of camel_stream_* code).

You could say: because we want to use a standard for our storage:

 - Mailbox can't work on Windows because the author of the spec refuses
   to change the character ':' into '!' for the filenames. Which renders
   his entire specification completely useless. Windows is not
   irrelevant, it's being used a lot. Ignoring it is like carving the
   word stupid on your head with a knife.

   But fine, let him. We are free to ignore his spec, right?

   Maildir also doesn't specify storing MIME parts as separate files.

 - MBox is just broken. You can't put 3Gigs of data in one file, require
   a rewrite of that file each time you want to remove 1kb of data out
   of it and have no index on it (this, at least, is something Maildir
   got right by letting the kernel's FS take care of that: atomic
   renames and DIR is quite good as an index).

   An MBox file is a ticking timebomb waiting to get corrupted.

   MBox also doesn't specify storing MIME parts as separate files.

 - What other formats do we have? Is there one so called standard
   format that stores MIME parts as individual decoded files?

   Because if not then just like the Maildir-guy I'll quickly make a
   website and give it a name. And then let's all start calling it a
   standard. Problem solved? It's not that Maildir is really that much
   more than that. A website that describes a broken way of storing
   E-mails.

   Well, ok, a few IMAP server guys decided to use that specification to
   shut up people who say that IMAP servers that store in a binary
   format are not compatible with their freedom religions. Of course
   that's an ill-educated point of view, but who cares. Freedom! *sigh*

-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be


___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers