Re: [Evolution-hackers] Moving the struct instance heap space to mmap

2006-09-11 Thread Philip Van Hoof
On Mon, 2006-09-11 at 12:07 +0200, Till Adam wrote:

Excluding some people and mailing lists.

 Having attempted various ways of implementing on-disk and in-memory 
 indexing, the last and current being mmap'd binary on-disk structures 
 (with a lovely collection of data integrity, robustness and performance 
 issues, (NFS, anyone?)), and having long battled huge memory footprints 
 in our applications, we (the KDEPIM developers) have decided that a 
 database approach to the indexing (and the caching) problem is probably 
 our best bet. 

 That's the heart of Akonadi, the PIM data storage server we are 
 currently implementing as part of KDE 4.

What I think is interesting about Akonadi for me is whether or not it
will be possible to implement a libtinymail-akonadi when the time is
right for this, and whether or not it would mean that projects like
Qtopia would be interested (because I would and am interested in their
needs when it comes to E-mail client application development).

I'm probably the least religious-about-it person in the G-town,
cooperation would mean benefits for both projects.

 Any and all input from any of you would be very much welcomed. We have a 
 lot of respect for the Evolution team, both present and past, and as 
 someone who's done a lot of work on KMail's IMAP implementation I have 
 particular respect for you, Michael. I feel your pain, dude ;). I've 
 also been following Philip's work on tinymail with interest, because 
 it's the kind of work I would love to do on KMail but lack the time, at 
 present.

While doing tinymail ...

I conclude that the most important part for an E-mail client on a mobile
device is to attempt to consume as few as possible memory per non-
visible summary item. Where the summary is the information being shown
to the user when he opens a folder (the from, received, to, subject, the
read-state, indication of attachments and needed meta data for threaded
sorting per message in the folder).

Non-visible data means data that can't be read by the user of the
device/software because the view isn't showing it to the user (it
doesn't fit on the screen or window where the view is).

This of course means that the view should pull what it needs from the
model. The model shouldn't push what it has to the view. For example
each time a row becomes visible, the view would pull the information
from the model (the model being a list-model in this case). A view will
pull row-per-row, not a full list.

It's important that the method for giving the view an item in the model
can be implemented.

A database cursor in the model is suitable for this as I can let the
view tell the model: I want the next or the previous item. This is
exactly what a cursor is about: iterating the result of a query.

I also need a way to go to the nth item while knowing the current
location. By that I mean that I can store in the state of the app. what
the current location is, and that when I at some point need the nth
I can efficiently choose the walking direction (nexts or prevs).

Currently it's implemented by doing sufficient next or prevs until I
reach the nth location. A database cursor can also be used for this.

If I can skip-sequential walk the cursor, it's even better. If the
database has support for traveling a cursor to the nth location by
itself, in a very efficient way, then that would be utterly cool.

The current model of tinymail is indeed implemented using an iterator.
It simply stores a position-state, and walks next/prevs to go to an
nth location in a list of headers in memory (a doubly linked list with
an external instance keeping the current, called an iterator).

The view (GtkTreeView) will indeed hammer give me the nth item on that
list. It happens hundreds of times per second when scrolling fast and
even more when sorting (the compare function of a qsort asks for nth).

 In short, this discussion reminds me a lot of similar ones we had over 
 in K-town which eventually led to Akonadi, so I couldn't resist to 
 pitch it to you a bit. My apologies for hijacking the thread. If you 
 have more concrete questions as to the kinds of problems we are facing 
 with memory mapped index files, feel free to ask in PM or on 
 evolution-hackers, which I'm subscribed to.


-- 
Philip Van Hoof, software developer at x-tend 
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
work: vanhoof at x-tend dot be 
http://www.pvanhoof.be - http://www.x-tend.be

___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] Moving the struct instance heap space to mmap

2006-09-08 Thread Eero Tamminen
Hi,

ext Federico Mena Quintero wrote:
 gpointer
 gimme_ptr_to_subject_for_message_number (int n)
 {
return pointers_to_message_summaries[n] + SUBJECT_OFFSET;
 }

Functions that do nothing else besides single array or struct lookup
are hopefully static inlines in some header, so that the binary is
not bloated with:
- unnecessary global function resolving (both performance  memory
   issue as can be read from Depper's ldso paper)
- code for pushing and popping the args to/from stack

(I'm not sure about x86, but if I remember correctly on 86k the inline
version of this function would have been just one asm instruction...)


The matter is a bit more tricky if one would want to export this API
from a shared library, as then details about the library internal
structures (offsets, array names etc) would be compiled into binaries
using the library.  This could be handled with library versioning,
but one needs to be more careful with changes and updates will require
more often rebuild of the dependent applications.


- Eero
___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] Moving the struct instance heap space to mmap

2006-09-08 Thread Philip Van Hoof
On Thu, 2006-09-07 at 21:48 -0500, Federico Mena Quintero wrote:
 On Thu, 2006-09-07 at 21:14 +0200, Philip Van Hoof wrote:
 
  The *new*/*extra* idea is to create a second index file which contains
  the offsets to the pointers in the camel summary file. Then mmap also
  that file. Extra because the idea will build on top of the existing
 
 Ummm, but this won't reduce your working set by very much, will it?
 
 I haven't looked at the details, but can't you just keep an array in
 memory with pointers to the *start* of each summary block, and then
 compute the other pointers on demand?

Yes. This sounds possible. Each member of the summary, however, would
need a length (or a lot strlens are needed per access).

The SUBJECT_OFFSET would be variable because the other members, in the
summary, aren't fixed-sized. Unless you for example use fixed-sized
strings (which would make the mmaped file grow a lot). Consider that you
would have to support the CC field of spam E-mail: 500 E-mail
addresses :), forcing you to make that CC field +- 2k in size per item.

Or the subject of ~ 200 bytes per item, whereas some subjects would
probably be less than 10 bytes. Even for mmap that would become wastage.

But it would be possible, with the offset to the start of the record, to
strlen the strings and that way calculate the position of the next item.

Or to put the length of the strings in a little sub-index per record (a
little bit like reiserfs, hehe). Or to put the length of the string in
front of the string, like pstrings, to avoid the strlen (note that this
makes each string 32 bits larger, unless you encode the integer which
makes data alignment more difficult).


Thanks a lot for your input Federico.

-- 
Philip Van Hoof, software developer at x-tend 
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
work: vanhoof at x-tend dot be 
http://www.pvanhoof.be - http://www.x-tend.be

___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] Moving the struct instance heap space to mmap

2006-09-08 Thread Philip Van Hoof
On Fri, 2006-09-08 at 11:22 -0500, Federico Mena Quintero wrote:
 On Fri, 2006-09-08 at 10:37 +0200, Philip Van Hoof wrote:

 1. What is the maximum length of those strings?  What is the maximum
 length of one message header in the summary file?  In the message info
 struct, can you use short ints for offsets instead of full pointers?  Or
 even single bytes for lengths, if the strings are really short?

At this moment is the file written using encoded integers. The algorithm
for that basically looks at the 0x80 bit to see whether or not the
bitfield ends on the current byte. In memory it's expanded to a 4 byte
normale integer (which is a reason why ls summary is smaller than the
amount of memory being used).

I'm sure that indeed an unsigned char can be used in stead. Very few
strings are larger than 256 bytes. And else two unsigned chars or 16
bit. Any string of which the length doesn't fit in 16 bit is probably an
error rather than a summary-string coming from an E-mail.

It does, however, feel a bit like micro optimizations. But if you
multiply it with the amount of headers, it becomes more significant
indeed.

 2. What are the access patterns?  When the array of summaries gets
 accesed, do you need all the fields, or only some of them?  I.e. you may
 need to access some flags directly, but you may be able to afford
 computing some lenghts by hand if you are doing an uncommon operation.

Only some of them. Removing unneeded pointers from CamelMessageInfoBase
is on my todo list. I tried it once and only partly succeeded (something
became unstable for some strange reason with the NNTP provider).

 3. Even if some things get accessed frequently, some others may not be.
 All my message folders show the default columns:  Flags, From, Subject,
 Date --- these are all fields in CamelMessageInfoBase.  However, some
 others don't get displayed at all:  mlist, to, cc, etc.  Can we remove
 those pointers from the struct, and compute them on demand?  [Those
 fields may be used when filtering, especially mlist.  But filtering is
 pretty much only done when you fetch new mail (isn't it?), so you can
 maybe compute that field on demand instead of keeping it around.]
 
 The goal is to reduce the working set.  Moving all that stuff to another
 mmap() cache just moves your working set from the heap to elsewhere :)

Correct. But the kernel does an awesome job (if you don't have a swap
partition -- mobile device --) of swapping it out. And an even better
job if your mmap is read-only (which it is, in this case).

The third idea sounds a little bit like the disk summary idea. Jeffrey
and Michael can probably explain that one better than me.


-- 
Philip Van Hoof, software developer at x-tend 
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
work: vanhoof at x-tend dot be 
http://www.pvanhoof.be - http://www.x-tend.be

___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] Moving the struct instance heap space to mmap

2006-09-07 Thread Philip Van Hoof
On Thu, 2006-09-07 at 21:14 +0200, Philip Van Hoof wrote:

 typedef struct {
 CamelFolderSummary *fs;
 int nth;
 MemoryMessageInfo *m;
 } CamelMessageInfo;
 
 
 #define camel_message_info_from (x)   \
 ((x)-m ? (x)-m-from : (x)-fs-sstart+*((x)-fs-istart\
  +(sizeof(int)*4*(x)-nth)+1))
 #define camel_message_info_subject (x)\
 ((x)-m ? (x)-m-subject : (x)-fs-sstart+*((x)-fs-istart \
  +(sizeof(int)*4*(x)-nth)+2))
 #define camel_message_info_from (x)   \
 ((x)-m ? (x)-m-to : (x)-fs-sstart+*((x)-fs-istart  \
  +(sizeof(int)*4*(x)-nth)+3))

Note of course that once the offsets are stored in the mmaped index
file, you can have a lot more than just the from, subject and from
members at almost no cost (only mmap  file size). Whereas today the
entire structure size for a camel-message-info grows to more than these
120 bytes.

That is the core idea here: reduce the size of the camel-message-info
type so that each instance needed is relatively small. Each instance
will do a two-stage lookup in two mmapped files. Once for knowing the
offset and once for getting a pointer (at that offset) with the actual
data. Each instance that isn't yet written to disk, has a normal memory
copy using the MemoryMessageInfo struct.

In stead of 120 bytes would this camel-message-info consume (aligned on
four bytes, as on a x86 i386) 12 bytes. That's 10 times less per item.

ps. I'm indeed interested in your input/ideas on this.

-- 
Philip Van Hoof, software developer at x-tend 
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
work: vanhoof at x-tend dot be 
http://www.pvanhoof.be - http://www.x-tend.be

___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] Moving the struct instance heap space to mmap

2006-09-07 Thread Philip Van Hoof
On Thu, 2006-09-07 at 21:14 +0200, Philip Van Hoof wrote:

 To read message n, you would simply do something like:
 
 from = sstart + *(istart + (sizeof (int) * 4 * n) + 1)
 subject = sstart + *(istart + (sizeof (int) * 4 * n) + 2)
 to = sstart + *(istart + (sizeof (int) * 4 * n) + 3)
 flags = sstart + *(istart + (sizeof (int) * 4 * n) + 4)
 

No no no no no. I meant (something like) this of course:


unsigned char *sstart = (uchar*)mmap(..), *istart = (uchar*)mmap(..);
#define AMOUNT_OF_OFFSETS_PER_RECORD 4
#define AOO AMOUNT_OF_OFFSETS_PER_RECORD

from = sstart + *(istart + ((sizeof (uint32_t) * AOO * n) + sizeof (uint32_t)))
subject = sstart + *(istart + ((sizeof (uint32_t) * AOO * n) + sizeof 
(uint32_t)))
to = sstart + *(istart + ((sizeof (uint32_t) * AOO * n) + sizeof (uint32_t)))
flags = sstart + *(istart + ((sizeof (uint32_t) * AOO * n) + sizeof (uint32_t)))

To much pointers in my head ;)

-- 
Philip Van Hoof, software developer at x-tend 
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
work: vanhoof at x-tend dot be 
http://www.pvanhoof.be - http://www.x-tend.be

___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] Moving the struct instance heap space to mmap

2006-09-07 Thread Federico Mena Quintero
On Thu, 2006-09-07 at 21:14 +0200, Philip Van Hoof wrote:

 The *new*/*extra* idea is to create a second index file which contains
 the offsets to the pointers in the camel summary file. Then mmap also
 that file. Extra because the idea will build on top of the existing

Ummm, but this won't reduce your working set by very much, will it?

I haven't looked at the details, but can't you just keep an array in
memory with pointers to the *start* of each summary block, and then
compute the other pointers on demand?

Sort of

gpointer *pointers_to_message_summaries;
gpointer base;

base = mmap (summary);

pointers_to_message_summaries = g_new (gpointer, num_messages);

pointers_to_message_summaries[0] = base;
pointers_to_message_summaries[1] = ...;

and then a set of functions

gpointer
gimme_ptr_to_subject_for_message_number (int n)
{
   return pointers_to_message_summaries[n] + SUBJECT_OFFSET;
}

etc.  Then, instead of 120 bytes per message, you'd have just
sizeof(pointer) and a little more code to compute the offsets on demand.

  Federico

___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] Moving the struct instance heap space to mmap

2006-09-07 Thread Michael Zucchi
Hmm, will it really reduce memory usage though? I'm not so sure. Remember that although the summary file contains all strings on disk, in memory they are actually unique-ized. For a lot of load-types (e.g. mailing lists), this saves a trememdous amount of memory - even after all the overhead of the hashtable to manage it. Address and subject strings in particular. And a mmap file isn't terribly different to performance to malloc'd memory - once the latter is setup, in both cases the kernel is the one loading it off disk (swap, or filesystem).
(an mmap file IS just memory, so if mmap'd file size + in-memory support tables  in-memory version, then you haven't generally won - even if 'ps' shows you have).Note also that all of the summary items, and all of the strings thus stored, are using special allocators which reduce - significantly - the overhead of malloc, on these small allocations. In earlier versions I tried a different string allocator for each messageinfo which mimicks much of what you're doing here, but in memory. Strings were stored in an array structure, but I only stored a single pointer to the base of the array, and packed the strings into a NUL separated right-sized buffer, and searched each time I needed to access them, overhead was only about 5 bytes (pointer to the 'array' plus an array 'limit'). Performance was fine, but memory usage was significantly larger than the current model, where strings are 'unique-ized' - like 30% or mo
 re iirc.
Having extra indices on disk - you're really just writing a datbase table - and all of the associated complexity, consistency and coherency issues that implies. And unless you do something 'proper' like a btree with a unique key, you're going to have to end up storing plenty of extra support stuff in memory anyway, 
e.g. a hashtable to find items by uid, the sorted list of messages for the view, and so on.We did a lot to reduce per-message overhead at the camel level, but its an area of diminishing returns - the ui table view will still take as much memory, etc.
On 08/09/06, Philip Van Hoof [EMAIL PROTECTED] wrote:
On Thu, 2006-09-07 at 21:14 +0200, Philip Van Hoof wrote: To read message n, you would simply do something like: from = sstart + *(istart + (sizeof (int) * 4 * n) + 1) subject = sstart + *(istart + (sizeof (int) * 4 * n) + 2)
 to = sstart + *(istart + (sizeof (int) * 4 * n) + 3) flags = sstart + *(istart + (sizeof (int) * 4 * n) + 4)No no no no no. I meant (something like) this of course:unsigned char *sstart = (uchar*)mmap(..), *istart = (uchar*)mmap(..);
#define AMOUNT_OF_OFFSETS_PER_RECORD 4#define AOO AMOUNT_OF_OFFSETS_PER_RECORDfrom = sstart + *(istart + ((sizeof (uint32_t) * AOO * n) + sizeof (uint32_t)))subject = sstart + *(istart + ((sizeof (uint32_t) * AOO * n) + sizeof (uint32_t)))
to = sstart + *(istart + ((sizeof (uint32_t) * AOO * n) + sizeof (uint32_t)))flags = sstart + *(istart + ((sizeof (uint32_t) * AOO * n) + sizeof (uint32_t)))To much pointers in my head ;)--Philip Van Hoof, software developer at x-tend
home: me at pvanhoof dot begnome: pvanhoof at gnome dot orgwork: vanhoof at x-tend dot behttp://www.pvanhoof.be - http://www.x-tend.be

___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers