Re: [Evolution-hackers] [Evolution] Beagle and Tracker, letting Evolution feed those beasts RDF triples instead

2008-12-10 Thread Michael Meeks
Hi Philip,

On Wed, 2008-12-10 at 12:49 +0100, Philip Van Hoof wrote:
> > What does the lifecycle for the data in that Unset store look like ?
>
> I think the LifeCycle is best described by this document:
> 
> http://live.gnome.org/MetadataOnRemovableDevices
>
> It specifies a metadata cache format for removable devices in Turtle
> format. 

Not read that before; I just read it - and, as you say here is how
things are removed:

> For your information when reading the document: The removal of a
> resource as a special notation using blank resources <> <>, and the
> removal of a predicate (of a field of a resource) uses the notation
>  <>.

Sure - so, that is fine - it's a representational detail of how
removals are stored. My concern is not that we can't represent removals
well - but that the life-cycle of that removal information is undefined.

Say eg. we install beagle, and tracker - but we never run beagle. Then
we have two parties that have registered an interest in changes. If we
run beagle only every year or so - we need to know all mails that were
deleted since a year ago. Unfortunately, perhaps we never run it again.
Does that mean we endlessly accumulate in some monster journal a huge
list of 'UnSets' ?

>   For a cache it's important to know the "modified" timestamp so that
>   you know whether your copy of the metadata is most recent, or the
>   cache is about the resource is most recent.

Sure - I buy the timestamp thing; that's all great.

> - When a resource got deleted then the RDF store wants to know about
>   this as soon as possible. Asynchronously (like if the RDF store,
>   being a subscriber, joins the subscription after the deletion took
>   place) this also counts: as soon as possible. Preferably immediately
>   after the subscription.

Sure - so my problem is the life-cycle of the store of deletion
information: how long do we grow that list for, if people eg. turn off
the search client after finding it chews more resource than they had
hoped on their small machine :-)

>   With IMAP there's a trick that you can do: you can assume that a hole
>   in the UIDSET meant that some sort of deleting occurred.

Sounds interesting.

> I think, anyway, that it would make sense for Evolution to start doing
> two things in the CamelDB:

Agreed.

>   * Log all deletions (just the UID should suffice), if the service
> reuses UIDs then upon effective reuse of the UID, this log's UID
> deletion should be removed from the log. Else you loose the E-mail
> at whoever depends on this log for knowing about effective
> deletions.

So there is at least some bound to the growth of the deleted UUID
log ;-) which is the size / likelyhood of re-use in the UUID space.

It's hard to think of solutions that are that satisfying; but - perhaps
something like cropping the deletion log-size at a percentage of stored
mail size, with some "log overflow" type message to flag that; or having
some arbitrary size bound on it, or more carefully disabling logging
when search services are disabled, or ... having only a single client,
or warning the user that they should run their search service some more,
or perhaps even coupling the indexing piece more closely to the mailer
itself somehow.

HTH,

Michael.

-- 
 [EMAIL PROTECTED]  <><, Pseudo Engineer, itinerant idiot

___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] [Evolution] Beagle and Tracker, letting Evolution feed those beasts RDF triples instead

2008-12-10 Thread Philip Van Hoof
On Wed, 2008-12-10 at 11:12 +, Michael Meeks wrote:
> Hi Philip,
> 
> On Tue, 2008-12-09 at 19:59 +0100, Philip Van Hoof wrote:
> > > http://live.gnome.org/Evolution/Metadata
> > 
> > For early visitors of that page, refresh because I have added/changed
> > quite a lot of it already.
> 
>   Looks really good.
> 
>   The only thing that I don't quite understand (the perennial problem
> with asynchronous interfaces), is the memory issue: it seems we need to
> store all Unset information on deleted mails somewhere [ unless you are
> a womble like me that keeps ~all mail forever ;-].
> 
>   What does the lifecycle for the data in that Unset store look like ?
> [ I assume that as/when you re-connect to the service you're as much
> likely to get an UnsetMany as a SetMany ]. What if that data starts to
> grow larger than the remaining data it describes ? ;-) [ depending on
> how we do Junk mail filtering of course that might be quite a common
> occurrence for some ].

I think the LifeCycle is best described by this document:

http://live.gnome.org/MetadataOnRemovableDevices

It specifies a metadata cache format for removable devices in Turtle
format. 

For your information when reading the document: The removal of a
resource as a special notation using blank resources <> <>, and the
removal of a predicate (of a field of a resource) uses the notation
 <>.

Although cached metadata on a removable device is not the exact same
use-case, the life-cycle of what the RDF store (or the metadata engine)
wants is the same:

- When a new resource is created or one of its predicates (one of its
  fields) is being updated, it just wants to know about these updates or
  creates. An update is the same as a create if the resource didn't
  exist before.

  For a cache it's important to know the "modified" timestamp so that
  you know whether your copy of the metadata is most recent, or the
  cache is about the resource is most recent.

  For Evolution (for E-mail clients) we can simplify this as "whenever a
  Set or a SetMany happens, we assume time() to be that date". That's
  because we can assume the E-mail client to have top-most priority in
  all cases (being the benevolent dictator about metadata about E-mails,
  it knows best what we should swallow and when we should swallow its
  updates - we should not make up our own minds and decisions about it)

- When a resource got deleted then the RDF store wants to know about
  this as soon as possible. Asynchronously (like if the RDF store,
  being a subscriber, joins the subscription after the deletion took
  place) this also counts: as soon as possible. Preferably immediately
  after the subscription.

  Right now I don't think Evolution is keeping state about deleted UIDs

  With IMAP there's a trick that you can do: you can assume that a hole
  in the UIDSET meant that some sort of deleting occurred. That's
  because IMAP is ~ specified that the server can't reuse UIDs (some
  IMAP servers might not respect this, and those are also broken in
  Evolution afaik - or at least require a workaround that makes
  Evolution basically perform like a POP client for IMAP when
  synchronizing -)

  With POP I don't think you can make any such assumptions.

- Removing the predicate from a resource (the field of a resource) ain't
  needed for E-mail. Luckily E-mail is a mostly read-only storage. With
  exception of fields like . Maybe if we want to support
  removing a flag or a custom-flag at some point we might need to add
  something to the API to indicate the removal of a field of a resource.

  For example it's not possible that the CC or the TO list of an E-mail
  changes. Because E-mails, once stored, are read-only in that aspect.


I think, anyway, that it would make sense for Evolution to start doing
two things in the CamelDB:

  * Log all deletions (just the UID should suffice), if the service
reuses UIDs then upon effective reuse of the UID, this log's UID
deletion should be removed from the log. Else you loose the E-mail
at whoever depends on this log for knowing about effective
deletions.

 * Record the timestamp for each record in the summary table. This
   timestamp would store the time() when the record got added and maybe
   would also store the time() (preferably separately) when the last
   time the E-mail's flags got changed was.

With those two additions to the schema of the CamelDB it would I think
be possible to make a plugin that implements the service as proposed on
the wiki page.

Matthew Barnes replied on IRC that we should start storing those
timestamps anyhow. I also think it's a good idea. I was planning to
discuss this with psankar and srag too.

If we'd change the schema then we will also need to implement a
migration path from the old schema to the new.

Using virtual tables you can simulate MySQL's ALTER TABLE in SQLite.

TRANSACTION 

SELECT * FROM orig_table INTO virtual_table;

DROP orig_table;

CREATE orig_table (
  ...
 

Re: [Evolution-hackers] [Evolution] Beagle and Tracker, letting Evolution feed those beasts RDF triples instead

2008-12-10 Thread Michael Meeks
Hi Philip,

On Tue, 2008-12-09 at 19:59 +0100, Philip Van Hoof wrote:
> > http://live.gnome.org/Evolution/Metadata
> 
> For early visitors of that page, refresh because I have added/changed
> quite a lot of it already.

Looks really good.

The only thing that I don't quite understand (the perennial problem
with asynchronous interfaces), is the memory issue: it seems we need to
store all Unset information on deleted mails somewhere [ unless you are
a womble like me that keeps ~all mail forever ;-].

What does the lifecycle for the data in that Unset store look like ?
[ I assume that as/when you re-connect to the service you're as much
likely to get an UnsetMany as a SetMany ]. What if that data starts to
grow larger than the remaining data it describes ? ;-) [ depending on
how we do Junk mail filtering of course that might be quite a common
occurrence for some ].

Thanks,

Michael.

-- 
 [EMAIL PROTECTED]  <><, Pseudo Engineer, itinerant idiot


___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers