Re: [Evolution-hackers] [Evolution] Beagle and Tracker, letting Evolution feed those beasts RDF triples instead

2008-12-15 Thread Philip Van Hoof
On Wed, 2008-12-10 at 16:37 +, Michael Meeks wrote:

[CUT]

   So there is at least some bound to the growth of the deleted UUID
 log ;-) which is the size / likelyhood of re-use in the UUID space.
 
   It's hard to think of solutions that are that satisfying; but - perhaps
 something like cropping the deletion log-size at a percentage of stored
 mail size, with some log overflow type message to flag that; or having
 some arbitrary size bound on it, or more carefully disabling logging
 when search services are disabled, or ... having only a single client,
 or warning the user that they should run their search service some more,
 or perhaps even coupling the indexing piece more closely to the mailer
 itself somehow.

After some discussion on IRC we decided to add a Cleanup method to the
registrar's interface. This method will be called whenever Evolution has
a reason to believe that the `last_checkout` date as passed during the
registry of the registrar has become (or is) too old.

After the Cleanup Evolution will do a re-import using mostly SetMany.

I have updated http://live.gnome.org/Evolution/Metadata for this.


-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be

___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] [Evolution] Beagle and Tracker, letting Evolution feed those beasts RDF triples instead

2008-12-10 Thread Philip Van Hoof
On Wed, 2008-12-10 at 11:12 +, Michael Meeks wrote:
 Hi Philip,
 
 On Tue, 2008-12-09 at 19:59 +0100, Philip Van Hoof wrote:
   http://live.gnome.org/Evolution/Metadata
  
  For early visitors of that page, refresh because I have added/changed
  quite a lot of it already.
 
   Looks really good.
 
   The only thing that I don't quite understand (the perennial problem
 with asynchronous interfaces), is the memory issue: it seems we need to
 store all Unset information on deleted mails somewhere [ unless you are
 a womble like me that keeps ~all mail forever ;-].
 
   What does the lifecycle for the data in that Unset store look like ?
 [ I assume that as/when you re-connect to the service you're as much
 likely to get an UnsetMany as a SetMany ]. What if that data starts to
 grow larger than the remaining data it describes ? ;-) [ depending on
 how we do Junk mail filtering of course that might be quite a common
 occurrence for some ].

I think the LifeCycle is best described by this document:

http://live.gnome.org/MetadataOnRemovableDevices

It specifies a metadata cache format for removable devices in Turtle
format. 

For your information when reading the document: The removal of a
resource as a special notation using blank resources  , and the
removal of a predicate (of a field of a resource) uses the notation
pfx:predicate .

Although cached metadata on a removable device is not the exact same
use-case, the life-cycle of what the RDF store (or the metadata engine)
wants is the same:

- When a new resource is created or one of its predicates (one of its
  fields) is being updated, it just wants to know about these updates or
  creates. An update is the same as a create if the resource didn't
  exist before.

  For a cache it's important to know the modified timestamp so that
  you know whether your copy of the metadata is most recent, or the
  cache is about the resource is most recent.

  For Evolution (for E-mail clients) we can simplify this as whenever a
  Set or a SetMany happens, we assume time() to be that date. That's
  because we can assume the E-mail client to have top-most priority in
  all cases (being the benevolent dictator about metadata about E-mails,
  it knows best what we should swallow and when we should swallow its
  updates - we should not make up our own minds and decisions about it)

- When a resource got deleted then the RDF store wants to know about
  this as soon as possible. Asynchronously (like if the RDF store,
  being a subscriber, joins the subscription after the deletion took
  place) this also counts: as soon as possible. Preferably immediately
  after the subscription.

  Right now I don't think Evolution is keeping state about deleted UIDs

  With IMAP there's a trick that you can do: you can assume that a hole
  in the UIDSET meant that some sort of deleting occurred. That's
  because IMAP is ~ specified that the server can't reuse UIDs (some
  IMAP servers might not respect this, and those are also broken in
  Evolution afaik - or at least require a workaround that makes
  Evolution basically perform like a POP client for IMAP when
  synchronizing -)

  With POP I don't think you can make any such assumptions.

- Removing the predicate from a resource (the field of a resource) ain't
  needed for E-mail. Luckily E-mail is a mostly read-only storage. With
  exception of fields like nmo:isRead. Maybe if we want to support
  removing a flag or a custom-flag at some point we might need to add
  something to the API to indicate the removal of a field of a resource.

  For example it's not possible that the CC or the TO list of an E-mail
  changes. Because E-mails, once stored, are read-only in that aspect.


I think, anyway, that it would make sense for Evolution to start doing
two things in the CamelDB:

  * Log all deletions (just the UID should suffice), if the service
reuses UIDs then upon effective reuse of the UID, this log's UID
deletion should be removed from the log. Else you loose the E-mail
at whoever depends on this log for knowing about effective
deletions.

 * Record the timestamp for each record in the summary table. This
   timestamp would store the time() when the record got added and maybe
   would also store the time() (preferably separately) when the last
   time the E-mail's flags got changed was.

With those two additions to the schema of the CamelDB it would I think
be possible to make a plugin that implements the service as proposed on
the wiki page.

Matthew Barnes replied on IRC that we should start storing those
timestamps anyhow. I also think it's a good idea. I was planning to
discuss this with psankar and srag too.

If we'd change the schema then we will also need to implement a
migration path from the old schema to the new.

Using virtual tables you can simulate MySQL's ALTER TABLE in SQLite.

TRANSACTION 

SELECT * FROM orig_table INTO virtual_table;

DROP orig_table;

CREATE orig_table (
  ...
  created 

Re: [Evolution-hackers] [Evolution] Beagle and Tracker, letting Evolution feed those beasts RDF triples instead

2008-12-10 Thread Michael Meeks
Hi Philip,

On Wed, 2008-12-10 at 12:49 +0100, Philip Van Hoof wrote:
  What does the lifecycle for the data in that Unset store look like ?

 I think the LifeCycle is best described by this document:
 
 http://live.gnome.org/MetadataOnRemovableDevices

 It specifies a metadata cache format for removable devices in Turtle
 format. 

Not read that before; I just read it - and, as you say here is how
things are removed:

 For your information when reading the document: The removal of a
 resource as a special notation using blank resources  , and the
 removal of a predicate (of a field of a resource) uses the notation
 pfx:predicate .

Sure - so, that is fine - it's a representational detail of how
removals are stored. My concern is not that we can't represent removals
well - but that the life-cycle of that removal information is undefined.

Say eg. we install beagle, and tracker - but we never run beagle. Then
we have two parties that have registered an interest in changes. If we
run beagle only every year or so - we need to know all mails that were
deleted since a year ago. Unfortunately, perhaps we never run it again.
Does that mean we endlessly accumulate in some monster journal a huge
list of 'UnSets' ?

   For a cache it's important to know the modified timestamp so that
   you know whether your copy of the metadata is most recent, or the
   cache is about the resource is most recent.

Sure - I buy the timestamp thing; that's all great.

 - When a resource got deleted then the RDF store wants to know about
   this as soon as possible. Asynchronously (like if the RDF store,
   being a subscriber, joins the subscription after the deletion took
   place) this also counts: as soon as possible. Preferably immediately
   after the subscription.

Sure - so my problem is the life-cycle of the store of deletion
information: how long do we grow that list for, if people eg. turn off
the search client after finding it chews more resource than they had
hoped on their small machine :-)

   With IMAP there's a trick that you can do: you can assume that a hole
   in the UIDSET meant that some sort of deleting occurred.

Sounds interesting.

 I think, anyway, that it would make sense for Evolution to start doing
 two things in the CamelDB:

Agreed.

   * Log all deletions (just the UID should suffice), if the service
 reuses UIDs then upon effective reuse of the UID, this log's UID
 deletion should be removed from the log. Else you loose the E-mail
 at whoever depends on this log for knowing about effective
 deletions.

So there is at least some bound to the growth of the deleted UUID
log ;-) which is the size / likelyhood of re-use in the UUID space.

It's hard to think of solutions that are that satisfying; but - perhaps
something like cropping the deletion log-size at a percentage of stored
mail size, with some log overflow type message to flag that; or having
some arbitrary size bound on it, or more carefully disabling logging
when search services are disabled, or ... having only a single client,
or warning the user that they should run their search service some more,
or perhaps even coupling the indexing piece more closely to the mailer
itself somehow.

HTH,

Michael.

-- 
 [EMAIL PROTECTED]  , Pseudo Engineer, itinerant idiot

___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] [Evolution] Beagle and Tracker, letting Evolution feed those beasts RDF triples instead

2008-12-09 Thread Philip Van Hoof
On Tue, 2008-12-09 at 18:00 +0530, Sankar wrote:

Hey Sankar,

I'm writing a plugin that will implement the Manager class as
described here. Tracker will then implement being a Registrar.

http://live.gnome.org/Evolution/Metadata

I will be using camel-db.h as you hinted me on IRC to implement the
features in a well performing way (direct SQLite access).

I will start working on this plugin tomorrow or next week. At the same
time I will be implementing support for it in Tracker, which will serve
as a prototype for other metadata engines.

I hope to inspire people from the Evolution team, and from the different
metadata engines, to comment on the proposed D-Bus API.

Let's try to get this valuable metadata out of those damn E-mail clients
and let's try to get it right this time. Not ad-hoc, but right.

If the namespace should be translated from org.gnome to org.freedesktop
we can of course do this afterwards. The metadata.Manager part would
also have to be renamed to a better name. But in the end implementation
would not be affected a lot by such renames. Meanwhile we can prototype
it in GNOME's Evolution D-Bus namespace.

The reason for all this prototyping is that we wouldn't like to release
a Tracker that doesn't support Evolution's new summary format.

This time we're into getting it right so just hacking around the new
summary format by fixing something that wrongfully interpreted
Evolution's cache by itself instead of letting Evolution tell us about
it ... 

* Well we could do this, but really ... let's just get it right now that
  I can spend time on this. At least that's my point of view on this.

* Other apps trying to read Evolution's caches externally just isn't
  ever going to be generic for all E-mail clients, and is not really
  right. For example file locking and (now that it's SQLite based)
  caring about transactions being held by Evolution and all that stuff.
  Caring about the possibility of Evolution changing the database
  schema.

* It's just not very nice to do it that way in my opinion: It adds a
  unasked for burden on the Evolution team too: having to negotiate with
  us when you want to change the schema of the database. Else you will
  break a lot of people's desktops unannounced. Evolution would need to
  make a mechanism for us to tell us about the version of the schema,
  for example. And we would have to implement things in Tracker that
  deal with all versions of Evolution's cache versions.

  One big spaghetti mess distributed over multiple projects.

  So, let's just do it right 


 On Mon, 2008-12-08 at 18:59 +0100, Philip Van Hoof wrote:
  All metadata engines are nowadays working on a method to let them get
  their metadata fed by external applications.
  Such APIs come down to storing RDF triples. A RDF triple comes down to a
  URI, a property and a value.
  
  For example (in Turtle format, which is SparQL's inline format and the
  typical w3's RDF storage format):
  We'd like to make an Evolution plugin that does this for Tracker. 
  
  Obviously would it be as easy as letting software like Beagle become an
  implementer of prox's InsertRDFTriples to start supporting Beagle with
  the same code and Evolution plugin, this way.
  
  I just don't know which EPlugin hooks I should use. Iterating all
  accounts and foreach account all folders and foreach folder all
  CamelMessageInfo instances is trivial and I know how to do this.
  
  What I don't know is what reliable hooks are for:
  
* Application started
 
 org.gnome.evolution.shell.events:1.0 - es-event.c - 
 
 sample plugin:
 groupwise-account-setup/org-gnome-gw-account-setup.eplug.xml 
 
 
* Account added
 
 org.gnome.evolution.mail.config:1.0 
 
 sample plugin:
 groupwise-account-setup/org-gnome-gw-account-setup.eplug.xml 
 
 For account-added: id = org.gnome.evolution.mail.config.accountDruid
 For account-edited: id = org.gnome.evolution.mail.config.accountEditor
 
* Account removed
 
 You may have to write a new hook
 
* Folder created
* Folder deleted
* Folder moved
* Message deleted (expunged)
* Message flagged for removal 
* Message flagged as Read and as Unread
* Message flagged (generic)
* Message moved (ie. deleted + created)
* New message received
  * Full message 
  * Just the ENVELOPE
  
 
 If you try to update your metadata for every of the above operations, it
 may be a overkill in terms of performance (and I believe more disk
 access as well for updating your metadata store). You can add a new hook
 while any change is made to the summary DB and listen to that. All the
 above changes will have to eventually come to summary DB for them to be
 valid.
 
 
 However, I personally believe:
 
 More and more applications are using sqlite (firefox and evolution my
 two most used apps.)  So, it may be a better idea to directly map the
 tables in an sqlite database into the search applications' data-store
 (beagle, tracker etc.) instead of depending on 

Re: [Evolution-hackers] [Evolution] Beagle and Tracker, letting Evolution feed those beasts RDF triples instead

2008-12-09 Thread Philip Van Hoof
On Tue, 2008-12-09 at 13:59 +0100, Philip Van Hoof wrote:
 On Tue, 2008-12-09 at 18:00 +0530, Sankar wrote:
 
 Hey Sankar,
 
 I'm writing a plugin that will implement the Manager class as
 described here. Tracker will then implement being a Registrar.
 
 http://live.gnome.org/Evolution/Metadata

For early visitors of that page, refresh because I have added/changed
quite a lot of it already.

This wiki page also serves as the description of the proposal. 

A experience developer should get a quite good idea of what will be
needed in Evolution:

Keeping timestamps around foreach message so that I can do a variation
of camel_db_read_message_info_records that accepts a since timestamp.

For example:

camel_db_message_infos_that_changed_since (db, since, callback, userd)

Something less easy is keeping track of deleted ones too. This would be
needed for the Unset and UnsetMany calls. Only thing that has to be
kept around is the UID. With direct access to IMAP I could implement
this in IMAP by searching for holes in the UID sets. For POP there's no
real other way than to just store all the UIDs or long-uids that ever
got expunged/popped/deleted and then locally deleted.

This is of course important for accurately cleaning up metadata engines
that want to accurately be aware of removed resources (removed E-mails).

This was also a painful part back when we manually parsed the summary
files: we had to scan all existing items to check if it's not in the
original summary file any longer. This meant having to parse-all,
scan-all, process-all each time we start up.

Not very nice for desktop-startup time :-(

Especially on mobile you want fast startup time and as few things as
possible to do before becoming operational. A metadata engine is of
course not good if it still has inaccuracies like metadata about data
that has long been removed in its stores.

Anyway, Evolution can easily log this as it's either updated by IMAP's
IDLE, unsolicited EXPUNGE events or NOTIFY, or by its synchronization
with POP or Evolution was the responsible for deleting the E-mail (or
apply this logic to E-mail protocol super X and E-mail protocol mega Y).


Let me know what you guys think ...

-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be

___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] [Evolution] Beagle and Tracker, letting Evolution feed those beasts RDF triples instead

2008-12-08 Thread Sankar
On Mon, 2008-12-08 at 18:59 +0100, Philip Van Hoof wrote:
 All metadata engines are nowadays working on a method to let them get
 their metadata fed by external applications.
 Such APIs come down to storing RDF triples. A RDF triple comes down to a
 URI, a property and a value.
 
 For example (in Turtle format, which is SparQL's inline format and the
 typical w3's RDF storage format):
 We'd like to make an Evolution plugin that does this for Tracker. 
 
 Obviously would it be as easy as letting software like Beagle become an
 implementer of prox's InsertRDFTriples to start supporting Beagle with
 the same code and Evolution plugin, this way.
 
 I just don't know which EPlugin hooks I should use. Iterating all
 accounts and foreach account all folders and foreach folder all
 CamelMessageInfo instances is trivial and I know how to do this.
 
 What I don't know is what reliable hooks are for:
 
   * Application started

org.gnome.evolution.shell.events:1.0 - es-event.c - 

sample plugin:
groupwise-account-setup/org-gnome-gw-account-setup.eplug.xml 


   * Account added

org.gnome.evolution.mail.config:1.0 

sample plugin:
groupwise-account-setup/org-gnome-gw-account-setup.eplug.xml 

For account-added: id = org.gnome.evolution.mail.config.accountDruid
For account-edited: id = org.gnome.evolution.mail.config.accountEditor

   * Account removed

You may have to write a new hook

   * Folder created
   * Folder deleted
   * Folder moved
   * Message deleted (expunged)
   * Message flagged for removal 
   * Message flagged as Read and as Unread
   * Message flagged (generic)
   * Message moved (ie. deleted + created)
   * New message received
 * Full message 
 * Just the ENVELOPE
 

If you try to update your metadata for every of the above operations, it
may be a overkill in terms of performance (and I believe more disk
access as well for updating your metadata store). You can add a new hook
while any change is made to the summary DB and listen to that. All the
above changes will have to eventually come to summary DB for them to be
valid.


However, I personally believe:

More and more applications are using sqlite (firefox and evolution my
two most used apps.)  So, it may be a better idea to directly map the
tables in an sqlite database into the search applications' data-store
(beagle, tracker etc.) instead of depending on the applications to give
the up-to-date data. 

When we implemented on-disk-summary for evolution summaries, we removed
the meta-summary code (used by beagle). We had to provide a way for
helping Beagle / Tracker to know of modified/new mails, so they could
(re)index these mails. Some suggested that we should add a DATETIME
field which contains the time-stamp of the time last modified/created
for each record. However, this in addition to bloating the database,
also does not provide any information about deleted records.

If, inside the sqlite db, if we have a special table comprising:
table-name,primary keys of records of last N records
modified/added,time-added; Any search application can make use of this
and update its lucene (or whatever) data ,  

It may not be the neatest approach, but what I want to say is : Instead
of depending on the enduser applications (which use sqlite) for giving
data, search applications, should be able to get the data from the db
itself. This also provides additional benefits like creating/updating
search indices when the machine is idle, instead of choking the
applications when they are running, etc.

My 0.2 EUROes ;-)

--
Sankar

___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers