[notmuch] Missing messages breaking threads

2009-12-22 Thread Olly Betts
Carl Worth writes:
> We don't have any concept of versioning yet, but it would obviously be
> easy to have a new version document with an increasing integer.

Adding a magic document for this isn't ideal as you have to make sure
it can't appear in search results, etc.

This is just the sort of thing which Xapian's "user metadata" is there
for.  It's essentially a key/value store which is versioned along with
the rest of the Xapian database.  So to set it:

  database.set_metadata("version", "1");

And to read (and default if not set):

  string version = database.get_metadata("version");
  if (version.empty()) version = "0";

Cheers,
   Olly



[notmuch] Missing messages breaking threads

2009-12-22 Thread Carl Worth
On Tue, 22 Dec 2009 22:48:25 + (UTC), Olly Betts  wrote:
> This is just the sort of thing which Xapian's "user metadata" is there
> for.  It's essentially a key/value store which is versioned along with
> the rest of the Xapian database.  So to set it:
> 
>   database.set_metadata("version", "1");
> 
> And to read (and default if not set):
> 
>   string version = database.get_metadata("version");
>   if (version.empty()) version = "0";

Thanks, Olly!

That is exactly what we'll want here, and is much better than a magic
document.

-Carl (grateful to have a Xapian expert keeping watch on the list)
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 



Re: [notmuch] Missing messages breaking threads

2009-12-22 Thread Carl Worth
On Tue, 22 Dec 2009 22:48:25 + (UTC), Olly Betts o...@survex.com wrote:
 This is just the sort of thing which Xapian's user metadata is there
 for.  It's essentially a key/value store which is versioned along with
 the rest of the Xapian database.  So to set it:
 
   database.set_metadata(version, 1);
 
 And to read (and default if not set):
 
   string version = database.get_metadata(version);
   if (version.empty()) version = 0;

Thanks, Olly!

That is exactly what we'll want here, and is much better than a magic
document.

-Carl (grateful to have a Xapian expert keeping watch on the list)


pgpXLbC5HmGJ2.pgp
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] Missing messages breaking threads

2009-12-18 Thread James Westby
On Fri, 18 Dec 2009 12:52:58 -0800, Carl Worth  wrote:
> On Fri, 18 Dec 2009 19:53:13 +, James Westby  jameswestby.net> wrote:
> Oh, I was assuming you wouldn't index any text. The UI can add "missing
> message" for a document with no filename, for example.

Works for me.

> > So, to summarise, I should first look at storing filesizes, then
> > the collision code to make it index further when the filesize grows,
> > and then finally the code to add documents for missing messages?
> 
> Some of the code areas to be touched will be changing soon, (at least as
> far as when filenames appear and disappear). Hopefully I'll have
> something posted for that sooner rather than later to avoid having to
> redo too much work.

That would be great. I'm learning all the code anyway, so there's not
a whole lot of knowledge being thrown away.

I've just sent an initial cut at the fist step.

> > The only thing I am unclear on is how to handle existing databases?
> > Do we have any concept of versioning? Or should I just assume that
> > filesize: may not be in the document and act appropriately?
> 
> My current, outstanding patch is going to be the first trigger for a
> "flag day" where we'll all need to rewrite our databases.
> 
> We don't have any concept of versioning yet, but it would obviously be
> easy to have a new version document with an increasing integer.
> 
> But even with my current patch I'm considering doing a graceful upgrade
> of the database in-place rather than making the user do something like a
> dump, delete, rebuild, restore. That would give a much better experience
> than "Your database is out-of-date, please rebuild it", so we'll see if
> I pursue that in the end.

That sounds nice, I'd certainly prefer this sort of thing as it evolves.

Thanks,

James


[notmuch] Missing messages breaking threads

2009-12-18 Thread James Westby
On Fri, 18 Dec 2009 11:41:18 -0800, Carl Worth  wrote:
> On Fri, 18 Dec 2009 19:02:21 +, James Westby  jameswestby.net> wrote:
> > Therefore I'd like to fix this. The obvious way is to
> > introduce documents in to the db for each id we see, and
> > threading should then naturally work better.
> 
> That sounds like a fine idea.

Good, at least I'm not totally off the map.

> > The only issue I see with doing this is with mail delays.
> > Once we do this we will sometimes receive a message that
> > already has a dummy document. What happens currently with
> > message-id collisions?
> 
> The current message-ID collision logic is pretty brain-dead. It just
> says "Oh, I've seen a file with this message before, so I'll skip this
> additional file".
> 
> But I'm just putting the finishing touches on a patch that instead does:
> 
>   Oh, and here's an additional filename for that message ID. Add
>   that too, please.
> 
> Beyond that, all we would need to do as well is to also index the new
> content. I don't want to do useless re-indexing when files just get
> renamed. So maybe all we need to do is to save the filesize of the
> last-indexed file for a document and then when we encounter a file with
> the same message ID and a larger file size, then index it as well?

I would say different file size, but I imagine larger is the majority
of interesting cases.

> That would even take care of providing the opportunity to index
> additional mailing-list-added content for messages also sent directly
> via CC.
> 
> The file-size heuristic wouldn't be perfect for these other cases. I
> guess we save a list of sha-1 sums for indexed files or so, (assuming
> that's cheaper than just re-indexing---before the Xapian Defect 250 fix
> I'm sure it is, but after I'm not sure---we maybe should just always
> re-index---but I think I have seen the TermGenerator appear in profiles
> of indexing runs.)

I'm not sure this is needed too much, but would obviously be
correct.

On Xapian 250, I have a very slow spinning disk, and it was hitting
me hard, making processing my inbox far too slow. I built Xapian SVN
with the patch from the bug and it is now lightning fast, so
consider this another endorsement. I also tried the supplemental
patch and it showed no further improvement for notmuch tag.

> >   * When we get a message-id conflict check for dummy:True
> > and replace the document if it is there.
> > 
> > How does this sound?
> 
> That sounds fine. It's the same as what I propose above with
> "filesize:0" instead of "dummy:true".

That works. However, we would want the old content to go away
in these cases wouldn't we.

Or do we not index whatever dummy text we add? Or do we not
even put it in? Or not even show it at all? I was just thinking
of having "Missing messages..." showing up as the start of
the thread, but maybe it's no needed.

> > There could be an issue with synthesising too many threads
> > and then ending up having to try and put a message in two
> > threads? I see there is code for merging threads, would that
> > handle this?
> 
> It should, yes.
> 
> The current logic is that a message can only appear in a single
> thread. So if a message has children or parents with distinct thread IDs
> then those threads are merged.
> 
> I can imagine some strange cross-posting scenario where one could argue
> that the merging shouldn't happen, but I'm not sure we want to try to
> respect that.

Fair enough.

So, to summarise, I should first look at storing filesizes, then
the collision code to make it index further when the filesize grows,
and then finally the code to add documents for missing messages?

The only thing I am unclear on is how to handle existing databases?
Do we have any concept of versioning? Or should I just assume that
filesize: may not be in the document and act appropriately?

Thanks,

James



[notmuch] Missing messages breaking threads

2009-12-18 Thread Carl Worth
On Fri, 18 Dec 2009 19:53:13 +, James Westby  
wrote:
> Or do we not index whatever dummy text we add? Or do we not
> even put it in? Or not even show it at all? I was just thinking
> of having "Missing messages..." showing up as the start of
> the thread, but maybe it's no needed.

Oh, I was assuming you wouldn't index any text. The UI can add "missing
message" for a document with no filename, for example.

> So, to summarise, I should first look at storing filesizes, then
> the collision code to make it index further when the filesize grows,
> and then finally the code to add documents for missing messages?

Some of the code areas to be touched will be changing soon, (at least as
far as when filenames appear and disappear). Hopefully I'll have
something posted for that sooner rather than later to avoid having to
redo too much work.

> The only thing I am unclear on is how to handle existing databases?
> Do we have any concept of versioning? Or should I just assume that
> filesize: may not be in the document and act appropriately?

My current, outstanding patch is going to be the first trigger for a
"flag day" where we'll all need to rewrite our databases.

We don't have any concept of versioning yet, but it would obviously be
easy to have a new version document with an increasing integer.

But even with my current patch I'm considering doing a graceful upgrade
of the database in-place rather than making the user do something like a
dump, delete, rebuild, restore. That would give a much better experience
than "Your database is out-of-date, please rebuild it", so we'll see if
I pursue that in the end.

-Carl


-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 



[notmuch] Missing messages breaking threads

2009-12-18 Thread Carl Worth
On Fri, 18 Dec 2009 19:02:21 +, James Westby  
wrote:
> I like the architecture of notmuch, and have just switched
> to using it as my primary client, so thanks.

You're quite welcome, James. Welcome to notmuch!

> Therefore I'd like to fix this. The obvious way is to
> introduce documents in to the db for each id we see, and
> threading should then naturally work better.

That sounds like a fine idea.

> The only issue I see with doing this is with mail delays.
> Once we do this we will sometimes receive a message that
> already has a dummy document. What happens currently with
> message-id collisions?

The current message-ID collision logic is pretty brain-dead. It just
says "Oh, I've seen a file with this message before, so I'll skip this
additional file".

But I'm just putting the finishing touches on a patch that instead does:

Oh, and here's an additional filename for that message ID. Add
that too, please.

Beyond that, all we would need to do as well is to also index the new
content. I don't want to do useless re-indexing when files just get
renamed. So maybe all we need to do is to save the filesize of the
last-indexed file for a document and then when we encounter a file with
the same message ID and a larger file size, then index it as well?

That would even take care of providing the opportunity to index
additional mailing-list-added content for messages also sent directly
via CC.

The file-size heuristic wouldn't be perfect for these other cases. I
guess we save a list of sha-1 sums for indexed files or so, (assuming
that's cheaper than just re-indexing---before the Xapian Defect 250 fix
I'm sure it is, but after I'm not sure---we maybe should just always
re-index---but I think I have seen the TermGenerator appear in profiles
of indexing runs.)

>   * When we get a message-id conflict check for dummy:True
> and replace the document if it is there.
> 
> How does this sound?

That sounds fine. It's the same as what I propose above with
"filesize:0" instead of "dummy:true".

> There could be an issue with synthesising too many threads
> and then ending up having to try and put a message in two
> threads? I see there is code for merging threads, would that
> handle this?

It should, yes.

The current logic is that a message can only appear in a single
thread. So if a message has children or parents with distinct thread IDs
then those threads are merged.

I can imagine some strange cross-posting scenario where one could argue
that the merging shouldn't happen, but I'm not sure we want to try to
respect that.

-Carl
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 



Re: [notmuch] Missing messages breaking threads

2009-12-18 Thread Carl Worth
On Fri, 18 Dec 2009 19:53:13 +, James Westby jw+deb...@jameswestby.net 
wrote:
 Or do we not index whatever dummy text we add? Or do we not
 even put it in? Or not even show it at all? I was just thinking
 of having Missing messages... showing up as the start of
 the thread, but maybe it's no needed.

Oh, I was assuming you wouldn't index any text. The UI can add missing
message for a document with no filename, for example.

 So, to summarise, I should first look at storing filesizes, then
 the collision code to make it index further when the filesize grows,
 and then finally the code to add documents for missing messages?

Some of the code areas to be touched will be changing soon, (at least as
far as when filenames appear and disappear). Hopefully I'll have
something posted for that sooner rather than later to avoid having to
redo too much work.

 The only thing I am unclear on is how to handle existing databases?
 Do we have any concept of versioning? Or should I just assume that
 filesize: may not be in the document and act appropriately?

My current, outstanding patch is going to be the first trigger for a
flag day where we'll all need to rewrite our databases.

We don't have any concept of versioning yet, but it would obviously be
easy to have a new version document with an increasing integer.

But even with my current patch I'm considering doing a graceful upgrade
of the database in-place rather than making the user do something like a
dump, delete, rebuild, restore. That would give a much better experience
than Your database is out-of-date, please rebuild it, so we'll see if
I pursue that in the end.

-Carl




pgpXvyQBVFou6.pgp
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Missing messages breaking threads

2009-12-18 Thread James Westby
On Fri, 18 Dec 2009 12:52:58 -0800, Carl Worth cwo...@cworth.org wrote:
 On Fri, 18 Dec 2009 19:53:13 +, James Westby jw+deb...@jameswestby.net 
 wrote:
 Oh, I was assuming you wouldn't index any text. The UI can add missing
 message for a document with no filename, for example.

Works for me.

  So, to summarise, I should first look at storing filesizes, then
  the collision code to make it index further when the filesize grows,
  and then finally the code to add documents for missing messages?
 
 Some of the code areas to be touched will be changing soon, (at least as
 far as when filenames appear and disappear). Hopefully I'll have
 something posted for that sooner rather than later to avoid having to
 redo too much work.

That would be great. I'm learning all the code anyway, so there's not
a whole lot of knowledge being thrown away.

I've just sent an initial cut at the fist step.

  The only thing I am unclear on is how to handle existing databases?
  Do we have any concept of versioning? Or should I just assume that
  filesize: may not be in the document and act appropriately?
 
 My current, outstanding patch is going to be the first trigger for a
 flag day where we'll all need to rewrite our databases.
 
 We don't have any concept of versioning yet, but it would obviously be
 easy to have a new version document with an increasing integer.
 
 But even with my current patch I'm considering doing a graceful upgrade
 of the database in-place rather than making the user do something like a
 dump, delete, rebuild, restore. That would give a much better experience
 than Your database is out-of-date, please rebuild it, so we'll see if
 I pursue that in the end.

That sounds nice, I'd certainly prefer this sort of thing as it evolves.

Thanks,

James
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch