[notmuch] Rather simple optimization for notmuch tag

2009-12-24 Thread Olly Betts
Mark Anderson writes:
> On Wed, 23 Dec 2009 03:45:14 +, Olly Betts wrote:
> > Handling a combination of removals and additions is trickier, but probably
> > possible, although the more tags you are dealing with, the less profitable
> > the filtering is likely to be (as the filter is likely to cull fewer
> > documents yet be more expensive to evaluate).
> 
> But the transform is pretty simple, I think that any combination of
> additions and removals could be transformed according to the following
> formula.
> 
> notmuch tag +a1 +a2 +a3 -d1 -d2 -d3 
> 
> would transform to something like:
> 
>  and ( not(a1) or not(a2) or not(a3) or d1 or d2 or d3)

Note that Xapian doesn't really have a "not" operator (because of how it
works - by storing the documents indexing each term - rather than because
nobody's implemented it), so it isn't quite as simple as the above.

There is a posting list for "all documents" (which is very efficient if
the document ids form a contiguous range; if they don't, it's as efficient
as a term which matches all those documents for the chert backend, but not
so great for the default flint backend in 1.0.x), and you can combine this
with the "AND_NOT" operator to give the equivalent of a "NOT" operator.

So I think the example above is probably best expressed as:

(  AND ( ( ALL AND_NOT (a1 AND a2 AND a3) ) OR d1 OR d2 OR d3 )

But my point wasn't that I doubted it could be handled, but that it becomes
less worthwhile as the number of tags increases (and at some point will
become slower).

> There are certainly may be much more optimal ways to do it depending on
> the specific corpus of the database, considering if the tags a1 and a2
> and a3 are usually added as one tag, or if the addition is done
> individually, because if I know that a3 implies a1 and a2, the first 3
> terms could be combined to not(a1 and a2 and a3), or I could just
> exclude a3 tagged messages for nearly the same effect, with expected
> performance improvements.

I think you always can combine them like that.  The documents that don't
need looking at are precisely those which already have all three tags
(i.e. a1 AND a2 AND a3), so those that do are "NOT" that expression.

Cheers,
Olly



[notmuch] Rather simple optimization for notmuch tag

2009-12-23 Thread Mark Anderson
On Wed, 23 Dec 2009 03:45:14 +, Olly Betts  wrote:
> Carl Worth writes:
> > On Fri, 18 Dec 2009 00:49:00 -0700, Mark Anderson wrote:
> > > I was updating my poll script that tags messages, and a common idiom is
> > > to put
> > >  tag +mytag  and not tag:mytag
> > > 
> > > I don't know anything about efficiency, but for the simple single-tag
> > > case, couldn't we imply the "and not tag:mytag" from the +mytag action
> > > list for the tag command?
> > 
> > On one level, it really shouldn't be a performance issue to tag messages
> > that already have a particular tag. (And in fact, the recently proposed
> > patches to fix Xapian defect 250 even address this I think.)
> 
> Applying a filter up-front like this is likely to still help I think as it
> avoids Xapian having to reverse-engineer this information internally.

That's good to hear.

> Actually, you could do this with multiple tags - you just need to build
> a filter for documents which might be affected.
> 
> So if you're adding tags a1 and a2, you want:  AND_NOT (a1 AND a2)
> since documents which already have tags a1 and a2 can be ignored.
> 
> If you're removing d1 and d2, then the filter is:  AND (d1 OR d2)
> since documents have to be tagged d1 or d2 in order for the removal to
> do anything.
> 
> Handling a combination of removals and additions is trickier, but probably
> possible, although the more tags you are dealing with, the less profitable
> the filtering is likely to be (as the filter is likely to cull fewer
> documents yet be more expensive to evaluate).

But the transform is pretty simple, I think that any combination of
additions and removals could be transformed according to the following
formula.

notmuch tag +a1 +a2 +a3 -d1 -d2 -d3 

would transform to something like:

 and ( not(a1) or not(a2) or not(a3) or d1 or d2 or d3) 

There are certainly may be much more optimal ways to do it depending on
the specific corpus of the database, considering if the tags a1 and a2
and a3 are usually added as one tag, or if the addition is done
individually, because if I know that a3 implies a1 and a2, the first 3
terms could be combined to not(a1 and a2 and a3), or I could just
exclude a3 tagged messages for nearly the same effect, with expected
performance improvements.

Unfortunately this requires that I know more about how the tags are used
than I ever want notmuch to deal with.  Perhaps a follow-on or parallel
project with less emphasis on minimalism.


This looks pretty good to me.  Easy to implement and not likely to break
things.  I've been wondering about whether there should be a repository
of mail added to the notmuch git so that we can start testing these
kinds of features on a consistent body of mail.

I doubt that I'll be the one to write this, since I don't have any time
set aside for real coding, but if it takes long enough, I'll probably
pick this one up eventually.

-Mark



[notmuch] Rather simple optimization for notmuch tag

2009-12-23 Thread Olly Betts
Carl Worth writes:
> On Fri, 18 Dec 2009 00:49:00 -0700, Mark Anderson wrote:
> > I was updating my poll script that tags messages, and a common idiom is
> > to put
> >  tag +mytag  and not tag:mytag
> > 
> > I don't know anything about efficiency, but for the simple single-tag
> > case, couldn't we imply the "and not tag:mytag" from the +mytag action
> > list for the tag command?
> 
> On one level, it really shouldn't be a performance issue to tag messages
> that already have a particular tag. (And in fact, the recently proposed
> patches to fix Xapian defect 250 even address this I think.)

Applying a filter up-front like this is likely to still help I think as it
avoids Xapian having to reverse-engineer this information internally.

> One potential snag with both ideas is that the "notmuch tag"
> command-line as currently implemented allows for multiple tag additions
> and removals with a single search. So the optimization here couldn't be
> used unless there was just a single tag action.

Actually, you could do this with multiple tags - you just need to build
a filter for documents which might be affected.

So if you're adding tags a1 and a2, you want:  AND_NOT (a1 AND a2)
since documents which already have tags a1 and a2 can be ignored.

If you're removing d1 and d2, then the filter is:  AND (d1 OR d2)
since documents have to be tagged d1 or d2 in order for the removal to
do anything.

Handling a combination of removals and additions is trickier, but probably
possible, although the more tags you are dealing with, the less profitable
the filtering is likely to be (as the filter is likely to cull fewer
documents yet be more expensive to evaluate).

Cheers,
Olly



[notmuch] Rather simple optimization for notmuch tag

2009-12-18 Thread Carl Worth
On Fri, 18 Dec 2009 00:49:00 -0700, Mark Anderson  
wrote:
> I was updating my poll script that tags messages, and a common idiom is
> to put
>  tag +mytag  and not tag:mytag
> 
> I don't know anything about efficiency, but for the simple single-tag
> case, couldn't we imply the "and not tag:mytag" from the +mytag action
> list for the tag command?

On one level, it really shouldn't be a performance issue to tag messages
that already have a particular tag. (And in fact, the recently proposed
patches to fix Xapian defect 250 even address this I think.)

In the meantime, it is fairly annoying to have to type this, and yes,
the tag command could infer that and append it to the search string
automatically. That's a good idea, really.

> The similar (dual?, rusty math terminology, beware of Math-tetanus) case
> of "tag -mytag  and tag:mytag" could be similarly optimized,
> since the tag removal action ought to be a null action in the case that
> the search terms matched on a thread or message, but the tag to be
> removed isn't attached to the message/thread returned.

Yes, that would work too.

One potential snag with both ideas is that the "notmuch tag"
command-line as currently implemented allows for multiple tag additions
and removals with a single search. So the optimization here couldn't be
used unless there was just a single tag action.

So that's another reason to really just want the lower-level
optimization to be in place.

-Carl

-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 



Re: [notmuch] Rather simple optimization for notmuch tag

2009-12-18 Thread Carl Worth
On Fri, 18 Dec 2009 00:49:00 -0700, Mark Anderson markr.ander...@amd.com 
wrote:
 I was updating my poll script that tags messages, and a common idiom is
 to put
  tag +mytag search_terms and not tag:mytag
 
 I don't know anything about efficiency, but for the simple single-tag
 case, couldn't we imply the and not tag:mytag from the +mytag action
 list for the tag command?

On one level, it really shouldn't be a performance issue to tag messages
that already have a particular tag. (And in fact, the recently proposed
patches to fix Xapian defect 250 even address this I think.)

In the meantime, it is fairly annoying to have to type this, and yes,
the tag command could infer that and append it to the search string
automatically. That's a good idea, really.

 The similar (dual?, rusty math terminology, beware of Math-tetanus) case
 of tag -mytag search-terms and tag:mytag could be similarly optimized,
 since the tag removal action ought to be a null action in the case that
 the search terms matched on a thread or message, but the tag to be
 removed isn't attached to the message/thread returned.

Yes, that would work too.

One potential snag with both ideas is that the notmuch tag
command-line as currently implemented allows for multiple tag additions
and removals with a single search. So the optimization here couldn't be
used unless there was just a single tag action.

So that's another reason to really just want the lower-level
optimization to be in place.

-Carl



pgpX3lcvcBjhJ.pgp
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch