[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-31 Thread Austin Clements
This is looking really good.  I think this overall approach is
significantly better than the initial exclude support and the UI
aspects look like they should be much more pleasant.

Quoth Mark Walters on Jan 29 at  6:36 pm:
> 
> Ok I now have a patch set which might be complete enough to be worth
> reviewing. It is essentially complete and appears to work.
> 
> Things that still need doing: 
>updating the test suite. The series changes notmuch-show to
>output the exclude flag so several tests need updating. Of
>course, the new functionality needs some tests too.
>
>emacs/notmuch.el I think it would be nice to hide (make
>invisible) threads with no matching non-excluded messages (with a
>toggle for visibility) but that is definitely beyond my elisp
>skills.

I'm happy to do that once this goes in.

> The first patch of the series is not really part of the series: it adds
> a --do-not-exclude option to tell the command not to exclude. I think
> this is useful anyway, but it also simplifies behaviour decisions with
> the excludes. For example notmuch count will only count matching
> non-excluded messages but notmuch count --do-not-exclude will count 
> all matching messages excluded or not.
> 
> One outstanding issue is that raised in
> id:"20120124025331.GZ16740 at mit.edu". I will need to think about
> that. 
> 
> Best wishes
> 
> Mark
> 


Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-30 Thread Austin Clements
This is looking really good.  I think this overall approach is
significantly better than the initial exclude support and the UI
aspects look like they should be much more pleasant.

Quoth Mark Walters on Jan 29 at  6:36 pm:
 
 Ok I now have a patch set which might be complete enough to be worth
 reviewing. It is essentially complete and appears to work.
 
 Things that still need doing: 
updating the test suite. The series changes notmuch-show to
output the exclude flag so several tests need updating. Of
course, the new functionality needs some tests too.

emacs/notmuch.el I think it would be nice to hide (make
invisible) threads with no matching non-excluded messages (with a
toggle for visibility) but that is definitely beyond my elisp
skills.

I'm happy to do that once this goes in.

 The first patch of the series is not really part of the series: it adds
 a --do-not-exclude option to tell the command not to exclude. I think
 this is useful anyway, but it also simplifies behaviour decisions with
 the excludes. For example notmuch count will only count matching
 non-excluded messages but notmuch count --do-not-exclude will count 
 all matching messages excluded or not.
 
 One outstanding issue is that raised in
 id:20120124025331.gz16...@mit.edu. I will need to think about
 that. 
 
 Best wishes
 
 Mark
 
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-29 Thread Mark Walters

Ok I now have a patch set which might be complete enough to be worth
reviewing. It is essentially complete and appears to work.

Things that still need doing: 
   updating the test suite. The series changes notmuch-show to
   output the exclude flag so several tests need updating. Of
   course, the new functionality needs some tests too.

   emacs/notmuch.el I think it would be nice to hide (make
   invisible) threads with no matching non-excluded messages (with a
   toggle for visibility) but that is definitely beyond my elisp
   skills.

The first patch of the series is not really part of the series: it adds
a --do-not-exclude option to tell the command not to exclude. I think
this is useful anyway, but it also simplifies behaviour decisions with
the excludes. For example notmuch count will only count matching
non-excluded messages but notmuch count --do-not-exclude will count 
all matching messages excluded or not.

One outstanding issue is that raised in
id:"20120124025331.GZ16740 at mit.edu". I will need to think about
that. 

Best wishes

Mark



[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-29 Thread Mark Walters

> The cli stuff needs thought (about what it should do rather than
> how to do it).

Ok I have thought about the cli interface. My thoughts are as follows:

count/search/show should all have a --do-not-exclude option.

notmuch count:

messages: just output count of matching not-excluded

threads: count threads matching in a non-excluded message

notmuch search: 

default/summary should output line as in the current patch with
number matching being number non-excluded matching (so some will
be zero)

messages/files/tags should be matching not-excluded

threads:  show thread ids of messages matching in a non-excluded message

notmuch show: 

raw and part both deal with a single message so should output it
regardless of exclude flags.

text/json can give out the results including the exclude flag

mbox only include matching not-excluded

The rationale is that all formats which can return an exclude flag do;
those that cannot omit the excluded results since the caller can just
call without setting the excludes if they want the full results.

Does that seem reasonable?

Best wishes

Mark


[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-28 Thread Mark Walters
On Sat, 28 Jan 2012 13:33:40 -0500, Austin Clements  wrote:
> Quoth Mark Walters on Jan 28 at 10:51 am:
> > 
> > > > exclude_query = _notmuch_exclude_tags (query, final_query);
> > > >  
> > > > -   final_query = Xapian::Query (Xapian::Query::OP_AND_NOT,
> > > > -final_query, exclude_query);
> > > > +   enquire.set_weighting_scheme (Xapian::BoolWeight());
> > > > +   enquire.set_query (exclude_query);
> > > > +
> > > > +   mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount 
> > > > ());
> > > > +
> > > > +   GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof 
> > > > (unsigned int));
> > > > +
> > > > +   for (iterator = mset.begin (); iterator != mset.end (); 
> > > > iterator++)
> > > > +   {
> > > > +   unsigned int doc_id = *iterator;
> > > > +   g_array_append_val (excluded_doc_ids, doc_id);
> > > > +   }
> > > > +   messages->base.excluded_doc_ids = talloc (query, 
> > > > _notmuch_doc_id_set);
> > > > +   _notmuch_doc_id_set_init (query, 
> > > > messages->base.excluded_doc_ids,
> > > > + excluded_doc_ids);
> > > 
> > > This might be inefficient for message-only queries, since it will
> > > fetch *all* excluded docids.  This highlights a basic difference
> > > between message and thread search: thread search can return messages
> > > that don't match the original query and hence needs to know all
> > > potentially excluded messages, while message search can only return
> > > messages that match the original query.
> > 
> > I now have some benchmarks (not run enough times to be hugely accurate
> > so ignore minor differences). The full results are below. The summary
> > is:
> > 
> > Large-archive = 1 100 000 messages in 290 000 threads (about 10 years of
> > lkml). I mark 1 000 000 deleted
> > Small-archive = 70 000 messages in 35 000 threads. 10 000 marked
> > deleted.
> > 
> > Doing the initial exclude work on the big collection takes about 0.8s
> > and on the small collection about 0.01s. So any query to the big
> > collection takes at least 0.8s longer and this all occurs before any
> > results appear.
> 
> Interesting.  Do you know where that time is spent?
> 
> Also, it might be reasonable to assume that no more than, say, 10% of
> a person's mail store is excluded, but maybe that depends on how
> people use this feature.
> 
> > I then implemented the exclude doing it once for each thread query in
> > _notmuch_create_thread. Roughly this made any query 50% slower.
> 
> That's not terrible.
> 
> > In normal front end use even the 0.8s is not totally unusable, but it is
> > totally unacceptable in the backend where a user might do something like
> > 
> > for i in ` notmuch search --output=threads  from:xxx ` ; 
> > do 
> >notmuch search --output=messages $i; 
> > done
> > 
> > to list all messages in all matching threads.
> > 
> > So I think my conclusions are:
> > 
> > (1) message only queries must be done without the full exclude.
> > (2) thread queries which only match one message should not do the full
> > exclude
> > (3) it would be nice to switch between the two approaches depending on
> > size but I don't see how to do that without extra(!) queries
> > (4) One possible might be do something that say does thirty threads with
> > the by thread method and then if not finished does the full exclude.
> > (5) thread-by-thread might be best for  Jani's limit-match 
> > id:"1327692900-22926-1-git-send-email-jani at nikula.org" 
> > 
> > Obviously, anything setting an exclude flag like this will be slower
> > (since it is doing more work): the question is are either of these (or a
> > combination like (4) above) acceptable?
> 
> Or only mark matched messages as excluded.
> 
> Here's another idea (actually, a rehash of an old idea).  For message
> search do two queries, the original query and " AND
> ", and use this to keep everything in order and mark excluded
> messages.  For thread search, use message search results so it's easy
> to both sort by unexcluded messages and include fully-excluded
> threads, but compute the excluded flag (either just for unmatched
> messages or for all messages) by examining each message's tags
> directly (which thread_add_message already iterates over, so this is
> easy and won't add any overhead).  If the excluded query is fast,
> which I think it will be, I think this should get the best of all
> worlds and be fairly straightforward to implement (no asymmetries
> between the queries used for message and thread search).  It would be
> easy and worth it to run the excluded query by hand on your test
> corpus; I suspect it will be much faster than 0.8s because the query
> already uses "Tmail", which is huge and doesn't seem to slow things
> down.

I have tried your suggestion (still marking all messages) and it does
seem the way to go: the difference in speed is small from master is
small: between 0 and 10% 

[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-28 Thread Austin Clements
Quoth Mark Walters on Jan 28 at 10:51 am:
> 
> > >   exclude_query = _notmuch_exclude_tags (query, final_query);
> > >  
> > > - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT,
> > > -  final_query, exclude_query);
> > > + enquire.set_weighting_scheme (Xapian::BoolWeight());
> > > + enquire.set_query (exclude_query);
> > > +
> > > + mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount ());
> > > +
> > > + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned 
> > > int));
> > > +
> > > + for (iterator = mset.begin (); iterator != mset.end (); iterator++)
> > > + {
> > > + unsigned int doc_id = *iterator;
> > > + g_array_append_val (excluded_doc_ids, doc_id);
> > > + }
> > > + messages->base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set);
> > > + _notmuch_doc_id_set_init (query, messages->base.excluded_doc_ids,
> > > +   excluded_doc_ids);
> > 
> > This might be inefficient for message-only queries, since it will
> > fetch *all* excluded docids.  This highlights a basic difference
> > between message and thread search: thread search can return messages
> > that don't match the original query and hence needs to know all
> > potentially excluded messages, while message search can only return
> > messages that match the original query.
> 
> I now have some benchmarks (not run enough times to be hugely accurate
> so ignore minor differences). The full results are below. The summary
> is:
> 
> Large-archive = 1 100 000 messages in 290 000 threads (about 10 years of
> lkml). I mark 1 000 000 deleted
> Small-archive = 70 000 messages in 35 000 threads. 10 000 marked
> deleted.
> 
> Doing the initial exclude work on the big collection takes about 0.8s
> and on the small collection about 0.01s. So any query to the big
> collection takes at least 0.8s longer and this all occurs before any
> results appear.

Interesting.  Do you know where that time is spent?

Also, it might be reasonable to assume that no more than, say, 10% of
a person's mail store is excluded, but maybe that depends on how
people use this feature.

> I then implemented the exclude doing it once for each thread query in
> _notmuch_create_thread. Roughly this made any query 50% slower.

That's not terrible.

> In normal front end use even the 0.8s is not totally unusable, but it is
> totally unacceptable in the backend where a user might do something like
> 
> for i in ` notmuch search --output=threads  from:xxx ` ; 
> do 
>notmuch search --output=messages $i; 
> done
> 
> to list all messages in all matching threads.
> 
> So I think my conclusions are:
> 
> (1) message only queries must be done without the full exclude.
> (2) thread queries which only match one message should not do the full
> exclude
> (3) it would be nice to switch between the two approaches depending on
> size but I don't see how to do that without extra(!) queries
> (4) One possible might be do something that say does thirty threads with
> the by thread method and then if not finished does the full exclude.
> (5) thread-by-thread might be best for  Jani's limit-match 
> id:"1327692900-22926-1-git-send-email-jani at nikula.org" 
> 
> Obviously, anything setting an exclude flag like this will be slower
> (since it is doing more work): the question is are either of these (or a
> combination like (4) above) acceptable?

Or only mark matched messages as excluded.

Here's another idea (actually, a rehash of an old idea).  For message
search do two queries, the original query and " AND
", and use this to keep everything in order and mark excluded
messages.  For thread search, use message search results so it's easy
to both sort by unexcluded messages and include fully-excluded
threads, but compute the excluded flag (either just for unmatched
messages or for all messages) by examining each message's tags
directly (which thread_add_message already iterates over, so this is
easy and won't add any overhead).  If the excluded query is fast,
which I think it will be, I think this should get the best of all
worlds and be fairly straightforward to implement (no asymmetries
between the queries used for message and thread search).  It would be
easy and worth it to run the excluded query by hand on your test
corpus; I suspect it will be much faster than 0.8s because the query
already uses "Tmail", which is huge and doesn't seem to slow things
down.

> I now have a mostly working implementation from library to
> emacs frontend and I do like the overall outcome.

Awesome.

> The complete benchmarks are below
> 
> Best wishes
> 
> Mark
> 
> LARGE COLLECTION is 1,100,000 messages 290,000 threads 1,000,000 deleted
> SMALL COLLECTION is 70,000 messages in 35,000 threads 10,000 deleted
> 
> benchmarks: all times in seconds, x/y/z means a query which matches x
> threads with y matching messages and z messages in total. Ig or ignore
> means with the tag-exclude turned off (i.e. with a query 

[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-28 Thread Mark Walters

> > exclude_query = _notmuch_exclude_tags (query, final_query);
> >  
> > -   final_query = Xapian::Query (Xapian::Query::OP_AND_NOT,
> > -final_query, exclude_query);
> > +   enquire.set_weighting_scheme (Xapian::BoolWeight());
> > +   enquire.set_query (exclude_query);
> > +
> > +   mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount ());
> > +
> > +   GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned 
> > int));
> > +
> > +   for (iterator = mset.begin (); iterator != mset.end (); iterator++)
> > +   {
> > +   unsigned int doc_id = *iterator;
> > +   g_array_append_val (excluded_doc_ids, doc_id);
> > +   }
> > +   messages->base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set);
> > +   _notmuch_doc_id_set_init (query, messages->base.excluded_doc_ids,
> > + excluded_doc_ids);
> 
> This might be inefficient for message-only queries, since it will
> fetch *all* excluded docids.  This highlights a basic difference
> between message and thread search: thread search can return messages
> that don't match the original query and hence needs to know all
> potentially excluded messages, while message search can only return
> messages that match the original query.

I now have some benchmarks (not run enough times to be hugely accurate
so ignore minor differences). The full results are below. The summary
is:

Large-archive = 1 100 000 messages in 290 000 threads (about 10 years of
lkml). I mark 1 000 000 deleted
Small-archive = 70 000 messages in 35 000 threads. 10 000 marked
deleted.

Doing the initial exclude work on the big collection takes about 0.8s
and on the small collection about 0.01s. So any query to the big
collection takes at least 0.8s longer and this all occurs before any
results appear.

I then implemented the exclude doing it once for each thread query in
_notmuch_create_thread. Roughly this made any query 50% slower.

In normal front end use even the 0.8s is not totally unusable, but it is
totally unacceptable in the backend where a user might do something like

for i in ` notmuch search --output=threads  from:xxx ` ; 
do 
   notmuch search --output=messages $i; 
done

to list all messages in all matching threads.

So I think my conclusions are:

(1) message only queries must be done without the full exclude.
(2) thread queries which only match one message should not do the full
exclude
(3) it would be nice to switch between the two approaches depending on
size but I don't see how to do that without extra(!) queries
(4) One possible might be do something that say does thirty threads with
the by thread method and then if not finished does the full exclude.
(5) thread-by-thread might be best for  Jani's limit-match 
id:"1327692900-22926-1-git-send-email-jani at nikula.org" 

Obviously, anything setting an exclude flag like this will be slower
(since it is doing more work): the question is are either of these (or a
combination like (4) above) acceptable?

I now have a mostly working implementation from library to
emacs frontend and I do like the overall outcome.

The complete benchmarks are below

Best wishes

Mark

LARGE COLLECTION is 1,100,000 messages 290,000 threads 1,000,000 deleted
SMALL COLLECTION is 70,000 messages in 35,000 threads 10,000 deleted

benchmarks: all times in seconds, x/y/z means a query which matches x
threads with y matching messages and z messages in total. Ig or ignore
means with the tag-exclude turned off (i.e. with a query matching the
excluded tag). list all messages is the time for the for loop listed
above giving all message-ids for all messages in any thread matching a
query.

Finally the three columns are master with exclude code disabled,
thread-thread is doing excludes once per thread construction, and
in-advance does all the exclude work in advance as in the patches I posted.

In most cases the benchmark is the average of a lot of runs so the
database should have been as cached as one could hope.

master-(all)thread-thread   in-advance
LARGE COLLECTION
show single message 0.016   0.018   0.78
search single message   0.015   0.016   0.78
search single with tag  0.015   0.015   0.009
945/2627/2
query ignore2.9 n/a 3
query   2.9 4.2 3.8
list all messages (ig)  13  n/a 13
list all messages   13  14  12mins
4754/13000/11
query ignore15.9n/a 17
query   15.922  17.6
only messages   1.251.261.9
177/483/1752
query   0.3 0.421.1

search '*'  20mins  28mins  21.5mins


SMALL COLLECTION
1500/2800/5600

Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-28 Thread Austin Clements
Quoth Mark Walters on Jan 28 at 10:51 am:
 
 exclude_query = _notmuch_exclude_tags (query, final_query);

   - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT,
   -  final_query, exclude_query);
   + enquire.set_weighting_scheme (Xapian::BoolWeight());
   + enquire.set_query (exclude_query);
   +
   + mset = enquire.get_mset (0, notmuch-xapian_db-get_doccount ());
   +
   + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned 
   int));
   +
   + for (iterator = mset.begin (); iterator != mset.end (); iterator++)
   + {
   + unsigned int doc_id = *iterator;
   + g_array_append_val (excluded_doc_ids, doc_id);
   + }
   + messages-base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set);
   + _notmuch_doc_id_set_init (query, messages-base.excluded_doc_ids,
   +   excluded_doc_ids);
  
  This might be inefficient for message-only queries, since it will
  fetch *all* excluded docids.  This highlights a basic difference
  between message and thread search: thread search can return messages
  that don't match the original query and hence needs to know all
  potentially excluded messages, while message search can only return
  messages that match the original query.
 
 I now have some benchmarks (not run enough times to be hugely accurate
 so ignore minor differences). The full results are below. The summary
 is:
 
 Large-archive = 1 100 000 messages in 290 000 threads (about 10 years of
 lkml). I mark 1 000 000 deleted
 Small-archive = 70 000 messages in 35 000 threads. 10 000 marked
 deleted.
 
 Doing the initial exclude work on the big collection takes about 0.8s
 and on the small collection about 0.01s. So any query to the big
 collection takes at least 0.8s longer and this all occurs before any
 results appear.

Interesting.  Do you know where that time is spent?

Also, it might be reasonable to assume that no more than, say, 10% of
a person's mail store is excluded, but maybe that depends on how
people use this feature.

 I then implemented the exclude doing it once for each thread query in
 _notmuch_create_thread. Roughly this made any query 50% slower.

That's not terrible.

 In normal front end use even the 0.8s is not totally unusable, but it is
 totally unacceptable in the backend where a user might do something like
 
 for i in ` notmuch search --output=threads  from:xxx ` ; 
 do 
notmuch search --output=messages $i; 
 done
 
 to list all messages in all matching threads.
 
 So I think my conclusions are:
 
 (1) message only queries must be done without the full exclude.
 (2) thread queries which only match one message should not do the full
 exclude
 (3) it would be nice to switch between the two approaches depending on
 size but I don't see how to do that without extra(!) queries
 (4) One possible might be do something that say does thirty threads with
 the by thread method and then if not finished does the full exclude.
 (5) thread-by-thread might be best for  Jani's limit-match 
 id:1327692900-22926-1-git-send-email-j...@nikula.org 
 
 Obviously, anything setting an exclude flag like this will be slower
 (since it is doing more work): the question is are either of these (or a
 combination like (4) above) acceptable?

Or only mark matched messages as excluded.

Here's another idea (actually, a rehash of an old idea).  For message
search do two queries, the original query and original AND
exclude, and use this to keep everything in order and mark excluded
messages.  For thread search, use message search results so it's easy
to both sort by unexcluded messages and include fully-excluded
threads, but compute the excluded flag (either just for unmatched
messages or for all messages) by examining each message's tags
directly (which thread_add_message already iterates over, so this is
easy and won't add any overhead).  If the excluded query is fast,
which I think it will be, I think this should get the best of all
worlds and be fairly straightforward to implement (no asymmetries
between the queries used for message and thread search).  It would be
easy and worth it to run the excluded query by hand on your test
corpus; I suspect it will be much faster than 0.8s because the query
already uses Tmail, which is huge and doesn't seem to slow things
down.

 I now have a mostly working implementation from library to
 emacs frontend and I do like the overall outcome.

Awesome.

 The complete benchmarks are below
 
 Best wishes
 
 Mark
 
 LARGE COLLECTION is 1,100,000 messages 290,000 threads 1,000,000 deleted
 SMALL COLLECTION is 70,000 messages in 35,000 threads 10,000 deleted
 
 benchmarks: all times in seconds, x/y/z means a query which matches x
 threads with y matching messages and z messages in total. Ig or ignore
 means with the tag-exclude turned off (i.e. with a query matching the
 excluded tag). list all messages is the time for the for loop listed
 above giving all message-ids for all messages in 

Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-28 Thread Mark Walters
On Sat, 28 Jan 2012 13:33:40 -0500, Austin Clements amdra...@mit.edu wrote:
 Quoth Mark Walters on Jan 28 at 10:51 am:
  
exclude_query = _notmuch_exclude_tags (query, final_query);
 
-   final_query = Xapian::Query (Xapian::Query::OP_AND_NOT,
-final_query, exclude_query);
+   enquire.set_weighting_scheme (Xapian::BoolWeight());
+   enquire.set_query (exclude_query);
+
+   mset = enquire.get_mset (0, notmuch-xapian_db-get_doccount 
());
+
+   GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof 
(unsigned int));
+
+   for (iterator = mset.begin (); iterator != mset.end (); 
iterator++)
+   {
+   unsigned int doc_id = *iterator;
+   g_array_append_val (excluded_doc_ids, doc_id);
+   }
+   messages-base.excluded_doc_ids = talloc (query, 
_notmuch_doc_id_set);
+   _notmuch_doc_id_set_init (query, 
messages-base.excluded_doc_ids,
+ excluded_doc_ids);
   
   This might be inefficient for message-only queries, since it will
   fetch *all* excluded docids.  This highlights a basic difference
   between message and thread search: thread search can return messages
   that don't match the original query and hence needs to know all
   potentially excluded messages, while message search can only return
   messages that match the original query.
  
  I now have some benchmarks (not run enough times to be hugely accurate
  so ignore minor differences). The full results are below. The summary
  is:
  
  Large-archive = 1 100 000 messages in 290 000 threads (about 10 years of
  lkml). I mark 1 000 000 deleted
  Small-archive = 70 000 messages in 35 000 threads. 10 000 marked
  deleted.
  
  Doing the initial exclude work on the big collection takes about 0.8s
  and on the small collection about 0.01s. So any query to the big
  collection takes at least 0.8s longer and this all occurs before any
  results appear.
 
 Interesting.  Do you know where that time is spent?
 
 Also, it might be reasonable to assume that no more than, say, 10% of
 a person's mail store is excluded, but maybe that depends on how
 people use this feature.
 
  I then implemented the exclude doing it once for each thread query in
  _notmuch_create_thread. Roughly this made any query 50% slower.
 
 That's not terrible.
 
  In normal front end use even the 0.8s is not totally unusable, but it is
  totally unacceptable in the backend where a user might do something like
  
  for i in ` notmuch search --output=threads  from:xxx ` ; 
  do 
 notmuch search --output=messages $i; 
  done
  
  to list all messages in all matching threads.
  
  So I think my conclusions are:
  
  (1) message only queries must be done without the full exclude.
  (2) thread queries which only match one message should not do the full
  exclude
  (3) it would be nice to switch between the two approaches depending on
  size but I don't see how to do that without extra(!) queries
  (4) One possible might be do something that say does thirty threads with
  the by thread method and then if not finished does the full exclude.
  (5) thread-by-thread might be best for  Jani's limit-match 
  id:1327692900-22926-1-git-send-email-j...@nikula.org 
  
  Obviously, anything setting an exclude flag like this will be slower
  (since it is doing more work): the question is are either of these (or a
  combination like (4) above) acceptable?
 
 Or only mark matched messages as excluded.
 
 Here's another idea (actually, a rehash of an old idea).  For message
 search do two queries, the original query and original AND
 exclude, and use this to keep everything in order and mark excluded
 messages.  For thread search, use message search results so it's easy
 to both sort by unexcluded messages and include fully-excluded
 threads, but compute the excluded flag (either just for unmatched
 messages or for all messages) by examining each message's tags
 directly (which thread_add_message already iterates over, so this is
 easy and won't add any overhead).  If the excluded query is fast,
 which I think it will be, I think this should get the best of all
 worlds and be fairly straightforward to implement (no asymmetries
 between the queries used for message and thread search).  It would be
 easy and worth it to run the excluded query by hand on your test
 corpus; I suspect it will be much faster than 0.8s because the query
 already uses Tmail, which is huge and doesn't seem to slow things
 down.

I have tried your suggestion (still marking all messages) and it does
seem the way to go: the difference in speed is small from master is
small: between 0 and 10% for most of the tests. 

The code seems to work and I will post it in reply to this thread. 

The library code is reasonable (although whether messages matching an
exclude tag that has been specified in the query 

[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-24 Thread Mark Walters
On Mon, 23 Jan 2012 21:45:21 -0500, Austin Clements  wrote:
> The overall structure of this series looks great.  There's obviously a
> lot of clean up to do, but I'll reply with a few high-level comments.
> 
> Quoth Mark Walters on Jan 24 at  1:18 am:
> > Form excluded doc_ids set and use that to exclude messages.
> > Should be no functional change.
> > 
> > ---
> >  lib/notmuch-private.h |1 +
> >  lib/query.cc  |   28 ++--
> >  2 files changed, 27 insertions(+), 2 deletions(-)
> > 
> > diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
> > index 7bf153e..e791bb0 100644
> > --- a/lib/notmuch-private.h
> > +++ b/lib/notmuch-private.h
> > @@ -401,6 +401,7 @@ typedef struct _notmuch_message_list {
> >   */
> >  struct visible _notmuch_messages {
> >  notmuch_bool_t is_of_list_type;
> > +notmuch_doc_id_set_t *excluded_doc_ids;
> >  notmuch_message_node_t *iterator;
> >  };
> >  
> > diff --git a/lib/query.cc b/lib/query.cc
> > index c25b301..92fa834 100644
> > --- a/lib/query.cc
> > +++ b/lib/query.cc
> > @@ -57,6 +57,11 @@ struct visible _notmuch_threads {
> >  notmuch_doc_id_set_t match_set;
> >  };
> >  
> > +static notmuch_bool_t
> > +_notmuch_doc_id_set_init (void *ctx,
> > + notmuch_doc_id_set_t *doc_ids,
> > + GArray *arr);
> > +
> >  notmuch_query_t *
> >  notmuch_query_create (notmuch_database_t *notmuch,
> >   const char *query_string)
> > @@ -173,6 +178,7 @@ notmuch_query_search_messages (notmuch_query_t *query)
> >"mail"));
> > Xapian::Query string_query, final_query, exclude_query;
> > Xapian::MSet mset;
> > +   Xapian::MSetIterator iterator;
> > unsigned int flags = (Xapian::QueryParser::FLAG_BOOLEAN |
> >   Xapian::QueryParser::FLAG_PHRASE |
> >   Xapian::QueryParser::FLAG_LOVEHATE |
> > @@ -193,8 +199,21 @@ notmuch_query_search_messages (notmuch_query_t *query)
> >  
> > exclude_query = _notmuch_exclude_tags (query, final_query);
> >  
> > -   final_query = Xapian::Query (Xapian::Query::OP_AND_NOT,
> > -final_query, exclude_query);
> > +   enquire.set_weighting_scheme (Xapian::BoolWeight());
> > +   enquire.set_query (exclude_query);
> > +
> > +   mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount ());
> > +
> > +   GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned 
> > int));
> > +
> > +   for (iterator = mset.begin (); iterator != mset.end (); iterator++)
> > +   {
> > +   unsigned int doc_id = *iterator;
> > +   g_array_append_val (excluded_doc_ids, doc_id);
> > +   }
> > +   messages->base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set);
> > +   _notmuch_doc_id_set_init (query, messages->base.excluded_doc_ids,
> > + excluded_doc_ids);
> 
> This might be inefficient for message-only queries, since it will
> fetch *all* excluded docids.  This highlights a basic difference
> between message and thread search: thread search can return messages
> that don't match the original query and hence needs to know all
> potentially excluded messages, while message search can only return
> messages that match the original query.
> 
> It's entirely possible this doesn't matter because Xapian probably
> still needs to fetch the full posting lists of the excluded terms, but
> it would be worth doing a quick/hacky benchmark to verify this, with
> enough excluded messages to make the cost non-trivial.
> 
> If it does matter, you could pass in a flag indicating if the exclude
> query should be limited by the original query or not.  Or you could do
> the limited exclude query in notmuch_query_search_messages and a
> separate open-ended exclude query in notmuch_query_search_threads.

Yes I will benchmark that: I am just importing a large archive into
notmuch for testing. 

> > enquire.set_weighting_scheme (Xapian::BoolWeight());
> >  
> > @@ -294,6 +313,11 @@ _notmuch_mset_messages_move_to_next 
> > (notmuch_messages_t *messages)
> >  mset_messages = (notmuch_mset_messages_t *) messages;
> >  
> >  mset_messages->iterator++;
> > +
> > +while ((mset_messages->iterator != mset_messages->iterator_end) &&
> > +  (_notmuch_doc_id_set_contains (messages->excluded_doc_ids,
> > + *mset_messages->iterator)))
> > +   mset_messages->iterator++;
> 
> This seemed a little weird, since you remove it in the next patch.  Is
> this just to keep the tests happy?  (If so, it would be worth
> mentioning in the commit message; other reviewers will definitely have
> the same question.)

Essentially just to keep tests happy: or rather to try and make it easy
for a reviewer to see that the individual patch does not make any
functional change.

Best wishes 

Mark




[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-24 Thread Mark Walters
Form excluded doc_ids set and use that to exclude messages.
Should be no functional change.

---
 lib/notmuch-private.h |1 +
 lib/query.cc  |   28 ++--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index 7bf153e..e791bb0 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -401,6 +401,7 @@ typedef struct _notmuch_message_list {
  */
 struct visible _notmuch_messages {
 notmuch_bool_t is_of_list_type;
+notmuch_doc_id_set_t *excluded_doc_ids;
 notmuch_message_node_t *iterator;
 };

diff --git a/lib/query.cc b/lib/query.cc
index c25b301..92fa834 100644
--- a/lib/query.cc
+++ b/lib/query.cc
@@ -57,6 +57,11 @@ struct visible _notmuch_threads {
 notmuch_doc_id_set_t match_set;
 };

+static notmuch_bool_t
+_notmuch_doc_id_set_init (void *ctx,
+ notmuch_doc_id_set_t *doc_ids,
+ GArray *arr);
+
 notmuch_query_t *
 notmuch_query_create (notmuch_database_t *notmuch,
  const char *query_string)
@@ -173,6 +178,7 @@ notmuch_query_search_messages (notmuch_query_t *query)
   "mail"));
Xapian::Query string_query, final_query, exclude_query;
Xapian::MSet mset;
+   Xapian::MSetIterator iterator;
unsigned int flags = (Xapian::QueryParser::FLAG_BOOLEAN |
  Xapian::QueryParser::FLAG_PHRASE |
  Xapian::QueryParser::FLAG_LOVEHATE |
@@ -193,8 +199,21 @@ notmuch_query_search_messages (notmuch_query_t *query)

exclude_query = _notmuch_exclude_tags (query, final_query);

-   final_query = Xapian::Query (Xapian::Query::OP_AND_NOT,
-final_query, exclude_query);
+   enquire.set_weighting_scheme (Xapian::BoolWeight());
+   enquire.set_query (exclude_query);
+
+   mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount ());
+
+   GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned 
int));
+
+   for (iterator = mset.begin (); iterator != mset.end (); iterator++)
+   {
+   unsigned int doc_id = *iterator;
+   g_array_append_val (excluded_doc_ids, doc_id);
+   }
+   messages->base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set);
+   _notmuch_doc_id_set_init (query, messages->base.excluded_doc_ids,
+ excluded_doc_ids);

enquire.set_weighting_scheme (Xapian::BoolWeight());

@@ -294,6 +313,11 @@ _notmuch_mset_messages_move_to_next (notmuch_messages_t 
*messages)
 mset_messages = (notmuch_mset_messages_t *) messages;

 mset_messages->iterator++;
+
+while ((mset_messages->iterator != mset_messages->iterator_end) &&
+  (_notmuch_doc_id_set_contains (messages->excluded_doc_ids,
+ *mset_messages->iterator)))
+   mset_messages->iterator++;
 }

 static notmuch_bool_t
-- 
1.7.2.3



[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-24 Thread Mark Walters
Form excluded doc_ids set and use that to exclude messages.
Should be no functional change.

---
 lib/notmuch-private.h |1 +
 lib/query.cc  |   28 ++--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index 7bf153e..e791bb0 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -401,6 +401,7 @@ typedef struct _notmuch_message_list {
  */
 struct visible _notmuch_messages {
 notmuch_bool_t is_of_list_type;
+notmuch_doc_id_set_t *excluded_doc_ids;
 notmuch_message_node_t *iterator;
 };
 
diff --git a/lib/query.cc b/lib/query.cc
index c25b301..92fa834 100644
--- a/lib/query.cc
+++ b/lib/query.cc
@@ -57,6 +57,11 @@ struct visible _notmuch_threads {
 notmuch_doc_id_set_t match_set;
 };
 
+static notmuch_bool_t
+_notmuch_doc_id_set_init (void *ctx,
+ notmuch_doc_id_set_t *doc_ids,
+ GArray *arr);
+
 notmuch_query_t *
 notmuch_query_create (notmuch_database_t *notmuch,
  const char *query_string)
@@ -173,6 +178,7 @@ notmuch_query_search_messages (notmuch_query_t *query)
   mail));
Xapian::Query string_query, final_query, exclude_query;
Xapian::MSet mset;
+   Xapian::MSetIterator iterator;
unsigned int flags = (Xapian::QueryParser::FLAG_BOOLEAN |
  Xapian::QueryParser::FLAG_PHRASE |
  Xapian::QueryParser::FLAG_LOVEHATE |
@@ -193,8 +199,21 @@ notmuch_query_search_messages (notmuch_query_t *query)
 
exclude_query = _notmuch_exclude_tags (query, final_query);
 
-   final_query = Xapian::Query (Xapian::Query::OP_AND_NOT,
-final_query, exclude_query);
+   enquire.set_weighting_scheme (Xapian::BoolWeight());
+   enquire.set_query (exclude_query);
+
+   mset = enquire.get_mset (0, notmuch-xapian_db-get_doccount ());
+
+   GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned 
int));
+
+   for (iterator = mset.begin (); iterator != mset.end (); iterator++)
+   {
+   unsigned int doc_id = *iterator;
+   g_array_append_val (excluded_doc_ids, doc_id);
+   }
+   messages-base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set);
+   _notmuch_doc_id_set_init (query, messages-base.excluded_doc_ids,
+ excluded_doc_ids);
 
enquire.set_weighting_scheme (Xapian::BoolWeight());
 
@@ -294,6 +313,11 @@ _notmuch_mset_messages_move_to_next (notmuch_messages_t 
*messages)
 mset_messages = (notmuch_mset_messages_t *) messages;
 
 mset_messages-iterator++;
+
+while ((mset_messages-iterator != mset_messages-iterator_end) 
+  (_notmuch_doc_id_set_contains (messages-excluded_doc_ids,
+ *mset_messages-iterator)))
+   mset_messages-iterator++;
 }
 
 static notmuch_bool_t
-- 
1.7.2.3

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-24 Thread Austin Clements
The overall structure of this series looks great.  There's obviously a
lot of clean up to do, but I'll reply with a few high-level comments.

Quoth Mark Walters on Jan 24 at  1:18 am:
 Form excluded doc_ids set and use that to exclude messages.
 Should be no functional change.
 
 ---
  lib/notmuch-private.h |1 +
  lib/query.cc  |   28 ++--
  2 files changed, 27 insertions(+), 2 deletions(-)
 
 diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
 index 7bf153e..e791bb0 100644
 --- a/lib/notmuch-private.h
 +++ b/lib/notmuch-private.h
 @@ -401,6 +401,7 @@ typedef struct _notmuch_message_list {
   */
  struct visible _notmuch_messages {
  notmuch_bool_t is_of_list_type;
 +notmuch_doc_id_set_t *excluded_doc_ids;
  notmuch_message_node_t *iterator;
  };
  
 diff --git a/lib/query.cc b/lib/query.cc
 index c25b301..92fa834 100644
 --- a/lib/query.cc
 +++ b/lib/query.cc
 @@ -57,6 +57,11 @@ struct visible _notmuch_threads {
  notmuch_doc_id_set_t match_set;
  };
  
 +static notmuch_bool_t
 +_notmuch_doc_id_set_init (void *ctx,
 +   notmuch_doc_id_set_t *doc_ids,
 +   GArray *arr);
 +
  notmuch_query_t *
  notmuch_query_create (notmuch_database_t *notmuch,
 const char *query_string)
 @@ -173,6 +178,7 @@ notmuch_query_search_messages (notmuch_query_t *query)
  mail));
   Xapian::Query string_query, final_query, exclude_query;
   Xapian::MSet mset;
 + Xapian::MSetIterator iterator;
   unsigned int flags = (Xapian::QueryParser::FLAG_BOOLEAN |
 Xapian::QueryParser::FLAG_PHRASE |
 Xapian::QueryParser::FLAG_LOVEHATE |
 @@ -193,8 +199,21 @@ notmuch_query_search_messages (notmuch_query_t *query)
  
   exclude_query = _notmuch_exclude_tags (query, final_query);
  
 - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT,
 -  final_query, exclude_query);
 + enquire.set_weighting_scheme (Xapian::BoolWeight());
 + enquire.set_query (exclude_query);
 +
 + mset = enquire.get_mset (0, notmuch-xapian_db-get_doccount ());
 +
 + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned 
 int));
 +
 + for (iterator = mset.begin (); iterator != mset.end (); iterator++)
 + {
 + unsigned int doc_id = *iterator;
 + g_array_append_val (excluded_doc_ids, doc_id);
 + }
 + messages-base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set);
 + _notmuch_doc_id_set_init (query, messages-base.excluded_doc_ids,
 +   excluded_doc_ids);

This might be inefficient for message-only queries, since it will
fetch *all* excluded docids.  This highlights a basic difference
between message and thread search: thread search can return messages
that don't match the original query and hence needs to know all
potentially excluded messages, while message search can only return
messages that match the original query.

It's entirely possible this doesn't matter because Xapian probably
still needs to fetch the full posting lists of the excluded terms, but
it would be worth doing a quick/hacky benchmark to verify this, with
enough excluded messages to make the cost non-trivial.

If it does matter, you could pass in a flag indicating if the exclude
query should be limited by the original query or not.  Or you could do
the limited exclude query in notmuch_query_search_messages and a
separate open-ended exclude query in notmuch_query_search_threads.

  
   enquire.set_weighting_scheme (Xapian::BoolWeight());
  
 @@ -294,6 +313,11 @@ _notmuch_mset_messages_move_to_next (notmuch_messages_t 
 *messages)
  mset_messages = (notmuch_mset_messages_t *) messages;
  
  mset_messages-iterator++;
 +
 +while ((mset_messages-iterator != mset_messages-iterator_end) 
 +(_notmuch_doc_id_set_contains (messages-excluded_doc_ids,
 +   *mset_messages-iterator)))
 + mset_messages-iterator++;

This seemed a little weird, since you remove it in the next patch.  Is
this just to keep the tests happy?  (If so, it would be worth
mentioning in the commit message; other reviewers will definitely have
the same question.)

  }
  
  static notmuch_bool_t
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-24 Thread Mark Walters
On Mon, 23 Jan 2012 21:45:21 -0500, Austin Clements amdra...@mit.edu wrote:
 The overall structure of this series looks great.  There's obviously a
 lot of clean up to do, but I'll reply with a few high-level comments.
 
 Quoth Mark Walters on Jan 24 at  1:18 am:
  Form excluded doc_ids set and use that to exclude messages.
  Should be no functional change.
  
  ---
   lib/notmuch-private.h |1 +
   lib/query.cc  |   28 ++--
   2 files changed, 27 insertions(+), 2 deletions(-)
  
  diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
  index 7bf153e..e791bb0 100644
  --- a/lib/notmuch-private.h
  +++ b/lib/notmuch-private.h
  @@ -401,6 +401,7 @@ typedef struct _notmuch_message_list {
*/
   struct visible _notmuch_messages {
   notmuch_bool_t is_of_list_type;
  +notmuch_doc_id_set_t *excluded_doc_ids;
   notmuch_message_node_t *iterator;
   };
   
  diff --git a/lib/query.cc b/lib/query.cc
  index c25b301..92fa834 100644
  --- a/lib/query.cc
  +++ b/lib/query.cc
  @@ -57,6 +57,11 @@ struct visible _notmuch_threads {
   notmuch_doc_id_set_t match_set;
   };
   
  +static notmuch_bool_t
  +_notmuch_doc_id_set_init (void *ctx,
  + notmuch_doc_id_set_t *doc_ids,
  + GArray *arr);
  +
   notmuch_query_t *
   notmuch_query_create (notmuch_database_t *notmuch,
const char *query_string)
  @@ -173,6 +178,7 @@ notmuch_query_search_messages (notmuch_query_t *query)
 mail));
  Xapian::Query string_query, final_query, exclude_query;
  Xapian::MSet mset;
  +   Xapian::MSetIterator iterator;
  unsigned int flags = (Xapian::QueryParser::FLAG_BOOLEAN |
Xapian::QueryParser::FLAG_PHRASE |
Xapian::QueryParser::FLAG_LOVEHATE |
  @@ -193,8 +199,21 @@ notmuch_query_search_messages (notmuch_query_t *query)
   
  exclude_query = _notmuch_exclude_tags (query, final_query);
   
  -   final_query = Xapian::Query (Xapian::Query::OP_AND_NOT,
  -final_query, exclude_query);
  +   enquire.set_weighting_scheme (Xapian::BoolWeight());
  +   enquire.set_query (exclude_query);
  +
  +   mset = enquire.get_mset (0, notmuch-xapian_db-get_doccount ());
  +
  +   GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned 
  int));
  +
  +   for (iterator = mset.begin (); iterator != mset.end (); iterator++)
  +   {
  +   unsigned int doc_id = *iterator;
  +   g_array_append_val (excluded_doc_ids, doc_id);
  +   }
  +   messages-base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set);
  +   _notmuch_doc_id_set_init (query, messages-base.excluded_doc_ids,
  + excluded_doc_ids);
 
 This might be inefficient for message-only queries, since it will
 fetch *all* excluded docids.  This highlights a basic difference
 between message and thread search: thread search can return messages
 that don't match the original query and hence needs to know all
 potentially excluded messages, while message search can only return
 messages that match the original query.
 
 It's entirely possible this doesn't matter because Xapian probably
 still needs to fetch the full posting lists of the excluded terms, but
 it would be worth doing a quick/hacky benchmark to verify this, with
 enough excluded messages to make the cost non-trivial.
 
 If it does matter, you could pass in a flag indicating if the exclude
 query should be limited by the original query or not.  Or you could do
 the limited exclude query in notmuch_query_search_messages and a
 separate open-ended exclude query in notmuch_query_search_threads.

Yes I will benchmark that: I am just importing a large archive into
notmuch for testing. 

  enquire.set_weighting_scheme (Xapian::BoolWeight());
   
  @@ -294,6 +313,11 @@ _notmuch_mset_messages_move_to_next 
  (notmuch_messages_t *messages)
   mset_messages = (notmuch_mset_messages_t *) messages;
   
   mset_messages-iterator++;
  +
  +while ((mset_messages-iterator != mset_messages-iterator_end) 
  +  (_notmuch_doc_id_set_contains (messages-excluded_doc_ids,
  + *mset_messages-iterator)))
  +   mset_messages-iterator++;
 
 This seemed a little weird, since you remove it in the next patch.  Is
 this just to keep the tests happy?  (If so, it would be worth
 mentioning in the commit message; other reviewers will definitely have
 the same question.)

Essentially just to keep tests happy: or rather to try and make it easy
for a reviewer to see that the individual patch does not make any
functional change.

Best wishes 

Mark


___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

2012-01-23 Thread Austin Clements
The overall structure of this series looks great.  There's obviously a
lot of clean up to do, but I'll reply with a few high-level comments.

Quoth Mark Walters on Jan 24 at  1:18 am:
> Form excluded doc_ids set and use that to exclude messages.
> Should be no functional change.
> 
> ---
>  lib/notmuch-private.h |1 +
>  lib/query.cc  |   28 ++--
>  2 files changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
> index 7bf153e..e791bb0 100644
> --- a/lib/notmuch-private.h
> +++ b/lib/notmuch-private.h
> @@ -401,6 +401,7 @@ typedef struct _notmuch_message_list {
>   */
>  struct visible _notmuch_messages {
>  notmuch_bool_t is_of_list_type;
> +notmuch_doc_id_set_t *excluded_doc_ids;
>  notmuch_message_node_t *iterator;
>  };
>  
> diff --git a/lib/query.cc b/lib/query.cc
> index c25b301..92fa834 100644
> --- a/lib/query.cc
> +++ b/lib/query.cc
> @@ -57,6 +57,11 @@ struct visible _notmuch_threads {
>  notmuch_doc_id_set_t match_set;
>  };
>  
> +static notmuch_bool_t
> +_notmuch_doc_id_set_init (void *ctx,
> +   notmuch_doc_id_set_t *doc_ids,
> +   GArray *arr);
> +
>  notmuch_query_t *
>  notmuch_query_create (notmuch_database_t *notmuch,
> const char *query_string)
> @@ -173,6 +178,7 @@ notmuch_query_search_messages (notmuch_query_t *query)
>  "mail"));
>   Xapian::Query string_query, final_query, exclude_query;
>   Xapian::MSet mset;
> + Xapian::MSetIterator iterator;
>   unsigned int flags = (Xapian::QueryParser::FLAG_BOOLEAN |
> Xapian::QueryParser::FLAG_PHRASE |
> Xapian::QueryParser::FLAG_LOVEHATE |
> @@ -193,8 +199,21 @@ notmuch_query_search_messages (notmuch_query_t *query)
>  
>   exclude_query = _notmuch_exclude_tags (query, final_query);
>  
> - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT,
> -  final_query, exclude_query);
> + enquire.set_weighting_scheme (Xapian::BoolWeight());
> + enquire.set_query (exclude_query);
> +
> + mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount ());
> +
> + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned 
> int));
> +
> + for (iterator = mset.begin (); iterator != mset.end (); iterator++)
> + {
> + unsigned int doc_id = *iterator;
> + g_array_append_val (excluded_doc_ids, doc_id);
> + }
> + messages->base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set);
> + _notmuch_doc_id_set_init (query, messages->base.excluded_doc_ids,
> +   excluded_doc_ids);

This might be inefficient for message-only queries, since it will
fetch *all* excluded docids.  This highlights a basic difference
between message and thread search: thread search can return messages
that don't match the original query and hence needs to know all
potentially excluded messages, while message search can only return
messages that match the original query.

It's entirely possible this doesn't matter because Xapian probably
still needs to fetch the full posting lists of the excluded terms, but
it would be worth doing a quick/hacky benchmark to verify this, with
enough excluded messages to make the cost non-trivial.

If it does matter, you could pass in a flag indicating if the exclude
query should be limited by the original query or not.  Or you could do
the limited exclude query in notmuch_query_search_messages and a
separate open-ended exclude query in notmuch_query_search_threads.

>  
>   enquire.set_weighting_scheme (Xapian::BoolWeight());
>  
> @@ -294,6 +313,11 @@ _notmuch_mset_messages_move_to_next (notmuch_messages_t 
> *messages)
>  mset_messages = (notmuch_mset_messages_t *) messages;
>  
>  mset_messages->iterator++;
> +
> +while ((mset_messages->iterator != mset_messages->iterator_end) &&
> +(_notmuch_doc_id_set_contains (messages->excluded_doc_ids,
> +   *mset_messages->iterator)))
> + mset_messages->iterator++;

This seemed a little weird, since you remove it in the next patch.  Is
this just to keep the tests happy?  (If so, it would be worth
mentioning in the commit message; other reviewers will definitely have
the same question.)

>  }
>  
>  static notmuch_bool_t