[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
This is looking really good. I think this overall approach is significantly better than the initial exclude support and the UI aspects look like they should be much more pleasant. Quoth Mark Walters on Jan 29 at 6:36 pm: > > Ok I now have a patch set which might be complete enough to be worth > reviewing. It is essentially complete and appears to work. > > Things that still need doing: >updating the test suite. The series changes notmuch-show to >output the exclude flag so several tests need updating. Of >course, the new functionality needs some tests too. > >emacs/notmuch.el I think it would be nice to hide (make >invisible) threads with no matching non-excluded messages (with a >toggle for visibility) but that is definitely beyond my elisp >skills. I'm happy to do that once this goes in. > The first patch of the series is not really part of the series: it adds > a --do-not-exclude option to tell the command not to exclude. I think > this is useful anyway, but it also simplifies behaviour decisions with > the excludes. For example notmuch count will only count matching > non-excluded messages but notmuch count --do-not-exclude will count > all matching messages excluded or not. > > One outstanding issue is that raised in > id:"20120124025331.GZ16740 at mit.edu". I will need to think about > that. > > Best wishes > > Mark >
Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
This is looking really good. I think this overall approach is significantly better than the initial exclude support and the UI aspects look like they should be much more pleasant. Quoth Mark Walters on Jan 29 at 6:36 pm: Ok I now have a patch set which might be complete enough to be worth reviewing. It is essentially complete and appears to work. Things that still need doing: updating the test suite. The series changes notmuch-show to output the exclude flag so several tests need updating. Of course, the new functionality needs some tests too. emacs/notmuch.el I think it would be nice to hide (make invisible) threads with no matching non-excluded messages (with a toggle for visibility) but that is definitely beyond my elisp skills. I'm happy to do that once this goes in. The first patch of the series is not really part of the series: it adds a --do-not-exclude option to tell the command not to exclude. I think this is useful anyway, but it also simplifies behaviour decisions with the excludes. For example notmuch count will only count matching non-excluded messages but notmuch count --do-not-exclude will count all matching messages excluded or not. One outstanding issue is that raised in id:20120124025331.gz16...@mit.edu. I will need to think about that. Best wishes Mark ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
Ok I now have a patch set which might be complete enough to be worth reviewing. It is essentially complete and appears to work. Things that still need doing: updating the test suite. The series changes notmuch-show to output the exclude flag so several tests need updating. Of course, the new functionality needs some tests too. emacs/notmuch.el I think it would be nice to hide (make invisible) threads with no matching non-excluded messages (with a toggle for visibility) but that is definitely beyond my elisp skills. The first patch of the series is not really part of the series: it adds a --do-not-exclude option to tell the command not to exclude. I think this is useful anyway, but it also simplifies behaviour decisions with the excludes. For example notmuch count will only count matching non-excluded messages but notmuch count --do-not-exclude will count all matching messages excluded or not. One outstanding issue is that raised in id:"20120124025331.GZ16740 at mit.edu". I will need to think about that. Best wishes Mark
[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
> The cli stuff needs thought (about what it should do rather than > how to do it). Ok I have thought about the cli interface. My thoughts are as follows: count/search/show should all have a --do-not-exclude option. notmuch count: messages: just output count of matching not-excluded threads: count threads matching in a non-excluded message notmuch search: default/summary should output line as in the current patch with number matching being number non-excluded matching (so some will be zero) messages/files/tags should be matching not-excluded threads: show thread ids of messages matching in a non-excluded message notmuch show: raw and part both deal with a single message so should output it regardless of exclude flags. text/json can give out the results including the exclude flag mbox only include matching not-excluded The rationale is that all formats which can return an exclude flag do; those that cannot omit the excluded results since the caller can just call without setting the excludes if they want the full results. Does that seem reasonable? Best wishes Mark
[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
On Sat, 28 Jan 2012 13:33:40 -0500, Austin Clements wrote: > Quoth Mark Walters on Jan 28 at 10:51 am: > > > > > > exclude_query = _notmuch_exclude_tags (query, final_query); > > > > > > > > - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT, > > > > -final_query, exclude_query); > > > > + enquire.set_weighting_scheme (Xapian::BoolWeight()); > > > > + enquire.set_query (exclude_query); > > > > + > > > > + mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount > > > > ()); > > > > + > > > > + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof > > > > (unsigned int)); > > > > + > > > > + for (iterator = mset.begin (); iterator != mset.end (); > > > > iterator++) > > > > + { > > > > + unsigned int doc_id = *iterator; > > > > + g_array_append_val (excluded_doc_ids, doc_id); > > > > + } > > > > + messages->base.excluded_doc_ids = talloc (query, > > > > _notmuch_doc_id_set); > > > > + _notmuch_doc_id_set_init (query, > > > > messages->base.excluded_doc_ids, > > > > + excluded_doc_ids); > > > > > > This might be inefficient for message-only queries, since it will > > > fetch *all* excluded docids. This highlights a basic difference > > > between message and thread search: thread search can return messages > > > that don't match the original query and hence needs to know all > > > potentially excluded messages, while message search can only return > > > messages that match the original query. > > > > I now have some benchmarks (not run enough times to be hugely accurate > > so ignore minor differences). The full results are below. The summary > > is: > > > > Large-archive = 1 100 000 messages in 290 000 threads (about 10 years of > > lkml). I mark 1 000 000 deleted > > Small-archive = 70 000 messages in 35 000 threads. 10 000 marked > > deleted. > > > > Doing the initial exclude work on the big collection takes about 0.8s > > and on the small collection about 0.01s. So any query to the big > > collection takes at least 0.8s longer and this all occurs before any > > results appear. > > Interesting. Do you know where that time is spent? > > Also, it might be reasonable to assume that no more than, say, 10% of > a person's mail store is excluded, but maybe that depends on how > people use this feature. > > > I then implemented the exclude doing it once for each thread query in > > _notmuch_create_thread. Roughly this made any query 50% slower. > > That's not terrible. > > > In normal front end use even the 0.8s is not totally unusable, but it is > > totally unacceptable in the backend where a user might do something like > > > > for i in ` notmuch search --output=threads from:xxx ` ; > > do > >notmuch search --output=messages $i; > > done > > > > to list all messages in all matching threads. > > > > So I think my conclusions are: > > > > (1) message only queries must be done without the full exclude. > > (2) thread queries which only match one message should not do the full > > exclude > > (3) it would be nice to switch between the two approaches depending on > > size but I don't see how to do that without extra(!) queries > > (4) One possible might be do something that say does thirty threads with > > the by thread method and then if not finished does the full exclude. > > (5) thread-by-thread might be best for Jani's limit-match > > id:"1327692900-22926-1-git-send-email-jani at nikula.org" > > > > Obviously, anything setting an exclude flag like this will be slower > > (since it is doing more work): the question is are either of these (or a > > combination like (4) above) acceptable? > > Or only mark matched messages as excluded. > > Here's another idea (actually, a rehash of an old idea). For message > search do two queries, the original query and " AND > ", and use this to keep everything in order and mark excluded > messages. For thread search, use message search results so it's easy > to both sort by unexcluded messages and include fully-excluded > threads, but compute the excluded flag (either just for unmatched > messages or for all messages) by examining each message's tags > directly (which thread_add_message already iterates over, so this is > easy and won't add any overhead). If the excluded query is fast, > which I think it will be, I think this should get the best of all > worlds and be fairly straightforward to implement (no asymmetries > between the queries used for message and thread search). It would be > easy and worth it to run the excluded query by hand on your test > corpus; I suspect it will be much faster than 0.8s because the query > already uses "Tmail", which is huge and doesn't seem to slow things > down. I have tried your suggestion (still marking all messages) and it does seem the way to go: the difference in speed is small from master is small: between 0 and 10%
[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
Quoth Mark Walters on Jan 28 at 10:51 am: > > > > exclude_query = _notmuch_exclude_tags (query, final_query); > > > > > > - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT, > > > - final_query, exclude_query); > > > + enquire.set_weighting_scheme (Xapian::BoolWeight()); > > > + enquire.set_query (exclude_query); > > > + > > > + mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount ()); > > > + > > > + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned > > > int)); > > > + > > > + for (iterator = mset.begin (); iterator != mset.end (); iterator++) > > > + { > > > + unsigned int doc_id = *iterator; > > > + g_array_append_val (excluded_doc_ids, doc_id); > > > + } > > > + messages->base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set); > > > + _notmuch_doc_id_set_init (query, messages->base.excluded_doc_ids, > > > + excluded_doc_ids); > > > > This might be inefficient for message-only queries, since it will > > fetch *all* excluded docids. This highlights a basic difference > > between message and thread search: thread search can return messages > > that don't match the original query and hence needs to know all > > potentially excluded messages, while message search can only return > > messages that match the original query. > > I now have some benchmarks (not run enough times to be hugely accurate > so ignore minor differences). The full results are below. The summary > is: > > Large-archive = 1 100 000 messages in 290 000 threads (about 10 years of > lkml). I mark 1 000 000 deleted > Small-archive = 70 000 messages in 35 000 threads. 10 000 marked > deleted. > > Doing the initial exclude work on the big collection takes about 0.8s > and on the small collection about 0.01s. So any query to the big > collection takes at least 0.8s longer and this all occurs before any > results appear. Interesting. Do you know where that time is spent? Also, it might be reasonable to assume that no more than, say, 10% of a person's mail store is excluded, but maybe that depends on how people use this feature. > I then implemented the exclude doing it once for each thread query in > _notmuch_create_thread. Roughly this made any query 50% slower. That's not terrible. > In normal front end use even the 0.8s is not totally unusable, but it is > totally unacceptable in the backend where a user might do something like > > for i in ` notmuch search --output=threads from:xxx ` ; > do >notmuch search --output=messages $i; > done > > to list all messages in all matching threads. > > So I think my conclusions are: > > (1) message only queries must be done without the full exclude. > (2) thread queries which only match one message should not do the full > exclude > (3) it would be nice to switch between the two approaches depending on > size but I don't see how to do that without extra(!) queries > (4) One possible might be do something that say does thirty threads with > the by thread method and then if not finished does the full exclude. > (5) thread-by-thread might be best for Jani's limit-match > id:"1327692900-22926-1-git-send-email-jani at nikula.org" > > Obviously, anything setting an exclude flag like this will be slower > (since it is doing more work): the question is are either of these (or a > combination like (4) above) acceptable? Or only mark matched messages as excluded. Here's another idea (actually, a rehash of an old idea). For message search do two queries, the original query and " AND ", and use this to keep everything in order and mark excluded messages. For thread search, use message search results so it's easy to both sort by unexcluded messages and include fully-excluded threads, but compute the excluded flag (either just for unmatched messages or for all messages) by examining each message's tags directly (which thread_add_message already iterates over, so this is easy and won't add any overhead). If the excluded query is fast, which I think it will be, I think this should get the best of all worlds and be fairly straightforward to implement (no asymmetries between the queries used for message and thread search). It would be easy and worth it to run the excluded query by hand on your test corpus; I suspect it will be much faster than 0.8s because the query already uses "Tmail", which is huge and doesn't seem to slow things down. > I now have a mostly working implementation from library to > emacs frontend and I do like the overall outcome. Awesome. > The complete benchmarks are below > > Best wishes > > Mark > > LARGE COLLECTION is 1,100,000 messages 290,000 threads 1,000,000 deleted > SMALL COLLECTION is 70,000 messages in 35,000 threads 10,000 deleted > > benchmarks: all times in seconds, x/y/z means a query which matches x > threads with y matching messages and z messages in total. Ig or ignore > means with the tag-exclude turned off (i.e. with a query
[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
> > exclude_query = _notmuch_exclude_tags (query, final_query); > > > > - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT, > > -final_query, exclude_query); > > + enquire.set_weighting_scheme (Xapian::BoolWeight()); > > + enquire.set_query (exclude_query); > > + > > + mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount ()); > > + > > + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned > > int)); > > + > > + for (iterator = mset.begin (); iterator != mset.end (); iterator++) > > + { > > + unsigned int doc_id = *iterator; > > + g_array_append_val (excluded_doc_ids, doc_id); > > + } > > + messages->base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set); > > + _notmuch_doc_id_set_init (query, messages->base.excluded_doc_ids, > > + excluded_doc_ids); > > This might be inefficient for message-only queries, since it will > fetch *all* excluded docids. This highlights a basic difference > between message and thread search: thread search can return messages > that don't match the original query and hence needs to know all > potentially excluded messages, while message search can only return > messages that match the original query. I now have some benchmarks (not run enough times to be hugely accurate so ignore minor differences). The full results are below. The summary is: Large-archive = 1 100 000 messages in 290 000 threads (about 10 years of lkml). I mark 1 000 000 deleted Small-archive = 70 000 messages in 35 000 threads. 10 000 marked deleted. Doing the initial exclude work on the big collection takes about 0.8s and on the small collection about 0.01s. So any query to the big collection takes at least 0.8s longer and this all occurs before any results appear. I then implemented the exclude doing it once for each thread query in _notmuch_create_thread. Roughly this made any query 50% slower. In normal front end use even the 0.8s is not totally unusable, but it is totally unacceptable in the backend where a user might do something like for i in ` notmuch search --output=threads from:xxx ` ; do notmuch search --output=messages $i; done to list all messages in all matching threads. So I think my conclusions are: (1) message only queries must be done without the full exclude. (2) thread queries which only match one message should not do the full exclude (3) it would be nice to switch between the two approaches depending on size but I don't see how to do that without extra(!) queries (4) One possible might be do something that say does thirty threads with the by thread method and then if not finished does the full exclude. (5) thread-by-thread might be best for Jani's limit-match id:"1327692900-22926-1-git-send-email-jani at nikula.org" Obviously, anything setting an exclude flag like this will be slower (since it is doing more work): the question is are either of these (or a combination like (4) above) acceptable? I now have a mostly working implementation from library to emacs frontend and I do like the overall outcome. The complete benchmarks are below Best wishes Mark LARGE COLLECTION is 1,100,000 messages 290,000 threads 1,000,000 deleted SMALL COLLECTION is 70,000 messages in 35,000 threads 10,000 deleted benchmarks: all times in seconds, x/y/z means a query which matches x threads with y matching messages and z messages in total. Ig or ignore means with the tag-exclude turned off (i.e. with a query matching the excluded tag). list all messages is the time for the for loop listed above giving all message-ids for all messages in any thread matching a query. Finally the three columns are master with exclude code disabled, thread-thread is doing excludes once per thread construction, and in-advance does all the exclude work in advance as in the patches I posted. In most cases the benchmark is the average of a lot of runs so the database should have been as cached as one could hope. master-(all)thread-thread in-advance LARGE COLLECTION show single message 0.016 0.018 0.78 search single message 0.015 0.016 0.78 search single with tag 0.015 0.015 0.009 945/2627/2 query ignore2.9 n/a 3 query 2.9 4.2 3.8 list all messages (ig) 13 n/a 13 list all messages 13 14 12mins 4754/13000/11 query ignore15.9n/a 17 query 15.922 17.6 only messages 1.251.261.9 177/483/1752 query 0.3 0.421.1 search '*' 20mins 28mins 21.5mins SMALL COLLECTION 1500/2800/5600
Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
Quoth Mark Walters on Jan 28 at 10:51 am: exclude_query = _notmuch_exclude_tags (query, final_query); - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT, - final_query, exclude_query); + enquire.set_weighting_scheme (Xapian::BoolWeight()); + enquire.set_query (exclude_query); + + mset = enquire.get_mset (0, notmuch-xapian_db-get_doccount ()); + + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned int)); + + for (iterator = mset.begin (); iterator != mset.end (); iterator++) + { + unsigned int doc_id = *iterator; + g_array_append_val (excluded_doc_ids, doc_id); + } + messages-base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set); + _notmuch_doc_id_set_init (query, messages-base.excluded_doc_ids, + excluded_doc_ids); This might be inefficient for message-only queries, since it will fetch *all* excluded docids. This highlights a basic difference between message and thread search: thread search can return messages that don't match the original query and hence needs to know all potentially excluded messages, while message search can only return messages that match the original query. I now have some benchmarks (not run enough times to be hugely accurate so ignore minor differences). The full results are below. The summary is: Large-archive = 1 100 000 messages in 290 000 threads (about 10 years of lkml). I mark 1 000 000 deleted Small-archive = 70 000 messages in 35 000 threads. 10 000 marked deleted. Doing the initial exclude work on the big collection takes about 0.8s and on the small collection about 0.01s. So any query to the big collection takes at least 0.8s longer and this all occurs before any results appear. Interesting. Do you know where that time is spent? Also, it might be reasonable to assume that no more than, say, 10% of a person's mail store is excluded, but maybe that depends on how people use this feature. I then implemented the exclude doing it once for each thread query in _notmuch_create_thread. Roughly this made any query 50% slower. That's not terrible. In normal front end use even the 0.8s is not totally unusable, but it is totally unacceptable in the backend where a user might do something like for i in ` notmuch search --output=threads from:xxx ` ; do notmuch search --output=messages $i; done to list all messages in all matching threads. So I think my conclusions are: (1) message only queries must be done without the full exclude. (2) thread queries which only match one message should not do the full exclude (3) it would be nice to switch between the two approaches depending on size but I don't see how to do that without extra(!) queries (4) One possible might be do something that say does thirty threads with the by thread method and then if not finished does the full exclude. (5) thread-by-thread might be best for Jani's limit-match id:1327692900-22926-1-git-send-email-j...@nikula.org Obviously, anything setting an exclude flag like this will be slower (since it is doing more work): the question is are either of these (or a combination like (4) above) acceptable? Or only mark matched messages as excluded. Here's another idea (actually, a rehash of an old idea). For message search do two queries, the original query and original AND exclude, and use this to keep everything in order and mark excluded messages. For thread search, use message search results so it's easy to both sort by unexcluded messages and include fully-excluded threads, but compute the excluded flag (either just for unmatched messages or for all messages) by examining each message's tags directly (which thread_add_message already iterates over, so this is easy and won't add any overhead). If the excluded query is fast, which I think it will be, I think this should get the best of all worlds and be fairly straightforward to implement (no asymmetries between the queries used for message and thread search). It would be easy and worth it to run the excluded query by hand on your test corpus; I suspect it will be much faster than 0.8s because the query already uses Tmail, which is huge and doesn't seem to slow things down. I now have a mostly working implementation from library to emacs frontend and I do like the overall outcome. Awesome. The complete benchmarks are below Best wishes Mark LARGE COLLECTION is 1,100,000 messages 290,000 threads 1,000,000 deleted SMALL COLLECTION is 70,000 messages in 35,000 threads 10,000 deleted benchmarks: all times in seconds, x/y/z means a query which matches x threads with y matching messages and z messages in total. Ig or ignore means with the tag-exclude turned off (i.e. with a query matching the excluded tag). list all messages is the time for the for loop listed above giving all message-ids for all messages in
Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
On Sat, 28 Jan 2012 13:33:40 -0500, Austin Clements amdra...@mit.edu wrote: Quoth Mark Walters on Jan 28 at 10:51 am: exclude_query = _notmuch_exclude_tags (query, final_query); - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT, -final_query, exclude_query); + enquire.set_weighting_scheme (Xapian::BoolWeight()); + enquire.set_query (exclude_query); + + mset = enquire.get_mset (0, notmuch-xapian_db-get_doccount ()); + + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned int)); + + for (iterator = mset.begin (); iterator != mset.end (); iterator++) + { + unsigned int doc_id = *iterator; + g_array_append_val (excluded_doc_ids, doc_id); + } + messages-base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set); + _notmuch_doc_id_set_init (query, messages-base.excluded_doc_ids, + excluded_doc_ids); This might be inefficient for message-only queries, since it will fetch *all* excluded docids. This highlights a basic difference between message and thread search: thread search can return messages that don't match the original query and hence needs to know all potentially excluded messages, while message search can only return messages that match the original query. I now have some benchmarks (not run enough times to be hugely accurate so ignore minor differences). The full results are below. The summary is: Large-archive = 1 100 000 messages in 290 000 threads (about 10 years of lkml). I mark 1 000 000 deleted Small-archive = 70 000 messages in 35 000 threads. 10 000 marked deleted. Doing the initial exclude work on the big collection takes about 0.8s and on the small collection about 0.01s. So any query to the big collection takes at least 0.8s longer and this all occurs before any results appear. Interesting. Do you know where that time is spent? Also, it might be reasonable to assume that no more than, say, 10% of a person's mail store is excluded, but maybe that depends on how people use this feature. I then implemented the exclude doing it once for each thread query in _notmuch_create_thread. Roughly this made any query 50% slower. That's not terrible. In normal front end use even the 0.8s is not totally unusable, but it is totally unacceptable in the backend where a user might do something like for i in ` notmuch search --output=threads from:xxx ` ; do notmuch search --output=messages $i; done to list all messages in all matching threads. So I think my conclusions are: (1) message only queries must be done without the full exclude. (2) thread queries which only match one message should not do the full exclude (3) it would be nice to switch between the two approaches depending on size but I don't see how to do that without extra(!) queries (4) One possible might be do something that say does thirty threads with the by thread method and then if not finished does the full exclude. (5) thread-by-thread might be best for Jani's limit-match id:1327692900-22926-1-git-send-email-j...@nikula.org Obviously, anything setting an exclude flag like this will be slower (since it is doing more work): the question is are either of these (or a combination like (4) above) acceptable? Or only mark matched messages as excluded. Here's another idea (actually, a rehash of an old idea). For message search do two queries, the original query and original AND exclude, and use this to keep everything in order and mark excluded messages. For thread search, use message search results so it's easy to both sort by unexcluded messages and include fully-excluded threads, but compute the excluded flag (either just for unmatched messages or for all messages) by examining each message's tags directly (which thread_add_message already iterates over, so this is easy and won't add any overhead). If the excluded query is fast, which I think it will be, I think this should get the best of all worlds and be fairly straightforward to implement (no asymmetries between the queries used for message and thread search). It would be easy and worth it to run the excluded query by hand on your test corpus; I suspect it will be much faster than 0.8s because the query already uses Tmail, which is huge and doesn't seem to slow things down. I have tried your suggestion (still marking all messages) and it does seem the way to go: the difference in speed is small from master is small: between 0 and 10% for most of the tests. The code seems to work and I will post it in reply to this thread. The library code is reasonable (although whether messages matching an exclude tag that has been specified in the query
[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
On Mon, 23 Jan 2012 21:45:21 -0500, Austin Clements wrote: > The overall structure of this series looks great. There's obviously a > lot of clean up to do, but I'll reply with a few high-level comments. > > Quoth Mark Walters on Jan 24 at 1:18 am: > > Form excluded doc_ids set and use that to exclude messages. > > Should be no functional change. > > > > --- > > lib/notmuch-private.h |1 + > > lib/query.cc | 28 ++-- > > 2 files changed, 27 insertions(+), 2 deletions(-) > > > > diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h > > index 7bf153e..e791bb0 100644 > > --- a/lib/notmuch-private.h > > +++ b/lib/notmuch-private.h > > @@ -401,6 +401,7 @@ typedef struct _notmuch_message_list { > > */ > > struct visible _notmuch_messages { > > notmuch_bool_t is_of_list_type; > > +notmuch_doc_id_set_t *excluded_doc_ids; > > notmuch_message_node_t *iterator; > > }; > > > > diff --git a/lib/query.cc b/lib/query.cc > > index c25b301..92fa834 100644 > > --- a/lib/query.cc > > +++ b/lib/query.cc > > @@ -57,6 +57,11 @@ struct visible _notmuch_threads { > > notmuch_doc_id_set_t match_set; > > }; > > > > +static notmuch_bool_t > > +_notmuch_doc_id_set_init (void *ctx, > > + notmuch_doc_id_set_t *doc_ids, > > + GArray *arr); > > + > > notmuch_query_t * > > notmuch_query_create (notmuch_database_t *notmuch, > > const char *query_string) > > @@ -173,6 +178,7 @@ notmuch_query_search_messages (notmuch_query_t *query) > >"mail")); > > Xapian::Query string_query, final_query, exclude_query; > > Xapian::MSet mset; > > + Xapian::MSetIterator iterator; > > unsigned int flags = (Xapian::QueryParser::FLAG_BOOLEAN | > > Xapian::QueryParser::FLAG_PHRASE | > > Xapian::QueryParser::FLAG_LOVEHATE | > > @@ -193,8 +199,21 @@ notmuch_query_search_messages (notmuch_query_t *query) > > > > exclude_query = _notmuch_exclude_tags (query, final_query); > > > > - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT, > > -final_query, exclude_query); > > + enquire.set_weighting_scheme (Xapian::BoolWeight()); > > + enquire.set_query (exclude_query); > > + > > + mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount ()); > > + > > + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned > > int)); > > + > > + for (iterator = mset.begin (); iterator != mset.end (); iterator++) > > + { > > + unsigned int doc_id = *iterator; > > + g_array_append_val (excluded_doc_ids, doc_id); > > + } > > + messages->base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set); > > + _notmuch_doc_id_set_init (query, messages->base.excluded_doc_ids, > > + excluded_doc_ids); > > This might be inefficient for message-only queries, since it will > fetch *all* excluded docids. This highlights a basic difference > between message and thread search: thread search can return messages > that don't match the original query and hence needs to know all > potentially excluded messages, while message search can only return > messages that match the original query. > > It's entirely possible this doesn't matter because Xapian probably > still needs to fetch the full posting lists of the excluded terms, but > it would be worth doing a quick/hacky benchmark to verify this, with > enough excluded messages to make the cost non-trivial. > > If it does matter, you could pass in a flag indicating if the exclude > query should be limited by the original query or not. Or you could do > the limited exclude query in notmuch_query_search_messages and a > separate open-ended exclude query in notmuch_query_search_threads. Yes I will benchmark that: I am just importing a large archive into notmuch for testing. > > enquire.set_weighting_scheme (Xapian::BoolWeight()); > > > > @@ -294,6 +313,11 @@ _notmuch_mset_messages_move_to_next > > (notmuch_messages_t *messages) > > mset_messages = (notmuch_mset_messages_t *) messages; > > > > mset_messages->iterator++; > > + > > +while ((mset_messages->iterator != mset_messages->iterator_end) && > > + (_notmuch_doc_id_set_contains (messages->excluded_doc_ids, > > + *mset_messages->iterator))) > > + mset_messages->iterator++; > > This seemed a little weird, since you remove it in the next patch. Is > this just to keep the tests happy? (If so, it would be worth > mentioning in the commit message; other reviewers will definitely have > the same question.) Essentially just to keep tests happy: or rather to try and make it easy for a reviewer to see that the individual patch does not make any functional change. Best wishes Mark
[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
Form excluded doc_ids set and use that to exclude messages. Should be no functional change. --- lib/notmuch-private.h |1 + lib/query.cc | 28 ++-- 2 files changed, 27 insertions(+), 2 deletions(-) diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h index 7bf153e..e791bb0 100644 --- a/lib/notmuch-private.h +++ b/lib/notmuch-private.h @@ -401,6 +401,7 @@ typedef struct _notmuch_message_list { */ struct visible _notmuch_messages { notmuch_bool_t is_of_list_type; +notmuch_doc_id_set_t *excluded_doc_ids; notmuch_message_node_t *iterator; }; diff --git a/lib/query.cc b/lib/query.cc index c25b301..92fa834 100644 --- a/lib/query.cc +++ b/lib/query.cc @@ -57,6 +57,11 @@ struct visible _notmuch_threads { notmuch_doc_id_set_t match_set; }; +static notmuch_bool_t +_notmuch_doc_id_set_init (void *ctx, + notmuch_doc_id_set_t *doc_ids, + GArray *arr); + notmuch_query_t * notmuch_query_create (notmuch_database_t *notmuch, const char *query_string) @@ -173,6 +178,7 @@ notmuch_query_search_messages (notmuch_query_t *query) "mail")); Xapian::Query string_query, final_query, exclude_query; Xapian::MSet mset; + Xapian::MSetIterator iterator; unsigned int flags = (Xapian::QueryParser::FLAG_BOOLEAN | Xapian::QueryParser::FLAG_PHRASE | Xapian::QueryParser::FLAG_LOVEHATE | @@ -193,8 +199,21 @@ notmuch_query_search_messages (notmuch_query_t *query) exclude_query = _notmuch_exclude_tags (query, final_query); - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT, -final_query, exclude_query); + enquire.set_weighting_scheme (Xapian::BoolWeight()); + enquire.set_query (exclude_query); + + mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount ()); + + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned int)); + + for (iterator = mset.begin (); iterator != mset.end (); iterator++) + { + unsigned int doc_id = *iterator; + g_array_append_val (excluded_doc_ids, doc_id); + } + messages->base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set); + _notmuch_doc_id_set_init (query, messages->base.excluded_doc_ids, + excluded_doc_ids); enquire.set_weighting_scheme (Xapian::BoolWeight()); @@ -294,6 +313,11 @@ _notmuch_mset_messages_move_to_next (notmuch_messages_t *messages) mset_messages = (notmuch_mset_messages_t *) messages; mset_messages->iterator++; + +while ((mset_messages->iterator != mset_messages->iterator_end) && + (_notmuch_doc_id_set_contains (messages->excluded_doc_ids, + *mset_messages->iterator))) + mset_messages->iterator++; } static notmuch_bool_t -- 1.7.2.3
[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
Form excluded doc_ids set and use that to exclude messages. Should be no functional change. --- lib/notmuch-private.h |1 + lib/query.cc | 28 ++-- 2 files changed, 27 insertions(+), 2 deletions(-) diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h index 7bf153e..e791bb0 100644 --- a/lib/notmuch-private.h +++ b/lib/notmuch-private.h @@ -401,6 +401,7 @@ typedef struct _notmuch_message_list { */ struct visible _notmuch_messages { notmuch_bool_t is_of_list_type; +notmuch_doc_id_set_t *excluded_doc_ids; notmuch_message_node_t *iterator; }; diff --git a/lib/query.cc b/lib/query.cc index c25b301..92fa834 100644 --- a/lib/query.cc +++ b/lib/query.cc @@ -57,6 +57,11 @@ struct visible _notmuch_threads { notmuch_doc_id_set_t match_set; }; +static notmuch_bool_t +_notmuch_doc_id_set_init (void *ctx, + notmuch_doc_id_set_t *doc_ids, + GArray *arr); + notmuch_query_t * notmuch_query_create (notmuch_database_t *notmuch, const char *query_string) @@ -173,6 +178,7 @@ notmuch_query_search_messages (notmuch_query_t *query) mail)); Xapian::Query string_query, final_query, exclude_query; Xapian::MSet mset; + Xapian::MSetIterator iterator; unsigned int flags = (Xapian::QueryParser::FLAG_BOOLEAN | Xapian::QueryParser::FLAG_PHRASE | Xapian::QueryParser::FLAG_LOVEHATE | @@ -193,8 +199,21 @@ notmuch_query_search_messages (notmuch_query_t *query) exclude_query = _notmuch_exclude_tags (query, final_query); - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT, -final_query, exclude_query); + enquire.set_weighting_scheme (Xapian::BoolWeight()); + enquire.set_query (exclude_query); + + mset = enquire.get_mset (0, notmuch-xapian_db-get_doccount ()); + + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned int)); + + for (iterator = mset.begin (); iterator != mset.end (); iterator++) + { + unsigned int doc_id = *iterator; + g_array_append_val (excluded_doc_ids, doc_id); + } + messages-base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set); + _notmuch_doc_id_set_init (query, messages-base.excluded_doc_ids, + excluded_doc_ids); enquire.set_weighting_scheme (Xapian::BoolWeight()); @@ -294,6 +313,11 @@ _notmuch_mset_messages_move_to_next (notmuch_messages_t *messages) mset_messages = (notmuch_mset_messages_t *) messages; mset_messages-iterator++; + +while ((mset_messages-iterator != mset_messages-iterator_end) + (_notmuch_doc_id_set_contains (messages-excluded_doc_ids, + *mset_messages-iterator))) + mset_messages-iterator++; } static notmuch_bool_t -- 1.7.2.3 ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
The overall structure of this series looks great. There's obviously a lot of clean up to do, but I'll reply with a few high-level comments. Quoth Mark Walters on Jan 24 at 1:18 am: Form excluded doc_ids set and use that to exclude messages. Should be no functional change. --- lib/notmuch-private.h |1 + lib/query.cc | 28 ++-- 2 files changed, 27 insertions(+), 2 deletions(-) diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h index 7bf153e..e791bb0 100644 --- a/lib/notmuch-private.h +++ b/lib/notmuch-private.h @@ -401,6 +401,7 @@ typedef struct _notmuch_message_list { */ struct visible _notmuch_messages { notmuch_bool_t is_of_list_type; +notmuch_doc_id_set_t *excluded_doc_ids; notmuch_message_node_t *iterator; }; diff --git a/lib/query.cc b/lib/query.cc index c25b301..92fa834 100644 --- a/lib/query.cc +++ b/lib/query.cc @@ -57,6 +57,11 @@ struct visible _notmuch_threads { notmuch_doc_id_set_t match_set; }; +static notmuch_bool_t +_notmuch_doc_id_set_init (void *ctx, + notmuch_doc_id_set_t *doc_ids, + GArray *arr); + notmuch_query_t * notmuch_query_create (notmuch_database_t *notmuch, const char *query_string) @@ -173,6 +178,7 @@ notmuch_query_search_messages (notmuch_query_t *query) mail)); Xapian::Query string_query, final_query, exclude_query; Xapian::MSet mset; + Xapian::MSetIterator iterator; unsigned int flags = (Xapian::QueryParser::FLAG_BOOLEAN | Xapian::QueryParser::FLAG_PHRASE | Xapian::QueryParser::FLAG_LOVEHATE | @@ -193,8 +199,21 @@ notmuch_query_search_messages (notmuch_query_t *query) exclude_query = _notmuch_exclude_tags (query, final_query); - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT, - final_query, exclude_query); + enquire.set_weighting_scheme (Xapian::BoolWeight()); + enquire.set_query (exclude_query); + + mset = enquire.get_mset (0, notmuch-xapian_db-get_doccount ()); + + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned int)); + + for (iterator = mset.begin (); iterator != mset.end (); iterator++) + { + unsigned int doc_id = *iterator; + g_array_append_val (excluded_doc_ids, doc_id); + } + messages-base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set); + _notmuch_doc_id_set_init (query, messages-base.excluded_doc_ids, + excluded_doc_ids); This might be inefficient for message-only queries, since it will fetch *all* excluded docids. This highlights a basic difference between message and thread search: thread search can return messages that don't match the original query and hence needs to know all potentially excluded messages, while message search can only return messages that match the original query. It's entirely possible this doesn't matter because Xapian probably still needs to fetch the full posting lists of the excluded terms, but it would be worth doing a quick/hacky benchmark to verify this, with enough excluded messages to make the cost non-trivial. If it does matter, you could pass in a flag indicating if the exclude query should be limited by the original query or not. Or you could do the limited exclude query in notmuch_query_search_messages and a separate open-ended exclude query in notmuch_query_search_threads. enquire.set_weighting_scheme (Xapian::BoolWeight()); @@ -294,6 +313,11 @@ _notmuch_mset_messages_move_to_next (notmuch_messages_t *messages) mset_messages = (notmuch_mset_messages_t *) messages; mset_messages-iterator++; + +while ((mset_messages-iterator != mset_messages-iterator_end) +(_notmuch_doc_id_set_contains (messages-excluded_doc_ids, + *mset_messages-iterator))) + mset_messages-iterator++; This seemed a little weird, since you remove it in the next patch. Is this just to keep the tests happy? (If so, it would be worth mentioning in the commit message; other reviewers will definitely have the same question.) } static notmuch_bool_t ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
On Mon, 23 Jan 2012 21:45:21 -0500, Austin Clements amdra...@mit.edu wrote: The overall structure of this series looks great. There's obviously a lot of clean up to do, but I'll reply with a few high-level comments. Quoth Mark Walters on Jan 24 at 1:18 am: Form excluded doc_ids set and use that to exclude messages. Should be no functional change. --- lib/notmuch-private.h |1 + lib/query.cc | 28 ++-- 2 files changed, 27 insertions(+), 2 deletions(-) diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h index 7bf153e..e791bb0 100644 --- a/lib/notmuch-private.h +++ b/lib/notmuch-private.h @@ -401,6 +401,7 @@ typedef struct _notmuch_message_list { */ struct visible _notmuch_messages { notmuch_bool_t is_of_list_type; +notmuch_doc_id_set_t *excluded_doc_ids; notmuch_message_node_t *iterator; }; diff --git a/lib/query.cc b/lib/query.cc index c25b301..92fa834 100644 --- a/lib/query.cc +++ b/lib/query.cc @@ -57,6 +57,11 @@ struct visible _notmuch_threads { notmuch_doc_id_set_t match_set; }; +static notmuch_bool_t +_notmuch_doc_id_set_init (void *ctx, + notmuch_doc_id_set_t *doc_ids, + GArray *arr); + notmuch_query_t * notmuch_query_create (notmuch_database_t *notmuch, const char *query_string) @@ -173,6 +178,7 @@ notmuch_query_search_messages (notmuch_query_t *query) mail)); Xapian::Query string_query, final_query, exclude_query; Xapian::MSet mset; + Xapian::MSetIterator iterator; unsigned int flags = (Xapian::QueryParser::FLAG_BOOLEAN | Xapian::QueryParser::FLAG_PHRASE | Xapian::QueryParser::FLAG_LOVEHATE | @@ -193,8 +199,21 @@ notmuch_query_search_messages (notmuch_query_t *query) exclude_query = _notmuch_exclude_tags (query, final_query); - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT, -final_query, exclude_query); + enquire.set_weighting_scheme (Xapian::BoolWeight()); + enquire.set_query (exclude_query); + + mset = enquire.get_mset (0, notmuch-xapian_db-get_doccount ()); + + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned int)); + + for (iterator = mset.begin (); iterator != mset.end (); iterator++) + { + unsigned int doc_id = *iterator; + g_array_append_val (excluded_doc_ids, doc_id); + } + messages-base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set); + _notmuch_doc_id_set_init (query, messages-base.excluded_doc_ids, + excluded_doc_ids); This might be inefficient for message-only queries, since it will fetch *all* excluded docids. This highlights a basic difference between message and thread search: thread search can return messages that don't match the original query and hence needs to know all potentially excluded messages, while message search can only return messages that match the original query. It's entirely possible this doesn't matter because Xapian probably still needs to fetch the full posting lists of the excluded terms, but it would be worth doing a quick/hacky benchmark to verify this, with enough excluded messages to make the cost non-trivial. If it does matter, you could pass in a flag indicating if the exclude query should be limited by the original query or not. Or you could do the limited exclude query in notmuch_query_search_messages and a separate open-ended exclude query in notmuch_query_search_threads. Yes I will benchmark that: I am just importing a large archive into notmuch for testing. enquire.set_weighting_scheme (Xapian::BoolWeight()); @@ -294,6 +313,11 @@ _notmuch_mset_messages_move_to_next (notmuch_messages_t *messages) mset_messages = (notmuch_mset_messages_t *) messages; mset_messages-iterator++; + +while ((mset_messages-iterator != mset_messages-iterator_end) + (_notmuch_doc_id_set_contains (messages-excluded_doc_ids, + *mset_messages-iterator))) + mset_messages-iterator++; This seemed a little weird, since you remove it in the next patch. Is this just to keep the tests happy? (If so, it would be worth mentioning in the commit message; other reviewers will definitely have the same question.) Essentially just to keep tests happy: or rather to try and make it easy for a reviewer to see that the individual patch does not make any functional change. Best wishes Mark ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
[RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag
The overall structure of this series looks great. There's obviously a lot of clean up to do, but I'll reply with a few high-level comments. Quoth Mark Walters on Jan 24 at 1:18 am: > Form excluded doc_ids set and use that to exclude messages. > Should be no functional change. > > --- > lib/notmuch-private.h |1 + > lib/query.cc | 28 ++-- > 2 files changed, 27 insertions(+), 2 deletions(-) > > diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h > index 7bf153e..e791bb0 100644 > --- a/lib/notmuch-private.h > +++ b/lib/notmuch-private.h > @@ -401,6 +401,7 @@ typedef struct _notmuch_message_list { > */ > struct visible _notmuch_messages { > notmuch_bool_t is_of_list_type; > +notmuch_doc_id_set_t *excluded_doc_ids; > notmuch_message_node_t *iterator; > }; > > diff --git a/lib/query.cc b/lib/query.cc > index c25b301..92fa834 100644 > --- a/lib/query.cc > +++ b/lib/query.cc > @@ -57,6 +57,11 @@ struct visible _notmuch_threads { > notmuch_doc_id_set_t match_set; > }; > > +static notmuch_bool_t > +_notmuch_doc_id_set_init (void *ctx, > + notmuch_doc_id_set_t *doc_ids, > + GArray *arr); > + > notmuch_query_t * > notmuch_query_create (notmuch_database_t *notmuch, > const char *query_string) > @@ -173,6 +178,7 @@ notmuch_query_search_messages (notmuch_query_t *query) > "mail")); > Xapian::Query string_query, final_query, exclude_query; > Xapian::MSet mset; > + Xapian::MSetIterator iterator; > unsigned int flags = (Xapian::QueryParser::FLAG_BOOLEAN | > Xapian::QueryParser::FLAG_PHRASE | > Xapian::QueryParser::FLAG_LOVEHATE | > @@ -193,8 +199,21 @@ notmuch_query_search_messages (notmuch_query_t *query) > > exclude_query = _notmuch_exclude_tags (query, final_query); > > - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT, > - final_query, exclude_query); > + enquire.set_weighting_scheme (Xapian::BoolWeight()); > + enquire.set_query (exclude_query); > + > + mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount ()); > + > + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned > int)); > + > + for (iterator = mset.begin (); iterator != mset.end (); iterator++) > + { > + unsigned int doc_id = *iterator; > + g_array_append_val (excluded_doc_ids, doc_id); > + } > + messages->base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set); > + _notmuch_doc_id_set_init (query, messages->base.excluded_doc_ids, > + excluded_doc_ids); This might be inefficient for message-only queries, since it will fetch *all* excluded docids. This highlights a basic difference between message and thread search: thread search can return messages that don't match the original query and hence needs to know all potentially excluded messages, while message search can only return messages that match the original query. It's entirely possible this doesn't matter because Xapian probably still needs to fetch the full posting lists of the excluded terms, but it would be worth doing a quick/hacky benchmark to verify this, with enough excluded messages to make the cost non-trivial. If it does matter, you could pass in a flag indicating if the exclude query should be limited by the original query or not. Or you could do the limited exclude query in notmuch_query_search_messages and a separate open-ended exclude query in notmuch_query_search_threads. > > enquire.set_weighting_scheme (Xapian::BoolWeight()); > > @@ -294,6 +313,11 @@ _notmuch_mset_messages_move_to_next (notmuch_messages_t > *messages) > mset_messages = (notmuch_mset_messages_t *) messages; > > mset_messages->iterator++; > + > +while ((mset_messages->iterator != mset_messages->iterator_end) && > +(_notmuch_doc_id_set_contains (messages->excluded_doc_ids, > + *mset_messages->iterator))) > + mset_messages->iterator++; This seemed a little weird, since you remove it in the next patch. Is this just to keep the tests happy? (If so, it would be worth mentioning in the commit message; other reviewers will definitely have the same question.) > } > > static notmuch_bool_t