Re: how to search for hyphenated words? (was: how to search for Morse code?)

2019-03-13 Thread Matt Armstrong
David Bremner  writes:

> Matt Armstrong  writes:
>
>> Carl Worth  writes:
>>
>>> Hi Gregor,
>>>
>>> The trick here is that when notmuch is indexing body text it feeds it
>>> into a Xapian function that parses the text by finding "terms" in the
>>> text. And this parser considers both punctuation and whitespace as
>>> separators between terms.
>>
>> I notice that Xapian supports something called "phrase searches",
>> documented as:
>>
>>   "A phrase surrounded with double quotes ("") matches documents
>>   containing that exact phrase. Hyphenated words are also treated as
>>   phrases, as are cases such as filenames and email addresses
>>   (e.g. /etc/passwd or presid...@whitehouse.gov)."
>>
>> I assume that this particular Xapian feature is unavailable in notmuch?
>> If so, I wonder if enabling has ever been considered?
>
> It is enabled, and documented in notmuch-search-terms(7). Unfortunately
> I don't think it's related to the original request. The mention of
> hyphenated words is about the input to the query parser, not the
> (necessarily) the retrieved text.

Ah, so it boils down to the Xapian definition of "exact phrase."
Notably, "exact phrase" is not "identical sequence of characters" as
some people might expect.

Quick tests with various search engines reveal their phrase search as
operating the same way.  E.g. searching for "org notmuch" finds all
sorts of results:

  org-notmuch.el
  notmuchmail.org/notmuch-emacs/
  to:devicet...@vger.kernel.org notmuch tag +inbox +unread -new
  (require 'org-notmuch nil t)
  https://notmuchmail.org/notmuch-emacs/. *
  imaps://mail.example.org/Notmuch/search

For what it is worth, one thing I've taken to doing is using period
separators in the notmuch phrase searches I use in scripts and even
interactively.  Using periods is generally immune to confusing issues
related to quoting double quoted things, and always remains a single
shell "word."  They are also, most often, clearly not the exact content
I'm searching for, so they make it clear than the match algorithm is
inexact.  E.g.

  subject:notmuch.is.wonderful

instead of:

  subject:"notmuch is wonderful"
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: how to search for hyphenated words? (was: how to search for Morse code?)

2019-03-12 Thread Gregor Zattler
Hi David,
* David Bremner  [2019-03-12; 07:41]:
> Gregor Zattler  writes:
>
>
>> From: root@len.workgroup (Cron Daemon)
>> Subject: Cron  ~/bin/mailwiederdurchschleusen
>> To: root@localhost
>> Date: Fri, 29 Dec 2017 17:00:09 +0100
>>
>> Date: Thu, 28 Dec 2017 21:04:52 -0500
>> From: Maxim Cournoyer 
>> To: help-gnu-em...@gnu.org
>> Subject: Re: Gnus and emails sent by me
>> --
>> Date: Thu, 28 Dec 2017 22:00:56 -0400
>> From: David Bremner 
>> To: David Edmondson , notmuch@notmuchmail.org
>> Subject: Re: Xapian exception leading to database corruption
>> --
>
> The line
>
> To: David Edmondson , notmuch@notmuchmail.org
>
> contains the phrase "org notmuch". You can see this easier by stripping
> all the punctuation.


Thanks, now I see (the light :-)

Ciao; Gregor
-- 
 -... --- .-. . -.. ..--.. ...-.-

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: how to search for hyphenated words? (was: how to search for Morse code?)

2019-03-12 Thread Carl Worth
On Tue, Mar 12 2019, Gregor Zattler wrote:
> what I do not understand is that it dosn't matter if I search for
>
> org-notmuch
>
> or
>
> "org-notmuch"
>
> '"org-notmuch"'
>
> or even
>
> org ADJ/1 notmuch

Correct. All four of those forms are giving you phrase searches, (so a
term "org" followed immediately by a term "notmuch").

> a typical example of a matched message is the attached one.
> Somehow the search matches the address of this very mailing list
> in the body of the email (I assume).

No, I don't think you are seeing a match on the mailing-list address
itself, (which has "notmuch" two terms before "org").

> Therefore I wonder why notmuch matches 581 messages, not 16795
> messages or 77 messages.

David showed you one example from the message you copied:

> To: David Edmondson , notmuch@notmuchmail.org

And I showed one earlier in the thread.

In each case, the message includes "org" followed (after some amount of
punctuation and whitespace, perhaps including newlines) by "notmuch".

-Carl


signature.asc
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: how to search for hyphenated words? (was: how to search for Morse code?)

2019-03-12 Thread David Bremner
Gregor Zattler  writes:


> From: root@len.workgroup (Cron Daemon)
> Subject: Cron  ~/bin/mailwiederdurchschleusen
> To: root@localhost
> Date: Fri, 29 Dec 2017 17:00:09 +0100
>
> Date: Thu, 28 Dec 2017 21:04:52 -0500
> From: Maxim Cournoyer 
> To: help-gnu-em...@gnu.org
> Subject: Re: Gnus and emails sent by me
> --
> Date: Thu, 28 Dec 2017 22:00:56 -0400
> From: David Bremner 
> To: David Edmondson , notmuch@notmuchmail.org
> Subject: Re: Xapian exception leading to database corruption
> --

The line

To: David Edmondson , notmuch@notmuchmail.org

contains the phrase "org notmuch". You can see this easier by stripping
all the punctuation.
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: how to search for hyphenated words? (was: how to search for Morse code?)

2019-03-12 Thread Gregor Zattler
Hi David, Matt, Carl, notmuch developers,
* David Bremner  [2019-03-11; 22:13]:
> Matt Armstrong  writes:
>> Carl Worth  writes:
>>> The trick here is that when notmuch is indexing body text it feeds it
>>> into a Xapian function that parses the text by finding "terms" in the
>>> text. And this parser considers both punctuation and whitespace as
>>> separators between terms.
>>
>> I notice that Xapian supports something called "phrase searches",
>> documented as:
>>
>>   "A phrase surrounded with double quotes ("") matches documents
>>   containing that exact phrase. Hyphenated words are also treated as
>>   phrases, as are cases such as filenames and email addresses
>>   (e.g. /etc/passwd or presid...@whitehouse.gov)."
>>
>> I assume that this particular Xapian feature is unavailable in notmuch?
>> If so, I wonder if enabling has ever been considered?
>
> It is enabled, and documented in notmuch-search-terms(7). Unfortunately
> I don't think it's related to the original request. The mention of
> hyphenated words is about the input to the query parser, not the
> (necessarily) the retrieved text.

what I do not understand is that it dosn't matter if I search for

org-notmuch

or

"org-notmuch"

'"org-notmuch"'

or even

org ADJ/1 notmuch

$ notmuch count --output=messages '"org-notmuch"'
581
$ notmuch count --output=messages 'org-notmuch'
581
$ notmuch count --output=messages org-notmuch
581
$ notmuch count --output=messages org ADJ/1 notmuch
581

a typical example of a matched message is the attached one.
Somehow the search matches the address of this very mailing list
in the body of the email (I assume).


But obviously there are much more emails with this address in
them:

$ notmuch count --output=messages 'notmuch@notmuchmail.org'
27396
$ notmuch count --output=messages '"notmuch@notmuchmail.org"'
27396

Or with a naive search (no decoding of possible base64 encoded
parts) there are

$ find /home/grfz/Mail/~ml/emacs-orgm...@gnu.org 
/home/grfz/Mail/~ml/notmuch@notmuchmail.org* -type f -print0 | xargs -0r grep 
-l -- 'notmuch@notmuchmail.org' | xargs -I sh -c "cat  | sed -e '1,/^$/ 
d' | grep -c notmuch@notmuchmail.org " | egrep -c "1|2|3|4|5|6|7|8|9"
16795

emails with the address at least once in the body.


Therefore I wonder why notmuch matches 581 messages.



A naive search for org-notmuch on the files (no decoding of
possible base64 encoded parts) only shows 79 files (77 unique
emails):

mkdir -vp /tmp/test/{cur,new,tmp}

$ find /home/grfz/Mail/~ml/emacs-orgm...@gnu.org 
/home/grfz/Mail/~ml/notmuch@notmuchmail.org* -type f -print0 | xargs -0r grep 
-l -- 'org-notmuch' | xargs ln -vs --target-directory=/tmp/kolp/cur/ | wc -l
79


Therefore I wonder why notmuch matches 581 messages, not 16795
messages or 77 messages.


Somehow these numbers do not fit!?


Ciao; Gregor
-- 
 -... --- .-. . -.. ..--.. ...-.-
--- Begin Message ---
Date: Thu, 28 Dec 2017 21:04:52 -0500
From: Maxim Cournoyer 
To: help-gnu-em...@gnu.org
Subject: Re: Gnus and emails sent by me
--
Date: Thu, 28 Dec 2017 22:00:56 -0400
From: David Bremner 
To: David Edmondson , notmuch@notmuchmail.org
Subject: Re: Xapian exception leading to database corruption
--

--- End Message ---
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: how to search for hyphenated words? (was: how to search for Morse code?)

2019-03-11 Thread David Bremner
Matt Armstrong  writes:

> Carl Worth  writes:
>
>> Hi Gregor,
>>
>> The trick here is that when notmuch is indexing body text it feeds it
>> into a Xapian function that parses the text by finding "terms" in the
>> text. And this parser considers both punctuation and whitespace as
>> separators between terms.
>
> I notice that Xapian supports something called "phrase searches",
> documented as:
>
>   "A phrase surrounded with double quotes ("") matches documents
>   containing that exact phrase. Hyphenated words are also treated as
>   phrases, as are cases such as filenames and email addresses
>   (e.g. /etc/passwd or presid...@whitehouse.gov)."
>
> I assume that this particular Xapian feature is unavailable in notmuch?
> If so, I wonder if enabling has ever been considered?

It is enabled, and documented in notmuch-search-terms(7). Unfortunately
I don't think it's related to the original request. The mention of
hyphenated words is about the input to the query parser, not the
(necessarily) the retrieved text.

d

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: how to search for hyphenated words? (was: how to search for Morse code?)

2019-03-11 Thread David Bremner
Gregor Zattler  writes:

> Hi David, notmuch developers,
> * David Bremner  [2019-03-10; 20:22]:
>> Gregor Zattler  writes:
>>> How would one search for hyphenated words with notmuch?
>>>
>>
>> In special cases, explained in notmuch-search-terms(7), one can use
>> regexp searches, which are slower, but don't drop punctuation.
>
> thanks, this works for the subject: field, which helps a lot.
>
> Regexes do not work on the body of messages and I assume they
> will not work with the upcoming "body:" field?

That's correct.

d
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: how to search for hyphenated words? (was: how to search for Morse code?)

2019-03-11 Thread Gregor Zattler
Hi David, notmuch developers,
* David Bremner  [2019-03-10; 20:22]:
> Gregor Zattler  writes:
>> How would one search for hyphenated words with notmuch?
>>
>
> In special cases, explained in notmuch-search-terms(7), one can use
> regexp searches, which are slower, but don't drop punctuation.

thanks, this works for the subject: field, which helps a lot.

Regexes do not work on the body of messages and I assume they
will not work with the upcoming "body:" field?


Thanks for your attention, Gregor


___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: how to search for hyphenated words? (was: how to search for Morse code?)

2019-03-10 Thread David Bremner
Gregor Zattler  writes:

>
> How would one search for hyphenated words with notmuch?
>

In special cases, explained in notmuch-search-terms(7), one can use
regexp searches, which are slower, but don't drop punctuation.

d
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: how to search for hyphenated words? (was: how to search for Morse code?)

2019-03-08 Thread Carl Worth
Hi Gregor,

The trick here is that when notmuch is indexing body text it feeds it
into a Xapian function that parses the text by finding "terms" in the
text. And this parser considers both punctuation and whitespace as
separators between terms.

So your messages are not being indexed in a way to let you distinguish
between "org notmuch" and "org-notmuch".

(Of note, the query parser applies the same parsing to your query---so
that even when you think you're typing an exact phrase like
"org-notmuch" that gets parsed into separate terms "org" and "notmuch"
for searching.)

> all these resulted in very many hits most or all of which do not
> contain the string "org-notmuch", one found email was e.g.
>
> id:20180904105723.15564-3-da...@tethera.net

That message does contain the following:

   +test_emacs '(notmuch-tree "id:000-real-r...@example.org")
   +   (notmuch-test-wait)

Where you will notice that there's a term "org" followed (after some
punctuation and whitespace separators) by a term "notmuch".

> How would one search for hyphenated words with notmuch?

You would need to arrange to have the indexer consider the hyphen as a
letter-like character to be made part of terms. Or be extra clever and
index something like "notmuch-test-wait" in multiple ways (such as a
single term "notmuch-test-wait" as well as three adjacent terms
"notmuch", "test", and "wait" as notmuch is doing currently).

-Carl


signature.asc
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch