storing From and Subject in xapian

2011-05-16 Thread Sebastian Spaeth
On Sat, 14 May 2011 21:37:25 -0400, Austin Clements  wrote:
> I wonder if a better approach would be to use
> notmuch_message_get_header everywhere, rather than introducing
> _notmuch_message_get_header_value, and have it simply recognize
> headers that can be retrieved directly from the database.  Then
> library callers could take advantage of this optimization and it could
> be trivially extended to other headers in the future.

+1, this is what the python bindings would prefer ;)

Sebastian
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: 



storing From and Subject in xapian

2011-05-16 Thread Istvan Marko
Austin Clements  writes:

> I wonder if a better approach would be to use
> notmuch_message_get_header everywhere, rather than introducing
> _notmuch_message_get_header_value, and have it simply recognize
> headers that can be retrieved directly from the database.  Then
> library callers could take advantage of this optimization and it could
> be trivially extended to other headers in the future.

That's a good idea, updated patch below. This version also has fallback
handling for database entries that don't have the new header value
fields.

I couldn't find a way to have the Xapian API differentiate between
undefined and blank value fields so empty subject lines are encoded as a
single space.

Also, the address completion discussion made me think that maybe a value
field containing To/Cc/Bcc could be added too to avoid message file
parsing for the address search case but I haven't tried implementing
that yet.

-- next part --
A non-text attachment was scrubbed...
Name: notmuch-value3.patch
Type: text/x-patch
Size: 3296 bytes
Desc: not available
URL: 

-- next part --

-- 
Istvan


Re: storing From and Subject in xapian

2011-05-16 Thread Sebastian Spaeth
On Sat, 14 May 2011 21:37:25 -0400, Austin Clements amdra...@mit.edu wrote:
 I wonder if a better approach would be to use
 notmuch_message_get_header everywhere, rather than introducing
 _notmuch_message_get_header_value, and have it simply recognize
 headers that can be retrieved directly from the database.  Then
 library callers could take advantage of this optimization and it could
 be trivially extended to other headers in the future.

+1, this is what the python bindings would prefer ;)

Sebastian


pgpnOGWzDitM2.pgp
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


storing From and Subject in xapian

2011-05-15 Thread servilio
On 12 May 2011 04:39, Istvan Marko  wrote:
> Stewart Smith  writes:
>
>> Would it be possible to progressively fill the DB with the new data?
>>
>> i.e.
>>
>> if Subject/From not in db for message
>> ? ?add Subject/From for this message to DB.
>
> I started looking into this but then realized that notmuch search opens
> the database in read-only mode so it cannot make updates. It might be
> desirable to keep it that way for safety and locking reasons.

What about the following:

- increase NOTMUCH_DATABASE_VERSION[1]
- update notmuch_database_upgrade[2] to fill in the new data for the
documents missing it
- include an upgradedb command that wraps notmuch_database_upgrade[2]
- have notmuch search prints a warning about running a DB version less
than the runtime and suggests running upgradedb

Regards,

Servilio

[1] http://git.notmuchmail.org/git/notmuch/blob/HEAD:/lib/database.cc#l39
[2] http://git.notmuchmail.org/git/notmuch/blob/HEAD:/lib/database.cc#l765


storing From and Subject in xapian

2011-05-14 Thread Austin Clements
I wonder if a better approach would be to use
notmuch_message_get_header everywhere, rather than introducing
_notmuch_message_get_header_value, and have it simply recognize
headers that can be retrieved directly from the database.  Then
library callers could take advantage of this optimization and it could
be trivially extended to other headers in the future.

On Tue, May 3, 2011 at 11:40 PM, Istvan Marko  wrote:
> I have been looking at the I/O patterns of "notmuch search" with the
> default output format and noticed that it has to parse the maildir file
> of every matched message to get the From and Subject headers. I figured
> that this must be slowing things down, especially when the files are not
> in the filesystem cache.
>
> So I wanted to see how much difference would it make to have the From
> and Subject stored in xapian to avoid this parsing.
>
> With the attached patch I get a speedup of 2x with cached and almost 10x
> with uncached files for searches with many matches.
>
> The attached patch is only intended as proof of concept. I am not
> familiar with xapian so I wasn't sure if this kind of data should be
> stored as terms, values or data. I went with values simply because I saw
> that message-id and timestamp were already stored that way. Perhaps the
> data type would be more appropriate since the fields are not used for
> searching or sorting. Oh and for some reason I get blank Subject for
> about 1% of the matches.
>
>
> Is there a downside to this approach? The only one I see is that the
> xapian db size increases by about 1% but to me the speed increase would
> be well worth it.


Re: storing From and Subject in xapian

2011-05-14 Thread Austin Clements
I wonder if a better approach would be to use
notmuch_message_get_header everywhere, rather than introducing
_notmuch_message_get_header_value, and have it simply recognize
headers that can be retrieved directly from the database.  Then
library callers could take advantage of this optimization and it could
be trivially extended to other headers in the future.

On Tue, May 3, 2011 at 11:40 PM, Istvan Marko notm...@kismala.com wrote:
 I have been looking at the I/O patterns of notmuch search with the
 default output format and noticed that it has to parse the maildir file
 of every matched message to get the From and Subject headers. I figured
 that this must be slowing things down, especially when the files are not
 in the filesystem cache.

 So I wanted to see how much difference would it make to have the From
 and Subject stored in xapian to avoid this parsing.

 With the attached patch I get a speedup of 2x with cached and almost 10x
 with uncached files for searches with many matches.

 The attached patch is only intended as proof of concept. I am not
 familiar with xapian so I wasn't sure if this kind of data should be
 stored as terms, values or data. I went with values simply because I saw
 that message-id and timestamp were already stored that way. Perhaps the
 data type would be more appropriate since the fields are not used for
 searching or sorting. Oh and for some reason I get blank Subject for
 about 1% of the matches.


 Is there a downside to this approach? The only one I see is that the
 xapian db size increases by about 1% but to me the speed increase would
 be well worth it.
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: storing From and Subject in xapian

2011-05-14 Thread servilio
On 12 May 2011 04:39, Istvan Marko notm...@kismala.com wrote:
 Stewart Smith stew...@flamingspork.com writes:

 Would it be possible to progressively fill the DB with the new data?

 i.e.

 if Subject/From not in db for message
    add Subject/From for this message to DB.

 I started looking into this but then realized that notmuch search opens
 the database in read-only mode so it cannot make updates. It might be
 desirable to keep it that way for safety and locking reasons.

What about the following:

- increase NOTMUCH_DATABASE_VERSION[1]
- update notmuch_database_upgrade[2] to fill in the new data for the
documents missing it
- include an upgradedb command that wraps notmuch_database_upgrade[2]
- have notmuch search prints a warning about running a DB version less
than the runtime and suggests running upgradedb

Regards,

Servilio

[1] http://git.notmuchmail.org/git/notmuch/blob/HEAD:/lib/database.cc#l39
[2] http://git.notmuchmail.org/git/notmuch/blob/HEAD:/lib/database.cc#l765
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


storing From and Subject in xapian

2011-05-12 Thread Istvan Marko
Stewart Smith  writes:

> Would it be possible to progressively fill the DB with the new data?
>
> i.e.
>
> if Subject/From not in db for message
>add Subject/From for this message to DB.

I started looking into this but then realized that notmuch search opens
the database in read-only mode so it cannot make updates. It might be
desirable to keep it that way for safety and locking reasons.

-- 
Istvan


Re: storing From and Subject in xapian

2011-05-12 Thread Istvan Marko
Stewart Smith stew...@flamingspork.com writes:

 Would it be possible to progressively fill the DB with the new data?

 i.e.

 if Subject/From not in db for message
add Subject/From for this message to DB.

I started looking into this but then realized that notmuch search opens
the database in read-only mode so it cannot make updates. It might be
desirable to keep it that way for safety and locking reasons.

-- 
Istvan
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


storing From and Subject in xapian

2011-05-11 Thread Stewart Smith
On Sun, 08 May 2011 22:24:37 -0700, Istvan Marko  wrote:
> Jameson Graef Rollins  writes:
> 
> > Unless I hear a strong positive response I'll hold off on considering it
> > for 0.6, and suggest instead targeting it for 0.7.
> 
> I would say wait until 0.7 at least.
> 
> An important thing missing is fallback to the old method for messages
> where the Subject/From VALUE fields don't exist. Otherwise people will
> get blank results until they rebuild their database.

Would it be possible to progressively fill the DB with the new data?

i.e.

if Subject/From not in db for message
   add Subject/From for this message to DB.

?

That'd be awesome from my pov (having just rebuilt my database in chert
format and that took FOREVER).

-- 
Stewart Smith


Re: storing From and Subject in xapian

2011-05-10 Thread Stewart Smith
On Sun, 08 May 2011 22:24:37 -0700, Istvan Marko notm...@kismala.com wrote:
 Jameson Graef Rollins jroll...@finestructure.net writes:
 
  Unless I hear a strong positive response I'll hold off on considering it
  for 0.6, and suggest instead targeting it for 0.7.
 
 I would say wait until 0.7 at least.
 
 An important thing missing is fallback to the old method for messages
 where the Subject/From VALUE fields don't exist. Otherwise people will
 get blank results until they rebuild their database.

Would it be possible to progressively fill the DB with the new data?

i.e.

if Subject/From not in db for message
   add Subject/From for this message to DB.

?

That'd be awesome from my pov (having just rebuilt my database in chert
format and that took FOREVER).

-- 
Stewart Smith
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


storing From and Subject in xapian

2011-05-08 Thread Istvan Marko
Jameson Graef Rollins  writes:

> Unless I hear a strong positive response I'll hold off on considering it
> for 0.6, and suggest instead targeting it for 0.7.

I would say wait until 0.7 at least.

An important thing missing is fallback to the old method for messages
where the Subject/From VALUE fields don't exist. Otherwise people will
get blank results until they rebuild their database.

-- 
Istvan


storing From and Subject in xapian

2011-05-08 Thread Jameson Graef Rollins
On Wed, 4 May 2011 21:48:39 -0400, Austin Clements  wrote:
> This is awesome.  What was your machine configuration?

Does anyone else have an opinions about this patch?  It seems reasonable
to me (other than a couple errant comments that were left in and should
be removed).  It seems worth the slight increase in database size for
such a nice performance improvement.

Unless I hear a strong positive response I'll hold off on considering it
for 0.6, and suggest instead targeting it for 0.7.

jamie.
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: 



Re: storing From and Subject in xapian

2011-05-08 Thread Jameson Graef Rollins
On Wed, 4 May 2011 21:48:39 -0400, Austin Clements amdra...@mit.edu wrote:
 This is awesome.  What was your machine configuration?

Does anyone else have an opinions about this patch?  It seems reasonable
to me (other than a couple errant comments that were left in and should
be removed).  It seems worth the slight increase in database size for
such a nice performance improvement.

Unless I hear a strong positive response I'll hold off on considering it
for 0.6, and suggest instead targeting it for 0.7.

jamie.


pgppzsKVItXSD.pgp
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


storing From and Subject in xapian

2011-05-05 Thread Istvan Marko
Austin Clements  writes:

> This is awesome.  What was your machine configuration?

Reasonably modern linux box, Core i5. Both the xapian db and the mail
files are on the same 7200 RPM SATA drive, ext4 filesystem.

I guess the SSD might explain why you your uncached results are not as
bad as mine.

My test search matches 8800 messages grouped into 5550 threads.

Wit the patch cached results go from 2.5 secs to 1.5, uncached goes from
40 secs to 6.

Thanks for the clue on the missing subject lines, your change does
indeed fix the problem!

-- 
Istvan


storing From and Subject in xapian

2011-05-04 Thread Austin Clements
On Wed, May 4, 2011 at 9:48 PM, Austin Clements  wrote:
> As another data point, with a probably very different configuration (8
> year old P4, new SSD), my test query was 1.9X faster uncached and 1.6X
> faster cached. ?It also produced 60% fewer disk reads. ?I saw the same
> 1% increase in database size.

Oops, the email was on an SSD, but the database was on a separate
spinning disk.  With them both on the SSD, it's 2.1X faster uncached.


storing From and Subject in xapian

2011-05-04 Thread Austin Clements
This is awesome.  What was your machine configuration?

As another data point, with a probably very different configuration (8
year old P4, new SSD), my test query was 1.9X faster uncached and 1.6X
faster cached.  It also produced 60% fewer disk reads.  I saw the same
1% increase in database size.

BTW, the reason you're missing some of the subjects is that the char*
returned from _notmuch_message_get_header_value goes out of scope as
soon as that function returns.  A simple fix is to replace
return value.c_str();
with
return talloc_strdup (message, value.c_str ());

Values are probably the right place to store this information (though
I've never been completely clear on the difference between document
data and values).  Terms would be indexed, which is both unnecessary
(unless there's a reason to do *exact* matches on from and subject?)
and would result in more database expansion.

On Tue, May 3, 2011 at 11:40 PM, Istvan Marko  wrote:
>
> I have been looking at the I/O patterns of "notmuch search" with the
> default output format and noticed that it has to parse the maildir file
> of every matched message to get the From and Subject headers. I figured
> that this must be slowing things down, especially when the files are not
> in the filesystem cache.
>
> So I wanted to see how much difference would it make to have the From
> and Subject stored in xapian to avoid this parsing.
>
> With the attached patch I get a speedup of 2x with cached and almost 10x
> with uncached files for searches with many matches.
>
> The attached patch is only intended as proof of concept. I am not
> familiar with xapian so I wasn't sure if this kind of data should be
> stored as terms, values or data. I went with values simply because I saw
> that message-id and timestamp were already stored that way. Perhaps the
> data type would be more appropriate since the fields are not used for
> searching or sorting. Oh and for some reason I get blank Subject for
> about 1% of the matches.
>
>
> Is there a downside to this approach? The only one I see is that the
> xapian db size increases by about 1% but to me the speed increase would
> be well worth it.
>
>
>
>
> --
> ? ? ? ?Istvan
>
> ___
> notmuch mailing list
> notmuch at notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch
>
>


Re: storing From and Subject in xapian

2011-05-04 Thread Austin Clements
This is awesome.  What was your machine configuration?

As another data point, with a probably very different configuration (8
year old P4, new SSD), my test query was 1.9X faster uncached and 1.6X
faster cached.  It also produced 60% fewer disk reads.  I saw the same
1% increase in database size.

BTW, the reason you're missing some of the subjects is that the char*
returned from _notmuch_message_get_header_value goes out of scope as
soon as that function returns.  A simple fix is to replace
return value.c_str();
with
return talloc_strdup (message, value.c_str ());

Values are probably the right place to store this information (though
I've never been completely clear on the difference between document
data and values).  Terms would be indexed, which is both unnecessary
(unless there's a reason to do *exact* matches on from and subject?)
and would result in more database expansion.

On Tue, May 3, 2011 at 11:40 PM, Istvan Marko notm...@kismala.com wrote:

 I have been looking at the I/O patterns of notmuch search with the
 default output format and noticed that it has to parse the maildir file
 of every matched message to get the From and Subject headers. I figured
 that this must be slowing things down, especially when the files are not
 in the filesystem cache.

 So I wanted to see how much difference would it make to have the From
 and Subject stored in xapian to avoid this parsing.

 With the attached patch I get a speedup of 2x with cached and almost 10x
 with uncached files for searches with many matches.

 The attached patch is only intended as proof of concept. I am not
 familiar with xapian so I wasn't sure if this kind of data should be
 stored as terms, values or data. I went with values simply because I saw
 that message-id and timestamp were already stored that way. Perhaps the
 data type would be more appropriate since the fields are not used for
 searching or sorting. Oh and for some reason I get blank Subject for
 about 1% of the matches.


 Is there a downside to this approach? The only one I see is that the
 xapian db size increases by about 1% but to me the speed increase would
 be well worth it.




 --
        Istvan

 ___
 notmuch mailing list
 notmuch@notmuchmail.org
 http://notmuchmail.org/mailman/listinfo/notmuch


___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: storing From and Subject in xapian

2011-05-04 Thread Austin Clements
On Wed, May 4, 2011 at 9:48 PM, Austin Clements amdra...@mit.edu wrote:
 As another data point, with a probably very different configuration (8
 year old P4, new SSD), my test query was 1.9X faster uncached and 1.6X
 faster cached.  It also produced 60% fewer disk reads.  I saw the same
 1% increase in database size.

Oops, the email was on an SSD, but the database was on a separate
spinning disk.  With them both on the SSD, it's 2.1X faster uncached.
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


storing From and Subject in xapian

2011-05-03 Thread Istvan Marko

I have been looking at the I/O patterns of "notmuch search" with the
default output format and noticed that it has to parse the maildir file
of every matched message to get the From and Subject headers. I figured
that this must be slowing things down, especially when the files are not
in the filesystem cache.

So I wanted to see how much difference would it make to have the From
and Subject stored in xapian to avoid this parsing. 

With the attached patch I get a speedup of 2x with cached and almost 10x
with uncached files for searches with many matches.

The attached patch is only intended as proof of concept. I am not
familiar with xapian so I wasn't sure if this kind of data should be
stored as terms, values or data. I went with values simply because I saw
that message-id and timestamp were already stored that way. Perhaps the
data type would be more appropriate since the fields are not used for
searching or sorting. Oh and for some reason I get blank Subject for
about 1% of the matches.


Is there a downside to this approach? The only one I see is that the
xapian db size increases by about 1% but to me the speed increase would
be well worth it.


-- next part --
A non-text attachment was scrubbed...
Name: notmuch-xapian-headers.patch
Type: text/x-patch
Size: 4003 bytes
Desc: not available
URL: 

-- next part --

-- 
Istvan