[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-23 Thread Carl Worth
On Sun, 22 Nov 2009 10:15:39 -0500, Brett Viren  
wrote:
> On Sun, Nov 22, 2009 at 3:36 AM, Mike Hommey  
> wrote:
> But, here is one that looks I/O bound:
> 
>  notmuch tag -unread tag:inbox
> 
> I have my home directory on an encfs volume and I see it and notmuch
> competing for CPU when viewing "top".

Yes. The "notmuch tag" command currently does much more IO than it
really should.

This is Xapian bug 250. Please see:

id:874oon4pgv.fsf at yoom.home.cworth.org 

for some details and thoughts on the bug from me and some pointers on
how one could go about fixing it.

-Carl


[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-22 Thread Brett Viren
On Sun, Nov 22, 2009 at 3:36 AM, Mike Hommey  wrote:

> A good test, if you have enough memory, would be to put your mailbox in
> a tmpfs, and see how fast that imports.

(Oops, forgot to reply to the list.)

I don't see any function calls related to I/O on the call graph.

But, here is one that looks I/O bound:

 notmuch tag -unread tag:inbox

I have my home directory on an encfs volume and I see it and notmuch
competing for CPU when viewing "top".

-Brett.


[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-22 Thread Mike Hommey
On Sat, Nov 21, 2009 at 05:36:18PM -0500, Brett Viren wrote:
> On Sat, Nov 21, 2009 at 12:07 PM, Carl Worth  wrote:
> 
> > Though, frankly, I think we need to fix "notmuch new" to do much better
> > than 40 files/sec.
> 
> Just a "me too".
> 
> Processed 130871 total files in 38m 7s (57 files/sec.).
> Added 102723 new messages to the database (not much, really).
> 
> This was ~2GB of mail on a 2.5GHz CPU.  That seems pretty reasonable
> to me but I'd like to rerun the "notmuch new" under google perftools
> to see if there are any obvious bottlenecks that might be cleaned up.

FWIW, my 90k+ messages mailbox was imported at a pace of 130 files/sec,
and my CPU is "only" 2.2GHz, but I have a SSD. A good share of the
bottlenecks is "simply" I/O. Don't forget having a lot of small files
sucks I/O wise, as files are most likely spread all over the disk.

A good test, if you have enough memory, would be to put your mailbox in
a tmpfs, and see how fast that imports.

Mike


[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-22 Thread Carl Worth
On Sat, 21 Nov 2009 17:36:18 -0500, Brett Viren  
wrote:
> Processed 130871 total files in 38m 7s (57 files/sec.).
> Added 102723 new messages to the database (not much, really).

Just be glad that you have so little mail. ;-)

> This was ~2GB of mail on a 2.5GHz CPU.  That seems pretty reasonable
> to me but I'd like to rerun the "notmuch new" under google perftools
> to see if there are any obvious bottlenecks that might be cleaned up.

To me, here are the obvious things to fix after looking at a profile:

  1. We're spending a *lot* of time searching in the Xapian database.

But our initial indexing operation should only be *writing* data into
the database, so what's this searching about?

Well, at each new message, we're looking up the ID from it's In-Reply-To
header to find a thread-ID to link to, and then we're looking up all of
the IDs from its References header to find thread IDs that need to be
merged with ours. So both parent and child lookups.

And since those are taking a bunch of time, I think it might make sense
to just keep a hashtable mapping message-ID -> thread-ID and do lookups
in that, (should have plenty of memory on current machines even with
lots of mail).

  2. We're hitting the slow Xapian document updates for thread-ID
  merging.

Whenever we find a child that was already in the database with one
thread ID that should have ours, we simply want to set its thread ID to
ours. But as we've talked about recently, Xapian has a bug (defect 250)
that makes it much more expensive than it should be to update a single
term.

So, we could do a first pass over the messages to find all their thread
IDs and get them to settle down before doing any indexing in a separate,
second pass.

Step (2) should help even if we don't do step (1), but clearly we can do
both.

It would be great if anyone wants to take a look at either or both of
these, otherwise I will when I can.

-Carl


[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-21 Thread Carl Worth
On Sat, 21 Nov 2009 20:36:06 +0100, Stefan Schmidt  wrote:
> Yup, I had the repo on my disk a week before Keith blogged about it. Just nice
> that it was going crazy that fast and people start using it and contributing 
> to
> it.

Yes, it's quite fun.

> > Though, frankly, I think we need to fix "notmuch new" to do much better
> > than 40 files/sec.
> 
> As a sidenote. That one is on a notebook with a slow 5400 disk and crypt + 
> lvm +
> ext3 on top. Perhaps I should put some money back for an X25 SSD. ;)

Sure. But I think we can still do a lot better even on your machine. :-)

> I have to admit it took me some time. Something like below should help?

Thanks so much! I committed this, (and then added a bit more
documentation on top of it).

> I think that's what I will try to get working here. Sounds the nearest 
> solution
> to my problem. That in combination with the just merged tags-based-on-folders
> patch should make me a lot happier. :)

Well, do note that I just reverted that patch too. :-/

So you might want to cherry-pick it back (or even add the configuration
option that will let us push it back out again).

-Carl


[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-21 Thread Stefan Schmidt
Hello.

On Sat, 2009-11-21 at 18:07, Carl Worth wrote:
> On Sat, 21 Nov 2009 15:51:11 +0100, Stefan Schmidt  datenfreihafen.org> wrote:
> > Disclaimer: I'm using vim, in combination with mutt for email, for years, 
> > but
> > never dealt with emacs. Please have this in mind and spot any emacs user 
> > errors
> > in this report. :)
> 
> Hi Stefan, welcome to Notmuch! And don't worry, we don't discriminate
> (too much) against non-emacs users around here.

:)

> > I have first seen notmuch several weeks ago as it seems a silent project. 
> > Being
> > more then happy now that it envolves quickly and a real developer community
> > builds around it.
> 
> Yes. Notmuch was a silent project since it was just something that I was
> doing for myself. I was always writing it as free software, and even had
> a public git repository available, but hadn't advertised it at all yet.

Yup, I had the repo on my disk a week before Keith blogged about it. Just nice
that it was going crazy that fast and people start using it and contributing to
it.

> > But now to my problem. Getting m mail indexed was easy enough:
> > 
> > stefan at excalibur:~$ du -chs not-much-mail/
> > 1.5Gnot-much-mail/
> > 1.5Gtotal
> > stefan at excalibur:~$ time notmuch new
> > Found 103677 total files.
> > Processed 103677 total files in 42m 30s (40 files/sec.).
> > Added 100899 new messages to the database (not much, really).
> 
> Good. I'm glad that went fairly smoothly for you.
> 
> Though, frankly, I think we need to fix "notmuch new" to do much better
> than 40 files/sec.

As a sidenote. That one is on a notebook with a slow 5400 disk and crypt + lvm +
ext3 on top. Perhaps I should put some money back for an X25 SSD. ;)

> > I put (require  'notmuch) in my ~/.emacs ans start emacs with the -f notmuch
> > option to enter the notmuch mode.
> 
> I'm glad you've figured that much out. I feel bad that that's not even
> in the documentation anywhere yet.

I have to admit it took me some time. Something like below should help?

> > What happends then is that a notmuch process gets started and emacs
> > waits for the return.
> 
> OK. This is a known shortcoming. As Bdale supposes, this problem is from
> notmuch trying to load and construct every thread in your
> database. There are actually several different bugs/missing features
> here that should be addressed:
> 
>   * "notmuch new" should look at the R flag in maildir files to
> determine that they are read and do not need to be marked as "inbox"
> and "unread"

I think that's what I will try to get working here. Sounds the nearest solution
to my problem. That in combination with the just merged tags-based-on-folders
patch should make me a lot happier. :)


>From 8f95e039e98addd0f4be7c31e41e534f1b519a5d Mon Sep 17 00:00:00 2001
From: Stefan Schmidt 
Date: Sat, 21 Nov 2009 20:31:55 +0100
Subject: [PATCH] INSTALL: emacs install dokumentation.

Write down the steps needed to install and actuall use notmuch in emacs. Should
help emacs newbies.

Signed-off-by: Stefan Schmidt 
---
 INSTALL |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/INSTALL b/INSTALL
index de268b6..64b8e36 100644
--- a/INSTALL
+++ b/INSTALL
@@ -14,6 +14,14 @@ Notmuch are satisfied. If they are not, the configure script 
will
 notice that and provide instructions on where to obtain the necessary
 dependencies.

+notmuch.el installation
+---
+Installing the notmuch.el emacs lisp function systemwide:
+
+   sudo make install-emacs
+
+Each user needs to add (require 'notmuch) in his ~/.emacs to activate it.
+
 Dependencies
 
 Notmuch depends on three libraries: Xapian, GMime 2.4, and Talloc
-- 
1.6.5.3

regards
Stefan Schmidt


[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-21 Thread Stefan Schmidt
Hello.

On Sat, 2009-11-21 at 18:26, Carl Worth wrote:
> On Sat, 21 Nov 2009 16:36:55 +0100, Stefan Schmidt  datenfreihafen.org> wrote:
> 
> > In my case only 80 messages were printed before the gap. All of them had a 
> > wrong
> > year in the timestamp. 1900 and 1970. Maybe notmuch just comes into a bad 
> > state
> > with this dates?
> 
> I don't think the bogus dates are throwing anything off. It's more
> likely that you just have a number of messages with no Date header on
> them at all. And for such messages, notmuch just chooses a time_t value
> of 0 so you'll see whatever that 0 maps to on your system---a date of
> 1970 there is not surprising. :-)

Yeah, I figured that removing the offending messages and re-run it brought
nothing. Time to look at the source. :)

regards
Stefan Schmidt


[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-21 Thread Stefan Schmidt
Hello.

On Sat, 2009-11-21 at 18:16, Carl Worth wrote:
> On Sat, 21 Nov 2009 08:12:52 -0700, Bdale Garbee  wrote:
> > I haven't figured out how to quickly tag everything as already read or
> > archived or whatever .. can someone who knows more about what's going on
> > confirm my hypothesis and if so, suggest the best approach to getting to
> > a happier state?
> 
> See my message up-thread. The only reasonable ways all really do involve
> at least a little bit of C-code hacking to either prevent those tags
> from getting put there by "notmuch new" or to make it easier to get them
> off afterwards.

Let's see if I come up with something here.

> And I can't help but apologize. I've known about all these issues, and
> wouldn't have invited people to try things out in the current state. But
> it was nice of Keith to share this with everyone. And it's nice of all
> you to come take a look at things.

Getting it out now was a good move. It had enough code to actually do omething
usefull and many people waited for something like this. The increasing number of
contributors in such a short time shows it very well. :)

regards
Stefan Schmidt


[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-21 Thread Carl Worth
On Sat, 21 Nov 2009 16:36:55 +0100, Stefan Schmidt  wrote:
> I executed "/usr/local/bin/notmuch search --sort=oldest-first tag:inbox" by 
> hand
> and from the 21 minutes it took it stayed around 20 in a state where no new
> message where printed and then sudenly all the rest comes up.

That's actually the expected behavior currently.

It used to be that "notmuch search" on the command line wouldn't present
any results until everything was available.

I recently threw in a hack to present the first 100 thread results
quickly and only then does it sit and spin before all the results are
available. I suppose it wouldn't be any harder for it to keep returning
chunks of 100 threads at a time, (though this will slow down the final
result a bit---perhaps not significantly).

And I wouldn't really mind any slowdown there anyway, since any *real*
interface should be calling "notmuch search" in small chunks anyway.

So I'll go ahead and do that.

> In my case only 80 messages were printed before the gap. All of them had a 
> wrong
> year in the timestamp. 1900 and 1970. Maybe notmuch just comes into a bad 
> state
> with this dates?

I don't think the bogus dates are throwing anything off. It's more
likely that you just have a number of messages with no Date header on
them at all. And for such messages, notmuch just chooses a time_t value
of 0 so you'll see whatever that 0 maps to on your system---a date of
1970 there is not surprising. :-)

> I will remove these mails and re-generate the notmuch index to test this out
> after dinner later today.

See my other mail. You may want to tweak the behavior of "notmuch new"
before running it again. (I would not expect the results to be any
different from running it again with no change.)

-Carl


[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-21 Thread Carl Worth
On Sat, 21 Nov 2009 08:12:52 -0700, Bdale Garbee  wrote:
> I haven't figured out how to quickly tag everything as already read or
> archived or whatever .. can someone who knows more about what's going on
> confirm my hypothesis and if so, suggest the best approach to getting to
> a happier state?

See my message up-thread. The only reasonable ways all really do involve
at least a little bit of C-code hacking to either prevent those tags
from getting put there by "notmuch new" or to make it easier to get them
off afterwards.

I'm hoping everyone with this problem will happen to choose a different
solution and we'll get a nice flood of patches to improve things. :-)

And I can't help but apologize. I've known about all these issues, and
wouldn't have invited people to try things out in the current state. But
it was nice of Keith to share this with everyone. And it's nice of all
you to come take a look at things.

So, I'll just ask for a little patience, and we'll hopefully have a nice
system soon.

-Carl


[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-21 Thread Carl Worth
On Sat, 21 Nov 2009 15:51:11 +0100, Stefan Schmidt  wrote:
> Disclaimer: I'm using vim, in combination with mutt for email, for years, but
> never dealt with emacs. Please have this in mind and spot any emacs user 
> errors
> in this report. :)

Hi Stefan, welcome to Notmuch! And don't worry, we don't discriminate
(too much) against non-emacs users around here.

> I have first seen notmuch several weeks ago as it seems a silent project. 
> Being
> more then happy now that it envolves quickly and a real developer community
> builds around it.

Yes. Notmuch was a silent project since it was just something that I was
doing for myself. I was always writing it as free software, and even had
a public git repository available, but hadn't advertised it at all yet.

And Keith did rather catch me off guard by announcing it. But I can't
complain as we have gotten a nice community started already, and it's
great to have other people writing the code that I intended to
write. :-)

But it's also true that some obvious problems just aren't taken care of
yet.

> But now to my problem. Getting m mail indexed was easy enough:
> 
> stefan at excalibur:~$ du -chs not-much-mail/
> 1.5Gnot-much-mail/
> 1.5Gtotal
> stefan at excalibur:~$ time notmuch new
> Found 103677 total files.
> Processed 103677 total files in 42m 30s (40 files/sec.).
> Added 100899 new messages to the database (not much, really).

Good. I'm glad that went fairly smoothly for you.

Though, frankly, I think we need to fix "notmuch new" to do much better
than 40 files/sec. One plan I have for this is to not use the database
to search for message IDs when adding many messages---but to instead
just use a hash-table (seeded from any messages already in the
database). This would allow us to do all thread resolution before
indexing messages, without having to do the N different searches, and
also means we'd avoid continually rewriting documents when merging
thread IDs.

> I put (require  'notmuch) in my ~/.emacs ans start emacs with the -f notmuch
> option to enter the notmuch mode.

I'm glad you've figured that much out. I feel bad that that's not even
in the documentation anywhere yet.

> What happends then is that a notmuch process gets started and emacs
> waits for the return.

OK. This is a known shortcoming. As Bdale supposes, this problem is from
notmuch trying to load and construct every thread in your
database. There are actually several different bugs/missing features
here that should be addressed:

  * "notmuch new" should look at the R flag in maildir files to
determine that they are read and do not need to be marked as "inbox"
and "unread"

  * "notmuch setup" should prompt for some date range, ("last 2 months"
by default?) before which no messages will be considered unread.

Either of those two fixes would have prevented your particular
problem. But it's still easy to generate searches that return large
numbers of results. So there's some more to do:

  * The emacs code needs to call "notmuch search" with the --first and
--max-threads options to get a limited set of results, (one or two
screenfuls). You should be able to test this at the command line and
see that it returns results quickly. Then, of course, we'd like the
emacs code to fill in subsequent screenfuls as you page.

But none of that helps you right now. What you need is to retroactively
remove all of the "inbox" and "unread" tags from messages older than
some time period. So then there's another missing feature:

  * We need to support date-range-based searches. If we had that you
could just do:

notmuch tag -inbox -unread until:"2 months ago"

But we don't quite have this yet. Xapian does have support for a
slightly less convenient date range specification:

1970-01-01..2009-09-21

but it turns out that we can't even use that just yet, since to make
that work we would have to have dates saved as MMDD strings for
each message, (where instead we have time_t values stored serialized
into a string that will sort correctly.). So we need a new
ValueRangeProcessor class to map to timestamps, and then we'll need
some fancy parsing to do things like "2 months ago".

So, what's the best thing to do today if you want to start playing with
notmuch? I think you could pick one of the above to work on, (a quick
hack to "notmuch new" and a re-import might do the trick). Or you might
just remove the inbox and unread tags from all messages and then just
let messages that are actually *new* in the future get tagged into the
inbox by "notmuch new". Oh, but then there's another missing feature:

  * We need a syntax to specify a search string that should match all
messages. Then you could do:

notmuch tag -inbox -unread 

Yikes! So many bugs and missing features. How is anyone actually using
this system? Well, Keith and I were able to get past all this by simply
doing a "notmuch restore" based on tags we 

[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-21 Thread Brett Viren
On Sat, Nov 21, 2009 at 12:07 PM, Carl Worth  wrote:

> Though, frankly, I think we need to fix "notmuch new" to do much better
> than 40 files/sec.

Just a "me too".

Processed 130871 total files in 38m 7s (57 files/sec.).
Added 102723 new messages to the database (not much, really).

This was ~2GB of mail on a 2.5GHz CPU.  That seems pretty reasonable
to me but I'd like to rerun the "notmuch new" under google perftools
to see if there are any obvious bottlenecks that might be cleaned up.

How can I purge the index?  I can't locate it.

-Brett.


[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-21 Thread Stefan Schmidt
Hello.

On Sat, 2009-11-21 at 08:12, Bdale Garbee wrote:
> On Sat, 2009-11-21 at 15:51 +0100, Stefan Schmidt wrote:
> 
> > Sadly that takes around 25 minutes here on an Intel Core2Duo notbeook 
> > (Thinkpad
> > X200s). I tried this several times now. CPU load was low (~10%) during this 
> > time
> > so it is mostly IO bound.
> 
> I see the same behavior on my notebook.  
> 
> I gather from talking to keithp that things like the 'state of already
> being read' aren't being picked up from the file names in the local
> Maildir yet.  Thus I suspect it's a fairly unusual / worst case scenario
> trying to start up with 178k (in my case) supposedly-unread messages
> tagged inbox.

Using the read flag during notmuch new would indeed be nice. But some further
testing brings some doubts that it is an overload due to to many unread
messages.

I executed "/usr/local/bin/notmuch search --sort=oldest-first tag:inbox" by hand
and from the 21 minutes it took it stayed around 20 in a state where no new
message where printed and then sudenly all the rest comes up.

In my case only 80 messages were printed before the gap. All of them had a wrong
year in the timestamp. 1900 and 1970. Maybe notmuch just comes into a bad state
with this dates?

Bdale, can you confirm this for your case?

I will remove these mails and re-generate the notmuch index to test this out
after dinner later today.

regards
Stefan Schmidt


[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-21 Thread Stefan Schmidt
Hello.

Disclaimer: I'm using vim, in combination with mutt for email, for years, but
never dealt with emacs. Please have this in mind and spot any emacs user errors
in this report. :)

I have first seen notmuch several weeks ago as it seems a silent project. Being
more then happy now that it envolves quickly and a real developer community
builds around it.

But now to my problem. Getting m mail indexed was easy enough:

stefan at excalibur:~$ du -chs not-much-mail/
1.5Gnot-much-mail/
1.5Gtotal
stefan at excalibur:~$ time notmuch new
Found 103677 total files.
Processed 103677 total files in 42m 30s (40 files/sec.).
Added 100899 new messages to the database (not much, really).

Tip: If you have any sub-directories that are archives (that is,
they will never receive new mail), marking these directories as
read-only (chmod u-w /path/to/dir) will make "notmuch new"
much more efficient (it won't even look in those directories).

real43m0.943s
user22m46.513s
sys 0m39.418s


I put (require  'notmuch) in my ~/.emacs ans start emacs with the -f notmuch
option to enter the notmuch mode. What happends then is that a notmuch process
gets started and emacs waits for the return.

23649 pts/1SN+0:00  |   \_ emacs -f notmuch
23651 ?RNs0:03  |   \_ /usr/local/bin/notmuch search
--sort=oldest-first tag:inbox

Sadly that takes around 25 minutes here on an Intel Core2Duo notbeook (Thinkpad
X200s). I tried this several times now. CPU load was low (~10%) during this time
so it is mostly IO bound.

I checked that I don't have any big files like mutt header caches left and all
my mail is stored in maildir format diretcly from offlineimap. I'm more then
happy to test any patches on this issue or do some debugging myself if I get
some hints where to look.

regards
Stefan Schmidt


[notmuch] 25 minutes load time with emacs -f notmuch

2009-11-21 Thread Bdale Garbee
On Sat, 2009-11-21 at 15:51 +0100, Stefan Schmidt wrote:

> Sadly that takes around 25 minutes here on an Intel Core2Duo notbeook 
> (Thinkpad
> X200s). I tried this several times now. CPU load was low (~10%) during this 
> time
> so it is mostly IO bound.

I see the same behavior on my notebook.  

I gather from talking to keithp that things like the 'state of already
being read' aren't being picked up from the file names in the local
Maildir yet.  Thus I suspect it's a fairly unusual / worst case scenario
trying to start up with 178k (in my case) supposedly-unread messages
tagged inbox.

I haven't figured out how to quickly tag everything as already read or
archived or whatever .. can someone who knows more about what's going on
confirm my hypothesis and if so, suggest the best approach to getting to
a happier state?

Bdale