[notmuch] notmuch new: Memory problem (with uuencoded content)

2010-02-08 Thread Michal Sojka
On Saturday 06 of February 2010 22:45:32 Carl Worth wrote:
> On Sat, 6 Feb 2010 11:40:18 +0100, Michal Sojka  
> wrote:
> > It is straightforward to convert your current test script to Git's
> > framework. If you are interested I'll do it.
> 
> Yes, I'd be quite interested in seeing that. Thanks for your
> contributions, and sorry I missed (or haven't yet gotten to) the patch
> you sent earlier.

Hi Carl,

I did the conversion of the test script. I'll post it to thread 
id:87ljf8pvxx.fsf at yoom.home.cworth.org, where it is more appropriate.

Michal


Re: [notmuch] notmuch new: Memory problem (with uuencoded content)

2010-02-08 Thread Michal Sojka
On Saturday 06 of February 2010 22:45:32 Carl Worth wrote:
 On Sat, 6 Feb 2010 11:40:18 +0100, Michal Sojka sojk...@fel.cvut.cz wrote:
  It is straightforward to convert your current test script to Git's
  framework. If you are interested I'll do it.
 
 Yes, I'd be quite interested in seeing that. Thanks for your
 contributions, and sorry I missed (or haven't yet gotten to) the patch
 you sent earlier.

Hi Carl,

I did the conversion of the test script. I'll post it to thread 
id:87ljf8pvxx@yoom.home.cworth.org, where it is more appropriate.

Michal
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] notmuch new: Memory problem (with uuencoded content)

2010-02-06 Thread Carl Worth
On Sat, 6 Feb 2010 11:40:18 +0100, Michal Sojka  wrote:
> I've just looked at your notmuch-test commits. Did you noticed my patches 
> which port Git's test framework for use with notmuch?

Hi Michal,

Ah, my mistake!

That's what I get for working through my backlog chronologically. ;-)

> That framework has the 
> same spirit as yours (shell scripting, easy to use) but compared to your 
> current test script it has some nice features:

All of these features do sound very nice.

> It is straightforward to convert your current test script to Git's framework. 
> If you are interested I'll do it.

Yes, I'd be quite interested in seeing that. Thanks for your
contributions, and sorry I missed (or haven't yet gotten to) the patch
you sent earlier.

-Carl
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 



[notmuch] notmuch new: Memory problem (with uuencoded content)

2010-02-06 Thread Michal Sojka
On Friday 05 of February 2010 19:59:12 Carl Worth wrote:
> Of course, I also pushed a set of tests to the test suite for this, (and
> some new "notmuch search" tests while I was at it).

Hi Carl,

I've just looked at your notmuch-test commits. Did you noticed my patches 
which port Git's test framework for use with notmuch? That framework has the 
same spirit as yours (shell scripting, easy to use) but compared to your 
current test script it has some nice features:

- Test suite is split into several files. Therefore you do not need to run the 
whole test suit when you are working in one area of notmuch.
- If some test fails, the executed commands are automatically displayed from 
which you can immediately see what was the problem.
- Working directory for each test has a fixed name based on the name of the 
script (no $$) so you know where to look if some test fails.
- You can decide whether you want to stop on the first failure or complete the 
whole test suite.
- At the end the results are summarized so you do not need to watch the output 
of the test suite.

It is straightforward to convert your current test script to Git's framework. 
If you are interested I'll do it.

Michal


Re: [notmuch] notmuch new: Memory problem (with uuencoded content)

2010-02-06 Thread Carl Worth
On Sat, 6 Feb 2010 11:40:18 +0100, Michal Sojka sojk...@fel.cvut.cz wrote:
 I've just looked at your notmuch-test commits. Did you noticed my patches 
 which port Git's test framework for use with notmuch?

Hi Michal,

Ah, my mistake!

That's what I get for working through my backlog chronologically. ;-)

 That framework has the 
 same spirit as yours (shell scripting, easy to use) but compared to your 
 current test script it has some nice features:

All of these features do sound very nice.

 It is straightforward to convert your current test script to Git's framework. 
 If you are interested I'll do it.

Yes, I'd be quite interested in seeing that. Thanks for your
contributions, and sorry I missed (or haven't yet gotten to) the patch
you sent earlier.

-Carl


pgpou2lWThT2o.pgp
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] notmuch new: Memory problem (with uuencoded content)

2010-02-05 Thread Carl Worth
On Thu, 26 Nov 2009 11:16:21 -0800, Carl Worth  wrote:
> Clearly, some experimenting is needed. Dominik, if you can share the
> large file, (with either me alone or with the whole list), a pointer to
> where we could download it would be appreciated.

Dominik replied to me privately and described a way for me to create a
file that replicates the bug. Here's a recipe I came up with from his
description:

mkdir tmp
cd tmp/
echo [database]$'\n'path=mail > notmuch-config
mkdir mail
echo From: Me$'\n'To: You$'\n'Subject: uuencode$'\n' > mail/msg
dd if=/dev/urandom of=blob bs=1024 count=10240
uuencode blob < blob >> mail/msg
NOTMUCH_CONFIG=notmuch-config notmuch new

So that's a 10MB blob of random data which uuencodes to a ~14MB mail
file. And notmuch (before a patch I just pushed) chews on it for quite a
while, consuming several hundred MB of memory and resulting finally in a
76MB Xapian database (with chert).

I'm not sure if there is a Xapian bug there or not, (or perhaps a bug in
how notmuch is using Xapian to generate the terms for this large of an
email message).

But the thing that's obvious to me is that indexing encoded data like
this doesn't make any sense at all. So I've just pushed a set of patches
to notmuch to make it detect uuencoded data within a mail message and
ignore it.

Of course, I also pushed a set of tests to the test suite for this, (and
some new "notmuch search" tests while I was at it).

-Carl
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 



[notmuch] notmuch new: Memory problem

2009-11-26 Thread Carl Worth
On Thu, 26 Nov 2009 10:46:54 -0800, Carl Worth  wrote:
> So perhaps the new configuration option we want is a limit on message
> size? Rather than ignoring large files entirely, notmuch could just stop
> indexing messages past the configured limit?

Having just written that, I don't think it's actually an interesting
option.

Instead of working around the bug, we should just find out what the bug
actually is. It could be that Xapian's TermGenerator is just going nuts
here. Or it could be that Xapian is just trying to hold too much data in
memory instead of flushing it out to disk.

Currently, notmuch doesn't ever call any explicit Xapian flush. Instead,
we rely on the default behavior which is that Xapian will flush to disk
after every batch of 1 documents added. So it's possible that all
that's actually needed here is for notmuch to notice that it just
indexed a huge file, and then explicitly flush to avoid Xapian using too
much memory. Or, perhaps better, Xapian could be fixed to automatically
flush if its memory usages gets "too big", (if the missing flush is
actually what's needed here).

Clearly, some experimenting is needed. Dominik, if you can share the
large file, (with either me alone or with the whole list), a pointer to
where we could download it would be appreciated.

-Carl


[notmuch] notmuch new: Memory problem

2009-11-26 Thread Carl Worth
On Wed, 25 Nov 2009 10:39:57 +0100, Dominik Epple  wrote:
> So the problem stems indeed from too many too large files being
> present. (I actually found some being as large as 40M, not just 2.4M,
> as written in previous mails.)

That's very good to know.

And I'm glad you at least have things working smoothly now.

So perhaps the new configuration option we want is a limit on message
size? Rather than ignoring large files entirely, notmuch could just stop
indexing messages past the configured limit?

-Carl


[notmuch] notmuch new: Memory problem

2009-11-25 Thread Dominik Epple
Hello,

I repeated the procedure (mb2md, notmuch new), but before, I saved all
those large emails with backup logs into a separate folder which i
deleted before "notmuch new". Then, "notmoch new" works as expected.
So the problem stems indeed from too many too large files being
present. (I actually found some being as large as 40M, not just 2.4M,
as written in previous mails.)

Regards
Dominik


2009/11/23 Dominik Epple :
> Hi,
>
> 2009/11/20 Carl Worth :
>> On Fri, 20 Nov 2009 09:56:50 +0100, Dominik Epple > googlemail.com> wrote:
>>> Is there a problem with the number of my mails? I currently have over
>>> 40.000 Mails... they live currently in mbox files, I created a Maildir
>>> with mb2md-3.20.pl.
>>
>> I'm suspecting that you have some big files in there, (such as indexes
>> from some other mail program). We had code in notmuch to detect and
>> ignore these, but a recent bug had broken that.
>>
>> I just fixed this code as of the below commit. So please update and try
>> again and let us know if things work any better.
>
> Ok, one of the problems seems to be solved. One can learn from the
> info: output that the code actually ignores non-email data. These
> files are small and fragments of real mail. Obviously the mb2md code
> made errors there.
>
> But I run in a different issue. I have a lot of files in the Maildir
> which contain base64 encoded binary data. (Some remote site sends my
> its daily backup logs.) Those files are all of 2.4 megabyte in size.
> By adding some debug code to notmuch-new.c, I find out that the
> program becomes very slow and consumes a lot of memory when adding
> these files. I just killed it when it consumed 2 GByte again.
>
> So as you suspected, the problem seems to stem from large files. But
> those large files are not indices or stuff like that from different
> mail programs, but they are valid emails which contain a lot of
> (encoded) binary data.
>
> Perhaps we should be able to configure notmuch such that he ignores
> all mails that match specific pattern (like "Subject: Backup logs
> from.*")
>
> Regards
> Dominik
>


[notmuch] notmuch new: Memory problem

2009-11-23 Thread Dominik Epple
Hi,

2009/11/20 Carl Worth :
> On Fri, 20 Nov 2009 09:56:50 +0100, Dominik Epple  googlemail.com> wrote:
>> Is there a problem with the number of my mails? I currently have over
>> 40.000 Mails... they live currently in mbox files, I created a Maildir
>> with mb2md-3.20.pl.
>
> I'm suspecting that you have some big files in there, (such as indexes
> from some other mail program). We had code in notmuch to detect and
> ignore these, but a recent bug had broken that.
>
> I just fixed this code as of the below commit. So please update and try
> again and let us know if things work any better.

Ok, one of the problems seems to be solved. One can learn from the
info: output that the code actually ignores non-email data. These
files are small and fragments of real mail. Obviously the mb2md code
made errors there.

But I run in a different issue. I have a lot of files in the Maildir
which contain base64 encoded binary data. (Some remote site sends my
its daily backup logs.) Those files are all of 2.4 megabyte in size.
By adding some debug code to notmuch-new.c, I find out that the
program becomes very slow and consumes a lot of memory when adding
these files. I just killed it when it consumed 2 GByte again.

So as you suspected, the problem seems to stem from large files. But
those large files are not indices or stuff like that from different
mail programs, but they are valid emails which contain a lot of
(encoded) binary data.

Perhaps we should be able to configure notmuch such that he ignores
all mails that match specific pattern (like "Subject: Backup logs
from.*")

Regards
Dominik


[notmuch] notmuch new: Memory problem

2009-11-23 Thread Dominik Epple
Hi,

Thanks for your help. Here is the information you requested:

2009/11/20 Carl Worth :
> I'm curious how big your .notmuch directory ended up after this
> operation. (And how that compares in size to the total size of your
> collection of mail.)

I guess you mean these directories:

$ du -sh Maildir
2,8GMaildir
$ cd Maildir
$ du -sh .notmuch
1,1G.notmuch

> That's definitely not too much mail. I think you should expect "notmuch
> new" currently to index on the order of 10 - 100 messages/sec.
>
> Your "notmuch new" process should have been reporting a count once per
> second as it progressed, (at least until things went wrong). How far did
> you see that go?

It started quickly, but its speed decreased, and I interrupted it at
some 4000 messages, if I remember correctly.

Regards
Dominik


[notmuch] notmuch new: Memory problem

2009-11-20 Thread Carl Worth
On Fri, 20 Nov 2009 09:56:50 +0100, Dominik Epple  wrote:
> Is there a problem with the number of my mails? I currently have over
> 40.000 Mails... they live currently in mbox files, I created a Maildir
> with mb2md-3.20.pl.

I'm suspecting that you have some big files in there, (such as indexes
from some other mail program). We had code in notmuch to detect and
ignore these, but a recent bug had broken that.

I just fixed this code as of the below commit. So please update and try
again and let us know if things work any better.

Thanks for your patience!

-Carl

commit 3ae12b1e286d1c0041a2e3957cb01daa2981dad9
Author: Carl Worth 
Date:   Fri Nov 20 21:46:37 2009 +0100

add_message: Re-fix handling of non-mail files.

More fallout from _get_header now returning "" for missing headers.

The bug here is that we would no longer detect that a file is not an
email message and give up on it like we should.

And this time, I actually audited all callers to
notmuch_message_get_header, so hopefully we're done fixing this
bug over and over.


[notmuch] notmuch new: Memory problem

2009-11-20 Thread Carl Worth
On Fri, 20 Nov 2009 09:56:50 +0100, Dominik Epple  wrote:
> I am strongly interested in giving notmuch a try.

Welcome to notmuch, Dominik! I'm sorry your initial attempt to use it
hasn't been quite as smooth as we might like.

>   But I fail setting
> it up. The problem is that during "notmuch new", memory consumption
> and system load increases to values that make my system unusable. I
> then killed "notmuch new" at a memory consumption of 2.7G and at a
> system load of 7.

Yikes. That really sounds like something ran out of control consuming
memory. I certainly haven't seen anything like that before.

> After hitting Ctrl-C, it says "Stopping" but does not stop. I then
> killed "notmuch new" after some minutes with signal KILL.

After "Stopping" gets printed, the notmuch code won't be doing any more
work. It is expected that it will take some time after that message is
printed before notmuch will actually exit. The extra time is to wait for
Xapian to flush out to disk data that notmuch has already provided to
it.

I'm curious how big your .notmuch directory ended up after this
operation. (And how that compares in size to the total size of your
collection of mail.)

> Is there a problem with the number of my mails? I currently have over
> 40.000 Mails... they live currently in mbox files, I created a Maildir
> with mb2md-3.20.pl.

That's definitely not too much mail. I think you should expect "notmuch
new" currently to index on the order of 10 - 100 messages/sec.

Your "notmuch new" process should have been reporting a count once per
second as it progressed, (at least until things went wrong). How far did
you see that go?

I'm wondering if there's a particular file (or files) that are
triggering the bad behavior. Maybe we need a debug option for "notmuch
new" to print the filenames of messages as they are being processed.

-Carl


[notmuch] notmuch new: Memory problem

2009-11-20 Thread Dominik Epple
Hi,

I am strongly interested in giving notmuch a try. But I fail setting
it up. The problem is that during "notmuch new", memory consumption
and system load increases to values that make my system unusable. I
then killed "notmuch new" at a memory consumption of 2.7G and at a
system load of 7.

After hitting Ctrl-C, it says "Stopping" but does not stop. I then
killed "notmuch new" after some minutes with signal KILL.

Is there a problem with the number of my mails? I currently have over
40.000 Mails... they live currently in mbox files, I created a Maildir
with mb2md-3.20.pl.

OS is SuSE Linux 11.1, kernel 2.6.27.29-0.1-default, notmuch pulled
today from git, compiled manually, dependencies also downloaded and
installed manually, in the following versions:

gmime-2.4.11.tar.bz2
talloc-2.0.0.tar.gz
xapian-core-1.0.17.tar.gz

Any help?

Thanks
Dominik