Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-05 Thread Greg Minshall
Ken,

> It did occur to me that if you did a folder -pack in the MH-E index
> folder and you had a numeric sub-folder then your sub-folder would
> change its name and I am not sure what that would do to the MH-E index.

i *suspect* it would mean that i wouldn't be able to track down a
previous search by looking at the folder names, and not much more.  i
*suspect*.

cheers, Greg



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-04 Thread Ken Hornstein
>In the meantime, an occasional folder(1) -pack might solve the problem
>manually.

It did occur to me that if you did a folder -pack in the MH-E index
folder and you had a numeric sub-folder then your sub-folder would
change its name and I am not sure what that would do to the MH-E index.

--Ken



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-04 Thread David Levine
Ken wrote:

> This does suggest to me we should probably change the internal API
> so sparse message ranges are handled better;

Yes, that would be a nice enhancement.

In the meantime, an occasional folder(1) -pack might solve the problem manually.

David



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-03 Thread Greg Minshall
Ken,

> The bottom line is nmh (and MH before it) is just not going to perform
> well with billion-sized gaps in message numbers and fixing that is going
> to be very very hard.

understood.  thanks for the explanation.

cheers, Greg



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-03 Thread Ken Hornstein
>while actual bytes of memory on my laptop are semi-precious, addresses
>in the address space are much less so.  here's somebody who uses mmap(2)
>to allocate a huge chunk of address space, and then madvise(2) (a call i
>think i've never used) to have that chunk backed by (lots and lots of)
>zeroes.
>
>https://robert.ocallahan.org/2016/06/managing-vast-sparse-memory-on-linux.html
>

While that is interesting, I see some issues:

- MAP_NORESERVE is a Linux-specific feature of mmap(), as far as I can tell.
  I'm not opposed to OS-specific features but we'd need to think about it
  carefully.
- As for if it would help ... well, it depends on what you are doing.
  In the specific case of flist(1), it would probably help because one
  of things folder_read() does is count up the total number of messages
  in a folder (mp->nummsg) and that's what flist uses.  But if you tried
  to use scan(1) on that folder, well ... what scan(1) does is start
  at "lowmsg" and calls does_exist() on every number between "lowmsg"
  and "highmsg" to determine if that message exists.  And does_exist()
  is using the msgstat array to see if a message exists, so you'd be
  reading every single msgstat array member.

The bottom line is nmh (and MH before it) is just not going to perform
well with billion-sized gaps in message numbers and fixing that is going
to be very very hard.

--Ken



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-02 Thread Greg Minshall
Robert, et al., thanks very much.

possibly mh-e could add something like a comma before integers.  i'll
ask and look.

on the more general issue, you all know a lot more about all of this
than me.  but ... :)

while actual bytes of memory on my laptop are semi-precious, addresses
in the address space are much less so.  here's somebody who uses mmap(2)
to allocate a huge chunk of address space, and then madvise(2) (a call i
think i've never used) to have that chunk backed by (lots and lots of)
zeroes.

https://robert.ocallahan.org/2016/06/managing-vast-sparse-memory-on-linux.html


i get the sense that nmh will only (after maybe zeroing the array, which
would be eliminated in this scenario!) access locations in the array
corresponding to actual "messages" found in the directory.  so, this
sparse array should stay sparse, right?

this wouldn't solve all problems -- places where the difference between
the largest and smallest "message" numbers is greater than the size of
the address space (+/-).

but, i wonder if it might help.  maybe combined with those places where
d_type is supported (and not "unknown") in directory entries.  at least
as a temporary fix?

and, if the mmap(2) call fails, that's maybe a way to provide a more
graceful termination than lots of sluggishness, then OOM.

cheers, Greg



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-02 Thread George Michaelson
I've always done sortm -verbose after big delete fests. verbose because I
love watching the towers of hanoi shuffle along.

lots of GUI mail systems have 'compact mailbox' command options. I assumed
that everyone did periodic tidyup anyway.

I'm not saying this isn't a problem. But, I seriously wonder how BIG a
problem this is. If you can renumber out of it, then isn't that a viable
work-around?

-G


Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-02 Thread Ken Hornstein
>a folder with the highest message number of "N" will cause the array to be
>configured to support N messages, even if there are many fewer (perhaps
>even one) messages

No, that's not correct.  If you have a single message in a folder with a
count of 100, you only get one entry allocated.  The number of entries
allocated is based on the difference between the lowest and highest
message number.

>Scale the array based upon the number of directory entries in the folder.
>This will over commit due to subfolders being counted, scratch files, and
>deleted messages. It seems this would only over commit in interesting cases
>by 3x (baseline of 1 covers the messages, the 2nd set is scratch files and
>deleted messages, and 3 is subfolders). Short of malicious actions, you'd
>end up with, maybe 5x (message, extracted parts of the message, deleted
>message, folders that look like message numbers). If you want more
>compactness, you take pains to dump the stuff that isn't a message number
>(the aforementioned extracted parts and deleted messages).

It's not filesystem internals that is the issue, it's (n)mh internals.

Right now the msgstats array is indexed by taking the message number and
subtracting the value of the lowest message number.  Obviously there are
much better ways to deal with this, but all of the nmh code directly
accesses the msgstats array.  And of course time is not infinite so
someone who HAS time would have to roll up their sleeves to fix it.

(A general assumption is that there are few holes in nmh message
numbers and this is reflected in more locations than just this).

--Ken



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-02 Thread Howard Bampton
On Thu, Mar 2, 2023 at 11:27 AM Ken Hornstein  wrote:

>
>
> This does suggest to me we should probably change the internal API
> so sparse message ranges are handled better; right now all of the
> programs access the folder structure members directly and assume that
> there will be a msgstat structure in every location in the array.
> Sigh.  One more thing to add to the list.
>
>
If I understand the problem correctly:
a folder with the highest message number of "N" will cause the array to be
configured to support N messages, even if there are many fewer (perhaps
even one) messages
stat()ing every file in a folder to make sure it is a file (message)
instead of a directory (folder) is very expensive (and harms the
performance of other programs where this isn't important and is thus a
no-go)

I assume we want "close enough" scaling, not perfect. Would not the
following work well enough?

Scale the array based upon the number of directory entries in the folder.
This will over commit due to subfolders being counted, scratch files, and
deleted messages. It seems this would only over commit in interesting cases
by 3x (baseline of 1 covers the messages, the 2nd set is scratch files and
deleted messages, and 3 is subfolders). Short of malicious actions, you'd
end up with, maybe 5x (message, extracted parts of the message, deleted
message, folders that look like message numbers). If you want more
compactness, you take pains to dump the stuff that isn't a message number
(the aforementioned extracted parts and deleted messages).

Or am I missing something about filesystem internals?


Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-02 Thread Ken Hornstein
>From Ken's description above, these 111 messages would allocate almost
>800,000 msgstat structures.  I don't know how huge the message numbers
>get in the results folder, but six digits is common.  I don't recall if
>I've seen seven digit or larger message numbers.

I see Conrad pointed out that if you set "sort=date+" in your .mairixc
then this resolves this issue (but I do not know if that has negative
side effects or if that interacts badly with MH-E).

This does suggest to me we should probably change the internal API
so sparse message ranges are handled better; right now all of the
programs access the folder structure members directly and assume that
there will be a msgstat structure in every location in the array.
Sigh.  One more thing to add to the list.

--Ken



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-02 Thread Simon Burge
Hi Conrad,

Conrad Hughes wrote:

> Simon> Possibly somewhat related, Greg mentioned he uses mairix for
> Simon> search.  mairix produces very "sparse" results folders.
>
> I use mairix and have never witnessed this.  A quick experiment shows
> that it's because I use
>
>   sort=date+
>
> in my .mairixrc.  At a guess, the default unsorted numbering system must
> use the emails' positions in mairix's own index, which could obviously
> get quite high, given a big archive.
>
> An odd choice.  Try using "sort=date+" if that's acceptable.

Ahh, I see that there have been two "recent" commits to mairix:

  17 Jan 2020 - MH search results are sequentially numbered from 1. 
  17 Jan 2020 - Add "sort=date+" option, and renumber MH results.

Unfortunately(?) I'm still using the latest release version which
is 0.24 from 14 Aug 2017 :/

Cheers,
Simon.



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-02 Thread Robert Elz
Date:Thu, 02 Mar 2023 13:33:01 +
From:Ralph Corderoy 
Message-ID:  <20230302133301.8900121...@orac.inputplus.co.uk>

  | The real issue is nmh doesn't forbid folders named with just decimal
  | digits and even creates them when requested.  MH-E is set a bad example.

That's true, but nmh doesn't just create folders on a whim, only when
the user requests it, and if the user requests a folder name that looks
like a message number, well, it is their problem to cause...   mh-e is
(apparently, I don't use it, or anything emacsish) doing in behind the
user's back, in a sense - and has no need to really.

kre




Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-02 Thread Conrad Hughes
Simon> Possibly somewhat related, Greg mentioned he uses mairix for
Simon> search.  mairix produces very "sparse" results folders.

I use mairix and have never witnessed this.  A quick experiment shows
that it's because I use

  sort=date+

in my .mairixrc.  At a guess, the default unsorted numbering system must
use the emails' positions in mairix's own index, which could obviously
get quite high, given a big archive.

An odd choice.  Try using "sort=date+" if that's acceptable.

Conrad



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-02 Thread Simon Burge
Ken Hornstein wrote:

> Exactly HOW many messages are in mhe-index?
>
> Ah, I think I see what's happening.  That line is this:
>
>   mp->msgstats = mh_xmalloc (MSGSTATSIZE(mp));
>
> MSGSTATSIZE is defined as:
>
> #define MSGSTATSIZE(mp) ((mp)->num_msgstats * sizeof *(mp)->msgstats)
>
> num_msgstats is set by the previous line:
>
> mp->num_msgstats = MSGSTATNUM (mp->lowoff, mp->hghoff);
>
> Which is defined as:
>
> #define MSGSTATNUM(lo, hi) ((size_t) ((hi) - (lo) + 1))
>
> So ... the summary here is that nmh (and MH before it) allocates a
> "message status" element for every possible message.  The possible
> number of messages is the range between the LOWEST message number and
> the HIGHEST message number.  So if you just had 100 and 1002 in
> a folder, it would allocate 3 elements.  But if you had 1 and 100,
> it would allocate a million elements.  A msgstat structure is an array
> of "struct bvector" which might be ... 8 + 8 + 16 bytes per message on
> a 64 bit platform.  That suggests there are either 1320920404 messages
> in that folder (1.2 billion) or there's a huge message number gap (that
> has come up before when someone had a huge gap; the my memory is the
> consensus was you just had to deal).

Possibly somewhat related, Greg mentioned he uses mairix for search.
mairix produces very "sparse" results folders.  For example:

thoreau 52115> mairix caffeine
Matched 111 messages
thoreau 52116> f +vfolder
vfolder+ has 111 messages  (47-782143); cur=650783.

>From Ken's description above, these 111 messages would allocate almost
800,000 msgstat structures.  I don't know how huge the message numbers
get in the results folder, but six digits is common.  I don't recall if
I've seen seven digit or larger message numbers.

Cheers,
Simon.



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-02 Thread Ralph Corderoy
Hi Ken,

> > > I think we have to push this back on the MH-E people
...
> >$ refile +31415  >$ folder +31415
> >31415+ has 1 message   (1-1).
>
> I'm aware of that, but what happens if you have a subfolder that is
> all numeric?  I believe all of the nmh tools will treat that subfolder
> as a message

$ ref +3/1/4/1/5
Create folder "/home/ralph/mail/3/1/4/1/5"? yes
$ folder +3/1/4/1/5
3/1/4/1/5+ has 1 message   (1-1).
$ scan -forma %{from} +3/1/4/1/5 1
Ken Hornstein 
$ scan -forma %{from} +3/1/4/1
scan: unable to read: Is a directory
scan: scan() botch (-3)
$

> (that's the real issue).

The real issue is nmh doesn't forbid folders named with just decimal
digits and even creates them when requested.  MH-E is set a bad example.

-- 
Cheers, Ralph.



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-02 Thread Ken Hornstein
>> I think we have to push this back on the MH-E people; Robert's
>> suggestion to add a non-numeric prefix to directories it creats sounds
>> like the best answer to me.
>
>$ refile +31415 $ folder +31415
>31415+ has 1 message   (1-1).

I'm aware of that, but what happens if you have a subfolder that is all
numeric?  I believe all of the nmh tools will treat that subfolder as
a message (that's the real issue).

--Ken



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-02 Thread Ralph Corderoy
Hi,

Ken wrote:
> I think we have to push this back on the MH-E people; Robert's
> suggestion to add a non-numeric prefix to directories it creats sounds
> like the best answer to me.

$ refile +31415 

Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-02 Thread Ken Hornstein
>it seems that at some point i had done a search for 74600607886815 (your
>basic "magic number" :).  mh-e, i guess, had created a directory with
>that number as its name (it uses the search term to name subfolders
>under the normal mhe-index folder).  and, i guess, flist decided that
>(under the ~/Mail/MHE-INDEX folder) was a message number?
>
>does that make sense?  i guess mh-e could not create such subfolders
>with names consisting only of decimal integers (i have some
>hexadecimal-named folders which don't seem to give a problem).  or, i
>could not search for such.  or, maybe flist (or, nmh in general?) could
>not think that a directory was a message?

The loop in folder_read() that is scanning for messages is this:

while ((dp = readdir (dd))) {
if ((msgnum = m_atoi (dp->d_name)) && msgnum > 0) {
[...]

So if the directory entry is a positive decimal integer, nmh (and MH
before it) considers it a message.  Robert already explained the issues
involved; stat()ing every file to determine if it was a file or not
would be prohibitively slow (and this would affect every nmh program;
almost everything calls folder_read()), and using d_type isn't portable.

I think we have to push this back on the MH-E people; Robert's
suggestion to add a non-numeric prefix to directories it creats sounds
like the best answer to me.

--Ken



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-02 Thread Robert Elz
Date:Thu, 02 Mar 2023 09:49:23 +0300
From:Greg Minshall 
Message-ID:  <814300.1677739763@archlinux>

  | bash archlinux (master): {49603} ls -a 74600607886815/

That would do it.

  | and, i guess, flist decided that
  | (under the ~/Mail/MHE-INDEX folder) was a message number?
  |
  | does that make sense?

yes.

  | i guess mh-e could not create such subfolders
  | with names consisting only of decimal integers (i have some
  | hexadecimal-named folders which don't seem to give a problem).

That, or have the index put in some tree outside your mh mail tree.

  | or, i could not search for such.

That sounds a bit draconian

  | or, maybe flist (or, nmh in general?) could
  | not think that a directory was a message?

On a standard system that would require a stat of every potential
message number named like directory entry.  If that was to be done
flist would need to be renamed to be slist instead.

On filesystems that support d_type in the directory entries it would
be possible, but would need the stat() fallback whenever it sees
DT_UNKNOWN which on many systems is lukely to be always.

Better would be to fix mhe to always add a 1 char non-numeric
prefix to directories it creates (perhaps '_', even ' '), and then
there is no confusion any more.

kre



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-01 Thread Greg Minshall
Ken,

> Exactly HOW many messages are in mhe-index?

exactly

bash archlinux (master): {49536} find ~/MHE-INDEX/ -type f | wc
 173477  173477 7838026


but, is that that many?  in my overall Mail directory

bash archlinux (master): {49516} find ~/Mail/ -type f | wc
1499737 1499739 63526598

i.e., an order of magnitude more.

nevertheless, if i put (what is now) ~/MHE-INDEX back under ~/Mail, i
get the very large malloc failure.

hmm...

bingo.

i'm sitting in ~/Mail/MHE-INDEX:

bash archlinux (master): {49603} ls -a 74600607886815/
./  ../  .mhe_index
bash archlinux (master): {49605} mkdir ~/BAD-MHE-INDEX
bash archlinux (master): {49606} mv 74600607886815/ !$
mv 74600607886815/ ~/BAD-MHE-INDEX
bash archlinux (master): {49607} flist -all | wc
1811629   12308
bash archlinux (master): {49608} mv ~/BAD-MHE-INDEX/74600607886815/ .
bash archlinux (master): {49609} flist -all | wc
flist: malloc failed, size wanted: 9207885440
  0   0   0


it seems that at some point i had done a search for 74600607886815 (your
basic "magic number" :).  mh-e, i guess, had created a directory with
that number as its name (it uses the search term to name subfolders
under the normal mhe-index folder).  and, i guess, flist decided that
(under the ~/Mail/MHE-INDEX folder) was a message number?

does that make sense?  i guess mh-e could not create such subfolders
with names consisting only of decimal integers (i have some
hexadecimal-named folders which don't seem to give a problem).  or, i
could not search for such.  or, maybe flist (or, nmh in general?) could
not think that a directory was a message?

cheers, Greg



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-01 Thread Ken Hornstein
>ah, great! yes, that works. and, yes, to my ignorant eye, it appears
>that the call from `folder_read()` to `mh_xmalloc()` is where we are
>going south.
>[...]
>#2  0xf898 in mh_xmalloc (size=42269452928) at sbr/utils.c:38
>#3  0xacf6 in folder_read (name=0x555d5400 
>"/home/minshall/Mail/mhe-index", lockflag=0) at sbr/folder_read.c:138

Exactly HOW many messages are in mhe-index?

Ah, I think I see what's happening.  That line is this:

mp->msgstats = mh_xmalloc (MSGSTATSIZE(mp));

MSGSTATSIZE is defined as:

#define MSGSTATSIZE(mp) ((mp)->num_msgstats * sizeof *(mp)->msgstats)

num_msgstats is set by the previous line:

mp->num_msgstats = MSGSTATNUM (mp->lowoff, mp->hghoff);

Which is defined as:

#define MSGSTATNUM(lo, hi) ((size_t) ((hi) - (lo) + 1))

So ... the summary here is that nmh (and MH before it) allocates a
"message status" element for every possible message.  The possible
number of messages is the range between the LOWEST message number and
the HIGHEST message number.  So if you just had 100 and 1002 in
a folder, it would allocate 3 elements.  But if you had 1 and 100,
it would allocate a million elements.  A msgstat structure is an array
of "struct bvector" which might be ... 8 + 8 + 16 bytes per message on
a 64 bit platform.  That suggests there are either 1320920404 messages
in that folder (1.2 billion) or there's a huge message number gap (that
has come up before when someone had a huge gap; the my memory is the
consensus was you just had to deal).

In general nmh will try to handle messages and folders up to the virtual
memory limit and it seems like you reached it.

--Ken



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-01 Thread Greg Minshall
Ralph,

> To watch progress, try
> 
> strace -e openat flist -recurse -sequence xyz -all >/dev/null

thanks.  Ken's hint about ulimit, and questioning the number of folders,
seems to have cleared things up.

> Also, is there anything odd about your tree of directories?
> Does it have cycles?  Symlinks?  Is it all on local disk?

just for reference: other than the number of folders (if 3000 is odd),
no.  no cycles, no symlinks, all on local (non-spinning, nvram) "disk".

cheers, Greg



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-01 Thread Greg Minshall
hi, Ken,

many thanks for the reply, and the sparkling idea at the end!

> Is there a long delay when you run flist?

yes, there is a long delay, as the system does some N^m processing.  the
system gets sluggish as always when there is a huge memory crunch.

>  Do you have a lot of folders?  Like a huge number?

well, huge is in the eye of the beholder (or, the malloc'r?):

bash archlinux: {49487} find ~/Mail -type d | wc
   27102710  113826


i run mh-e with emacs.  *most* of the folders come from synthetic
folders created when searching (using mairix, in my case) my e-mail
archives.

if i move those folders out, i have fewer

bash archlinux: {49489} find ~/Mail -type d | grep -v mhe-index | wc
   15771577   61380

and, flist works just fine, thank you very much.

i guess flist doesn't have a way of specifying folders *not* to search.
so, i will just "garbage-collect" most of these search-related folders
(by removing them).

>  I see that there are arrays allocated based on the number of folders
> you have.  I am just trying to figure out if there is a number of
> small allocations or large ones.  You could also disable OOM
> completely; I suspect flist will just segfault when it hits the limit.

> Oh, wait, I see that using limit/ulimit and setting the "datasize"
> limit should cause a SIGSEGV when it hits that limit.  So if you set
> that below the OOM limit that should make it easier to debug things.

ah, great!  yes, that works.  and, yes, to my ignorant eye, it appears
that the call from `folder_read()` to `mh_xmalloc()` is where we are
going south.


(gdb) run -all
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/minshall/src/import/nmh/git/nmh/uip/flist -all
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
flist: malloc failed, size wanted: 42269452928

Breakpoint 1, __GI_exit (status=1) at exit.c:137
137 {
(gdb) where
#0  __GI_exit (status=1) at exit.c:137
#1  0xa339 in adios (what=0x0, fmt=0x55565178 "malloc failed, 
size wanted: %zu") at sbr/error.c:56
#2  0xf898 in mh_xmalloc (size=42269452928) at sbr/utils.c:38
#3  0xacf6 in folder_read (name=0x555d5400 
"/home/minshall/Mail/mhe-index", lockflag=0) at sbr/folder_read.c:138
#4  0x8440 in AddFolder (name=0x7fffa7a0 "mhe-index", force=1) 
at uip/flist.c:452
#5  0x8362 in BuildFolderListRecurse (dirName=0x55564176 ".", 
s=0x7fffb7d0, searchdepth=0) at uip/flist.c:428
#6  0x805e in BuildFolderList (dirName=0x55564176 ".", 
searchdepth=0) at uip/flist.c:352
#7  0x7fab in ScanFolders () at uip/flist.c:323
#8  0x7cdd in main (argc=2, argv=0x7fffda18) at uip/flist.c:244


presumably, "fixing this" isn't a priority, especially as i guess it
would require considerable changes to the code.

but, i'm glad to know what it is.  (and, i hope i remember it!)

cheers, Greg



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-03-01 Thread Ralph Corderoy
Hi Greg,

> $ valgrind flist -sequence xyz -all
...
> ==729147== Warning: set address range perms: large range [0x59c8e040, 
> 0xa313d8ac0) (undefined)

To watch progress, try

strace -e openat flist -recurse -sequence xyz -all >/dev/null

Also, is there anything odd about your tree of directories?
Does it have cycles?  Symlinks?  Is it all on local disk?

-- 
Cheers, Ralph.



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-02-28 Thread Ken Hornstein
>> If you run under the debugger, you should stop when you receive the
>> signal from the OOM process.
>
>thanks.  OOM is a pretty strange way to die...

Sigh, I guess I was thinking that ptrace() would be able to catch a
process killed by SIGKILL, but I guess not.

Is there a long delay when you run flist?  Do you have a lot of folders?
Like a huge number?  I see that there are arrays allocated based on the
number of folders you have.  I am just trying to figure out if there is
a number of small allocations or large ones.  You could also disable
OOM completely; I suspect flist will just segfault when it hits the limit.

Oh, wait, I see that using limit/ulimit and setting the "datasize" limit
should cause a SIGSEGV when it hits that limit.  So if you set that below
the OOM limit that should make it easier to debug things.

--Ken



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-02-28 Thread Greg Minshall
Ken,

> If you run under the debugger, you should stop when you receive the signal
> from the OOM process.

thanks.  OOM is a pretty strange way to die...

run -sequence xyz -all
Starting program: /usr/bin/flist -sequence xyz -all
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Program terminated with signal SIGKILL, Killed.
The program no longer exists.
(gdb) where
No stack.


as for valgrind (which really is a wonderful contribution to software
development!)

bash archlinux (master): {50084} valgrind flist -sequence xyz -all
==729147== Memcheck, a memory error detector
==729147== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==729147== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==729147== Command: flist -sequence xyz -all
==729147==
==729147== Warning: set address range perms: large range [0x59c8e040, 
0xa313d8ac0) (undefined)
Killed
bash archlinux (master): {50085}

again, the way the OOM system kills you i guess doesn't leave even
footprints.


> Otherwise you could run under valgrind (it should be available in the
> packaging system for your distro) which should very quickly tell you
> where memory is leaking.

the "set address range perms" does seem large.  one stackoverflow
description:

https://stackoverflow.com/a/13561685/1527747

so, maybe that is

bash archlinux (master): {50088} dc -e '10o 16i A313D8AC0 59C8E040 - p'
42269452928

hmm ... 42GB of memory?

if that jogs anyone's memory, please let me know.  otherwise (assuming
this condition persists), i'll eventually try looking at the code and
etc.  (if anyone thinks of any syscalls that flist and friends might be
making that might change these memory permissions, that would be of use,
as my memory of memory is slight.)

cheers, Greg



Re: flist -- "Killed" -- oom (*not* 1.8 related)

2023-02-27 Thread Ken Hornstein
>and, if not, any thoughts on how to debug?  if i build "cc -g", any
>thoughts on where to set breakpoints, or where to insert printf's, to
>try to track this down?

If you run under the debugger, you should stop when you receive the signal
from the OOM process.

That MIGHT be useful _if_ you hit the limit in the routine that is causing
the memory leak, which is likely but not guaranteed.  Otherwise you could
run under valgrind (it should be available in the packaging system for
your distro) which should very quickly tell you where memory is leaking.

--Ken



flist -- "Killed" -- oom (*not* 1.8 related)

2023-02-27 Thread Greg Minshall
hi.

for some years now, i occasionally get in a situation where flist
gobbles up lots of memory, and is then killed by the kernel's "oom"
procedure.  eventually this goes away, though i'm not exactly sure how
that happens either.


bash archlinux (master): {50071} flist -sequence unseen -all -recurse
Killed
bash archlinux (master): {50072} flist -sequence unseen -all
Killed
bash archlinux (master): {50073} flist -sequence unseen
drafts+ has 0 in sequence unseen; out of 1
bash archlinux (master): {50074} flist -sequence xyz -all
Killed
bash archlinux (master): {50075} flist -version
flist -- nmh-1.7.1 built 2022-11-25 04:27:57 + on archlinux


any ideas?

and, if not, any thoughts on how to debug?  if i build "cc -g", any
thoughts on where to set breakpoints, or where to insert printf's, to
try to track this down?

cheers, Greg