Re: Stuck in Needbuf state, trying to understand (6.7)

2020-07-14 Thread Bob Beck
On Mon, Jun 29, 2020 at 03:56:43PM -0400, sven falempin wrote:
> On Mon, Jun 29, 2020 at 12:58 PM sven falempin 
> wrote:
>
> It works in the original problematic setup.
> 
> Will it go to base ?
> 

Yes.

revision 1.201
date: 2020/07/14 06:02:50;  author: beck;  state: Exp;  lines: +9 -3;  
commitid: G6yRUUYskLjLY0oH;
Do not convert the NOCACHE buffers that come from a vnd strategy routine
into more delayed writes if the vnd is mounted from a file on an MNT_ASYNC
filesystem. This prevents a situaiton where the cleaner can not clean
delayed writes out without making more delayed writes, and we end up
waiting for the syncer to spit things occasionaly when it runs.

noticed and reported by sven falempin  on tech,
who validated this fixes his issue.

ok krw@



Re: Stuck in Needbuf state, trying to understand (6.7)

2020-06-29 Thread sven falempin
On Mon, Jun 29, 2020 at 12:58 PM sven falempin 
wrote:

>
>
> On Mon, Jun 29, 2020 at 12:12 PM Bob Beck  wrote:
>
>>
>> > Awesome, thanks!
>> >
>> > I will test that, ASAP,
>> > do not hesitate to slay dragon,
>> > i heard the bathing in the blood pool is good for the skin
>> >
>> > Little concern, I did the test without the MFS and ran into issues ,
>> > anyway i get back to you (or list ?) when i have test report with
>> patched
>> > kernel
>>
>> Yes, howver, you didn't tell my what options you had on the filesystem
>> mounted
>> when you did the test without MFS, because it matters. If you had your
>> filesystem
>> mounted ASYNC it would have exhibited the same behavoir.  the issue is
>> due to the
>> async mount, which MFS does by default, not strictly to do with MFS.
>>
>>
> tmpfs was just not mounted so it was the option of the underlying /home
>
> /dev/sd0d on /home type ffs (local, nodev, nosuid)
>
> So far the above patch improve the situation drastically
>
> I will now perform a test in the original device.
>
>
>
It works in the original problematic setup.

Will it go to base ?

-- 
--
-
Knowing is not enough; we must apply. Willing is not enough; we must do


Re: Stuck in Needbuf state, trying to understand (6.7)

2020-06-29 Thread sven falempin
On Mon, Jun 29, 2020 at 12:12 PM Bob Beck  wrote:

>
> > Awesome, thanks!
> >
> > I will test that, ASAP,
> > do not hesitate to slay dragon,
> > i heard the bathing in the blood pool is good for the skin
> >
> > Little concern, I did the test without the MFS and ran into issues ,
> > anyway i get back to you (or list ?) when i have test report with patched
> > kernel
>
> Yes, howver, you didn't tell my what options you had on the filesystem
> mounted
> when you did the test without MFS, because it matters. If you had your
> filesystem
> mounted ASYNC it would have exhibited the same behavoir.  the issue is due
> to the
> async mount, which MFS does by default, not strictly to do with MFS.
>
>
tmpfs was just not mounted so it was the option of the underlying /home

/dev/sd0d on /home type ffs (local, nodev, nosuid)

So far the above patch improve the situation drastically

I will now perform a test in the original device.


Re: Stuck in Needbuf state, trying to understand (6.7)

2020-06-29 Thread Bob Beck


> Awesome, thanks!
> 
> I will test that, ASAP,
> do not hesitate to slay dragon,
> i heard the bathing in the blood pool is good for the skin
> 
> Little concern, I did the test without the MFS and ran into issues ,
> anyway i get back to you (or list ?) when i have test report with patched
> kernel

Yes, howver, you didn't tell my what options you had on the filesystem mounted
when you did the test without MFS, because it matters. If you had your 
filesystem
mounted ASYNC it would have exhibited the same behavoir.  the issue is due to 
the
async mount, which MFS does by default, not strictly to do with MFS. 


> 
> Again thanks for helping.
> 
> -- 
> --
> -
> Knowing is not enough; we must apply. Willing is not enough; we must do



Re: Stuck in Needbuf state, trying to understand (6.7)

2020-06-29 Thread sven falempin
On Mon, Jun 29, 2020 at 11:44 AM Bob Beck  wrote:

> On Sun, Jun 28, 2020 at 12:18:06PM -0400, sven falempin wrote:
> > On Sun, Jun 28, 2020 at 2:40 AM Bryan Linton  wrote:
> >
> > > On 2020-06-27 19:29:31, Bob Beck  wrote:
> > > >
> > > > No.
> > > >
> > > > I know *exactly* what needbuf is but to attempt to diagnose what your
> > > > problem is we need exact details. especially:
> > > >
> > > > 1) The configuration of your system including all the details of the
> > > filesystems
> > > > you have mounted, all options used, etc.
> > > >
> > > > 2) The script you are using to generate the problem (Not a
> paraphrasing
> > > of what
> > > > you think the script does) What filesystems it is using.
> > > >
> > >
> > > Not the OP, but this problem sounds almost exactly like the bug I
> > > reported last year.
> > >
> > > There is a detailed list of steps I used to reproduce the bug in
> > > the following bug report.
> > >
> > > https://marc.info/?l=openbsd-bugs=156412299418191
> > >
> > > I was even able to bisect and identify the commit which first
> > > caused the breakage for me.
> > >
> > >
> > > ---8<---
> > >
> > > CVSROOT:/cvs
> > > Module name:src
> > > Changes by: b...@cvs.openbsd.org2019/05/08 06:40:57
> > >
> > > Modified files:
> > > sys/kern   : vfs_bio.c vfs_biomem.c
> > >
> > > Log message:
> > > Modify the buffer cache to always flip recovered DMA buffers high.
> > >
> > > This also modifies the backoff logic to only back off what is requested
> > > and not a "mimimum" amount. Tested by me, benno@, tedu@ anda ports
> build
> > > by naddy@.
> > >
> > > ok tedu@
> > >
> > > ---8<---
> > >
> > > However, I have since migrated away from using vnd(4)s since I was
> > > able to find other solutions that worked for my use cases.  So I
> > > may not be able to provide much additional information other than
> > > what is contained in the above bug report.
> > >
> > > --
> > > Bryan
> > >
> > > >
> > > >
> > >
> > >
> > Reproduction of BUG.
> >
> >
> > # optional
> > mkdir /tmpfs
> > mount_mfs -o rw -s 2500M swap /tmpfs # i mounted through fstab so this
> line
> > is not tested
> > #the bug
> > /bin/dd if=/dev/zero of=/tmpfs/img.dd count=0 bs=1 seek=25
> > vnconfig vnd3 /tmpfs/img.dd
> > printf "a a\n\n\n\nw\nq\n" | disklabel -E vnd3
> > newfs vnd3a
> > mount /dev/vnd3a /mnt
> > cd /tmp && ftp https://cdn.openbsd.org/pub/OpenBSD/6.7/amd64/base67.tgz
> > cd /mnt
> > #will occur here (the mkdir was ub beedbuf state for a while ...
> > for v in 1 2 3 4 5 6 7 8 9; do mkdir /tmp/$v; tar xzvf /tmp/base67.tgz -C
> > /mnt/$v; done
> >
> > Ready to test patches.
> >
> >
>
> So, your problem is that you have your vnd created in an mfs
> filesystem, when I run your test with the vnd backed by a regular
> filesystem (withe softdep even) it works fine.
>
> The trouble happens when your VND has buffers cached in it's
> "filesystem" but then is not flushing them out to the underlyin file
> (vnode) that you have your vnd backed by.  On normal filesystems this
> works fine, since vnd tells the lower layer to not cache the writes
> and to do them syncrhonously, to avoid an explosion of delayed writes
> and dependencies of buffers.
>
> The problem happens when we convert syncryonous bwrites to
> asynchronous bdwrites if the fileystem is mounted ASYNC, which,
> curiously, MFS always is (I don't know why, it doesn't really make any
> sense, and I might even look at changing that) All the writes you do
> end up being delayed anc chewing up more buffer space. And they are
> all tied to one vnode (your image).  once you exhaust the buffer
> space, the cleaner runs, but as you have noticed can't clean out your
> vnode until the syncer runs (every 60 seconds).  This is why your
> thing "takes a long time", and things stall in need buffer. softdep
> has deep dark voodoo in it to avoid this problem and therefore when I
> use a softdep filesystem instead of an ASYNC filesystem it works.
>
> Anyway, what's below fixes your issue on my machine. I'm not sure I'm
> happy that it's the final fix but it does fix it. there are many other
> dragons lurking in here.
>
> Index: sys/kern/vfs_bio.c
> ===
> RCS file: /cvs/src/sys/kern/vfs_bio.c,v
> retrieving revision 1.200
> diff -u -p -u -p -r1.200 vfs_bio.c
> --- sys/kern/vfs_bio.c  29 Apr 2020 02:25:48 -  1.200
> +++ sys/kern/vfs_bio.c  29 Jun 2020 15:18:21 -
> @@ -706,8 +706,14 @@ bwrite(struct buf *bp)
>  */
> async = ISSET(bp->b_flags, B_ASYNC);
> if (!async && mp && ISSET(mp->mnt_flag, MNT_ASYNC)) {
> -   bdwrite(bp);
> -   return (0);
> +   /*
> +* Don't convert writes from VND on async filesystems
> +* that already have delayed writes in the upper layer.
> +*/
> +   if (!ISSET(bp->b_flags, B_NOCACHE)) {
> +   bdwrite(bp);
> + 

Re: Stuck in Needbuf state, trying to understand (6.7)

2020-06-29 Thread Bob Beck
On Sun, Jun 28, 2020 at 12:18:06PM -0400, sven falempin wrote:
> On Sun, Jun 28, 2020 at 2:40 AM Bryan Linton  wrote:
> 
> > On 2020-06-27 19:29:31, Bob Beck  wrote:
> > >
> > > No.
> > >
> > > I know *exactly* what needbuf is but to attempt to diagnose what your
> > > problem is we need exact details. especially:
> > >
> > > 1) The configuration of your system including all the details of the
> > filesystems
> > > you have mounted, all options used, etc.
> > >
> > > 2) The script you are using to generate the problem (Not a paraphrasing
> > of what
> > > you think the script does) What filesystems it is using.
> > >
> >
> > Not the OP, but this problem sounds almost exactly like the bug I
> > reported last year.
> >
> > There is a detailed list of steps I used to reproduce the bug in
> > the following bug report.
> >
> > https://marc.info/?l=openbsd-bugs=156412299418191
> >
> > I was even able to bisect and identify the commit which first
> > caused the breakage for me.
> >
> >
> > ---8<---
> >
> > CVSROOT:/cvs
> > Module name:src
> > Changes by: b...@cvs.openbsd.org2019/05/08 06:40:57
> >
> > Modified files:
> > sys/kern   : vfs_bio.c vfs_biomem.c
> >
> > Log message:
> > Modify the buffer cache to always flip recovered DMA buffers high.
> >
> > This also modifies the backoff logic to only back off what is requested
> > and not a "mimimum" amount. Tested by me, benno@, tedu@ anda ports build
> > by naddy@.
> >
> > ok tedu@
> >
> > ---8<---
> >
> > However, I have since migrated away from using vnd(4)s since I was
> > able to find other solutions that worked for my use cases.  So I
> > may not be able to provide much additional information other than
> > what is contained in the above bug report.
> >
> > --
> > Bryan
> >
> > >
> > >
> >
> >
> Reproduction of BUG.
> 
> 
> # optional
> mkdir /tmpfs
> mount_mfs -o rw -s 2500M swap /tmpfs # i mounted through fstab so this line
> is not tested
> #the bug
> /bin/dd if=/dev/zero of=/tmpfs/img.dd count=0 bs=1 seek=25
> vnconfig vnd3 /tmpfs/img.dd
> printf "a a\n\n\n\nw\nq\n" | disklabel -E vnd3
> newfs vnd3a
> mount /dev/vnd3a /mnt
> cd /tmp && ftp https://cdn.openbsd.org/pub/OpenBSD/6.7/amd64/base67.tgz
> cd /mnt
> #will occur here (the mkdir was ub beedbuf state for a while ...
> for v in 1 2 3 4 5 6 7 8 9; do mkdir /tmp/$v; tar xzvf /tmp/base67.tgz -C
> /mnt/$v; done
> 
> Ready to test patches.
> 
> 

So, your problem is that you have your vnd created in an mfs
filesystem, when I run your test with the vnd backed by a regular
filesystem (withe softdep even) it works fine. 

The trouble happens when your VND has buffers cached in it's
"filesystem" but then is not flushing them out to the underlyin file
(vnode) that you have your vnd backed by.  On normal filesystems this
works fine, since vnd tells the lower layer to not cache the writes
and to do them syncrhonously, to avoid an explosion of delayed writes
and dependencies of buffers. 

The problem happens when we convert syncryonous bwrites to
asynchronous bdwrites if the fileystem is mounted ASYNC, which,
curiously, MFS always is (I don't know why, it doesn't really make any
sense, and I might even look at changing that) All the writes you do
end up being delayed anc chewing up more buffer space. And they are
all tied to one vnode (your image).  once you exhaust the buffer
space, the cleaner runs, but as you have noticed can't clean out your
vnode until the syncer runs (every 60 seconds).  This is why your
thing "takes a long time", and things stall in need buffer. softdep
has deep dark voodoo in it to avoid this problem and therefore when I
use a softdep filesystem instead of an ASYNC filesystem it works. 

Anyway, what's below fixes your issue on my machine. I'm not sure I'm
happy that it's the final fix but it does fix it. there are many other
dragons lurking in here.

Index: sys/kern/vfs_bio.c
===
RCS file: /cvs/src/sys/kern/vfs_bio.c,v
retrieving revision 1.200
diff -u -p -u -p -r1.200 vfs_bio.c
--- sys/kern/vfs_bio.c  29 Apr 2020 02:25:48 -  1.200
+++ sys/kern/vfs_bio.c  29 Jun 2020 15:18:21 -
@@ -706,8 +706,14 @@ bwrite(struct buf *bp)
 */
async = ISSET(bp->b_flags, B_ASYNC);
if (!async && mp && ISSET(mp->mnt_flag, MNT_ASYNC)) {
-   bdwrite(bp);
-   return (0);
+   /*
+* Don't convert writes from VND on async filesystems
+* that already have delayed writes in the upper layer.
+*/
+   if (!ISSET(bp->b_flags, B_NOCACHE)) {
+   bdwrite(bp);
+   return (0);
+   }
}
 
/*



Re: Stuck in Needbuf state, trying to understand (6.7)

2020-06-28 Thread sven falempin
On Sun, Jun 28, 2020 at 2:40 AM Bryan Linton  wrote:

> On 2020-06-27 19:29:31, Bob Beck  wrote:
> >
> > No.
> >
> > I know *exactly* what needbuf is but to attempt to diagnose what your
> > problem is we need exact details. especially:
> >
> > 1) The configuration of your system including all the details of the
> filesystems
> > you have mounted, all options used, etc.
> >
> > 2) The script you are using to generate the problem (Not a paraphrasing
> of what
> > you think the script does) What filesystems it is using.
> >
>
> Not the OP, but this problem sounds almost exactly like the bug I
> reported last year.
>
> There is a detailed list of steps I used to reproduce the bug in
> the following bug report.
>
> https://marc.info/?l=openbsd-bugs=156412299418191
>
> I was even able to bisect and identify the commit which first
> caused the breakage for me.
>
>
> ---8<---
>
> CVSROOT:/cvs
> Module name:src
> Changes by: b...@cvs.openbsd.org2019/05/08 06:40:57
>
> Modified files:
> sys/kern   : vfs_bio.c vfs_biomem.c
>
> Log message:
> Modify the buffer cache to always flip recovered DMA buffers high.
>
> This also modifies the backoff logic to only back off what is requested
> and not a "mimimum" amount. Tested by me, benno@, tedu@ anda ports build
> by naddy@.
>
> ok tedu@
>
> ---8<---
>
> However, I have since migrated away from using vnd(4)s since I was
> able to find other solutions that worked for my use cases.  So I
> may not be able to provide much additional information other than
> what is contained in the above bug report.
>
> --
> Bryan
>
> >
> >
>
>
Reproduction of BUG.


# optional
mkdir /tmpfs
mount_mfs -o rw -s 2500M swap /tmpfs # i mounted through fstab so this line
is not tested
#the bug
/bin/dd if=/dev/zero of=/tmpfs/img.dd count=0 bs=1 seek=25
vnconfig vnd3 /tmpfs/img.dd
printf "a a\n\n\n\nw\nq\n" | disklabel -E vnd3
newfs vnd3a
mount /dev/vnd3a /mnt
cd /tmp && ftp https://cdn.openbsd.org/pub/OpenBSD/6.7/amd64/base67.tgz
cd /mnt
#will occur here (the mkdir was ub beedbuf state for a while ...
for v in 1 2 3 4 5 6 7 8 9; do mkdir /tmp/$v; tar xzvf /tmp/base67.tgz -C
/mnt/$v; done

Ready to test patches.



-- 
--
-
Knowing is not enough; we must apply. Willing is not enough; we must do


Re: Stuck in Needbuf state, trying to understand (6.7)

2020-06-28 Thread Bryan Linton
On 2020-06-27 19:29:31, Bob Beck  wrote:
> 
> No. 
> 
> I know *exactly* what needbuf is but to attempt to diagnose what your
> problem is we need exact details. especially:
> 
> 1) The configuration of your system including all the details of the 
> filesystems
> you have mounted, all options used, etc. 
> 
> 2) The script you are using to generate the problem (Not a paraphrasing of 
> what
> you think the script does) What filesystems it is using. 
> 

Not the OP, but this problem sounds almost exactly like the bug I
reported last year.

There is a detailed list of steps I used to reproduce the bug in
the following bug report.

https://marc.info/?l=openbsd-bugs=156412299418191

I was even able to bisect and identify the commit which first
caused the breakage for me.


---8<---

CVSROOT:/cvs
Module name:src
Changes by: b...@cvs.openbsd.org2019/05/08 06:40:57

Modified files:
sys/kern   : vfs_bio.c vfs_biomem.c

Log message:
Modify the buffer cache to always flip recovered DMA buffers high.

This also modifies the backoff logic to only back off what is requested
and not a "mimimum" amount. Tested by me, benno@, tedu@ anda ports build
by naddy@.

ok tedu@

---8<---

However, I have since migrated away from using vnd(4)s since I was
able to find other solutions that worked for my use cases.  So I 
may not be able to provide much additional information other than
what is contained in the above bug report.

-- 
Bryan

> 
> 
> On Sat, Jun 27, 2020 at 08:09:18PM -0400, sven falempin wrote:
> > On Fri, Jun 26, 2020 at 7:35 PM sven falempin 
> > wrote:
> > 
> > >
> > >
> > > On Fri, Jun 26, 2020 at 5:22 PM Stuart Henderson 
> > > wrote:
> > >
> > >> On 2020/06/26 15:30, sven falempin wrote:
> > >> > behavior confirmed on current.
> > >> >
> > >> > Once the process stalls,  ( could be anything writing to the vnconfig
> > >> disk,
> > >> > cp , umount )
> > >> > a few other calls like df , or ps, etc may hang, never the same
> > >> > sp or mp kernel, reproduced on today's snapshots.
> > >>
> > >> vnconfig is used as part of "make release", many builds are done every
> > >> week using this so it's not a general problem with vnconfig.
> > >>
> > >> Can you show some commands or a script to trigger the behaviour?
> > >>
> > >
> > > the perl script use the system to call :
> > >
> > > vnconfig.
> > > mount.
> > > umount. <- saw hanged
> > > cp.<- saw hanged
> > > tar.<- saw hanged
> > > svn up.<- saw hanged
> > > and dd.
> > > newfs.
> > >
> > > really nothing fancy, only stuff writing to disk got stuck.
> > >
> > > At one point it does a chroot but it never hangs near that , most of the
> > > time it hangs before.
> > >
> > > The script has been used like 1000 times on 6.0 and maybe twice more on
> > > 6.4.
> > >
> > > I have absolutely no idea what the 'needbuf' of top is .
> > >
> > > the script hangs at random position , always writing into vnconfig.
> > >
> > > I have no idea how to reproduce outside the perl script , so maybe it is
> > > related
> > > to some devious perl stdin/stdout buffer .
> > >
> > > Nevertheless there's like a 5% chance that's the script will work( slowly 
> > > )
> > >
> > > Most of the system call are inside a routine to log
> > >
> > > sub debug_system {
> > >   $logger->debug('running: '.join(' ', @_));
> > >   return system(@_);
> > > }
> > >
> > > so i can easily put things inside to try to understand the issue.
> > >
> > > It is really a strange behavior, and the device must be shut down
> > > electrically.
> > > Something really odd, i run syslogd on a buffer, and syslogc buffer is
> > > stuck too
> > > when the device stuck (but it supposed to be mostly already allocated
> > > memory ).
> > >
> > > It's really like the vm does not want to give anymore bucket (<- i
> > > don't know what i m talking about here,
> > > but i looks like that anything that doesn't malloc is ok , computer reply
> > > to ping , can do a few things for a while , and then complete
> > > hang )
> > >
> > > I ran the 6.7 release on a VM somewhere and another device with many perl
> > > script and they work.
> > >
> > > Only this fails 95% of the time and is VERY VERY slow when ok.
> > > compared to what i saw in /usr/src the vnconfig is big ,  ( forgot to copy
> > > df -h  ),
> > > like 2GB
> > >
> > 
> > 
> > i put ktrace in front of the perl system call
> > 
> > An di was able to recover a 800MB trace
> > 
> > $ kdump -f ./trace.out | tail -20
> > kdump: realloc: Cannot allocate memory
> >  25955 UNKNOWN(1634890859)
> >  72466 ? CALL  syscall()
> > 
> > 
> > could that be of some use ?
> > 
> > 
> > -- 
> > --
> > -
> > Knowing is not enough; we must apply. Willing is not enough; we must do
> 
> 



Re: Stuck in Needbuf state, trying to understand (6.7)

2020-06-27 Thread Bob Beck


No. 

I know *exactly* what needbuf is but to attempt to diagnose what your
problem is we need exact details. especially:

1) The configuration of your system including all the details of the filesystems
you have mounted, all options used, etc. 

2) The script you are using to generate the problem (Not a paraphrasing of what
you think the script does) What filesystems it is using. 



On Sat, Jun 27, 2020 at 08:09:18PM -0400, sven falempin wrote:
> On Fri, Jun 26, 2020 at 7:35 PM sven falempin 
> wrote:
> 
> >
> >
> > On Fri, Jun 26, 2020 at 5:22 PM Stuart Henderson 
> > wrote:
> >
> >> On 2020/06/26 15:30, sven falempin wrote:
> >> > behavior confirmed on current.
> >> >
> >> > Once the process stalls,  ( could be anything writing to the vnconfig
> >> disk,
> >> > cp , umount )
> >> > a few other calls like df , or ps, etc may hang, never the same
> >> > sp or mp kernel, reproduced on today's snapshots.
> >>
> >> vnconfig is used as part of "make release", many builds are done every
> >> week using this so it's not a general problem with vnconfig.
> >>
> >> Can you show some commands or a script to trigger the behaviour?
> >>
> >
> > the perl script use the system to call :
> >
> > vnconfig.
> > mount.
> > umount. <- saw hanged
> > cp.<- saw hanged
> > tar.<- saw hanged
> > svn up.<- saw hanged
> > and dd.
> > newfs.
> >
> > really nothing fancy, only stuff writing to disk got stuck.
> >
> > At one point it does a chroot but it never hangs near that , most of the
> > time it hangs before.
> >
> > The script has been used like 1000 times on 6.0 and maybe twice more on
> > 6.4.
> >
> > I have absolutely no idea what the 'needbuf' of top is .
> >
> > the script hangs at random position , always writing into vnconfig.
> >
> > I have no idea how to reproduce outside the perl script , so maybe it is
> > related
> > to some devious perl stdin/stdout buffer .
> >
> > Nevertheless there's like a 5% chance that's the script will work( slowly )
> >
> > Most of the system call are inside a routine to log
> >
> > sub debug_system {
> >   $logger->debug('running: '.join(' ', @_));
> >   return system(@_);
> > }
> >
> > so i can easily put things inside to try to understand the issue.
> >
> > It is really a strange behavior, and the device must be shut down
> > electrically.
> > Something really odd, i run syslogd on a buffer, and syslogc buffer is
> > stuck too
> > when the device stuck (but it supposed to be mostly already allocated
> > memory ).
> >
> > It's really like the vm does not want to give anymore bucket (<- i
> > don't know what i m talking about here,
> > but i looks like that anything that doesn't malloc is ok , computer reply
> > to ping , can do a few things for a while , and then complete
> > hang )
> >
> > I ran the 6.7 release on a VM somewhere and another device with many perl
> > script and they work.
> >
> > Only this fails 95% of the time and is VERY VERY slow when ok.
> > compared to what i saw in /usr/src the vnconfig is big ,  ( forgot to copy
> > df -h  ),
> > like 2GB
> >
> 
> 
> i put ktrace in front of the perl system call
> 
> An di was able to recover a 800MB trace
> 
> $ kdump -f ./trace.out | tail -20
> kdump: realloc: Cannot allocate memory
>  25955 UNKNOWN(1634890859)
>  72466 ? CALL  syscall()
> 
> 
> could that be of some use ?
> 
> 
> -- 
> --
> -
> Knowing is not enough; we must apply. Willing is not enough; we must do



Re: Stuck in Needbuf state, trying to understand (6.7)

2020-06-27 Thread sven falempin
On Fri, Jun 26, 2020 at 7:35 PM sven falempin 
wrote:

>
>
> On Fri, Jun 26, 2020 at 5:22 PM Stuart Henderson 
> wrote:
>
>> On 2020/06/26 15:30, sven falempin wrote:
>> > behavior confirmed on current.
>> >
>> > Once the process stalls,  ( could be anything writing to the vnconfig
>> disk,
>> > cp , umount )
>> > a few other calls like df , or ps, etc may hang, never the same
>> > sp or mp kernel, reproduced on today's snapshots.
>>
>> vnconfig is used as part of "make release", many builds are done every
>> week using this so it's not a general problem with vnconfig.
>>
>> Can you show some commands or a script to trigger the behaviour?
>>
>
> the perl script use the system to call :
>
> vnconfig.
> mount.
> umount. <- saw hanged
> cp.<- saw hanged
> tar.<- saw hanged
> svn up.<- saw hanged
> and dd.
> newfs.
>
> really nothing fancy, only stuff writing to disk got stuck.
>
> At one point it does a chroot but it never hangs near that , most of the
> time it hangs before.
>
> The script has been used like 1000 times on 6.0 and maybe twice more on
> 6.4.
>
> I have absolutely no idea what the 'needbuf' of top is .
>
> the script hangs at random position , always writing into vnconfig.
>
> I have no idea how to reproduce outside the perl script , so maybe it is
> related
> to some devious perl stdin/stdout buffer .
>
> Nevertheless there's like a 5% chance that's the script will work( slowly )
>
> Most of the system call are inside a routine to log
>
> sub debug_system {
>   $logger->debug('running: '.join(' ', @_));
>   return system(@_);
> }
>
> so i can easily put things inside to try to understand the issue.
>
> It is really a strange behavior, and the device must be shut down
> electrically.
> Something really odd, i run syslogd on a buffer, and syslogc buffer is
> stuck too
> when the device stuck (but it supposed to be mostly already allocated
> memory ).
>
> It's really like the vm does not want to give anymore bucket (<- i
> don't know what i m talking about here,
> but i looks like that anything that doesn't malloc is ok , computer reply
> to ping , can do a few things for a while , and then complete
> hang )
>
> I ran the 6.7 release on a VM somewhere and another device with many perl
> script and they work.
>
> Only this fails 95% of the time and is VERY VERY slow when ok.
> compared to what i saw in /usr/src the vnconfig is big ,  ( forgot to copy
> df -h  ),
> like 2GB
>


i put ktrace in front of the perl system call

An di was able to recover a 800MB trace

$ kdump -f ./trace.out | tail -20
kdump: realloc: Cannot allocate memory
 25955 UNKNOWN(1634890859)
 72466 ▒▒▒ CALL  syscall()


could that be of some use ?


-- 
--
-
Knowing is not enough; we must apply. Willing is not enough; we must do


Re: Stuck in Needbuf state, trying to understand (6.7)

2020-06-26 Thread sven falempin
On Fri, Jun 26, 2020 at 5:22 PM Stuart Henderson 
wrote:

> On 2020/06/26 15:30, sven falempin wrote:
> > behavior confirmed on current.
> >
> > Once the process stalls,  ( could be anything writing to the vnconfig
> disk,
> > cp , umount )
> > a few other calls like df , or ps, etc may hang, never the same
> > sp or mp kernel, reproduced on today's snapshots.
>
> vnconfig is used as part of "make release", many builds are done every
> week using this so it's not a general problem with vnconfig.
>
> Can you show some commands or a script to trigger the behaviour?
>

the perl script use the system to call :

vnconfig.
mount.
umount. <- saw hanged
cp.<- saw hanged
tar.<- saw hanged
svn up.<- saw hanged
and dd.
newfs.

really nothing fancy, only stuff writing to disk got stuck.

At one point it does a chroot but it never hangs near that , most of the
time it hangs before.

The script has been used like 1000 times on 6.0 and maybe twice more on 6.4.

I have absolutely no idea what the 'needbuf' of top is .

the script hangs at random position , always writing into vnconfig.

I have no idea how to reproduce outside the perl script , so maybe it is
related
to some devious perl stdin/stdout buffer .

Nevertheless there's like a 5% chance that's the script will work( slowly )

Most of the system call are inside a routine to log

sub debug_system {
  $logger->debug('running: '.join(' ', @_));
  return system(@_);
}

so i can easily put things inside to try to understand the issue.

It is really a strange behavior, and the device must be shut down
electrically.
Something really odd, i run syslogd on a buffer, and syslogc buffer is
stuck too
when the device stuck (but it supposed to be mostly already allocated
memory ).

It's really like the vm does not want to give anymore bucket (<- i
don't know what i m talking about here,
but i looks like that anything that doesn't malloc is ok , computer reply
to ping , can do a few things for a while , and then complete
hang )

I ran the 6.7 release on a VM somewhere and another device with many perl
script and they work.

Only this fails 95% of the time and is VERY VERY slow when ok.
compared to what i saw in /usr/src the vnconfig is big ,  ( forgot to copy
df -h  ),
like 2GB

-- 
--
-
Knowing is not enough; we must apply. Willing is not enough; we must do


Re: Stuck in Needbuf state, trying to understand (6.7)

2020-06-26 Thread Stuart Henderson
On 2020/06/26 15:30, sven falempin wrote:
> behavior confirmed on current.
> 
> Once the process stalls,  ( could be anything writing to the vnconfig disk,
> cp , umount )
> a few other calls like df , or ps, etc may hang, never the same
> sp or mp kernel, reproduced on today's snapshots.

vnconfig is used as part of "make release", many builds are done every
week using this so it's not a general problem with vnconfig.

Can you show some commands or a script to trigger the behaviour?