Re: kernel panic in gsync_wait

2016-11-09 Thread Olaf Buddenhagen
Hi,

On Fri, Nov 04, 2016 at 12:14:27PM -1000, Brent W. Baccala wrote:

> How do we want to handle fixed bugs, like this gsync problem?  Should we
> open and close a bug report, just so it's documented in the bug database?
> I can open the bug, of course, but I don't have permission to close it...

Nah, the tracker is not supposed to be a complete history of all issues
ever... It's just to keep track of known open issues.

-antrik-



Re: kernel panic in gsync_wait

2016-11-04 Thread Brent W. Baccala
On Fri, Nov 4, 2016 at 12:00 AM, Samuel Thibault 
wrote:

> Brent W. Baccala, on Thu 03 Nov 2016 15:51:04 -1000, wrote:
> > I see... so there must be fallback code (option 2 on my list); I just
> haven't
> > found it.
> >
> > Where is KERN_FAILURE handled in user space?
>
> It's not. Gsync_wait just returns, and the while loop just tries to take
> the lock again.
>

Of course; it's designed to replace a spinlock!  Very clever.

Clever can cause problems, though.  We need to document this.

I've booted a new kernel and the weird ext2fs data corruption problems have
gone away.

I'm currently working on a test case where I create a small ramdisk and
fill it to exhaustion.  'dd' doesn't cleanly error out, though; it hangs.
I'm working through the error handling code in libpager (and friends) and
have already found one problem, but there are others.

I guess I should file it as a bug.

How do we want to handle fixed bugs, like this gsync problem?  Should we
open and close a bug report, just so it's documented in the bug database?
I can open the bug, of course, but I don't have permission to close it...

agape
brent


Re: kernel panic in gsync_wait

2016-11-04 Thread Samuel Thibault
Brent W. Baccala, on Thu 03 Nov 2016 15:51:04 -1000, wrote:
> I see... so there must be fallback code (option 2 on my list); I just haven't
> found it.
> 
> Where is KERN_FAILURE handled in user space?

It's not. Gsync_wait just returns, and the while loop just tries to take
the lock again.

Samuel



Re: kernel panic in gsync_wait

2016-11-03 Thread Brent W. Baccala
On Thu, Nov 3, 2016 at 2:18 PM, Samuel Thibault 
wrote:

> Brent W. Baccala, on Thu 03 Nov 2016 14:12:41 -1000, wrote:
> > I suspect that this ultimately affects just about every program on a
> > Hurd system.
>
> Sure, it's used internally by glibc. But see the commit I made to
> gnumach: that makes gnumach return an error. glibc thus doesn't actually
> wait, just spinlock, i.e. behave correctly, just not so efficiently.
>

I see... so there must be fallback code (option 2 on my list); I just
haven't found it.

Where is KERN_FAILURE handled in user space?

agape
brent


Re: kernel panic in gsync_wait

2016-11-03 Thread Samuel Thibault
Brent W. Baccala, on Thu 03 Nov 2016 14:12:41 -1000, wrote:
> I suspect that this ultimately affects just about every program on a
> Hurd system.

Sure, it's used internally by glibc. But see the commit I made to
gnumach: that makes gnumach return an error. glibc thus doesn't actually
wait, just spinlock, i.e. behave correctly, just not so efficiently.

Samuel



Re: kernel panic in gsync_wait

2016-11-03 Thread Brent W. Baccala
Hi -

I've been trying to figure what to do with the gsync code.  It causes
undefined behavior and occasional kernel panics when rpctrace is used on
gsync_wait and gsync_wake calls.  I suspect that this ultimately affects
just about every program on a Hurd system.  Even if a program doesn't call
locking primitives directly, they're used by the C library, right?

On Sun, Oct 30, 2016 at 10:16 PM, Kalle Olavi Niemitalo  wrote:

> "Brent W. Baccala"  writes:
>
> > Even if I'm right about the nature of this bug, I don't understand
> gnumach
> > well enough to know how a task should access another task's memory.
>
> vm_copy apparently supports such access; code from there could be
> reused.  But if rpctrace uses gsync_wait on the address space of
> another task and the page has been paged out, then the call could
> end up blocking for the pager, and I don't think you want that.
>

I studied vm_copy and came up with some code, but gsync_wake needs to
actually modify the memory location, so setting up a copy doesn't work.
Plus, vm_copy makes a mapping that is visible to user space and leaves the
user space code responsible for deallocating it.

As for blocking for the pager, I think the code does that already.  As
innocuous as this line of code looks:

 *(unsigned int *)addr = val;

it can trigger a page fault and block the RPC, right?

Here's what I think gsync_wait and gsync_wake need to do if their task
argument isn't the current task:

1. lookup the address in the other task's pagemap, flag the map entry
'shared', and insert a copy of it into the kernel pagemap,

2. access the memory location, triggering a page fault which actually
creates the physical mapping, since Mach doesn't create physical maps until
required, then

3a. either remove the pagemap entry from the kernel map when we're done, or
3b. leave it for future use.

If we choose option (3a), then every wrapped call to gsync_{wait,wake} will
trigger a page fault.  If we choose option (3b), then step 1 needs to be
modified to search the kernel pagemap to see if it already has a copy of
the map entry in question.  I have to think some more to figure how that
mapping ultimately gets removed, and of course there's a memory exhaustion
issue if we add lots of new entries to the kernel map on a 32-bit system.

Another possibility would be to reject the RPC back to user space, and put
the old locking code back into glibc as a fallback.

All this brings us back to...

On Sun, Oct 30, 2016 at 10:16 PM, Kalle Olavi Niemitalo  wrote:

>
> Could gsync_wait be removed from gnumach.defs and replaced with
> only a trap that does not take a task_t parameter and cannot be
> intercepted by rpctrace?
>

I don't think we have anything else quite like gsync - kernel code that
directly accesses user space memory in a different task from the one that
trapped.  Our choices seem to be:

1. add a lot of new complexity to the gsync kernel routines
2. add fallback code to glibc
3. add a new system call as Kalle proposed

Comments?

agape
brent


Re: kernel panic in gsync_wait

2016-10-31 Thread Kalle Olavi Niemitalo
"Brent W. Baccala"  writes:

> My new and improved rpctrace is generating kernel panics when run on
> ext2fs.  This happens when rpctrace calls gsync_wait, with ext2fs as the
> 'task' argument.

Could gsync_wait be removed from gnumach.defs and replaced with
only a trap that does not take a task_t parameter and cannot be
intercepted by rpctrace?
(If the gsync code is not deleted altogether because of the
license conflict.)

> Even if I'm right about the nature of this bug, I don't understand gnumach
> well enough to know how a task should access another task's memory.

vm_copy apparently supports such access; code from there could be
reused.  But if rpctrace uses gsync_wait on the address space of
another task and the page has been paged out, then the call could
end up blocking for the pager, and I don't think you want that.



kernel panic in gsync_wait

2016-10-31 Thread Brent W. Baccala
Hi -

My new and improved rpctrace is generating kernel panics when run on
ext2fs.  This happens when rpctrace calls gsync_wait, with ext2fs as the
'task' argument.

Could you guys look at gnumach's kern/gsync.c, at line 212?  It looks to me
like that code tacitly assumes that the 'addr' it's accessing is in the
memory space of the task calling the RPC, instead of the task passed in as
the first argument.

Even if I'm right about the nature of this bug, I don't understand gnumach
well enough to know how a task should access another task's memory.

agape
brent