Re: CVS commit: src/sys/arch

2010-08-17 Thread Jean-Yves Migeon

On Tue, 17 Aug 2010 18:18:22 +0100, David Laight  wrote:
>> Unifying sizes of certain types up to 64-bits would be a good choice
>> from engineering standpoint.  Actually measuring the overhead would
>> answer whether it is worth.
> 
> Why not use a 64bit type that is the union of a 64bit number and two
> 32bit numbers.
> A kernel that uses 32bit paddr_t would use tha approriate (for the
> endianness) 32bit value, having initialised the unused half to zero.
> 
> That would remove almost everything except the structure size overhead

Which is precisely the "problem" here :)

> while maintaining binary compatibility.

I don't think so; modules that manipulates 32 bits bus_addr_t will likely
fail with PAE when you cross the 4GB boundary. I don't see how the union
could solve that.

-- 
Jean-Yves Migeon
jeanyves.mig...@free.fr


Re: CVS commit: src/sys/arch

2010-08-17 Thread David Laight
On Tue, Aug 17, 2010 at 04:08:10PM +0100, Mindaugas Rasiukevicius wrote:
> 
> Unifying sizes of certain types up to 64-bits would be a good choice
> from engineering standpoint.  Actually measuring the overhead would
> answer whether it is worth.

Why not use a 64bit type that is the union of a 64bit number and two
32bit numbers.
A kernel that uses 32bit paddr_t would use tha approriate (for the
endianness) 32bit value, having initialised the unused half to zero.

That would remove almost everything except the structure size overhead
while maintaining binary compatibility.

(Embedding the integer type in a struct/union also makes any places
where it is used incorrectly rather more obvious.)

David

-- 
David Laight: da...@l8s.co.uk


Re: CVS commit: src/sys/arch

2010-08-17 Thread Jean-Yves Migeon

On Tue, 17 Aug 2010 14:26:53 +, Andrew Doran  wrote:
> Hi,
> 
> Why are any types other than in the pmap different between PAE and !PAE?
> 
> When this was originally proposed I asked for stuff like paddr_t to be
64
> bits no matter what the kernel compile settings where.
> 
> Thanks.


I need more time!



Let's do this /gradually/. There are already regressions with PAE that
needs fixing before moving further, especially "bugs" that people reported
to me I cannot reproduce with my hosts (a.out breakage with Xen PAE + NX,
and invalid TLB entries that only happens 'randomly' with some AMD 64
Athlon CPUs).

That, and also:
- rmind@ branch work on uvm/pmap. Such a change will interfere with his
work.
- the issue is wider, bus_addr_t will move to "all 64 bits" too
(performance impact?)
- p{d,t}_entry_t vs paddr_t size mix up in pmap, which is bad in the !PAE
case (32 vs 64 bits). Breaking i386 pmap will not be particularly good for
my life expectancy.
- I want to merge xensuspend before entering another bug hunting session
:) One at a time is enough :)
- this is orthogonal to the kvm(3) issue: having 64 bits in-kernel paddr_t
will not automagically solve the "all PAs are unsigned long" assumption of
kvm(3).


-- 
Jean-Yves Migeon
jeanyves.mig...@free.fr


Re: CVS commit: src/sys/kern

2010-08-17 Thread Mindaugas Rasiukevicius
Juergen Hannken-Illjes  wrote:
> > > > <...>
> > > > 
> > > > It's not obvious from your commit message what prevents it from
> > > > gaining a reference after the lock is dropped.
> > > 
> > > The interlock is taken before the freelist lock is dropped and vnodes
> > > on the free lists should never appear on other lists.
> > 
> > Not even the name cache?
> > 
> > (I can't remember off the top of my head, but IIRC there's something
> > sneaky about it)
> 
> Name cache does vtryget() then vget(). As the usecount is zero, vtryget()
> fails and vget() gets called with the interlock held and needs the
> freelist lock to proceed.
> 
> I still don't see a race here.

It would be quite cool to expand the comment in the top of vfs_subr.c, with
sections like "Vnode reclamation mechanism", "Vnode life-cycle", etc and
thus document our knowledge.  Now, each time we re-read and re-discover the
logic and potential corner cases.  And, I believe, not all corner cases.

Just thinking out load. :)

-- 
Mindaugas


Re: CVS commit: src/sys/arch

2010-08-17 Thread Mindaugas Rasiukevicius
"Christoph Egger"  wrote:
> 
> > Hi,
> > 
> > Why are any types other than in the pmap different between
> > PAE and !PAE?
> > 
> > When this was originally proposed I asked for stuff like
> > paddr_t to be 64 bits no matter what the kernel compile
> > settings where.
> 
> Interesting to hear that from your side. When asked for this
> a debate about overhead started.

Unifying sizes of certain types up to 64-bits would be a good choice
from engineering standpoint.  Actually measuring the overhead would
answer whether it is worth.

My bet is that the overhead would be quite low on modern x86 machines.

> 
> Christoph

-- 
Mindaugas


Re: CVS commit: src/sys/arch

2010-08-17 Thread Christoph Egger

> Hi,
> 
> Why are any types other than in the pmap different between
> PAE and !PAE?
> 
> When this was originally proposed I asked for stuff like
> paddr_t to be 64 bits no matter what the kernel compile
> settings where.

Interesting to hear that from your side. When asked for this
a debate about overhead started.

Christoph

 
> Thanks.
> 
> 
> On Mon, Aug 16, 2010 at 10:45:10PM +0200, Jean-Yves Migeon wrote:
> > On 16.08.2010 22:22, matthew green wrote:
> > >> Module Name: src
> > >> Committed By:jym
> > >> Date:Mon Aug 16 19:39:06 UTC 2010
> > >>
> > >> Modified Files:
> > >>  src/sys/arch/i386/i386: machdep.c
> > >>  src/sys/arch/x86/include: cpu.h
> > >>
> > >> Log Message:
> > >> Add machdep.pae sysctl(7) for i386. Thanks to Paul and Joerg for
> their
> > >> reviews.
> > >>
> > >> In kernel, it matches the 'i386_use_pae' variable (0: kernel does not
> use
> > >> PAE, 1: kernel uses PAE). Will be used by i386 kvm(3) to know the
> functions
> > >> that should get called for VA => PA translations.
> > > 
> > > does this work for core files as well?
> > >
> > > if not, wouldn't this feature be better done use ugly kvm/nlist so
> > > that it works with the same code on dead kernels?  getting a 0/1
> > > value via kvm shouldn't really be considered any real problem.
> > 
> > That's the purpose; I don't know of any other way. The sysctl is only
> > there for practical purposes: cpuctl(8) can indicate that the CPU
> > supports PAE, but there is no easy way to know if it is active or not
> > (unless playing with config -x together with the booted kernel, not very
> > practical...). Hence, the sysctl machdep.
> > 
> > > (why the extern in cpu.h?  i doesn't seem necessary.)
> > 
> > Yes; I wanted to place it at the same level as i386_use_fxsave.
> > i386_use_pae may get use elsewhere eventually, so I added it to cpu.h.
> > 
> > -- 
> > Jean-Yves Migeon
> > jeanyves.mig...@free.fr


Re: CVS commit: src/sys/kern

2010-08-17 Thread Antti Kantee
On Tue Aug 17 2010 at 16:26:10 +0200, Juergen Hannken-Illjes wrote:
> > > > It's not obvious from your commit message what prevents it from gaining
> > > > a reference after the lock is dropped.
> > > 
> > > The interlock is taken before the freelist lock is dropped and vnodes on
> > > the free lists should never appear on other lists.
> > 
> > Not even the name cache?
> > 
> > (I can't remember off the top of my head, but IIRC there's something
> > sneaky about it)
> 
> Name cache does vtryget() then vget(). As the usecount is zero, vtryget()
> fails and vget() gets called with the interlock held and needs the freelist
> lock to proceed.
> 
> I still don't see a race here.

Ok, it wasn't clear what you meant by "other lists".

Your commit message makes sense now.


Re: CVS commit: src/sys/kern

2010-08-17 Thread Juergen Hannken-Illjes
On Tue, Aug 17, 2010 at 04:45:36PM +0300, Antti Kantee wrote:
> On Tue Aug 17 2010 at 15:41:11 +0200, Juergen Hannken-Illjes wrote:
> > > > vp->v_freelisthd = NULL;
> > > > mutex_exit(&vnode_free_list_lock);
> > > >  
> > > > -   if (vp->v_usecount != 0) {
> > > > -   /*
> > > > -* was referenced again before we got the interlock
> > > > -* Don't return to freelist - the holder of the last
> > > > -* reference will destroy it.
> > > > -*/
> > > > -   mutex_exit(&vp->v_interlock);
> > > > -   mutex_enter(&vnode_free_list_lock);
> > > > -   goto retry;
> > > > -   }
> > > > +   KASSERT(vp->v_usecount == 0);
> > > 
> > > It's not obvious from your commit message what prevents it from gaining
> > > a reference after the lock is dropped.
> > 
> > The interlock is taken before the freelist lock is dropped and vnodes on
> > the free lists should never appear on other lists.
> 
> Not even the name cache?
> 
> (I can't remember off the top of my head, but IIRC there's something
> sneaky about it)

Name cache does vtryget() then vget(). As the usecount is zero, vtryget()
fails and vget() gets called with the interlock held and needs the freelist
lock to proceed.

I still don't see a race here.

-- 
Juergen Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)


Re: CVS commit: src/sys/arch

2010-08-17 Thread Andrew Doran
Hi,

Why are any types other than in the pmap different between PAE and !PAE?

When this was originally proposed I asked for stuff like paddr_t to be 64
bits no matter what the kernel compile settings where.

Thanks.


On Mon, Aug 16, 2010 at 10:45:10PM +0200, Jean-Yves Migeon wrote:
> On 16.08.2010 22:22, matthew green wrote:
> >> Module Name:   src
> >> Committed By:  jym
> >> Date:  Mon Aug 16 19:39:06 UTC 2010
> >>
> >> Modified Files:
> >>src/sys/arch/i386/i386: machdep.c
> >>src/sys/arch/x86/include: cpu.h
> >>
> >> Log Message:
> >> Add machdep.pae sysctl(7) for i386. Thanks to Paul and Joerg for their
> >> reviews.
> >>
> >> In kernel, it matches the 'i386_use_pae' variable (0: kernel does not use
> >> PAE, 1: kernel uses PAE). Will be used by i386 kvm(3) to know the functions
> >> that should get called for VA => PA translations.
> > 
> > does this work for core files as well?
> >
> > if not, wouldn't this feature be better done use ugly kvm/nlist so
> > that it works with the same code on dead kernels?  getting a 0/1
> > value via kvm shouldn't really be considered any real problem.
> 
> That's the purpose; I don't know of any other way. The sysctl is only
> there for practical purposes: cpuctl(8) can indicate that the CPU
> supports PAE, but there is no easy way to know if it is active or not
> (unless playing with config -x together with the booted kernel, not very
> practical...). Hence, the sysctl machdep.
> 
> > (why the extern in cpu.h?  i doesn't seem necessary.)
> 
> Yes; I wanted to place it at the same level as i386_use_fxsave.
> i386_use_pae may get use elsewhere eventually, so I added it to cpu.h.
> 
> -- 
> Jean-Yves Migeon
> jeanyves.mig...@free.fr


Re: CVS commit: src/sys/kern

2010-08-17 Thread Andrew Doran
On Tue, Aug 17, 2010 at 04:45:36PM +0300, Antti Kantee wrote:
> On Tue Aug 17 2010 at 15:41:11 +0200, Juergen Hannken-Illjes wrote:
> > > > vp->v_freelisthd = NULL;
> > > > mutex_exit(&vnode_free_list_lock);
> > > >  
> > > > -   if (vp->v_usecount != 0) {
> > > > -   /*
> > > > -* was referenced again before we got the interlock
> > > > -* Don't return to freelist - the holder of the last
> > > > -* reference will destroy it.
> > > > -*/
> > > > -   mutex_exit(&vp->v_interlock);
> > > > -   mutex_enter(&vnode_free_list_lock);
> > > > -   goto retry;
> > > > -   }
> > > > +   KASSERT(vp->v_usecount == 0);
> > > 
> > > It's not obvious from your commit message what prevents it from gaining
> > > a reference after the lock is dropped.
> > 
> > The interlock is taken before the freelist lock is dropped and vnodes on
> > the free lists should never appear on other lists.
> 
> Not even the name cache?
> 
> (I can't remember off the top of my head, but IIRC there's something
> sneaky about it)

They can still appear on mount lists I think (it has been a while)..
Are these changes in diff format anywhere?



Re: CVS commit: src/sys/kern

2010-08-17 Thread Antti Kantee
On Tue Aug 17 2010 at 15:41:11 +0200, Juergen Hannken-Illjes wrote:
> > >   vp->v_freelisthd = NULL;
> > >   mutex_exit(&vnode_free_list_lock);
> > >  
> > > - if (vp->v_usecount != 0) {
> > > - /*
> > > -  * was referenced again before we got the interlock
> > > -  * Don't return to freelist - the holder of the last
> > > -  * reference will destroy it.
> > > -  */
> > > - mutex_exit(&vp->v_interlock);
> > > - mutex_enter(&vnode_free_list_lock);
> > > - goto retry;
> > > - }
> > > + KASSERT(vp->v_usecount == 0);
> > 
> > It's not obvious from your commit message what prevents it from gaining
> > a reference after the lock is dropped.
> 
> The interlock is taken before the freelist lock is dropped and vnodes on
> the free lists should never appear on other lists.

Not even the name cache?

(I can't remember off the top of my head, but IIRC there's something
sneaky about it)


Re: CVS commit: src/sys/kern

2010-08-17 Thread Juergen Hannken-Illjes
On Tue, Aug 17, 2010 at 04:35:31PM +0300, Antti Kantee wrote:
> On Tue Aug 17 2010 at 13:17:48 +, Juergen Hannken-Illjes wrote:
> > Module Name:src
> > Committed By:   hannken
> > Date:   Tue Aug 17 13:17:48 UTC 2010
> > 
> > Modified Files:
> > src/sys/kern: vfs_subr.c
> > 
> > Log Message:
> > Now that ffs on disk inodes get freed in the reclaim routine it is no longer
> > necessary for vget() to handle VI_INACTNOW as a special case.  Remove this
> > check and its support in vrelel().
> 
> Hi,
> 
> I didn't fully review your changes, so just to make sure: it's still
> possible to gain a reference during inactive, right?

Yes.

> > Getting another reference while the freelist is locked is an error.  Replace
> > the check with a KASSERT.
> > 
> 
> > vp->v_freelisthd = NULL;
> > mutex_exit(&vnode_free_list_lock);
> >  
> > -   if (vp->v_usecount != 0) {
> > -   /*
> > -* was referenced again before we got the interlock
> > -* Don't return to freelist - the holder of the last
> > -* reference will destroy it.
> > -*/
> > -   mutex_exit(&vp->v_interlock);
> > -   mutex_enter(&vnode_free_list_lock);
> > -   goto retry;
> > -   }
> > +   KASSERT(vp->v_usecount == 0);
> 
> It's not obvious from your commit message what prevents it from gaining
> a reference after the lock is dropped.

The interlock is taken before the freelist lock is dropped and vnodes on
the free lists should never appear on other lists.

-- 
Juergen Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)


Re: CVS commit: src/sys/kern

2010-08-17 Thread Antti Kantee
On Tue Aug 17 2010 at 13:17:48 +, Juergen Hannken-Illjes wrote:
> Module Name:  src
> Committed By: hannken
> Date: Tue Aug 17 13:17:48 UTC 2010
> 
> Modified Files:
>   src/sys/kern: vfs_subr.c
> 
> Log Message:
> Now that ffs on disk inodes get freed in the reclaim routine it is no longer
> necessary for vget() to handle VI_INACTNOW as a special case.  Remove this
> check and its support in vrelel().

Hi,

I didn't fully review your changes, so just to make sure: it's still
possible to gain a reference during inactive, right?

> Getting another reference while the freelist is locked is an error.  Replace
> the check with a KASSERT.
> 

>   vp->v_freelisthd = NULL;
>   mutex_exit(&vnode_free_list_lock);
>  
> - if (vp->v_usecount != 0) {
> - /*
> -  * was referenced again before we got the interlock
> -  * Don't return to freelist - the holder of the last
> -  * reference will destroy it.
> -  */
> - mutex_exit(&vp->v_interlock);
> - mutex_enter(&vnode_free_list_lock);
> - goto retry;
> - }
> + KASSERT(vp->v_usecount == 0);

It's not obvious from your commit message what prevents it from gaining
a reference after the lock is dropped.


Re: CVS commit: src/sys/arch

2010-08-17 Thread Joerg Sonnenberger
On Tue, Aug 17, 2010 at 12:57:21AM +0200, Jean-Yves Migeon wrote:
> - kvm(3) API will have to move from using 'unsigned long' to 'paddr_t'. I
> am currently checking that for all architectures except i386,
> sizeof(paddr_t) == sizeof(u_long).

No. The kvm(3) API should be uint64_t or similar exclusively.

Joerg


re: CVS commit: src/sys/arch

2010-08-17 Thread matthew green

> >> am currently checking that for all architectures except i386,
> >> sizeof(paddr_t) == sizeof(u_long).
> > 
> > sparc and 32 bit sparc64 kernels use 64 bit paddr_t.  there are some
> > mips ports that do too.  you can't assume the above equality at all,
> > not since "paddr_t" was introduced a decade+ ago.
> 
> Eeek. I can't see how I can fix that then. Would it be acceptable to say
> "the affected functions are private to kvm(3), so don't bother with this
> anyway?"

kvm is about as special case as we get in userland.

> >> - i386 will use 64 bits paddr_t for kvm(3), so we will have to discuss
> on
> z> how to make it work without too much fuzz with pre-64 bits i386 dumps.
> > 
> > i think this seems the sanest way.  make userland see the 64 bit
> > definition in all cases, and have the libkvm code DTRT.
> 
> Thing is, kvm(3) has two functions (kvm_kvatop and kvm_pa2off) that use
> PAs internally and assume they are unsigned long. I can't move i386 to 64
> bits PA without having to tweak kvm_private.h prototypes, which will affect
> all ports in turn.
> 
> IMHO, the path of least resistance would be to move to paddr_t.

i think that this sounds sane.

> It is only there for convenience, eg. when someone wants to know easily if
> its system is running under PAE.

ok.

.mrg.


re: CVS commit: src/sys/arch

2010-08-17 Thread Jean-Yves Migeon

On Tue, 17 Aug 2010 09:59:54 +1000, matthew green 
wrote:
>> For kernel core dumps, at the present time, no; this needs thinking,
and
>> testing:
>> - kvm(3) API will have to move from using 'unsigned long' to 'paddr_t'.
I
>> am currently checking that for all architectures except i386,
>> sizeof(paddr_t) == sizeof(u_long).
> 
> sparc and 32 bit sparc64 kernels use 64 bit paddr_t.  there are some
> mips ports that do too.  you can't assume the above equality at all,
> not since "paddr_t" was introduced a decade+ ago.

Eeek. I can't see how I can fix that then. Would it be acceptable to say
"the affected functions are private to kvm(3), so don't bother with this
anyway?"

>> - i386 will use 64 bits paddr_t for kvm(3), so we will have to discuss
on
z> how to make it work without too much fuzz with pre-64 bits i386 dumps.
> 
> i think this seems the sanest way.  make userland see the 64 bit
> definition in all cases, and have the libkvm code DTRT.

Thing is, kvm(3) has two functions (kvm_kvatop and kvm_pa2off) that use
PAs internally and assume they are unsigned long. I can't move i386 to 64
bits PA without having to tweak kvm_private.h prototypes, which will affect
all ports in turn.

IMHO, the path of least resistance would be to move to paddr_t.

>> - sparse dumps for memory above 4GB.
>> 
>> Eventually, PAE and !PAE kernels core files should be handled by kvm,
if
>> that's what you are asking.
> 
> right.  i was suggesting that if you simply used nlist(3) to find out
the
> value of "i386_use_pae" then the same code will work for live debugging
> as well as core dumps.
> 
> since core file debugging basically requires this method, and it works
> just as well for live debugging, the sysctl seems like something we
don't
> need to do.

I insist: the sysctl(7) was never meant to be used for checking whether
PAE is enabled for "live" kvm(3) session. I admit, my commit message is not
very clear: the " Will be used by i386 kvm(3) to know the functions that
should [...]" sentence applies to i386_use_pae, not machdep.pae.

It is only there for convenience, eg. when someone wants to know easily if
its system is running under PAE.

-- 
Jean-Yves Migeon
jeanyves.mig...@free.fr