Re: syscalls implementation

2004-08-27 Thread Terry Lambert
Mmaist wrote:
 Hi!
 I was wondering were syscalls implementation is in the FreeBSD source tree.
 I would like to know, especially, where
 
 int kldload(const char*);
 
 is located. sys/kern/kern_linker.c contains
 
 int
 kldload(struct thread *, struct kldload_args *)
 
 and I need to watch at what called between them.

This is probably the wrong list.  You probably want -questions.

Here's the simplest answer you will probably understand, given that the
question you are really asking is about understanding how system calls
work, and where baby system calls come from.  8-).

The system calls are in code that's generated from the file
/usr/src/sys/kern/syscalls.master by /usr/src/sys/kern/maksyscalls.sh,
which is an awk script encapsulated in a shell script.

In user space, the system calls are stubs in a library that traps into
the vector code generated from syscalls.master in the kernel.  This
code is located in /usr/src/lib/libc.  In the case os system calls
that are unwrapped (like kldload(2)), the calls are generated assembly
code that comes from running a script against the same syscalls.master
file describd earlier; see src/lib/libc/sys/Makefile.inc for the exact
place in the source tree that these stubs are being generated from.

The way a system call happens is that the arguments are pushed on the
stack (or put into registers, depending), and then a trap is issued by
attempting to execute a supervisor-only instruction in user code.  This
effectively generates a fault which is then serviced by a fault handler
in the kernel which recognizes the particular trap as special, and
treats it as a system call.

Now when the special trap code in the kernel itself is activated, it
packages up the arguments in a structure of a known size (known to the
code in init_sysent.c, generated from syscalls.master).

For the Intel version of the trap code, see /usr/src/sys/i386/i386/trap.c
and the related assembly code there.  The current thread and the packaged
argument pointer are passed to the system call.

Techinically, it makes a lot of things difficult that the current thread
is passed into the system call as part of its context, rather than obtained
when needed, and cached locally, if necessary, but since it's handy at the
time, it's passed in.  On non-register-poor architectures, the current
thread is usually made available in a register, so the cost in obtaining it
later is actually lss than a memory dereference of a computed stack offset,
as it is/was (depends on the version, architecture, etc.) in FreeBSD because
it's being passed in.  On of the major things this makes difficult is
accurate proxy credential representation at various points in the kernel (see
the NFS server source code for examples of the contorted logic this makes
necessary).

The packages argument structures are defined in the (also generated from
the syscalls.master) file /usr/src/sys/sys/sysproto.h.  For the kldload
system call, this looks like:

struct kldload_args {
  char file_l_[PADL_(const char *)]; const char * file; char
file_r_[PADR_(const char *)];
};

This argument structure is actually a dscriptor.  A descriptor is used to
ensure packing and alignment, so that the user stack save area can be
coerced directly to the structure type, and dereferenced in the function.

The descriptor contains the information necessary to line the structure
contents up with the user stack area and/or register spill area, where the
arguments were stored in user space before the trap.  Mostly, you can just
look at the middle element to know what the argument is, for each line of
arguments.  In this case, it's a const char * file.  This matches what
it was in user space when you made the call.

So the function you are seeing in the kernel:

int kldload(struct thread *, struct kldload_args *);

*is* the system call you saw in user space:

int kldload(const char*);

with the trap added thread pointer and pointer to the packed up save area
containing the same char * value you passed in user space.

Note that since the user and kernel space are not necessarily in core at
the same time (maybe the pointer you pointed to was in a page that was
swapped out), so you have to use copyin/uiomove in the function in the
kernel to copy the path in before it can be used in the kernel address
space.

Probably you should not be hacking in this area until you understand the
code operation a little better, since unless you know what you are doing,
most changes you could possibly make will leave you with a dead system.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: api for sharing memory from kernel to userspace?

2004-05-21 Thread Terry Lambert
Alfred Perlstein [EMAIL PROTECTED] writes:
 I need to share about 100megs of memory between kernel and userspace.
 
 The memory can not be paged and should appear contig in the process's
 address space.  Any suggestions?
 
 I need a way to either:
 map user memory into the kernel's address space.
 map kernel memory into the user's address space.
 
 I was looking at pmap_qenter() but it didn't see attractive because
 it's for short term mappings, this mapping will exist for quite a
 while.

Given your non-paged requirement, the allocation needs to take place
in the kernel.

Visible in a single process, or in all of them?

If all of them, set the PG_U bit; it will have the same address in user
space as it does in the kernel, and be visible to all processes.

If only one of them, the best general solution is to use a pseudo-device,
and support mmap() on it.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: FreeBSD VFS System?

2003-12-19 Thread Terry Lambert
Ryan Sommers wrote:
 Are there any good web resources or books on the VFS system that the
 FreeBSD kernel uses? I'm guessing it might have originated from the
 4.4BSD(?) interface. I've been attempting to read through the source
 code for different system calls (ie mkdir, rmdir, mount/umount) and
 haven't been able to get very far because of the substantial layers of
 indirection involved. For this reason I was looking at picking apart the
 different subsystems involved and was wondering if there was anything
 any more annotated then the source code itself.

The VFS stacking code came from the FICUS projct out of UCLA.  Here are
all their papers on FS stacking:

ftp://ftp.cs.ucla.edu/pub/ficus/usenix_summer_90.ps.gz
ftp://ftp.cs.ucla.edu/pub/ficus/OLD_TECHREPORTS/ucla_csd_900044.ps.gz
ftp://ftp.cs.ucla.edu/pub/ficus/OLD_TECHREPORTS/ucla_csd_910007.ps.gz
ftp://ftp.cs.ucla.edu/pub/ficus/WorkObjOrOpSys_90.ps.gz
ftp://ftp.cs.ucla.edu/pub/ficus/heidemann_thesis.ps.gz
ftp://ftp.cs.ucla.edu/pub/ficus/ucla_csd_930019.ps.gz

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Machines with = 4GB of RAM

2003-12-19 Thread Terry Lambert
Andrew Kinney wrote:
 On 17 Dec 2003 at 15:44, Julian Elischer wrote:
 snip
  options  KVA_PAGES=512
 
  may be a start, but is it still required, and do I have to change
  anything else to match it? (where does the Makefile work out where to
  link the kernel for?)
Is a value of 512 enough for a machine with 16GB of RAM?
 
  Any hints, (even a better google search string) appreciated.
 
 We have a 4GB machine running 4.8-RELEASE, so we aren't using PAE,
 but we had to make changes similar to what you're asking about for a
 different reason.
 
 Your requirements will vary depending on the version of FreeBSD
 you're running, but in general, increasing KVA_PAGES will help
 considerably with stability on large memory machines.  It should be
 noted that releases prior to 4.8 required more changes than just
 KVA_PAGES, but the documentation is a bit muddied on that subject.

In general, the kmem_map size and other kernel memory usage, including
page tables necessary to reference the full memory, end up taking more
than 1G, so the 3G user:1G kernel ratio that's the default for older
FreeBSD won't work at all.  I usually recommend that people make it
1G user:3G kernel; you can get away with 2G:2G if you aren't going to
be allocating lots of mbufs or supporting lots of open sockets, etc.,
but in general most people throw 4G+ into a box because they plan on
building a network server and then throwing some serious load on it.


 I don't know if it is required, but we rebuilt the world after
 changing KVA_PAGES just to make sure that any hidden dependencies on
 that value were handled in things other than the kernel.

Depends.  The normal case where this will be required is for prebuilt
kernel modules.  The only user space code I'm aware of which cares is
the Linux threads package (and anything that links against it), since
the threads mailboxes are in a fixed location apriori known to both
the kernel module and the threads library, and the location has to be
changed when the KVA space changes, since it assumes a 3:1 or whatever
was in effect when it was compiled.


 As far as 512 being a large enough setting for a 16GB machine, that
 depends entirely on what you plan to do with the machine and its
 usage pattern of various system resources.

In my personal experience, th kernel and data structures consume over
1G in a 4G box, du to the auto-tuning cruft trying to b smarter than
it actually is, and making bad decisions.

In the 4.7/4.8 time frame, this was catastrophic with more than 4G,
since it didn't stop scaling at 4G (the kernel can only address 4G,
without PAE, no matter what, since pointers are 32 bits).  Scaling
above that point tris to allocate more memory for the kernel than
the kernel is capable of addressing.


 For instance, on our 4GB machine, it does a lot of heavy web serving,
 databases, and email.  We needed the 2GB KVA on that machine because
 of large numbers of files, large network buffers, and some weirdness
 relating to Apache and pv entries.  If your usage patterns were
 similar and you wanted to make full use of the 16GB without getting
 trap 12 panics, then 2GB KVA may be inadequate.

Older boxes won't even boot with 3:1 if you jam 4G in them, priod.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: question about _exit() function

2003-11-28 Thread Terry Lambert
rmkml wrote:
 is the _exit() function safe for a thread ?
 my program use vfork() and then execve in a thread context.
 The documentation mentions that the process has to call _exit() in case
 of failure.
 But this _exit() is really safe for the parent thread ?

The behaviour is undefined in the failure case, but only if you
have stablishd a pthread_atfork() handler that does anything in
the child.

In general, the more correct approach is to use posix_spawn(), but
FreeBSD doesn't support posix_spawn() (funny; this has come up now
twice in within 60 messages of each other, while ther was a very
long time when it was not pertinent at all...).

POSIX also sriously discourages the use of vfork(), and recommends
fork() instead, for threaded programs.

Note that the fork() only *ever* creates a single thread, so it
will only replicate the currently running thread and its address
space, rather than all currently running threads in the parent.

You said in another message that this is on 4.8.  I think that the
behaviour will not be quite what you expect, in that case, and that
it'll be better in -current, but might still not be what you expect
there, either (depends on what you are expecting).  See also:

http://www.opengroup.org/onlinepubs/007904975/functions/fork.html

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: question about _exit() function

2003-11-28 Thread Terry Lambert
rmkml wrote:
 Thanks a lot for the answer. I will change vfork() with fork().
 
 An another question: in the man page of vfork() it is mentionned that
 the fork() function has to use _exit(0) too when something wrong with the
 execve() happens!

I can see how you might read it this way, but that's not really
correct.

The purpose of _exit() instead of exit() is to avoid calling the
atexit() handler, the per thread data cleanup handlers, and the
cancellation routines.

In the case of a vfork(), it's undefined as to whether you will
be operating against resources currently allocated and appropriately
reference counted to the child, or whether you are operating against
that of the parent.  In the case of fork(), you are guaranteed to
operate in the context of the child.

Which you should use is dependent on your application.  The normal
operation of a fork() in a threaded program is to duplicate only
the calling thread.  If you have registered atexit(), per thread
data cleanup handlers, or cancellation routines (or, in some cases,
signal handlers), then when you call exit(), these things will be
invoked.

Consider that the thread forking has perhaps the responsibility of
cleaning up a shared memory segment created by a task.  You do *not*
want to do this cleanup (which happens on an interface that you are
manually resource tracking) in the child process in this case, since
it could rip the shared memory segment out from under other processes
which are using it.

Rather than calling _exit() to avoid this, you probably want to call
exit()... however, you must deal with detaching your registered
handlers and avoiding your manual cleanup process.  The correct way
to do this is to disable them in the child by utilizing the function
pthread_atfork() to disentangle them at fork time.

See:
http://www.opengroup.org/onlinepubs/007904975/functions/pthread_atfork.html


 Is the child a real process or because of the thread context a part of
 the parent process, so a new thread.

It is a real process; the only thread running in this real process
is a copy of the thread that was running at the time of the fork().
This is the primary reason that vfork() cautions against its use in
POSIX documentation, and why it only permits either an _exit() or an
execve(), and why you should avoid vfork().


 In this case a pthread_exit() may be a better solution.
 Is that point a view complety wrong ?

You likely do not want to use pthread_exit() in this case, since you
are the only thread in a process.  If you were to use this with vfork(),
the combination would likely cause a program malfunction.  If you were
to use it with fork(), the combination may or may not be a problem.  In
the second case, the reason for this is that there exists the possibility
that when you create the new process, it will link with object files with
statically declared instances whose .init routines create worker threads.
For example, the Netscape LDAP library is not threads-safe, so at one
point I wrote a wrapper library that queued requests to a single worker
thread that was created at the time the library reference was instanced
via a .init section.  Calls into the library queued requests, which were
then serialized to the Netscape library, and responses were sent back out.
The result looked like an LDAP client library that could be reentered,
but which would serialize work to the non-thread-reentrant library under
the covers.

So the reason pthread_exit() should not be used is that you may not be
the only thread running, and if so, there will be no pthread_join()
called to reap your thread, and the other threads will continue
indefinitely: you can't depend on not creating threads as a side effect
of using various libraries or shared objects.


 Currently, is some indeterminate case, a part of my program freeze just
 after the vfork().
 So, I try to understand what may cause the calling thread of vfork() to
 freeze ...

Without more detail, this is hard to pinpoint, since I don't really
know what you mean by freeze, and there appears to be a language
barrier to a precise explanation.

Most likely, this is an interaction between the user space scheduler
and the vfork().  Realize that in the 4.x series of FreeBSD, the pthreads
are implemented with a user space scheduler.  This means that following
a vfork(), since there is only one schedulable entity, the process, all
threads are suspended until your _exit(0 or execve() call (assuming that
these do the right thing in the vfork() case interacting with threads,
and POSIX says that it's undefined if this will happen).  If you want
other threads in your main applicaion to run concurrently with the child,
you *must* use fork() instead of vfork().

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: getpwnam with md5 encrypted passwds

2003-11-27 Thread Terry Lambert
Clifton Royston wrote:
   If you will need to do authentication after your program drops
 privileges, your best course is probably to go through PAM, to install
 a separate daemon which implements a PAM-supported protocol and which
 runs with privileges, and then to enable that protocol as a PAM
 authentication method for your application.

[ ... RADIUS example with LDAP mention ... ]

Sounds like a good approach, though I'll point out that had
you tried LDP, you would have been hard-put to use LDAP as a
proxy protocol to another authentication base (a PAM backend
for an LDAP server, while not quite impossible, would be very
hard).

How did you avoid the recursion problem of the RADIUS server
trying to authenticate via pam_radius to the RADIUS server
tyring to authenticate ...

-- Terry?


___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: getpwnam with md5 encrypted passwds

2003-11-27 Thread Terry Lambert
Peter Pentchev wrote:
 On Wed, Nov 26, 2003 at 02:21:04PM +0100, Kai Mosebach wrote:
  Looks interesting ... is this method also usable, when i dropped my privs ?
 
 I think Terry meant pam_authenticate() (not pan), but to answer your
 question: no, when you drop your privileges, you do not have access to
 at least the system's password database (/etc/spwd.db, generated from
 /etc/passwd and /etc/master.passwd by pwd_mkdb(8)).  If this will be any
 consolation, getpwnam() won't return a password field when you have
 dropped root privileges either.

Peter is correct on both counts.  If I had not sen his reply
first, I would have made the same reply.  You cannot crypt
something you cannot read.

-- Terry


___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS Flags Oddity

2003-11-27 Thread Terry Lambert
Kris Kirby wrote:
 FreeBSD (4.9-RC) doesn't appear to export schg flags over NFS.  You've
 got to shell in locally to the machine to move the schg flags; ls -lao
 doesn't report them over NFS, but does list them locally.

Non-local flags are not defined, so they are not permitted to
be exported over NFS.

You'll find the same thing with the number of bits in major
and minor number, etc..  For a long time (until Julian added
the first devfs to FreeBSD), it was not possible to NFS-boot
a FreeBSD box off of e.g. an Alpha running TRU64 UNIX, for
example.

-- Terry


___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: getpwnam with md5 encrypted passwds

2003-11-26 Thread Terry Lambert
[EMAIL PROTECTED] wrote:
 i am trying to validate a given user password against my local passwd-file with
 this piece of code :
 
 if (!( pwd = getpwnam ( user ))) {
 log(ERROR,User %s not known,user);
 stat=NOUSER;
 }
 if (!strcmp( crypt(pass,pwd-pw_name), pwd-pw_passwd) ) {
 log(DEBUG|MISC,HURRAY : %s authenticated\n, user);
 stat = AUTHED;
 }

I know you have the fix for the crypt of the wrong field, but the
proper thing to do is probably to use pan_authenticate() so that
you are insensitive to the athentication method being used, rather
than crypting and comparing it yourself.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: secure file flag?

2003-11-25 Thread Terry Lambert
Wes Peters wrote:
 On Tuesday 18 November 2003 16:31, Rayson Ho wrote:
  e.g. when deleting a secure file, the OS will overwrite the file
  with random data.
 
 Better to overwrite it with a more secure pattern.  See ports/
 sysutils/obliterate for references.  It has been mentioned before that
 this could be done on in the kernel, obliterating blocks in the VM
 rather than zeroing them.  I hadn't thought of applying at the file or
 filesystem level.

The DOD has a specific pattern it requirs, to consider the deletion
to be secure.


 The closest we have is the 'rm -P' command and the above-mentioned
 obliterate command.  The overwrite pattern used in 'rm -P' is not
 likely to be effective against a dedicated inspection of the disk; the
 one in obliterate somewhat more so.

On most modern drives, nothing is likly to be ffective, without OS
support all th way down to the driver and hardware flags level.

The DOD specified pattern is only effective if you have separate
control of the seek and the write.  The reason for this is that you
musttake head hysteresis into account since if you did a seek in
for the initial write and a seek out for the erase, you will ens up
with a small strip of bits that are readable, even if they are much
smaller than a standard track width, since there is jitter in the
head positioning that depends on the side of the track you are coming
from.

So in reality, you also need to control sector sparing and write
caching, as well, to avoid track caching, if possible, and seeks for
sector sparing which are hidden from the OS trying to invoke the
write pattern: you need to turn both of these off, if you can.  If
you can't, you need to buy a different disk, and turn both of these
off, if you can.  If you can't, you are going to ned to write your
own disk firmware.

You also need to deal with not writing to tracks at one end or the
other of the disk, since you can't seek to them from the opposite
direction, which means you have no way to write the pattern you are
expected to write.  This generally means that the end tracks need
to be treated as scrath landing zones, and you only ever write
pattern data to them, and then only because that's the way to get
the disk head onto the track so you can seek back to the track that
you really want to erase.

In a track-caching world, this tends to be not useful, unless you
can determine the physical geometry of the disk, and treat tracks as
separate entities.

Finally, if you have a track-caching disk, it's likely that the way
it operates is to just seek in and start writing.  That will mean
that in order to avoid a thin stripe of your old bits, you have to
trat tracks as singl entities, and that means that if you have a
track that shares data with several files, and you want to scribble
over one of them effectively, you have to scribble over everything
effectively, and then put the data for the filec(s) you didn't want
to erase back on the track.


 This sounds like an interesting file flag.  Would you expect the process
 to block on the unlink(2) call while the overwrite takes place, or for
 this to happen in a kernel thread?  The former seems pretty straight-
 forward, hacking at ffs_blkfree.  The latter I really wouldn't know how
 to begin without (a lot) more study.

You would have to do the former, or you would not pass common criteria
valuation, if that's what you are aiming for.

The normal way this is handled in government secure facilities is a 2U
rack unit containing thermite charges.

The normal way this is handled in a commercial scure facility is mostly
to put the disks in a crusher.

If this is somthing other than that, I doubt anyone would be willing
to spend US$60,000/MB to have someone recover your porn.  You are
likely safe enough with PHK's somewhat inscure disk encryption thingy.


As an overall note, you might want to contact Michal Serrachio off
list; he has a solution to this problem which h might be willing to
license to you for a fee.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Library libgcc_pic.a missing on 5.1?

2003-11-25 Thread Terry Lambert
Jim Durham wrote:
 Is liibgcc_a not supposed to be on 5.1?

Is the one in /usr/lib not good enough for you?  8-) 8-).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Conflict between sys/sysproto.h stdio.h ... ?

2003-11-20 Thread Terry Lambert
lucy loo wrote:
 I am writing a kernel loadable module to reimplement some system calls.
 I have included sys/sysproto.h, sys/systm.h, etc. -- very standard
 header files for kld implmentation.

So far...

 I also want to do file i/o in this module, therefore I need to include
 stdio.h. But it obviously conflicts with those sys/..., and make
 won't pass.  Anyone knows how to fix this?

You cannot use libc functions in the kernel.  The kernel does not
link against libc.  It is not an application, it is a kernel.

There are some libc functions which are provided in the kernel;
there are other libc functions for which there are similar kernel
functions of the same name (e.g. printf), and there are some
libc functions -- quoted because they aren't really there, but
you can use them -- that are inlined by the compiler.

Programming in the kernel environment is not the same as programming
in the normal applications environment (the POSIX environment).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: non-root process and PID files

2003-11-15 Thread Terry Lambert
Jos Backus wrote:
 On Fri, Nov 14, 2003 at 01:45:45AM -0800, Terry Lambert wrote:
  OK.  We already have one of those.  We call it init.  8-).
 
 Feature-wise init and svscan/supervise don't quite match; svscan has more
 features, one of which being that it doesn't use a single control file which
 if you screw it can render your system unusable. Even SysV init has more
 (useful) features than ours. Also, init is kinda hard to use by non-root users
 for that reason. People have been known to successfully replace init with
 svscan, btw.

The main feature svscan lacks for me is the right license.  8-).

Practically, init does what you want: monitors and restarts things
that die.

This whole discussion, though, is pretty stupid, since the things
you are worring about keeping running should be written to not die
in the first place, and if they *are* dying, it's generally dumb
to restart them thinking that they will magically not die again,
since, having ben restarted, they are now among the blessed.

8-) 8-).


 This is really off-topic. But sure, there are instances where this approach is
 less than robust. So what? Using pidfiles doesn't solve this problem either.

Not using PID files won't fix it either.  So portraying PID files
as being a bad way to solve this problem (which is what people were
originally complaining about) is really dumb: PID files work very
well for resolving the problem that they were intended to resolve.

You just need to create them when you are first started, which is
done by root, unless you are SUID to some other UID, in which case
it's done by that ID instead.  After that, the same UID has priviledges
on the file, and the problem is entirely solved.

[ ... ]

 Also, software can fail in ways unaccounted
 for. But this is really off-topic.

It can't fail that way if you account for it failing that way.  8-).

But see the above: restarting sshd because it core dumps when you
try to login using putty isn't going to magically make attempting
to login via putty not core dump it.


  Anyway, FreeBSD has steadfastly disliked the concept of a registry,
  ever since Microsoft implemented it in Windows95; it's on of the
  biggest NIH items of all time.
 
 I think it would help if config files had a common structure and could be
 queried for interesting service metadata like dependencies. See the Arusha
 project for an example.

This is totally bogus.  Data interfaces are the same things that
screw us over with proc size mismatch messages every time the
proc structure changes.  The only reasoanble interface is one
that provides procedural accessor/mutator functions to abstract
the format of the interned vs. the externed data, so that it then
becomes *impossible* to get a proc size mismatch.

FreeBSD currently continues to have the proc size mismatch problem
because it currently insists on continuing to use data interfaces.

FreeBSD continues to use data interfaces in this are because it can
not use procedural interfaces to operate against system dumps.

FreeBSD can't use procedural interfaces to operate against system
dumps because it does not take advantage of the ELF format to add
an ELF section to the kernel image to contain the shared library
for the procedural interface to use against the kernel (with an
internal, but therefore hidden, data interface), so that there is
never a discrepancy.


 Anyway, we were talking about the use of pidfiles versus using a process
 monitor. I'm simply claiming that using a process monitor is far superior.

And I'm simply claiming that they solve different problems, and
that complaining about not having a solution to the problem that
a process monitor solves (to wit: restartting buggy programs that
should not be crashing in the first place) is OK, but complaining
that PID files don't solve the same problem is incredibly bogus.

They solve different problems, and you can't simply replace PID
files with process monitors, and continue to solve the problems
that PID files solve that process monitors don't solve.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: non-root process and PID files

2003-11-14 Thread Terry Lambert
Jos Backus wrote:
 On Thu, Nov 13, 2003 at 02:45:18AM -0800, Terry Lambert wrote:
   Why use pid files at all if you could be using a process supervisor instead?
 
  Who supervises the supervisor?
 
 Heh. The supervisor should be small and robust, like init. Has init died on
 you recently? Do you want to solve this problem or find Nirvana? If the
 latter, don't use computers.

OK.  We already have one of those.  We call it init.  8-).


  There are also the small issues of ordering (the reason you can't
  just run everything out of /etc/ttys via init in the first place),
 
 Sure. Hard to get right but not unsolvable. No reason you can't use process
 monitoring with something like rcNG.

We tried very hard to do this on the InterJet.  We still ended up
shooting most things in the head with large caliber bullets each
time the dial on demand interface went up or down because we did
not have the idea of hard and soft dependencies.  Even if we had
had them, though, we would still have been SOL, since many of the
Open Source programs we used cached information when they started.
Because of this, the data could get stale.

For example, say I ran sendmail and bound it to the external port
(or INADDR_ANY).  What is the host name that I should claim to the
remote host when I answer with the 200 Connected message?  What
should I use for the argument to the HELO or EHLO for outbound
SMTP connections so that the name I use matches the name the remote
host gets on it's crosscheck for the canonical name of the machine
contacting it via a gethostbyaddr(getpeername())?

Basically, you end up with a system where you either can't cache
data, or where the cache has to be chared, or where you implement
a generic notification mechanism.  No matter how you slice it,
though, you're talking about rewriting millions of lines of code.

Cacheing Considered Harmful.



  multiple instances,
 
 /service/smtpd.{external,internal}

Yeah, we did this, so that we could shoot only half the processes
in the head on link up/down.

It sucked.  We almost shipped a product that wouldn't hav worked
when we did the DNS split, because the dependency graph had to be
manually managed, and wasn't.


  and removing human error from adding and removing new things to be
  monitored.
 
 That's a generic problem with any type of change management.

Not really.  If your configuration changes all happened in a
centralized data repository, and nobody cached anything, but got
their information from that central repository, and the interface
to the repository was a system interface (so the system could
cache on your behalf so performance didn't degrade unbearably),
THEN you might have something.  After you rewrote millions of
lines of Open Source code to use your registry instead of working
the way it currently works, which is everyone has their own poop
files.  If you are lucky, hitting them over the head with a
shovel (SIGHUP) works, and you don't have to kill and resurrect
them (you just have to wait a long time before the services become
usable again, e.g. DNS reading its config files).

Anyway, FreeBSD has steadfastly disliked the concept of a registry,
ever since Microsoft implemented it in Windows95; it's on of the
biggest NIH items of all time.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: non-root process and PID files

2003-11-13 Thread Terry Lambert
Jos Backus wrote:
 On Mon, Oct 27, 2003 at 10:31:18AM -0500, Dan Langille wrote:
  If a process starts up and does a setuid, should it be writing the
  PID file before or after the setuid?
 
  Two methods exists AFAIK:
 
  1 - write your PID immediately, and the file is chown root:wheel
  2 - write your PID to /var/run/myapp/myapp.pid where /var/run/myapp/
  is chown myapp:myapp
 
  Of the two, I think #1 is cleaner as it does not require another
  directory with special permissions.
 
  Any suggestions?
 
 Why use pid files at all if you could be using a process supervisor instead?

Who supervises the supervisor?  Sure, you can take the English
Bobby approach (init dies, the kernel yells Help me, human, or
I shall yell 'Help me Human!' again, and tries to start software
that will never start over and over), but that solves nothing;
you would be amazed at the number of people who want MacOS X to
try to restart init, instead of panicing, when init can't be
started in the first place, or won't stay running if it was.

So this doesn't solve the origin of authority problem.

The problem being solved is avoiding running multiple instances
of roles... so actually, it would be better if the file were
named e.g. smtp.pid, rather than sendmail.pid, which would
step on the toes of everyone who wanted to use their program name
as part of the file name to make it harder to use someone else's
software to replace their software.

There are also the small issues of ordering (the reason you can't
just run everything out of /etc/ttys via init in the first place),
multiple instances, and removing human error from adding and
removing new things to be monitored.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: kqueue, NOTE_EOF

2003-11-13 Thread Terry Lambert
Jaromir Dolecek wrote:
 marius aamodt eriksen wrote:
  in order to be able to preserve consistent semantics across poll,
  select, and kqueue (EVFILT_READ), i propose the following change: on
  EVFILT_READ, add an fflag NOTE_EOF which will return when the file
  pointer *is* at the end of the file (effectively always returning on
  EVFILT_READ, but setting the NOTE_EOF flag when it is at the end).
 
  specifically, this allows libevent[1] to behave consistently across
  underlying polling infrastructures (this has become a practical
  issue).
 
 I'm not sure I understand what is the exact issue.
 
 Why would this be necessary or what does this exactly solve? AFAIK
 poll() doesn't set any flags in this case neither, so I don't
 see how this is inconsistent.
 
 BTW, shouldn't the EOF flag be cleared when the file is extended?

It solves the half-close-on-socket issue which occurs on an HTTP/1.0
request or an HTTP/1.1 non-pipelined/terminal request that occurs on
most HTTP connections as a result of the client closing their side
of the socket connection, but the server being expected to provide a
response to the request on the same socket.

You need this to be able to distinguish, after getting an EOF, if you
need to provide a response to the request, or you need to drop it,
based on additional processing you choose to do in your application.

Doing it this way saves you 33%, 50%, or 66% of the required system
calls to detect the edge conditions, depending on your I/O model and
when they hit.

It is useful for HTTP servers, HTTP Proxy servers, L7 load balancers,
load balancers that implement Direct Server Return for requests, and
in a number of other common cases having to do with networking (e.g.
transcoding proxies for cell phones or other users requiring it, FTP
control vs. data channels in the non-passive case, ssh/rcmd/etc., as
just a couple of select examples).

I rather expect that it would be singularly useless for socketpair(),
pipe(), named pipe (FIFO), or file operations, but that's not what we
are talking about when we talk about event libraries that deal with
input sources/sinks.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Confused about HyperThreading and Performance

2003-11-13 Thread Terry Lambert
Daniel Ellard wrote:
 Can someone point me at some non-marketing documentation about
 hyperthreading on the latest Intel chips?  I'm seeing some strange
 performance measurements and I would like to figure out what they
 mean.

Go out to Intel's web site's developer section, and look for
SMT.  There is a lot of literature.

The reason you are seeing a performance drop is contention for
shared resources that the scheduler doesn't know are shared.

SPECmark and similar benchmarks tend to get worse numbers on
every OS when SMT is enabled, due to contention.

The only model which will work is a hierarchical affinity and
negaffinity model, and I am not away of an OS that is not also
treating the hardware as NUMA which works this way (and none of
those run on Intel chips; mostly, they run on real NUMA hardware).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Kylix in FreeBSD

2003-11-07 Thread Terry Lambert
Rod Person wrote:
 On Thursday 06 November 2003 09:09 am, It was written:
  If you futs with getting Kylix to run under FreeBSD, don't forget the
  special glibc requirements that some versions of Kylix have. Maybe you
  should probably simply replace the entire /compat userland with the
  userland of a distro that Kylix supprorts _with_ kylix extra patches
  installed?
 
 Have you tried this? Since Kylix came out I have tried to get it to run on
 FreeBSD and various Linux distros. A few days ago I got kylix to run on SuSE
 8.2 (from the kylix newsgroups this seems to be the best distro for it).
 NetBSDs Linux emulation is based on SuSE, isn't it? But, I found no postings
 related to Kylix on NetBSD. My next wondering is would NetBSD Linux emu run
 under FreeBSD and would this run kylix?

Since all new developement in Kylix is apparently officially
stalled, now would be a good time to do the porting work, since
it's no longer a moving target...

http://www.linuxworld.com.au/nindex.php?id=122384005eid=-50

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: FreeBSD 5.1-p10 reproducible crash with Apache2

2003-11-07 Thread Terry Lambert
Mike Silbersack wrote:
 On Wed, 5 Nov 2003, [ISO-8859-1] Branko F. Grac(nar wrote:
  I tried today with yesterday's -CURRENT. Same symptoms. No kernel panic,
  just lockup.
 
 Ok, submit a PR with clear details on how to recreate the problem, and
 we'll see if someone can take a look into it.  I'm too busy to look at it,
 but at least putting it in a PR will ensure that it doesn't get too lost.
 Once the PR is filed, you might want to try asking on the freebsd-threads
 list; it sounds like the issue might be thread-related.
 
 (Note that your original e-mail might contain enough detail, I'm not
 certain; I just skimmed it.  Filing a good PR is important either way,
 mailing list messages get easily lost.)

Is gdb good enough in FreeBSD that you can break to the kernel
debugger with GDB enabled, and dump out the stacks for all
threads currently in the kernel for all processes?

The way to find this, if it's a threads related issue, is to do
exactly that, and then look to se if there's something like a
close in one thread of an fd being used in a blocking operation
in another thread.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: spambouncer tags much freebsd list mail as spam

2003-11-03 Thread Terry Lambert
C. Kukulies wrote:
 I installed the spambouncer.org procmail script and before I was switching
 the behaviour from SILENT to COMPLAIN I took a look at my spam.incoming folder
 and found a lot of messages from freebsd-bugs and freebsd-mobile in there.
 
 Both lists are not directed to folders prior to spambouncer coming into effect
 so they are trapped by spambouncer and I suspect that other freebsd lists
 would be trapped as well.
 
 Anyone experienced similar?

The spambouncer script makes the same incorrect assumption that
the EarthLink SPAM stuff makes, which is that any mail not explicitly
addressed to you is SPAM.  In other words, they expect mailing lists
to violate the draft RFC that prohibits header rewriting by mailing
lists, and they expect all mailing list servers to eat the overhead
of expanding each message to a single recipient message, instead of
sending the messages in bulk if the destination domain is identical.

At least the spambouncer script can be modified to respect the
Sender: header, which EarthLink fails to respect in their list of
allowed senders.  This is pretty much how you should modify the
spambouncer code to handle mailing lists (and how EarthLink should
modify their anti-SPAM system, as well, but probably won't).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: O_NOACCESS?

2003-11-01 Thread Terry Lambert
andi payn wrote:
 Now hold on. The standard (by which I you mean POSIX? or one of the UNIX
 standards?) doesn't say that you can't have an additional flag called
 O_NOACCESS with whatever value and meaning you want.

A strictly conforming implementation can not expose things into
the namespace that are not defined by the standard to be in the
namespace, when a manifest constant specifying a conformance level
is in scope.

This is the main reason for things like _types.h.


 Obviously, code that relies on such a flag will be non-portable, since
 no standard defines such a flag, but that's fine, since the intended
 uses (writing a FreeBSD-specific backend for fam, for example) aren't
 expected to be portable anyway.

Not justnot portable, but fails to conform to standards.


 If O_NOACCESS happens to be == O_ACCMODE on FreeBSD--just as it is on
 linux--and if that happens to also be == O_WRONLY | O_RDWR (with no
 other flags set), I don't see how that changes anything.

Other than the security issues it raises, you mean, right?


[ ... ]
  In which case, your example is (O_RDWR|O_WRONLY) == O_RDWR.  The
  standard does not indicate whether the implementation is to use
  bits, or sequential manifest constants, only that the bits that
  make up the constants be in the range covered by O_ACCMODE.
 
 First, again, this is intended to be used for non-portable code, and
 therefore, the fact that this happens not to be true on FreeBSD means
 it's irrelevant that it could be true elsewhere. Especially since, if
 O_NOACCESS were added to FreeBSD, it would still fail to exist entirely
 on other platforms, which means it matters little what value it might
 have if it did exist--code written to use O_NOACCESS won't compile on
 platforms without O_NOACCESS.

You nead to look at the implementation of VOP_OPEN in FreeBSD;
specifically, you need to look at the fact that fp-f_flags is
passed as one of its parameters, and that the FS is permitted
to interpret these flags in an FS-dependent fashion.

Then you need to at the fact that FreeBSD supports locadable FS
types, and that there are third party FS's that proxy operations
over the network, which can include a network version of the
flags, and conversion back and forth could therefore end up
being ambiguous.

Really, it needs to be a bitmap internally in FreeBSD, as well,
but that's a big step.


 Second, any platform that defines O_NOACCESS could do so differently. On
 FreeBSD, as on linux, the most sensible definition is O_NOACCESS ==
 O_WRONLY | O_RDWR == 3. Or a platform that defined O_RDONLY as 1 and
 O_WRONLY as 2, the most sensible definition would be O_NOACCESS == 0.

I pray this flag never gets adopted outside of Linux...


  In fact, externally, they are bits, but internally, in the kernel,
  they are manifest constants.
 
 Yes, FFLAGS and OFLAGS convert between the two. If you look at how this
 works in the linux kernel, you'll see that O_RDONLY (0) converts to
 FREAD (1); O_WRONLY (1) to FWRITE (2); O_RDWR (2) to FREAD | FWRITE (3);
 and O_NOACCESS (3) to 0. This could be done the same way in FreeBSD.*
 
 * Actually, this is a tiny lie; linux has a 2-bit internal access flags
 value which it derives in this way, and uses the original passed-in
 flags for everything except access. FreeBSD instead just adds 1, relying
 on the fact that the lower 2 bits will never be 3, and therefore all of
 the other bits will stay the same. This means that enabling this value
 would make the FFLAGS and OFLAGS macros slightly more complicated on
 FreeBSD.

It would be more useful to intern them as a bitmap, IMO, and get rid
of the conversion.  The problm is compatability with historical
source code passing literal constants instead of manifest values.


  The most useful thing you could do with this, IMO, is opn a directory
  for fchdir().
 
 Except that you can already do exactly this with chdir(). But I can see
 that you might at some point want to check the directory before
 chdir'ing to it, or pass an fd down into some function instead of a
 string, and this would be useful in such a case.

Or deal with issues of privilege granted merely by open.  For
example, on FreeBSD, an implementation of this would permit any
normal user to do INB/OUTB to any I/O port on any hardare on the
machine.

This is a can of worms.


  Of course, allowing this on directories for which you
  are normally denied read/write permissions would be a neat way to
  escape from chroot'ed environments and compromise a host system...
 
 How would it allow that? If you can open files outside your chroot
 environment--even files you would otherwise have read access to--it's
 not much of a chroot!

Mounted procfs within a chrooted environment.  Admittedly, FreeBSD
is moving away from procfs, but on Linux, it's a serious issue,
since such basic utilities as ps and so on won't work without it.


   Having O_NOACCESS would be useful for the fam port, for porting pieces
   of lilo, and probably for other things I 

Re: O_NOACCESS?

2003-11-01 Thread Terry Lambert
M. Warner Losh wrote:
 Rewind units on tape drives?  If there's no access check done, and I
 open the rewind unit as joe-smoe?  The close code is what does the
 rewind, and you don't have enough knowledge to know if the tape was
 opened r/w there.

Which brings up the idea of passing fp-fd_flags to VOP_CLOSE()...

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: O_NOACCESS?

2003-10-31 Thread Terry Lambert
andi payn wrote:
 As far as I can tell, FreeBSD doesn't have anything equivalent to
 linux's O_NOACCESS (which is not in any of the standard headers, but
 it's equal to O_WRONLY | O_RDWR, or O_ACCMODE). In linux, this can be
 used to say, give me an fd for this file, but don't try to open it for
 reading or writing or anything else.

The standard does not permit this.

First off, O_ACCMODE is a bitmask, and is guaranteed to be
inclusive of the bits for O_RDONLY, O_WRONLY, and O_RDWR, but
*not* guaranteed to not be inclusive of additional bits,
reserved or locally defined but outside the _POSIX_SOURCE
namespace.  By using this value as a parameter, you could very
well be setting many more bits, and you could be setting bits
for local implementation options that you really, really do
not want to set.

Second, the standard is ambiguous as to how O_RDWR is defined;
it is perfectly permissable to define these values as:

#define O_RDONLY1   /* the read bit */
#define O_WRONLY2   /* the write bit */
#define O_RDWR  (O_RDONLY|O_WRONLY) /* read + write */

In which case, your example is (O_RDWR|O_WRONLY) == O_RDWR.  The
standard does not indicate whether the implementation is to use
bits, or sequential manifest constants, only that the bits that
make up the constants be in the range covered by O_ACCMODE.

In fact, externally, they are bits, but internally, in the kernel,
they are manifest constants.


 This allows you to get an fd to pass to fcntl (e.g., for dnotify), or
 call ioctl's on, etc.--even if you don't have either read or write
 access to the file. The obvious question is, Why should this ever be
 allowed? Well, if you can stat the file, why can't you, e.g., ask
 kevent to monitor it?

The most useful thing you could do with this, IMO, is opn a directory
for fchdir().  Of course, allowing this on directories for which you
are normally denied read/write permissions would be a neat way to
escape from chroot'ed environments and compromise a host system...


 In FreeBSD, this doesn't work; you just get EINVAL.
 
 Having O_NOACCESS would be useful for the fam port, for porting pieces
 of lilo, and probably for other things I haven't thought of yet. (I
 believe that either this was added to linux to support lilo, or the open
 syscall just happened to work this way, and once the lilo developers
 discovered this and took advantage of it, it's been retained that way
 ever since to keep lilo working.)

The latter is most likely.  In any case, this would not be allowed
by GEOM for the purpose to which LILO is trying to put it, unless
you were to modify GEOM to add a control path for parents of
already opened devices.  If you did this, you might as well just
add a proper set of abstract fcntl's to GEOM, and get rid of all
the raw disk crap in user space, and unbreak dislabel and the other
stuff that GEOM broke when it went in.


 On the other hand, BSD has done without it for many years, and there's
 probably a good reason it's never been added. So, what is that good
 reason?

fcntl.h:
#define FFLAGS(oflags)  ((oflags) + 1)


 I don't think there's a backwards-compatibility issue.

Unfortunately, yes, there is.  The values are not bits, internally
to the kernel.  The conversion to internal form merely adds 1, it
doesn't shift the values.


-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: kevent and related stuff

2003-10-31 Thread Terry Lambert
andi payn wrote:
 First, let me mention that I'm not nearly as experienced coding for *BSD
 as for linux, so I may ask some stupid questions.
 
 I've been looking at the fam port, and this has brought up a whole slew
 of questions. I'm not sure if all of them are appropriate to this list,
 but I don't know who else to ask, so
 
 First, some background: On Irix and Linux, fam works by asking the
 kernel to send it a signal whenever the specified accesses occur. On
 FreeBSD, since there is no imon interface and no dnotify fcntl, it
 instead works by periodically stating all of the files it's
 watching--which is obviously not as good. The fam FAQ suggests that
 FreeBSD users should adapt fam to use the kevent interface.

Yes.  The file access monitor tool is the classic argument.


 I looked into kevent, and it seems like there are a number of problems
 that lead me to suspect that this is a really stupid idea. And yet, I'd
 assume that someone on the fam team at SGI and/or one of the outside fam
 developers would know FreeBSD at least as well as me. Therefore, I'm
 guessing I'm missing something here. So, any ideas anyone can offer
 would be very helpful.
 
 So, here's the questions I have:
 
 * I think (but I'm not sure) that kevent doesn't notify at all if the
 only change to a file is its ATIME. If I'm right, this makes kevent
 completely useless for fam. Adding a NOTE_ACCESS or something similar
 would fix this. Since I'm pretty new to FreeBSD: What process do I go
 through to figure out whether anyone else wants this, whether the
 interface I've come up with is acceptable, etc.? And, once I write the
 code, do I submit it as a pr?

You add it, submit it as a PR, if send-pr will work from your
machine properly, discuss it on the lists, and if someone with
a commit bit has the time and likes the idea, it will be committed.


 * The kevent mechanism doesn't monitor directories in a sufficient way
 to make fam happy. If you change a file in a directory that you're
 watching, unlike imon or dnotify, kevent won't see anything worth
 reporting at all. This means that for directory monitoring, kevent is
 useless as-is. Again, if I wanted to patch kevent to provide this
 additional notification, would others want this?

I'm not sure that this is correct, unless you are talking about
monitoring all files in a directory by merely monitoring the
directory.  If you make a modification to the file metadata (e.g.
add a link or rename it), then you will be notified that the
directory has changed.

The argument against subhierarchy monitoring is that it will, by
definition, stop at the directory level, and it can not be
successfully implemented for all FS types.


 * When two equivalent events appear in the queue, kevent aggregates
 them. This means that if there are two updates to a file since the last
 time you checked, you'll only see the most recent one. For some uses of
 fam (keeping a folder window up to date), this is what you want; for
 others (keeping track of how often a file is read), this is useless. The
 only solution I can think of is to add an additional flag, or some other
 way to specify that you want duplicated events.

This is the classic edge triggered vs. level triggered argument
that Linux people bring up every time someone suggest they implement
kqueue in Linux.

This is easily fixable: you seperate the flag from the data, adding
an additional argument to KNOTE().  This also has the side effect
of removing the restriction on the PID size, which is imposed by
the limited number of bits left over for representing the PID.

This is a trivial change, and I've done it several times.

The way this works is that you establish, via definition of the
udata argument, a contract between the kernel and the user space
over what the udata means.  The additional argument to KNOTE can
then be used by the per event note handling code in the kernel
to fill out a udata structure with as much data as you want to
give it, and to identify the place in user space to copy it out
to.

For example, you could set up an accept filter to accept up to 10
connections at a time, and return the fd's into the user space
structure's int [10] array and fill out the int count value with
how many were returned.

For your case, you could use it to copy out each and every event
instance, rather than aggregating the events.


 * Unlike imon and dnotify, kevent doesn't provide any kind of callback
 mechanism; instead, you have to poll the queue for events. Would it be
 useful to specify another flag/parameter that would tell the kernel to
 signal the monitoring process whenever an event is available? (It would
 certainly make the fam code easier to write--but if it's not useful
 anywhere else, that's probably not enough.)

You can SIGPOLL on the event descriptor returned by kqueue().  You
can use it in a select() or poll() call.  You can pass it to another
kqueue() as an EVFILT_READ event.

Snding signals (callbacks) is probably the 

Re: non-root process and PID files

2003-10-31 Thread Terry Lambert
Nielsen wrote:
 Christopher Vance wrote:
  May I suggest a different feature: the ability to mark an open file
  (not just its fd) 'remove on close', with permission checked at mark
  time rather than close time (this status forgotten if not permitted
  when set) and the unlink actually done at close time only if the file
  has exactly one link and one open file instance at that time.
 
 WinNT (2K etc...) has this capability. Not saying that this makes it a
 good idea though.

In all Windows supported FS's, there is no separation between the
inode and the directory entry referencing the inode, so you'd expect
them to be able to do this from an open file reference, since they
can always get the directory entry back.

In response to another post: saving the path at startup time is no
good, since if someone moves the file (if it's open) or removes it
preemptively (and it's closed), then there's a race window in
which another instance of the program may start, and the program
exiting removes the wrong file (or gets a permission denied error).

All this hacking to solve the problem of harmless old files lying
around in /var/run is fraught with peril...

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: freeing data segment

2003-10-30 Thread Terry Lambert
Vinod R. Kashyap wrote:
 I have this huge data structure in the data segment of my scsi driver.  This
 data structure is initialized at driver build time, and is used only during
 driver
 initialization.  I am trying to find out if I can free-up the memory it
 occupies,
 once I am done with the driver initialization.  Does anyone know how to do
 this?

You can either implement it in a separate ELF section that's marked
init only (this is a defined ELF attribute), and then fix FreeBSD
to honor discarding of such sections (FreeBSD doesn't implement
very much of the capabilities of ELF), or...

You can make two drivers, make the init driver depend on the other
driver, load the init driver, have it's init routine call an entry
point in the other driver to give it a callback into itself, do the
callback to do the actual initialization, and then unload the second
driver.

Convoluted, but it works (I used it for a firmware downaload in a
GigE driver at one point).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: non-root process and PID files

2003-10-30 Thread Terry Lambert
Christopher Vance wrote:
 You can already mark a fd 'close on exec'.
 
 May I suggest a different feature: the ability to mark an open file
 (not just its fd) 'remove on close', with permission checked at mark
 time rather than close time (this status forgotten if not permitted
 when set) and the unlink actually done at close time only if the file
 has exactly one link and one open file instance at that time.

If all you have is an fd, you can not get from an fd to a path
without an exhaustive search of the disk, in most FS's.

Also, leaving the path peresent permits someone to hard-link it
to some other file, to make it stay around.  Since /var has a
/var/tmp, this would be a real danger, I think.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fine grained locking (was: FreeBSD mail list etiquitte)

2003-10-28 Thread Terry Lambert
Robert Watson wrote:
 On Sat, 25 Oct 2003, Matthew Dillon wrote:
  It's a lot easier lockup path then the direction 5.x is going, and
  a whole lot more maintainable IMHO because most of the coding doesn't
  have to worry about mutexes or LORs or anything like that.
 
 You still have to be pretty careful, though, with relying on implicit
 synchronization, because while it works well deep in a subsystem, it can
 break down on subsystem boundaries.  One of the challenges I've been
 bumping into recently when working with Darwin has been the split between
 their Giant kernel lock, and their network lock.  To give a high level
 summary of the architecture, basically they have two Funnels, which behave
 similarly to the Giant lock in -STABLE/-CURRENT: when you block, the lock
 is released, allowing other threads to enter the kernel, and regained when
 the thread starts to execute again. They then have fine-grained locking
 for the Mach-derived components, such as memory allocation, VM, et al.
 
 Deep in a particular subsystem -- say, the network stack, all works fine.
 The problem is at the boundaries, where structures are shared between
 multiple compartments.  I.e., process credentials are referenced by both
 halves  of the Darwin BSD kernel code, and are insufficiently protected
 in the current implementation (they have a write lock, but no read lock,
 so it looks like it should be possible to get stale references with
 pointers accessed in a read form under two different locks). Similarly,
 there's the potential for serious problems at the surprisingly frequently
 occuring boundaries between the network subsystem and remainder of the
 kernel: file descriptor related code, fifos, BPF, et al.  By making use of
 two large subsystem locks, they do simplify locking inside the subsystem,
 but it's based on a web of implicit assumptions and boundary
 synchronization that carries most of the risks of explicit locking.

Are you maybe looking at the Jaguar (10.2) code base here?

I personally fixed a number of these types of issues.  Mostly,
these types of issues come down to the fact that funnels, like
the BGL in FreeBSD, are actually code locks, not data locks.

FWIW, it looks like the 10.3 (Panther) code is available from the
Darwin (7.0) project now:

http://developer.apple.com/darwin/

Mixing code and data locks is where most problems happen; it's one
of the reasons that the BGL and the push-down to fine grain locking
in FreeBSD is, I think, the wrong way to go.  The destination is
right, but the path is probably wrong, and things will get much
worse with the FreeBSD approach before they get any better.  Part
of the problem is FreeBSD's mutex implementation is complicated,
and permits recursion (thereby tacitly encouraging recursion).  I
personally think that should result in a panic -- though you could
maybe make an argument for it on the basis of pool mutexes, where
the same mutex pool is used for all your locks.  But FreeBSD doesn't
really do that effectively, so it's kind of a lose/lose situation.

It's really, really hard to put multiple code locks into a kernel,
and get things right.  That's the problem with the FreeBSD approach:
having a BGL in the first place implicitly turns all your data
locks that have to coexist with the BGL code lock into code locks
as well.  I think the right approach would have been to start with
a single pool mutex implementation, allowing recursion, and a macro
wrapper for grabbing/releasing locks so that the locks could be
substituted for per-data item locks instead of pool locks, and the
per-data item locks would *disallow* recursion.  A smart appoach to
that would be to have a structure prefix on all kernel structures,
such that you could abuse a cast and use a single mutex implementation
for all data items.


 It's also worth noting that there have been some serious bugs associated
 with a lack of explicit synchronization in the non-concurrent kernel model
 used in RELENG_4 (and a host of other early UNIX systems relying on a
 single kernel lock).  These have to do with unexpected blocking deep in a
 function call stack, where it's not anticipated by a developer writing
 source code higher in the stack, resulting in race conditions.  In the
 past, there have been a number of exploitable security vulnerabilities due
 to races opened up in low memory conditions, during paging, etc.

I think this is a general problem.  I noted early on that FreeBSD
was going to have issues in this regard once there was real kernel
threading support.  The KSE model is much less prone to triggering
the race conditions, but any model where the same process can be in
the kernel multiple times is problematic.  Particularly since the
BSD code doesn't really support the concept of cancellation of a
blocking system call in progress (you don't have a means of doing
the wakeup to fail the tsleep's with a recognizable error code that
could mean cancellation to back out state).  It's 

Re: non-root process and PID files

2003-10-28 Thread Terry Lambert
Leo Bicknell wrote:
 Dan Langille wrote:
  Any suggestions?
 
 Here's a slightly backwards concept.
 
 We're all familar with how you can open a file, remove it from the
 directory, and not have it go away until the application closes
 it.  Well, extend those semantics to the namespace.
 
 That is, have a directory where any name that does not exist can be
 opened RW, any name that does exist can be opened RO.  A file is
 automatically removed when no one has an open descriptor to it anymore.


This is a somewhat neat idea.  However, it would open a pretty
big race window, and you could denial-of-service a server by
creating a PID file belonging to some server, and leaving it
there with a bogus PID in it, and anything that was watching
the file R/O to kill -0 it to check if the processs needs to be
restarted would always think the process needs to be restarted.

8-).

Basically, all your processes would end up needing to be SUID
root, at least initially, which would mean breaking most mail
server software.  They'd need that so that you could deny any
create except by root to keep ordinary users from DOS'ing a
daemon.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Some mmap observations compared to Linux 2.6/OpenBSD

2003-10-25 Thread Terry Lambert
Ted Unangst wrote:
 On Fri, 24 Oct 2003, Michel TALON wrote:
  What is more interesting is to look at the actual benchmark results in
  http://bulk.fefe.de/scalability/
  in particular the section about mmap benchmarks, the only one where
  OpenBSD shines. However as soon as touching pages is benchmarked
  OpenBSD fails very much.
 
 look closer.  openbsd's touch page times are identical to what you'd
 expect a disk access to be.  the pages aren't cached, they're read from
 disk.  so compared to systems that don't read from disk, it looks pretty
 bad.  a 5 line patch to fix the benchmark so that the file actually is
 cached on openbsd results in performance much in line with freebsd/linux.

Why does the benchmark need to be fixed for OpenBSD and not
for any other platform?

My point here is that a benchmark measures what it measures, and
if you don't like what it measures, making it measure something
else is not a fix for the problem that it was originally intended
to show.

Microbenchmarks are pretty dumb, in general, because they are not
representative of real world performance on a given fixed load,
and are totally useless for predicting performance under a mixed
load.

That said, if this microbenchmark bothers you, fix OpenBSD.

I know that Linux has some very good scores on LMBench, and that
optimiziing the code to produce good numbers on that test set has
pessimized performance in a number of areas under real world
conditions.

Unless there's an obvious win to be had without additional cost,
it's best to take the numbers with a grain of salt.

THAT said, it's probably a good idea for the other BSD's to use
the read/black code from OpenBSD as a guid for their own code.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: FreeBSD mail list etiquette

2003-10-25 Thread Terry Lambert
John-Mark Gurney wrote:
 Wes Peters wrote this message on Thu, Oct 23, 2003 at 01:43 -0700:
  Kip Macy, other DragonFlyBSD developers, and anyone else wishing to
  contribute are invited to join and participate in the open FreeBSD mail
  lists, sharing code, design information, research and test results, etc.
  according to their own will.  We welcome input from everyone, including
  constructive criticism of weaknesses or flaws in FreeBSD.
 
 And patches (against FreeBSD) are highly encouraged.  It rarely helps
 to simply point out flaws (or showing how X OS runs soo much better than
 FreeBSD, why are you guys even running FreeBSD?) w/o showing code to fix it.
 
 Note: I am not speaking as an offical representive of FreeBSD, just as
 a developer who has too few time to try to code up a patch for code I
 haven't seen.  And considering that DragonFlyBSD is based upon FreeBSD
 coming up with said patches should be trivial.


First off, I really appreciate the mmap() discussion which has
taken place.  Someone has done a lot of work to create benchmarks,
which, while being microbenchmarks, are a hell of a lot more
useful than most of their kind.

Further, they've pointed out where to get code to get comparable
results in FreeBSD, licensed under a two clause BSD license, which
means the only issue facing anyone is one of trivial integration.

Second, Kip Macy and Matt Dillon have done some excellent work on
the checkpointing code.  It's basically ELF-based, and requires
only small changes to the exec to set up the process for being
able to be checkpointed and restarted.

Again, the license is a two clause BSD license, and again, the only
work necessary to get this over to FreeBSD is integration.


When someone offers you a gift, you don't jump down their throat
with jack-boots on, complaining about how the gift is wrapped or
what color it is; you shut the hell up about any complaints and
say Thank you.

If the wrapping bothers you, well, you're going to remove it anyway.

If the color bothers you, wait until they leave and paint the damn
thing.  If they come for a visit, they will be much more likely to
be happy that you put it on display on the mantle than unhappy that
you changed its color.


Frankly, FreeBSD has too many cooks, and not enough bottle washers;
this is a euphimism for saying that all anyone with a commit bit
seems to want to do any more is write new code, and no one is
willing to take on the integration and maintenance tasks.  In Linux,
this work is done by Linus, Alan Cox, and a couple of other people.
People get commit bits so that they can do integration, and so that
patches don't sit in bug databases for 6 years unintegrated.

The problem with this imbalance, is that you seem to be unwilling
to hire bottle washers, and people willing to wash bottles when
there are no clean bottles left are never given any respect, and
certainly not the level of respect accorded to cooks.

You guys need to get your heads out, and give out some commit bits
to some people willing to do the dirty work of integration of the
code people are donating, and of closing out bug database entries
where code is provided, and writing code that demonstrates the bug
database problem and coming up with a fix and integrating *that*,
where patches aren't provided.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Benchmarking kqueue() performance?

2003-10-17 Thread Terry Lambert
Lev Walkin wrote:
 One of the most comprehensive sites about that problem is:
 
 http://www.kegel.com/c10k.html

That's about scaling to a large number of connections, not about
kqueue() vs. select performance.

The biggest problem with a large number of connections, at least
as far as FreeBSD is concerned, is the TCP timer implementation
using a callout wheel, since any expiring timer has to traverse
every bucket in the chain, instead of stopping at the first one
that's un expired (see the BSD 4.2/4.3 timers for an example of
the right way to do it).

FWIW: I've had a FreeBSD box with a static page server on it up
to 1.6M simultaneous connections with very little work, so 10K
is pretty trivial in comparison.

For doing real work, and giving 1G to a server process and 512M
to caching, this number drops to ~250K connections, but that's
still 25 time what he claims is some insurmountable barrier.

BTW, the company for which I did this work is still shipping
real product that handles those loads on a FreeBSD box, FWIW.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: GEOM Gate.

2003-10-15 Thread Terry Lambert
Wilko Bulte wrote:
 On Tue, Oct 14, 2003 at 10:44:14PM +0200, Oldach, Helge wrote:
  From: Richard Tobin [mailto:[EMAIL PROTECTED]
Ok, GEOM Gate is ready for testing.
For those who don't know what it is, they can read README:
  
   Aaargh!  It's the return of nd(4) from SunOS.
 
  Excuse me?
 
  # uname -a
  SunOS galaxy 4.1.4 18 sun4m
 
 Too new..

Yeah... Think Sun2 systems

http://www.netbsd.org/Documentation/network/netboot/nd.html

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Dynamic reads without locking.

2003-10-09 Thread Terry Lambert
Harti Brandt wrote:
 You need to lock when reading if you insist on consistent data. Even a
 simple read may be non-atomic (this should be the case for 64bit
 operations on all our platforms). So you need to do
 
 mtx_lock(foo_mtx);
 bar = foo;
 mtx_unlock(foo_mtx);
 
 if foo is a datatype that is not guaranteed to be red atomically. For
 8-bit data you should be safe without the lock on any architecture. I'm
 not sure for 16 and 32 bit, but for 64-bit you need the look for all
 our architectures, I think.
 
 If you don't care about occasionally reading false data (for statistics or
 such stuff) you can go without the lock.

For a read-before-write type read, this is always true.

For an interlock read, this is also true, if you intend to use
the data you read as a timing-dependent one-shot (e.g. for a
condition variable, etc.).

For certain uses, however, it's safe to not lock before the
read *on Intel architectures*.  This can go out the window on
SPARC or PPC, or any architecture where there is no guarantee
that there won't be speculative execution or out-of-order
execution without explicit synchronization.   For instance,
on a PPC, there are two instructions that are used to implement
a mutext, isync and dsync, one which is used to check it,
and the other which is used to take it.  Using these, it's a
lot less expensive to implement a mutex, and you are guaranteed
that after you take a mutex, there's no speculative execution
coherency issues left outstanding.

For things with an explicit periodicity, with no explicit
guarantee of order of operation, reading without a lock will
do no damage.

For example, consider the case of a push-model migration of
processes from one CPU to another, and a (in the general case)
lockless implementation of a per CPU scheduler queue with
explicit migration requirements -- this avoids the FreeBSD issue
of having to take a lock each time you enter the scheduler,
and avoids the IPI that results, as well as implying implicit
CPU affinity; here's how it works:

Each CPU has 3 values associated with it:

int loadptr migration queue ptr run queue

When it's time to schedule a process on a given CPU, here is the
order of operation:

1)  Compare migration queue ptr to NULL; this is a
lockless operation.

2)  If the ptr is non-NULL,
a)  take a lock on the migration queue.
b)  move items on it to the per CPU run queue
-- LIFO; writing the per CPU run queue is
a lockless operation, since no other CPU
is permited access to it (no pull of
work for migration).
c)  NULL the migration queue head of the now
empty migration queue.
d)  drop the migration queue lock.

3)  Take the entry off the top of my run queue; this is
what I'm going to run next.

4)  Compare my 'load' value against that of other CPUs;
since a CPU is the only one permitted to write it's
own load value, this is a lockless operation for the
read of all other CPU's 'load' value, so long as that
value is accessible in a single bus/memory cycle.

5)  If my load is significantly higher than the lowest of
all other CPU's, consider migrating work to the lowest
of all other CPU's:
a)  Locate any candidates for migration; these
is the current top of my own run queue *after*
the removal of the next thing I'm going to run,
*without* the don't migrate bit set.
b)  If a candidate exists,
i)  Remove the item from the local run
queue (lockless).
ii) Take the lock on the target CPU
migration queue.
iii)Put the item on the target migration
queue.
iv) Drop the migration queue lock.

6)  Recalculate my local 'load' value for the benefit of
other CPUs.

7)  Start executing the item obtained in step #3.

Because the operation occurs periodically, and because you skip
items that are marked non-migrate (you could have a bounded depth
to the search, if so desired), there is the ability to implement
explicit CPU affinity, and there is also the ability to implement
implicit CPU affinity (since there is hysteresis in the act of
deciding to push work off to another CPU).

The purpose of LIFO ordering, and examining the migration queue
before examining the local run queue *after* the LIFO ordering of
inserts of work from other CPUs is that the work on the other CPU
that pushed to you was at the top of its run queue, and this
avoids penalizing migrated objects one scheduler cycle latency.

Note that all this works, even in a decoherent system on the
migration queue examination, since the worst case is you delay
a migration for up to two scheduling cycles (one for the 'load'
value on the 

Re: Dynamic reads without locking.

2003-10-09 Thread Terry Lambert
Frank Mayhar wrote:
 The other thing is that the unlocked reads about which I assume Jeffrey
 Hsu was speaking can only be used in very specific cases, where one has
 control over both the write and the read.  If you have to handle unmodified
 third-party modules, you have no choice but to do locking, for the reason
 you indicate.  On the other hand, you can indeed make such a rule as you
 describe:  For this global datum, we always either write or read it in a
 single operation, we never do a write/read/modify/write.  Hey, if it's
 your kernel, you can make the rules you want to make!  But it's better
 to not allow third-party modules to use those global data.

I'm pretty sure that Jeffrey is aware of read-modify-write issues
with atomic vs. idempotent multi-instruction operations, since he
generally knows what he's doing.  8-).

Probably the most interesting cases, from his perspective in the
network stack, are queue insertions and removals for singly
linked queues, where the insertion OR removal can be done atomically
without taking locks (but not both, except on fully MESI coherent
systems without speculative execution).

FreeBSD's use of queue structures is sometimes overkill, and the
macro order of operations as they are currently defined prevent
non-locking operations, even where they would be safe, if the
order of operation were reversed (e.g. for a simple singly linked
tail queue).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recovery from mbuf cluster exhaustion

2003-10-09 Thread Terry Lambert
Peter Bozarov wrote:
[ ... ]
 What I can't seem to figure out is how to flush out the
 stale mbufs/clusters. I can close down all network
 interfaces, and kill/restart most of the processes that I
 presume use up the mbufs. At a given point, there can't
 possibly be any processes that are hogging the mbuf
 clusters. Yet, a while later, this is what the pool
 looks like
 
 [grid] ~ $ netstat -m
 4305/4944/18240 mbufs in use (current/peak/max):
  4305 mbufs allocated to data
 4304/4560/4560 mbuf clusters in use (current/peak/max)
 10356 Kbytes allocated to network (75% of mb_map in use)
 8832 requests for memory denied
 1 requests for memory delayed
 0 calls to protocol drain routines
 [grid] ~ $
 
 A few clusters have been freed. But not much. Now, if
 (presumably) no clusters are being used by a process,
 should they not be released by the kernel? Alternatively,
 how can I enforce this (short of rebooting the machine,
 which is *not* the solution I'm looking for)?

Wait for 2*MSL for the network connections to go away.  Assuming
the other end is still there, and not some network loading device
that effectively SYN-floods and establishes real connections
(e.g. a Web Avalanche or similar product).

Doing a netstat -a will show you a list of active connections,
of which I'm sure you have more than a few hanging around, even
though you killed the process that opened them.

You will see a number of bytes in the receive queue or transmit
queue columns, and these will indicate the amount of data that
you have pending in queues that's either not being read by your
application, or that your application has written, but which
cannot be sent because the other side of the connection has been
shut off, lost network connectivity, died, or intentionally
started a transfer with no intention of actually reading the
data you were going to send (e.g. the Microsoft WAST web tool
for benchmarking does this, and so does httpload).

Probably, you need more mbuf clusters, and therefore more mbufs
as well.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: HZ = 1000 slows down application

2003-10-08 Thread Terry Lambert
Luigi Rizzo wrote:
 On Tue, Oct 07, 2003 at 06:17:04PM -0400, [EMAIL PROTECTED] wrote:
  We did some intensive profiling of our application. It does not seem like
  we are depending on clock ticks for any calculations.
 
  On the other hand we notice that our slow iterations happen almost at the
  same instant as microuptime went backward messages in the system log. We
 
 if this is the case, probably your code at some point computes a
 time difference which turns out negative (or if it is unsigned, it
 becomes very very large) upon those events, thus causing some loop
 to explode.
 It should be easy to check if this is the case, and just ignore
 those outliers rather than trying to figure out why the clock
 goes backward. I used to see the same microuptime went backwards
 msg on some of my 400MHz boxes, even without NTP enabled.
 Maybe a buggy timer, not sure which timecounter was used on that
 box (i read some time ago that the cpu on the soekris4801 has a
 weird TSC implementation where the upper 32 bits change when the
 lower 32 bits are 0xfffd, who knows what other bugs might be
 in other hardware...)

FWIW: Internally, MacOS X supports monotime, which is a
monotonically increasing time counter, guaranteed to not go
backwards.  That avoids problems exactly like what you are
describing.  FreeBSD should consider supporting a monotime.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: netisr

2003-10-08 Thread Terry Lambert
Giovanni P. Tirloni wrote:
  I'm studying the network stack and now I'm confronted with something
  called netisr. It seems ether_demux puts the packet in a netisr queue
  instead of passing it directly to ip_input (if that was the packet's
  type). Is this derived from LRP ?

No.  NETISR is a software interrupt that runs when software
interrupts run, which is to say, when the SPL is lowered as
a result of returning from the last nested hardware interrupt,
which means on hardware and clock interrupts.

It is completely antithetical to LRP, which, despite the name
Lazy Receiver Processing, is mostly only lazy about direct
dispatch.


  I've read their paper and it looks
  like their network channel (ni) but I'm not sure.

No.  There are two LRP papers from Rice University.  The first
was against FreeBSD 3.2, and dealt with just the idea of LRP.
The scond makes things much more complicated than they need to
be, in order to introduce a concept called ResCon, or resource
containers.  Neither set of code is really production, because
it uses an alternate protocol family definition and a seperately
hacked uipc_socket.c and uipc_socket2.c, along with the totally
parallel TCP/IP stack.

The closest you can come to LRP in FreeBSD 5.x is to use the
sysctl's for direct dispatch, which will, in fact, directly
call ip_input() from interrupt.

This isn't a full LRP, since it doesn't add a receive buffer
into the proc structure for per-process enqueuing of packets.

When I implemented the 3.x version of Rice's LRP in FreeBSD 4.3,
I avoided this hack.  The main reason for the hack was to deal
with accepting connections, since at interrupt, without a proc
structure, there was no way to deal with the socket creation for
the accept, due to a lack of an appropriate credential.  The
sneaky approach I used for this was to create the accept socket
using the cred that was present on the listen socket on which
the connection had come in.  For this to be at all useful, you
need to extend kevent accept filters to allow creation of accepted
descriptor instances in the process context, and throw them onto
the kqueue that was set up against the listen socket.

I recommend that if you want to play with LRP, you add an attribute
flag to the protocol stack structure to indicate an LRP capable vs.
an LRP incapable stack, and then implement it inline, rather than
as a separate thing.  I also recommend that if you do this, you do
it using the Rice 3.x code, and ignore the ResCon stuff, which I
think is an interesting idea, but which adds very little in the way
of real value to things (though it does add overhead).

  Where can I find more information on this?

If you are asking about NETISR, then I recommend W. Richard Steven's
books, specifically TCP/IP Illustrated, Volume 2: The Implementation.

If you are asking about LRP, then any search engine search for Lazy
Receiver Processing should turn up the two Rice University and the
one Duke University reference, as well as dozens of other references,
including the one to the Lucent/IBM work on QLinux (which has some
other neat information on things like WFS Queueing and other things
that are actually necessary to avoid the potential for livelock at
the user/kernel boundary).

FWIW: Interrupt threads, as they are used in 5.x, are pretty much
antithetical to an LRP implementation, since you can still end up
live-locked under heavy load (or denial of service attack), which
is why you wanted LRP in the first place: to make progress under a
load heavier than you could possibly respond to in a reasonable
time.  The problem is that the requeing to the interrupt thread adds
back in the same type of transition boundary you were trying to take
out by getting rid of NETISR in the first place.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Changing the NAT IP on demand?

2003-10-07 Thread Terry Lambert
Julian Elischer wrote:
 On Mon, 6 Oct 2003, Leo Bicknell wrote:
  In a message written on Sun, Oct 05, 2003 at 08:11:05PM -0600, Nick Rogness wrote:
   In addition to keeping your NAT translations (as suggested by
   Wes), you need to also keep routes for those entries as well, so
   that preserved traffic remains to route out the right ISP even if
   a switch occurs.
 
  You're right, however I would go with a different mechanism, but one
  I've also never tried to do.  What you want is routing based on the
  source address of the packet, not the destination as per usual.  You
  want to be able to say source a.a.a.a goes out link A.  I've never
  tried to do it on FreeBSD (it's easy on say Cisco's, with a bit of a
  performance hit on some platforms).
 
 this is very easy using the ipfw 'fwd' rule..

Actually, it's very hard; the 'fwd' rule doesn't quite cut it
for things like, for example, NFS.  It also fails to work with
aliases, when you want the packet sent from a specific IP on a
given interface.  There are some workarounds, like binding it
locally to an IP, but that's not so good when you are wanting
to be able to change IP addresses, as in the case in point.

Cisco really does routing differently.

One thing that would be handy is a socket type that was a TCP
stream socket, but which was bound to an interface, rather than
to a specific IP address.  This only solves some of the problems,
like not having to restart your already listening servers when
the IP address changes out from under it (e.g. the kick sendmail
in the head issue we had with the InterJet I's  II's when they
were running dialup in dial-on-demand mode).

Really, you need to have routing implemented like it's implemented
in Cisco's, and associate the reverse route with the last packet
you received from a given IP address (you always have this because
you had to have it to do the handshake).  The FreeBSD routing code
tends to get the routing wrong some of the time, particularly in
picking the local address to send a packet from.

The problem with the ipfw forwarding is that you don't apriori
know who's going to be talking to you, so you can't really make
preestablished rules for the forwarding for every possible IP
address that's non-local and across one link or the other.  I
suppose you could establish rules when you saw packets, but that
would require running as root, and hacking all your servers to
do the right thing anytime you got a connection.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Why is PCE not set in CR4?

2003-10-02 Thread Terry Lambert
Bruce M Simpson wrote:
 Now that I think on this a bit more, a sysctl might be a better place to
 put this, but it seemed to belong with the i386_vm86() bits, rather than
 polluting initcpu.c right away.

The important thing is to allow the kernel to intermediate and
control allocation of counters to applications, so where you put
it is less important than that it be a procedural interface.  A
sysctl can be a procedural interface, but it's kind of ugly.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Why is PCE not set in CR4?

2003-10-02 Thread Terry Lambert
Bruce M Simpson wrote:
 On Wed, Oct 01, 2003 at 11:39:36AM +0200, Grumble wrote:
  However, I am not allowed to use the RDPMC instruction from ring 3
  because the PCE (Performance-monitoring Counters Enable) bit is not set.
  
  You can do it with /dev/perfmon. man 4 perfmon.
 
  I have read the perfmon documentation and source code. For several
  reasons, I do not think it is totally adequate in my situation.
[ ... ]
 
 This is an extension to the i386_vm86() syscall which will let you turn
 PCE on and off if you're the superuser.

I like this a lot better.

To answer the inevitable question of why: PCE counters are a
scarce resource, and the kernel needs to run interference on
their allocation and deallocation by user space applications, to
avoid collisions between applications; this is the same reason
we have AGP and sound card device drivers in the kernel.

I'm not sure if restricting this to root users is exactly
necessary, but it can't hurt, given that there is a performance
denial of service possible otherwise.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: has anyone installed 5.1 from a SCSI CD?

2003-09-30 Thread Terry Lambert
Peter Jeremy wrote:
 On Sun, Sep 28, 2003 at 06:14:25PM -0400, Sergey Babkin wrote:
 BTW, I have another related issue too: since at least 4.7
 all the disk device nodes have charcater device entries in /dev.
 
 'block' vs 'character' has nothing to do with random or sequential
 access and any application that thinks it does is broken.  Any
 application that directly accesses devices must understand all the
 various quirks - ability to seek, block size(s) supported, side-
 effects of operations etc.

As opposed to the kernel understanding them, and representing the
classes of devices uniformly to the user level software.


 Yes, block devices must be random access,
 but character devices can be either random or sequential-only
 depending on the physical device.

But character devices can't be random-only.  Therefore, you
can assume the ability to perform random access on block devices,
and you can assume character devices require sequential access,
and your software will just work, without a lot of buffer
copying around in user space.


 The only purpose for block devices was to provide a cache for disk
 devices.  It makes far more sense for this caching to be tightly
 coupled into the filesystem code where the cache characteristics
 can be better controlled.

Actually, there are a number of other uses for this.  The number
one use is for devices whose physical block size is not an even
power of two less than or equal to the page size.  The primary
place you see this is in reading audio tracks directly off CD's.

Another place this is useful is in the UDF file system that Apple
was prepared to donate to the FreeBSD project several years ago.
DVD's are recorded in two discrete areas, one of which is an
index section, recorded as ISO9660, and one of which is recorded
as UDF.  By providing two distinct devices to access the drive,
it was possible to mount the character device as ISO9660, and
then access the UDF data via the block device.  Again, we are
talking about physical block sizes of which the page size is not
an even power of 2 multiple.

Another use for these devices is to avoid the need for some form
of intermediary blocking program (e.g. dd, etc.) for accessing
historical archives on tape devices.  Traditional blocking on
tape devices is 20K, and by enforcing this at the block device
layer, it's possible to deal with streaming of tape devices without
adding an unacceptable CPU overhead.

Another issue is Linux emulation; Linux itself has only block
devices, not character, and when things are the right size
and alignment, the block devices pass through and act like
character devices.  However... this means that Linux software
which depends on this behaviour will not run on FreeBSD under
emulation.

Finally, block devices serve the function of double-buffering a
device for unaligned and/or non-physical block sized accesses.
The benefit to this is that you do not need to replicate code in
*every single user space program* in order deal with buffering
issues.  There has historically been a lot of pain involved in
maintaining disk utilities, and in porting new FS's to FreeBSD,
as a result of the lack of block devices to deal with issues like
this.

I'll agree that the change has been mostly harmless -- additional
pain, rather than actually obstructing code from being written
(except that Apple didn't donate the UDF code and it took years to
reproduce it, of course, FreeBSD doesn't appear to have suffered
anything other than a migration of FS developers to other platforms).

On the other hand, a lot of the promised benefits of this change
never really materialized; for example, even though it's more
efficient in theory, Linux performance still surpasses FreeBSD
performance when it comes to raw device I/O (and Linux has only
*block* devices).  We still have to use a hack (foot shooting)
to allow us to edit disklabels, rather than using an ioctl() to
retrive thm or rewrite them as necessary, etc., and thus use
user space utilities to do the work that belongs below an abstract
layer in the kernel.

I'm not saying that FreeBSD should switch to the Linux model -- though
doing so would benefit Linux emulation, and, as Linux demonstrates,
it does not have to mean a loss of performance -- but to paint it as
something everyone agreed upon or even something everyone has
grown to accept is also not a fair characterization.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: user malloc from kernel

2003-09-30 Thread Terry Lambert
earthman wrote:
 how to allocate some memory chunk
 in user space memory from kernel code?
 how to do it correctly?

If your intent is to allocate a chunk of memory which is shared
between your kernel and a single process in user space, the
normal way of doing this is to allocate the memory to a device
driver in the kernel, and then support mmap() on it to establish
a user space mapping for the kernel allocated memory.  In general,
you must do this so that the memory is wired down in the kernel
address space, so that if you attempt to access it in the kernel
while the process you are interested in sharing with is swapped
out, you do not segfault and trap-12 (page not present) panic
your kernel.

If your intent is to share memory with every process in user
space (e.g. similar to what some OS's use to implement zero
system call gettimeofday() functions, etc.), then you want to
allocate the memory in kernel space (still), make sure it's on
a page boundary, and set the PG_G and PG_U bits on the page(s)
in question.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: user malloc from kernel

2003-09-30 Thread Terry Lambert
Pawel Jakub Dawidek wrote:
 On Mon, Sep 29, 2003 at 06:56:13PM +0300, Peter Pentchev wrote:
 + I mean, won't the application's memory manager attempt to allocate the
 + next chunk of memory right over the region that you have stolen with
 + this brk(2) invocation?  Thus, when the application tries to write into
 + its newly-allocated memory, it will overwrite the data that the kernel
 + has placed there, and any attempt to access the kernel's data later will
 + fail in wonderfully unpredictable ways :)
 
 I'm not sure if newly allocated memory will overwrite memory allocated
 in kernel, but for sure process is able to write to this memory.
 
 Sometime ago I proposed model which will allow to remove all copyin(9)
 calls and many copyout(9), but I'm not so skilled in VM to implement it.

You probably need two pages; one R/O in user space and R/W in
kernel space, and one R/W in both user and kernel space.  The
copyin() elimination would use the R/W page.  Frankly, I have
to say that you aren't saving much by eliminating copyin() this
way, and most of your overhead is going to be data copies with
pointers, and it doesn't really matter where you get the pointers
into the kernel, the bummer is going to be copying around the
data pointed to by the pointers.

For the copyout, you'd probably get a rather larger benefit if
you could implement getpid(), getuid(), getgid(), getppid(), and
so on, in user space entirely, just by referencing the common
read-only page.

You could probably also benefit significantly by deobfuscating
the timer code and using a flip-flop timer and externalizing
the calibration information in a single globally read-only
page (PG_G, PG_U, R/O mapping one place, kernel-only R/W mapping
another), and then using it to implement a zero system call
gettimeofday() operation (there's really no need to have a huge
list of timers, if updates are effectively atomic at the clock
interrupt, and you use a flip-flop pointer to only two contexts
instead of a huge number of them).

Specifically, you could find yourself with a huge performance
improvement in anything that has to log in the Apache/SQUID
styles, which require a *lot* of logging, which would mean a
*lot* of system calls.

You could also use a knote for this, which is only returned
when other knote's are returned, and not otherwise, but that
would be a lot less friendly to third party source code that
was not specifically adulterated for FreeBSD.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [PATCH] : libc_r/uthread/uthread_write.c

2003-09-19 Thread Terry Lambert
Daniel Eischen wrote:
   If you are using libkse or
   libthr, you will get a partial byte count and not zero because
   the tape driver returns the (partial) bytes written.  So exiting
   the loop in libc_r and returning 0 would only seem to correct
   the problem for libc_r.
 
 If there is a difference, it could be because libc_r is using non-blocking
 IO behind the scenes, and sa(4) may be returning partial byte count
 in the non-blocking case and 0 (or -1 and ENOSPC) in the blocking case
 (which is what you'd get using libkse/libthr).

I would think that for non-block multiple and/or non-block-aligned
writes, there's no way to avoid the fault-in penalty for the need
to do read-before-write, so there will always be some unavoidable
stalls.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: TCP information

2003-09-19 Thread Terry Lambert
Dan Nelson wrote:
  These types of statistics aren't kept.
 
  They usually do not make it into commercial product distributions for
  performance reasons, and because every byte added to a tcpcb
  structure is one byte less that can be used for something else. In
  practice, adding 134 bytes of statistics to a tcpcb would double its
  size and halve the number of simultaneous connections you would be
  able to support with the same amount of RAM in a given machine (as
  one example), if all of that memory had to come out of the same
  space, all other things being equal.
 
 tcpcb is currently 236 bytes though, and I don't imagine adding another
 8 bytes for an unsigned long dropped packets counter is going to kill
 him.

236 is too large.  We do stupid things like not compressing the
state.  For example, there is state that is unique to a listen
socket and state that is unique to a connecting socket: this
state should be in a union, so that tcpcb's are smaller.  The
kqueue bloat, particularly that for accept filters is another
issue.  So is the bloated credential and other information, most
of which belongs in application-specific extension data chains
that are *only* used when the aplication is active vs. the TCP
connection (e.g. when IPSec is active, when kqueues have been
registered, etc.).

In 4.x, the structure size was 134 bytes (maybe 136; depends on
which 4.x, I guess).  The exra 100 bytes are cruft.  Removing
the cruft and compressing the state with a union would get you
just under 128 bytes, so the current structure is almost 100%
additional bloat for features that are rarely used, or are
used, but are generally only in effect on a small number of the
open sockets you are dealing with; very very annoying.


 Deepak: if you really want stats, try adding a struct tcpstat to tcpcb
 and hack all the netinet/tcp* code to update those whenever the global
 tcpstat gets updated.  You'll get all the info that netstat -s prints,
 for each socket.  *That* will definitely double the size of struct tcpcb :)

The statistics gathering really should be macrotized, and a
macro declaration added for this.  You could then make it a
compile-time option as to whether or not you gather the stats
(default to off!).  Assuming some FreeBSD committer is willing
to stick the macros in the headers and the instrumentation
points.

If you did the extension structure chaining trick, noted above,
you could even make it runtime adjustable; however, you would
need to (1) add a timestamp to the structure to indicate the
start time for statistics gathering and (2) walk the list of
open sockets to add an extension for each of the already open
sockets in the system.  You could even have a seperate set of
commands (I would suggest a psuedo device driver for doing it)
to enable/start/stop/disable, so you can leave dormant extension
structure lying around to control sample intervals separated by
non-sample intervals of indeterminate length.

Either way, though, I think you would want it to be off by
default, just like you want the IPSEC to be off by default,
given that it soaks up a huge default object per socket just
by bing compiled in, even if the socket never actually uses
the feature.  8-(.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: TCP information

2003-09-19 Thread Terry Lambert
Deepak Jain wrote:
 If the tcpcb struct were expanded/changed and the various increments were
 added in the appropriate packet pushing code, this would work right? Is
 there something non-obvious that one would need to worry about to undertake
 such a project?

Your overhead would be slightly higher when doing statistics,
because you would need to store more information for each of
the statistics you wanted to gather.

The main reasonable objections to doing this by default would
be (1) added overhead and (2) increased per-connection costs.

The first of these is an issue for everybody.

The second of these would be an issue for connections which
remain idle for significant fractions of time.  I'd call 20%
or more of the time idle significant for this purpose, so
you could include FTP control channels, HTTP persistent
connections, IMAP4 connections, database entry screens, and
pretty much anything else that was client/server, had a slow
human on one end of the client, and a persistent connection
to the server on the other.

See my other posting on how to do this at a slightly higher
cost, but only when it's enabled, via a pointer indirection,
or at equal cost without one, as a compile-time option.  I
think that approach would be better for your purposes,
particularly if you wanted to offload the code maintenance,
rather than reintegrating a lot of patches for each release
you wanted to run on.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: TCP information

2003-09-18 Thread Terry Lambert
Deepak Jain wrote:
 Is there a utility/hack/patch that would allow a diligent sysadmin to obtain
 which specific TCP connections are generating retransmits and receiving
 packet drops? netstat will show me drops on an interface, but not on a
 specific source/dest pair?
 
 I am guessing something like a netstat -n, but instead of showing send/rec
 queues it shows retransmit or packet drops? Would there be much interest in
 this feature if we were to build it ourselves?

These types of statistics aren't kept.

Generally, they are used only by network researchers, who hack
their stacks to get them.

They usually do not make it into commercial product distributions
for performance reasons, and because every byte added to a tcpcb
structure is one byte less that can be used for something else.
In practice, adding 134 bytes of statistics to a tcpcb would
double its size and halve the number of simultaneous connections
you would be able to support with the same amount of RAM in a
given machine (as one example), if all of that memory had to
come out of the same space, all other things being equal.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Machine wedges solid after one serial-port source-lineaddition...

2003-09-18 Thread Terry Lambert
Barry Bouwsma wrote:
 You see, what I'm attempting to do, without knowing what I'm doing,
 is to implement the TIOCMIWAIT ioctl that apparently exists in Linux,
 to notify a userland program that there's been a status change on one
 or more of the modem status lines, and eliminate the need to poll the
 status line in question, cutting that program's cost to run by a factor
 of about 20 in the testing I did before the machine would wedge.
 
 I did all this offline, with no examples to follow, but now I have
 something to look at and see if I have the general idea.  So I should
 probably shut up and study it.
 
 So, since a printf() is right out, is it safe for me (as a non-
 programmer, so forgive my ignorance of the basics) to simply use
 little more than a wakeup() in its place?  Or does that, or the
 tsleep() corresponding, need some sort of careful handling to avoid
 the lockups I've experienced?

I remember wakeup() being bad.  Taking any time to do anything
at all more than just queueing data and going away is probably
bad.

If it were my project, I'd mirror the values out to a status
structure that's only written at interrupt, and read and reset
at software interrupt, and then use the soft interrupt handler
to raise the signals/send the wakeup/whatever and then resets
the flags bits to zero via a call down that synchronizes like
a baud rate or FIFO depth change (e.g. like the mouse line
discipline does to set the FIFO depth to avoid jerky mouse
movement).

Bruce Evans is the authority in this area; you would be well
advised to consult him directly.  He may even already have
code to do something similar to what you want (I think I
remember code to signal a program on an RI going high, but
I could be mistaken).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Any workarounds for Verisign .com/.net highjacking?

2003-09-17 Thread Terry Lambert
Clifton Royston wrote:
   For those who don't know what I'm talking about, try executing host
 thisdomainhasneverexistedandneverwill.com, or any other domain you'd
 care to make up in .com or .net.  Verisign has abused the trust placed
 in them to operate a root name server, by creating wildcard A records
 directly under .com and .net, which point to Verisign's search
 website.

If you get their A record in your resolver, pretend you got the
standard error instead.  It's a really easy resolver hack.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: pppoe - nmap - No buffer space available

2003-09-17 Thread Terry Lambert
[EMAIL PROTECTED] wrote:
 sendto in send_tcp_raw: sendto(3, packet, 40, 0, X.X.X.X, 16) = No buffer
 space available

Your interface is down.  This happens all the time.

If you use PPP on a dialup modem with a normal net connection,
and unplug the modem while you are doing a ping, you will see
the same thing.

The easiest fix is don't send packets out routes that transit
interfaces which are not up.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: tty layer and lbolt sleeps

2003-09-17 Thread Terry Lambert
Mike Durian wrote:
 I'm trying to implement a serial protocol that is timing sensitive.
 I'm noticing things like drains and reads and blocking until the
 next kernel tick.  I believe this is due to the lbolt sleeps
 in the tty.c code.
 
 It looks like I can avoid these sleeps if isbackground() returns
 false, however I can't figure out how to make this happen.  The
 process is running in the foreground and my attempts to play
 with the process group haven't helped.
 
 Can anyone explain what is happening and nudge me towards a fix?

You need your process to become a process group leader, and then
you need the serial port you are interested in to become the
controlling tty for your process.

The first is accomplished with setpgid(2); the second is accomplished
with setsid(2) and open(2) (the open must not specify O_NOCTTY).  You
can move around after that by calling tcsetpgrp(3).

You can only have one controlling tty per process, so if you wanted
to, for example, have a terminal emulation program that would quit
when you turned off your terminal (on-to-off transition of DTR) *and*
you *also* wanted it to receive SIGHUP when you got an on-to-off DCD
transition from a modem, you would need two processes.

See also the source code for getty(8) and the library utility
function login_tty(3).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Machine wedges solid after one serial-port source-line addition...

2003-09-16 Thread Terry Lambert
Barry Bouwsma wrote:
 Would anyone care to explain why the following simple patch could be
 enough to wedge my machine solid?  (My original hack-patches without
 any console printf() debuggery did the same thing within seconds, as
 well...)  All it does is notify the console whenever a serial port DCD
 PPS signal transition is detected, as follows (patch against 4.foo; I
 haven't tried this with 5.bar or later -- also, not a real patch as I've
 included context and snipped my comments) :
[ ... ]
 I'm wondering if it's something really blindingly obvious that I should
 be but am not aware of, or something I gotta work on to track down.

You are calling printf() from a fast interrupt handler.

You shouldn't call printf() from any interrupt handler, and
particularly you shouldn't call it from something that can
and will have a FIFO overrun well before the printf() gets
back.

If you need to communicate information to a console log (or
wherever), then you should enqueue the information on the
status change, and wake up some thread to do the actual
processing of the information out to the console.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 20TB Storage System (fsck????)

2003-09-05 Thread Terry Lambert
Geoff Buckingham wrote:
 On Thu, Sep 04, 2003 at 01:12:45AM -0700, Terry Lambert wrote:
  Yes.  Limit the number of CG bitmaps you examine simultaneously,
  and make the operation multiple pass over the disk.  This is not
  that hard a modification to fsck, and it can be done fairly
  quickly by anyone who understands the code.  The code in time to
  fsck the disk will go up inversely proportionally to the amount
  of RAM it's allowed to use, which is limited to the UVA size
  minus the fsck program size itself, and the fsck buffers used for
  things like FS metadata for a given file/directory.
 
 Pardon my ignorance but does the number of inodes in the filesystem have a
 significant impact on the memory requirement of fsck?

I can't answer empirically, but extrapolating from the empirical
data that I *do* have, the time is going to go up proportional
to the number of blocks in use, and the number of blocks in
use is going to equal the average number of blocks per file
times the number of files, and given that there is one inode
per file, you will bound the amount of blocks by bounding the
number of inodes.

This makes the answer yes, indirectly.  What passes get run
really depend on how your FS is configured.  By default, a
background fsck will only check for blocks that are marked as
used in the CG bitmaps that are not actually used; so this is
a CG bitmap vs. all inodes direct and indirect block lists
consistency check only.

Most of the incremental or multipass techniques I've discussed
on the mailing list assume either a full fsck, or that you are
able to lock individual CGs, or at least ranges on the disk, if
you wish to do a BG check; or that you read-only the entire
disk until you are done, and maintain a list of needs update
items (this can be very compact, since it can be run-length
encoded or otherwise highly compressed).

If you read the fsck manual page and understand what it means,
you can get some idea of what parameters effect it in what
phases:

 Inconsistencies checked are as follows:
 1.   Blocks claimed by more than one inode or the free map.

Every inode needs to be scanned to see what blocks in the free
map should not be in the free map.  The free map is the set
of bits in the set of all cylinder group bitmaps.

Cross-checking multiple references for directories is a process
of combinatorial math.  You take N inodes 2 at a time and compare
them.  A trick that is possible, if you are willing to rebuild
the CG bitmaps in core, or are willing to double the space for
the set you are examining would be to examine a range, and at
the same time keep a shadow.  Zero the shadow, and pass the list
of inodes once, setting bits in the shadow.  If you go to set a
bit and it's already set, *then* you go back and find out who it
was who had the bit set.  This is probably an OK trade-off,
particularly if you maintain a list of this-file-this-suspect-bit,
and then pass the FS again (large numbers of cross-linked blocks
are rare).

The second of these operations is as expensive as:

#inodes_used*(#inodes_used-1)*(#indirect_blocks**2-1)

 2.   Blocks claimed by an inode outside the range of the filesystem.

What this really should say is which are outside the range.  In
other words, bogus block numbers.  This is a compare that can be
made during a direct linear search.

 3.   Incorrect link counts.

This is a directory entry vs. inode count.  The expense of this
operation depends on whether you are directory-entry-major or
on your pass, and the relative number of directory entries vs.
inodes.  For most FS's, the number of entries is going to be
~15% higher than the number of inodes; this is because of the
hard links to directories from their parents, and to parent
directories from their child directories.  This number could be
much, much higher on an FS with a large number of hard links
per file.

The thing you have to worry about is tracking the number of
hard links per inode, and whether you can do this all in memory
(e.g. with a linear array of integers of the same type size as
the link count, whose length is equal to the number of inodes
available in the system), or whether you have to break the job
up and pass over the directory structure multiple times.  If
you can't keep all the items in memory, and must make multiple
passes, then it's better to be inode-major; otherwise, it's
better to be directory-major.

 4.   Size checks:
Directory size not a multiple of DIRBLKSIZ.

Simple check; can be done during one of the single linear passes.

Partially truncated file.

Also a linear check, but somewhat harder to handle.

 5.   Bad inode format.

Self-inconsistent contents on inodes.

 6.   Blocks not accounted for anywhere.

Every blocks in the free map that's not there and should be,
because it's not claimed by a directory or inode.  The free map
is the set of bits in the set of all cylinder group bitmaps.
This is the background fsck

Re: 20TB Storage System

2003-09-05 Thread Terry Lambert
David Gilbert wrote:
  Poul-Henning == Poul-Henning Kamp [EMAIL PROTECTED] writes:
 Poul-Henning I am not sure I would advocate 64k blocks yet.
 Poul-Henning I tend to stick with 32k block, 4k fragment myself.
 
 That reminds me... has anyone thought of designing the system to have
 more than 8 frags per block?  Increasingly, for large file
 performance, we're pushing up the block size dramatically.  This is
 with the assumption that large disks will contain large files.

My assumptions on the previous two statements by Poul are:

1)  You cannot trust that a short will be treated as an
unsigned 16 bit value in all cases, so values that
are between 32768 and 65535 may be treated incorrectly.

2)  A fully populate block bitmap byte, which means a divide
by 8, is necessary to avoid potential division errors.

In other words, he's afraid that the sign bit and/or the block
size bitmap used by frags may be treated incorrectly.

I have to agree with both those observations.  A number of people
have, historically, reported issues with a divisor other than 8,
and the worry about the sign bit is common sense, given the many
historical issues faced by other OS's when it comes to 64K block
sizes.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 20TB Storage System (fsck????)

2003-09-04 Thread Terry Lambert
Max Clark wrote:
 Ohh, that's an interesting snag. I was under the impression that 5.x w/ PAE
 could address more than 4GB of Ram.

The kernel being able to address the RAM does not meant that
the KVA+UVA space is larger than 4G.  At best, you could take
the uiomove/copyin/copyout performance hit, and move both of
thse to 4G, each, rather than 4G total.  That still limits you
to 4G.


 If fsck requires 700K for each 1GB of Disk, we are talking about 7GB of Ram
 for 10TB of disk. Is this correct? Will PAE not function correctly to give
 me 8GB of Ram? To check 10TB of disk?

No, it will not.


 Is there anyway to bypass this requirement and split fsck into smaller
 chunks? Being able to fsck my disk is kinda important.

Yes.  Limit the number of CG bitmaps you examine simultaneously,
and make the operation multiple pass over the disk.  This is not
that hard a modification to fsck, and it can be done fairly
quickly by anyone who understands the code.  The code in time to
fsck the disk will go up inversely proportionally to the amount
of RAM it's allowed to use, which is limited to the UVA size
minus the fsck program size itself, and the fsck buffers used for
things like FS metadata for a given file/directory.


 I have zero experience with either itanium or opteron. What is the current
 status of support for these processors in FreeBSD? What would the preferred
 CPU be? Will there be PCI cards that I would not be able to use in either of
 these systems?

I have no idea whether these systems support a larger UVA size,
or how much memory you could jam into them...

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Ugly Huge BSD Monster

2003-09-01 Thread Terry Lambert
Denis Troshin wrote:
 Almost  every  package  I  install requires a few other packages. This
 'idea   of   using   dependent  packages'  turns  FreeBSD  (and  other
 unix-systems) to an ugly monster.

You're right.  The authors of the offending software packages
should not do that.  It's going to be incredibly hard to get
the FSF to quit using libibery, getline, gdb, etc., though.


 For  example, I don't need Perl or Python but a few packages I install
 require them.

Don't install those packages?

Provide patches that remove the dependencies, if they are
trivial?

Rewrite the software from scratch, if the dependencirs turn
out to be non-trivial?


 Does exist a programming under unix without these dependencies?

Sure.  Anything you are willing to write that doesn't do that.


 P.S.  Under Windows it is possible to write not bad applications which
 depend  just  on  libraries (KERNEL32, USER32, GDI32).  And these libs
 exist on every base system!!!

I beg to differ.  InstallShield has a tendency to install the
NT version of CTL3D.DLL over top of the Windows 95/98 version,
breaking things utterly (as one example).

Also, CRTL32.DLL no longer ships with the base system, but it
is required for a lot of runtime executable code.  It was left
out of the base system in order to force people to distribute
it, and that was done to impose license restrictions on where
the resulting code can be run (i.e. it's free to redistribute
with your applications, so long as you only run them on a
Microsoft OS -- see the VisualDevStudio license next time you
get a chance).


 Is it possible in unix?
 
 Before I thought that unix programs very compact, but they are huge!

You're using the wrong programs.  I'm going to guess you are
installing Gnome or KDE or something like that that has a huge
dependency list because it wants to have a huge feature list.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Looking for detailed documentation: Install to existingfilesystem

2003-08-27 Thread Terry Lambert
Charles Howse wrote:
 I'm a hobbyist, and for my personal education, I would like to learn how
 to install FBSD from an existing filesystem, rather than from FTP or CD.
 
 My intention is to copy the files to a directory on the second HDD of my
 present FBSD system, and point sysinstall to that partition/directory
 during the install.

This is fairly easy.  What you do it copy the files to a directory
on a second disk drive on the machine, and point sysinstall to that
partition/directory during the install.

8-).



 I can't seem to find any detailed instructions.  The handbook just
 glosses over it, saying follow the instructions on the screen in
 sysinstall.  I've Googled for days and can only find other people asking
 the same question and talking about their failures.

The easiest thing to do is follow the instructions in sysinstall.

The reason the handbook says this is that the order of dialogs, etc.,
in sysinstall varies from version to version of FreeBSD.

Here is an example of detailed instructions for a particular
version of FreeBSD.  If you have a different version, you will
need to use instructions matching the version of sysinstall you
happen to be using:

C
return
13*cursor down
space
7*cursor up
space
7
(selects option 6 from Choose Installation Media dialog because
 sysinstall is stupid)
return
Enter the pathname of the directory containing the installation files
return
Q
cursor right
(Cancel)
return
U
(for Upgrade)
return
(disclaimer screen)
return   
return
(to begin upgrade)


 I need to know:
 
 1) What files do I need to have on the partition from which I will be
 installing?

Everything from the first CDROM, in the same directory hierarchy.
It's easiest if you just mount a CDROM, or mount a vnconfig'ed
CDROM ISO image as a cd9660 FS.


 2) ftp address and directory where I can find those files.

ftp://ftp.freebsd.org/

The exact subdirectory depends on which particular version you
are trying to upgrade to.


 3) Can FBSD install from the .iso files?

Yes, if you vnconfig them and mount them up as FS's so that to
the system they appear as CDROMs, instead of ISOs.

 4) A link to a tutorial or howto would really be nice. If none exists, I
 might consider writing one once I figure out how to do it properly.

It's at least slightly different for every version of the system
(see above).  That's also why you didn't find one.

PS: You should also vnconfig the floppy image on the vnconfig'ed
ISO image, and pull sysinstall off there, instead of using your
local copy.  The sysinstall program has some string lists which
are hard-coded, and may also vary from version to version.  As it
is a crunchgen'ed file, you will need to name it sysinstall
for it to work.  I suggest copying it to /tmp, and running it
from there.

PPS: I've posted detailed instructions on doing this at least
three times in the past three years, since I needed to upgrade
over NFS to a machine without anything but a local copy of
FreeBSD that could be booted; start looking around June of 2001
in the -current and -hackers archives.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: GEOM Gate.

2003-08-19 Thread Terry Lambert
Attila Nagy wrote:
 Terry Lambert wrote:
  It works on firewire and it works on a dual port RAID array (as a
  separate box containing the RAID array).
 
 What does 'it' means? I guess it's not UFS, but the pure ability of
 sharing a device on a bus, connected to more than one adapters.

The it was the subject of the previous sentence, which you diked
out; that's how prepositional phrases work in English.  8-).  In
other words, multiple access to the same device from one or more
SCSI controllers.


  SAN and NAS are also options, but of course, you still have to have
  an FS that can deal with it, and an external locking protocol.
 
 Right, we were talking about FreeBSD, which lacks such a filesystem :(

I've said it before, and I'll say it again: porting GFS would be
a really trivial amount of work, taking almost no creativity to do;
the last time this subject came up and Sistina was offering to
change their license, I ported all the user space utilities in under
a day.  I didn't finish off the whole FS port because I lacked the
necessary disk drives and FreeBSD lacked the necessary controller
driver for those disk drives, and the active maintainers claimed
that they had a port in progress.

These types of things are primarily busy-work and a way to spend
money on hardware I'll likely never use in a production environment
to end up with code under a license that prevents me from using it
in a commercial product.  That makes doing the work very uninteresting.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: GEOM Gate.

2003-08-16 Thread Terry Lambert
Attila Nagy wrote:
 Pawel Jakub Dawidek wrote:
  It'll be, but probably in read-write mode on one machine and read-only
  mode on rest machines, because you don't export file systems here, but
  disk devices.
 
 This doesn't work on a shared SCSI bus, so I suspect sharing the device
 on the net won't help.

It works on firewire and it works on a dual port RAID array (as a
separate box containing the RAID array).

It's supposed to work on SCSI III, but the vendors can quit their
arguing and jockey'ing for advantage long enough to approve the
range locking specification (which is why GFS uses a network daemon).

SAN and NAS are also options, but of course, you still have to have
an FS that can deal with it, and an external locking protocol.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: a.out binaries

2003-08-15 Thread Terry Lambert
S.Gopinath wrote:
  I'm required to run a.out binaries like foxplus
  in a recent Intel based hardware. I have chosen
  FreeBSD 5.1 and successfuly installed. But I could
  not run a.out binaries like Foxplus. I tried it by
  load ibcs modules and aout modules in /boot/kernel
  directory. My foxplus did not work.
 
  I require your suggestions regarding this.
 
  I may not use FreeBSD 2.1 version as I require
  driver for Adaptec 7902 (Ultra Wide SCSI 320).

The a.out file format is a file format, not an ABI definition.

In the particular case you are talking about, the a.out file
you are trying to run is an SCO Xenix ABI binary, rather than
a 386BSD/FreeBSD 1.x/2.x binary.

In order to support this, you would need to support the Xenix
system call entry points, and to recognize the file as a Xenix
a.out binary, and to trap to a Xenix system call entry table,
rather than a FreeBSD or some other entry point table.  This
is actually not a very difficult thing to do (for example, you
do not have to worry about shared libraries), but it requires
kernel modifications in order to support it.

This is not supported at this time.

If you have a Xenix system with developement tools on it, a
competent kernel programmer who knew both platforms and had
access to the tools could probably have you up and running in
a few weeks, at most.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: IP Network Multipathing failover on FreeBSD..??

2003-08-15 Thread Terry Lambert
maillist bsd wrote:
 Is it there have IP Network Multipathing failover on FreeBSD..??  how to do so??

Look for VRRP in /usr/ports/net.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: your mail

2003-08-15 Thread Terry Lambert
Kris Kennaway wrote:
 On Thu, Aug 14, 2003 at 03:14:27PM +0530, S.Gopinath wrote:
   I'm required to run a.out binaries like foxplus
   in a recent Intel based hardware. I have chosen
   FreeBSD 5.1 and successfuly installed. But I could
   not run a.out binaries like Foxplus. I tried it by
   load ibcs modules and aout modules in /boot/kernel
   directory. My foxplus did not work.
 
 Do you have the FreeBSD 2.x/3.x/4.x compatibility libraries installed
 (via sysinstall, or by setting the appropriate COMPAT* variable in
 /etc/make.conf)?
 
 If so, then please post an exact description of the commands you have
 run and the errors you receive.

FoxPro is the later version of FoxBase, which was a Xenix clone
of the Ashton-Tate dBase III software.

The a.out binaries he's talking about are for one of the Xenix
platforms that FoxPro ran on.  The only ones I'm aware of that
it ran on that were a.out were SCO Xenix 2.1.[23], 3.x, and
Altos Xenix (386 platforms: Altos 686 and 886).

He might also mean the Intel 320's running Intel Xenix, but I'm
pretty sure that FoxBase was the only thing that ran on those
platforms.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fw: your mail

2003-08-15 Thread Terry Lambert
S.Gopinath wrote:
  $ foxplus
  /usr/lib/foxplus/no87: 1: Syntax error: newline unexpected (expecting ))
  /usr/lib/foxplus/foxplus.pr: 1: Syntax error: word unexpected (expecting
  ))
  $ file /usr/lib/foxplus/no87
  /usr/lib/foxplus/no87: Microsoft a.out separate pure segmented
 word-swapped
  V2.3 V3.0 286 small model executable Large Text Large Data
  $ file /usr/lib/foxplus/foxplus.pr
  /usr/lib/foxplus/foxplus.pr: Microsoft a.out separate pure segmented
  word-swapped not-stripped V2.3 V3.0 386 small model executable not
 stripped

This is not a Microsoft Xenix binary, contrary to what file
claims.

This is an SCO Xenix binary.  It's not going to work unless you
make an execution class loader for the a.out magic number, add
the system call entry point mechanism for it, and then emulate
the system calls we don't support, and provide stub-calls for the
onces we do support, with appropriate translation of any manifest
constant values that differ between FreeBSD and Xenix.

Do you have a Xenix system with the developement kit?  This would
be a trivial project for someone who knew FreeBSD and had access
to a Xenix developement system.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Ultra ATA card doesn't seem to provide Ultra speeds.

2003-08-02 Thread Terry Lambert
John-Mark Gurney wrote:
 Ruben de Groot wrote this message on Fri, Aug 01, 2003 at 10:15 +0200:
  On Fri, Aug 01, 2003 at 04:33:08AM +0200, mh typed:
  The following comparison is probably bogus, but can anybody explain the
  huge difference?
 
 It's called micro optimization.  Linux feels the need to special case
 /dev/zero to /dev/null, and instead of even reading/writing the data,
 It just ignores the user request, (or does something like set the pages
 in the user space to be zero'd.
 
 Also, dual procs won't help your performance when you run a single
 process like this.

They will if you interleave the page zero'ing they do on both
CPU's... 8^p.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [patch] Re: getfsent(3) and spaces in fstab

2003-08-02 Thread Terry Lambert
Simon Barner wrote:
 The attached patch will allow blanks and tabs for file systems and
 path names, as long as the are protected by a '\'.
 
 For the old fstab style, blanks and tabs are not allowed as delimiters
 (as it was in the old implementation).

You need to add '\\' to the delimited list, so that it is not
skipped.

You need to add '\\' to the list of characters that can be escaped,
or you've just traded the inability to specify '\t' or ' ' for an
inability to speciy '\\'.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Assembly Syscall Question

2003-08-02 Thread Terry Lambert
Matthew Dillon wrote:
 I think the ultimate performance solution is to have some explicitly
 shared memory between kerneland and userland and store the arguments,
 error code, and return value there.  Being a fairly small package of
 memory multi-threading would not be an issue as each thread would have
 its own slot.

You need 8K for this, minimally: 4K that's RO user/RW kernel and
4K that's RW user/RW kernel.  You can use it for things like zero
system call getpid/getuid/etc..

It's also worth a page doubly mapped, with the second mapping with
the PG_G bit set on it (to make it RO visible to user space at the
sampe place in all programs) to hold the timecounter information;
the current timecounter implementation, with a scad of structures,
is both wasteful and unnecessary, given that pointer assigns are
atomic, so you can implement with only two, which only take a small
part of the page.  Doing this, you can use a pointer reference and
a structure assign, and a compare-pointer-afterwards to make a zero
system call gettimeofday() and other calls (consider the benefits
to Apache, SQID, and other programs that have hard logging with
timestamp requirements).

I've also been toying with the idea of putting environp ** in a COW
page, and dealing with changes as a fixup operation in the fault
handler (really, environp needs to die, to make way for logical name
tables; it persists because POSIX and SuS demand that it persist).

So, Matt... how does the modified message based system call
interface fare in a before-and-after with lmbench?  8-) 8-).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Assembly Syscall Question

2003-08-01 Thread Terry Lambert
Ryan Sommers wrote:
 When making a system call to the kernel why is it necessary to push the
 syscall value onto the stack when you don't call another function?

The stack is visible in both user space and kernel space; in
general, the register space won't be, unless you are on an
architecture with an abundance of registers that doesn't do a
save/restore on trap entries.

By pushing it onto the stack, you are *positive* that the vale
is visible.

There is also the (small) possibility that the C compiler will
take advanatage of the calling conventions to assume that a
value will not change over a system call.  Short of declaring
that all registers are volatile, you can't really guarantee
that the registers pushed in will have the values after the
call that they had before the call, unless you save and restore
all of them (which is more expensive than the copyin, for system
calls with 3 arguments or less -- which is most of them; cost,
of course, will vary by architecture).

Personally, I like to look at the Linux register-based passing
mechanism in the same light that they look at the FreeBSD use
of the MMU hardware to assist VM, at the cost of increased
FreeBSD VM system complexity (i.e. they think our VM is too
convoluted, and we think their system calls are too convoluted).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: getfsent(3) and spaces in fstab

2003-08-01 Thread Terry Lambert
Chris BeHanna wrote:
 What about
 
 test%201/mnt/test%201   ufs ro  0   0
 
 ?
 Ugly, yes, but that's how a lot of tools escape spaces.

% is almost infinitely more likely in a path than \; better
to use the \ than the % mechanism.  Also, the parser can
b LALR single token lookahead, wheras % rewuires pushing state
(2 token lookahead).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: usb (ucom) driver code comments..?

2003-07-31 Thread Terry Lambert
M. Warner Losh wrote:
 In message: [EMAIL PROTECTED]
 JacobRhoden [EMAIL PROTECTED] writes:
 : I am trying to get a device working which uses ucom, and the ucom code has no
 : comments whatsoever, I am able to work bits out, I was wondering if there was
 : any sort of documentation whatsoever on this area?
 
 Use the source luke.  However, you'll need to look closely at
 sys/kern/tty* as well.  Looking at something like umodem that
 implements a ucom plug in might be useful too.  You might check out
 the handbook too, but I think that the usb docs there don't
 specifically cover usb.

There was also a patch posted to the -current or -hackers mailing
list around about the end of May that switched 0 and 1 and made
made this wort of thing work for another USB device.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: gcc segfault on -CURRENT (cvs yesterday)

2003-07-31 Thread Terry Lambert
Kai Mosebach wrote:
 Trying to compile sapdb fails on a -CURRENT system build yesterday.
 
 On a system from 22.July it compiled fine.
 
 Any ideas ?

This is pretty ugly, but put a space before the ::'s on that
line.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Console serial speed

2003-07-29 Thread Terry Lambert
Russell Cattelan wrote:
 On Sat, 2003-07-26 at 07:12, Daniel Lang wrote:
  Bruce M Simpson wrote on Sat, Jul 26, 2003 at 10:06:36AM +0100:
   On Fri, Jul 25, 2003 at 01:06:28PM -0500, Russell Cattelan wrote:
How does one set the serial speed of the console.
  
   Does specifying BOOT_COMCONSOLE_SPEED=57600 not work?
 
  No, I've experienced the same problem years ago.
  The funny thing is, that it worked on some machines,
  while it didn't on others.
 
  I worked around the problem by putting
  machdep.conspeed=38400
  in /etc/sysctl.conf, so the speed is reset to the right
  speed, once the system is up.
 
  Of course this doesn't work for boot2, loader or the kernel
  itself. These three components seem to set their console speed
  in some cases arbitrarly.
 Yes this seems to be the case.
 No matter how hard I try to convince sio.c to default to 57600
 (I even went as far at setting
 static  volatile speed_tcomdefaultrate = CONSPEED;
 to 57600)
 so there would be no confusion.
 I still end up with random console speed each time.
 
 The boot loader did pick up the speed from /etc/make.conf and
 does come up every time at 57600.

You also need to modify the /etc/ttys entry to specify std.57600
(see the examples for /dev/ttyd[0-3]) and to set:

options CONSPEED=57600

in your config file.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: vpo in ECP

2003-07-29 Thread Terry Lambert
Paulo Roberto wrote:
 Sorry hackers, I have posted this to [EMAIL PROTECTED], but got no
 answer...
 
 I did set my mainboard BIOS to use ECP transfer mode (dma 3  irq 7). I
 edited my kernel to:
 
 device ppc0 at isa? flags 0x8 irq 7
 
 (is there a way to declare the dma I want to use? config complains if I
 add dma xyz)
 and when I boot I get:

Not sure if it's actually meaningful for the driver, but the
config file syntax is:

device ppc0 at isa? flags 0x8 irq 7 drq 3


 Jul  1 10:36:42 delta /kernel: ppc0: Parallel port at port
 0x378-0x37f irq 7 flags 0x8 on isa0
 Jul  1 10:36:42 delta /kernel: ppc0: Generic chipset (ECP-only) in ECP
 mode
 Jul  1 10:36:42 delta /kernel: ppc0: FIFO with 16/16/16 bytes threshold
 Jul  1 10:36:42 delta /kernel: ppi0: Parallel I/O on ppbus0
 Jul  1 10:36:42 delta /kernel: imm0: NIBBLE mode unavailable!
 
 and my zip-100 drive does not get recognized. Is it possible to use vpo
 in ECP mode?? EPP and compatible modes are just too damn slow.

man ppc.  The answer is it depends on your chipset; clearly,
it's not had anyone write explicit support for it yet.  The manual
page goes into some good detail on how to fix this; specifically,
you should look at the Adding support to a new chipset section.

See also man vpo; it discusses how to force specific modes for
your given chipset.  Probably, you should add specific support for
your chipset instead, so that other people who end up with the
same chipset don't end up having to repeat your problems.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: BCM4401 Support for FreeBSD

2003-07-29 Thread Terry Lambert
Joe Marcus Clarke wrote:
 On Mon, 2003-07-28 at 12:18, Aeefyu wrote:
  i.e. Broadcom 440x NIC support for FreeBSD 4.x and 5.x (as found on
  latest Dell's Notebooks - mine is a 8500)
 
  Would anyone  be so kind to enlighten me on the the current status?
  Last I heard of developments being made was end of June.
 
 This was forwarded to me from Greg Lehey.  The dcm driver works okay for
 me, but I had to hack it for some new bus dma changes.  I have noticed a
 few issues of slowness with it when using it in a more normal sense
 (i.e. using it to read mail, ssh to machines, etc.).


Most likely, the interactive performance is a result of the RX
packet coelescing, with a timer set to too long an interval.  In
general, it's probably an easy adjustment to make.

I notice that he also says he didn't know what to add into some
of the timer and media routines, so that could be part of the
issue as well (e.g. if it's getting wedged and having to unwedge
itself).

BTW: As a rule of thumb, if you don't know what to put in a timer
or media routine, put a printf() there.  It will likely annoy
someone into fixing it.  8-).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Probing for devices

2003-07-25 Thread Terry Lambert
Geoff Glasson wrote:
 I'm trying to port the Linux i810 Direct Rendering Interface ( DRI ) kernel
 module to FreeBSD.  I have reached the point where the thing compiles, and I
 can load it as a kernel module, but it can't find the graphics device.
 
 Through a process of elimination I have come to the conclusion that once the
 AGP kernel module probes and attaches to the i810 graphics device, nothing
 else can attach to it.  When I read the section on PCI devices it implied (
 to me at least ) that multiple kernel modules should be able to attach to the
 same device.  I have tried to get it to work without any success.

How would you expect the hardware to act if both drivers were
being accessed simultaneously?

In general, multiple device attaches are only possible on
multifunction devices, where the driver claims by function,
rather than merely by PCI ID.

You will likely need to have the two drivers intentionally
cooperate with each other on the management of the device,
in order to accomplish your goal.

If your driver is an actual port of the Linux driver, rather
than a rewrite, licensing dictates that it talk to the FreeBSD
driver and ask it to step out of the way so it can do what it
wants to do, and that it act gracefully if the FreeBSD driver
refuses because someone is using it.  Ideally, your driver
would leave the FreeBSD driver in place, and ask it to do some
of the device management functions you need on behalf of your
driver, rather than your driver attempting to do them directly.

If your driver is in the exact same ecological niche (e.g. it
provides the same AGP services the FreeBSD driver does), then
you will need to ensure that only one driver is ever loaded
into the kernel at a time.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Where / how to begin the FreeBSD development journey?

2003-07-23 Thread Terry Lambert
Shawn wrote:
  On Tue, 2003-07-22 at 02:04, Terry Lambert wrote:
  There's a wide range of options, from the expensive to the free
  online stuff.  At the high end, we have:
 
  $1300 https://www.mckusick.com/courses/introorderform.html
  $1500 https://www.mckusick.com/courses/advorderform.html
 
 Sadly, he doesn't teach these classes anymore (due to lack of students),
 and I don't live anywhere near there. Also, the price is indeed
 prohibitive.

It's the price for the video tapes of the class.  I agree that
it's prohibitive.  He states at the link, though, that he gives
a 50% discount to students, and is willing to work with hardship
cases to negotiate a fair price.  No harm in asking him about it.


http://www.vsi.ru/library/Programmer/fbsdkern/
 
 Nice to know, the information in the Developer's Handbook seems to be
 far more recent and thorough although admittedly PCI DMA information is
 currently out of date according to the document.

The steady state of all documentation is needs to be updated...

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Where / how to begin the FreeBSD development journey?

2003-07-22 Thread Terry Lambert
Shawn wrote:
 I did peruse the bugs list at the FreeBSD web site curious as to what
 the current outstanding issue list was, and felt compelled to see if
 there was anything left open that I might put my hand to and felt a bit
 overwhelmed. I noticed that there are over 2,000 some entries with some
 dating as far back as 1996. So, I wasn't exactly sure where one would
 begin there either.

There's a wide range of options, from the expensive to the free
online stuff.  At the high end, we have:

$1300   https://www.mckusick.com/courses/introorderform.html
$1500   https://www.mckusick.com/courses/advorderform.html

At the low end, we have things like:

http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/index.html
http://www.vsi.ru/library/Programmer/fbsdkern/


You could also look at the Blue Prints column at the web site

http://www.daemonnews.org/

http://search.atomz.com/search/?sp-q=blueprintssp-k=Monthly+Ezinesp-a=sp10015f36

There is also a new users section there:

http://www.daemonnews.org/new2bsd/

And there is always the mailing list archives:

http://docs.freebsd.org/mail/

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Communications kernel - userland

2003-07-21 Thread Terry Lambert
Robert Watson wrote:
 Of these approaches, my favorite are writing directly to a file, and using
 a psuedo-device, depending on the requirements.  They have fairly
 well-defined security semantics (especially if you properly cache the
 open-time credentials in the file case).  I don't really like the Fifo
 case as it has to re-look-up the fifo each time, and has some odd blocking
 semantics.  Sockets, as I said, involve a lot of special casing, so unless
 you're already dealing with network code, you probably don't want to drag
 it into the mix.  If you're creating big new infrastructure for a feature,
 I suppose you could also hook it up as a first class object at the file
 descriptor level, in the style of kqueue.  If it's relatively minor event
 data, you could hook up a new kqueue event type.  You could also just use
 a special-purpose system call or sysctl if you don't mind a lot of context
 switching and lack of buffering.

I like setting the PG_G bit on the page involved, which maps it
into the address space of all processes.  8-).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: running 5.1-RELEASE with no procfs mounted (lockups?)

2003-07-18 Thread Terry Lambert
Pawel Jakub Dawidek wrote:
 + trussRelies on the event model of procfs; there have been some
 +  initial patches and discussion of migrating truss to ptrace() but
 +  I don't think we have anything very usable yet.  I'd be happy to
 +  be corrected on this. :-)
 
 Hmm, why to change this behaviour? Is there any functionality that
 ktrace(1) doesn't provide?

It can interactively run in another window, giving you realtime
updates on what's happening up to the point of a kernel crash.
With ktrace, you are relatively screwed.

Another good example is that it dump out information that ktrace
can't, because of where it synchronizes.  Some people recently
have been seeing EAGAIN when they haven't expected it, with
the process exiting immediately after that, with no real clue
as to where in the code it's happening (e.g. which system call);
truss will show this, if run in another terminal window, but
ktrace will not (yes, I know it should; it doesn't.  If you can't
reconcile this with how you think ktrace should work, then fix it).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: complicated downgrade

2003-07-18 Thread Terry Lambert
Valentin Nechayev wrote:
 I need to downgrade a remote FreeBSD system from 5.1-release to 4.8-release
 remotely without any local help (except possible hitting Reset).
 Don't ask why the collocation provider is too ugly and too far from me; it's
 given and unchangeable. This system never was 4.* (began from 5.0-DP2).

I have done this before.

The best was to deal with this is to create a 5.1 system locally,
and deal with all the problems that come up there, transplanting
the resulting scripts to the system in question.

Your biggest problems are going to be the creation of the /dev,
which will need to occur in an rc.local on reboot, replacing the
disklabel boot code, and the changes to the conf file for ssh to
operate correctly (you will likely need to regenerate keys).

If you can't remotely NFS mount a CDROM, a lot of the work is
going to be getting access to installation media.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: MSDOSFS patch of dirty flag (Darwin Import)

2003-07-18 Thread Terry Lambert
Jun Su wrote:
 I began to import some code from Darwin msdosfs. Here
 is my first patch about the dirty flag. I patched the
 msdosfs kernel module and fsck_msdos to enable the
 flag. Can someone test it and checked in? Must I
 submit a PR?
 
 From my own option, the new features of Darwin's
 msdosfs are dirty flag, adv_lock and unicode name. I
 will check them in the next week. Do these features
 have chance to commit?

Not if you never get them published.

If you want to send attachments to the mailing list and
have them get through, you need to send them as text/plain.

The best way to see why your patches are not making it to
the mailing list is to look at your last patch posting, and
see what's difference between your signature and your file
attachments, and make your file attachments look like your
signature attachment (since it got through).

On a side note, you probably do not want to corsspost between
the -hackers and -current lists so much.  A lot of us are
subscribed to both of them, so we get two copies.  For some
of us, this triggers SPAM filtering, and we never see your
posts, unless we save SPAM to a folder instead of deleting it,
and go look at it occasionally.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Communications kernel - userland

2003-07-18 Thread Terry Lambert
Marc Ramirez wrote:
 I asked this in -questions, but got no response; sorry for the repost.
 
 I have a device driver that needs to make requests for data from a
 userland daemon.  What's the preferred method for doing this in 4.8R and
 5.1R?  I'm assuming the answer is Unix-domain sockets...

It depends on the application.  In most cases these are set up
as request/response protocols.

In that case, the best method is to ise an ioctl() or fcntl()
(which you use depends on what in the kernel is talking to
userland), and then returning to user space with the request.
The userland then makes another call back down with the response,
and the next wait-for-request.  This saves you fully 50% of the
protection domain crossing system calls from an ordinary callback,
and it saves you 300% of the protection domain crossings of what
you would need for a pipe/FIFO/unix-domain-socket.

E.g.:

userkernel
--
REQ1make_req()
sleep_waiting_for_available()
ioctl(fd, MY_GETREQ, req)
sleep_waiting_for_req()
copyout()
sleep_waiting_for_rsp()
ioctl(fd, MY_RSPREQ, req)
sleep_waiting_for_req()
copyin()
...
REQ2make_req()
copyout()
sleep_waiting_for_rsp()
ioctl(fd, MY_RSPREQ, req)
sleep_waiting_for_req()
copyin()
...
...

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: running 5.1-RELEASE with no procfs mounted (lockups?)

2003-07-18 Thread Terry Lambert
John Baldwin wrote:
 Since ktrace logs all syscall entries and exits, it should seem that
 a kdump after the process had exited would show which syscall returned
 EAGAIN quite easily.

This works if the process exits after the EAGAIN; that would only
work for the specific error that people are seeing currently.  If
the process does what it's supposed to do when it sees EAGAIN, and
repeats the call, you could get in a tight loop.

The ktrace output could be examined after killing the process,
but until the process exits, there's really no output that can
be examined using kdump.

The problem is that ktrace/kdump rendesvous at a file; truss does
not, so it has some capabilities that ktrace does not.  In some
circumstances (e.g. a system crash, where kdump doesn't get a
chance to get at the file, because it's cleaned up and not
even fully written, when it's not cleaned up) ktrace loses
utterly.

This is not to say that it's not a useful tool (I use it myself);
just that truss has some utility in situations where a tighter
coupling between the tracing and the display of the trace
information is useful.

My second example is a much better case; my first one was mostly
designed for a current discussion about EAGAIN, whereas the most
utility for truss over ktrace involves an actual system crash,
and/or an application that doesn't exit [ab]normally, thus giving
you a synchronized trace file to play with.

It's really all about loosely couple synchronization (ktrace) vs.
tightl couple synchronization (truss).

With truss, you can even expect that in many circumstances, you
will at least get boundary information, even in the face of a
system crash -- this is a situation that ktrace would lose for
sure, if the crash couldn't sync.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: USB device programming with ugen [Solved]

2003-07-15 Thread Terry Lambert
Martin wrote:
 Seems it was my fault. I found a solution in forums. I had to try
 out many things until someone has pointed me to BIOS settings and
 assigning interrupt to USB. I noticed it was off and enabled it.
 It works now. :)
 
 (E.a.: no delays while accessing ugen0 and no freeze with X11.)
 
 Martin
 
 PS.: perhaps someone should mention it somewhere in the FAQ.

The attach should have failed, if it didn't have an interrupt.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: portupgrade

2003-07-12 Thread Terry Lambert
Andrew Konstantinov wrote:
   I've written a simple script to make my life easier, but there is a
 problem with that script and I can't figure out the source of that problem.
[ ... ]
   The problem is simple. Whenever this script confronts a program which
 needs to be upgraded, portupgrade removes the old version and then script
 terminates without any error messages, while several instances of bash
 and linker continue to run in the background for several seconds, and then
 also terminate.

Try using the real shell instead of bash, and let us know if the
problem still happens.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: What ever happened with this? eXperimental bandwidthdelayproduct code

2003-07-10 Thread Terry Lambert
Dan Nelson wrote:
 In the last episode (Jul 09), Max Clark said:
   600/8*.220 = 165Kbytes or 1.32Mbit/s
 
  I understand the BDP concept and the calculation to then generate the
  tcp window sizes. What I don't understand is this...
 
  How in the world is a windows 2000 box running commercial software
  able to push this link to 625KByte/s (5Mbit/s)
 
 Perhaps it defaults to a larger window size?  You can easily verify
 this with tcpdump or ethereal.

It's a guarantee that they default to a smaller default MSL
than the standard permits.  It's smaller by a factor of 10;
to get the same effect in FreeBSD:

sysctl net.inet.tcp.msl=3000

I *do not* recommend mucking with this timer in order to
reduce latency; there are a number of nasty session restart
and other attacks you can do using this and taking advantage
of intimate knowledge of the TCP state machine implementation
of state transitions, and it's easier to DOS attack your
machine because it's a 10 times shorter trip to run you up
past your tolerable latency.  I would much more recommend the
approaches referenced by my other posting, and listed in the
recent FreeBSD-performance mailing list discussion.

Note that one of the things Microsoft is specifically required
to do when running certain benchmarks is set their registry
values to push the MSL back to the standards mandated value.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: What ever happened with this? eXperimental bandwidthdelayproduct code

2003-07-10 Thread Terry Lambert
Max Clark wrote:
 :) hehe...
 
 Okay, let's say how do I force my machine to think it doesn't have any
 latency and saturate a 6Mbit/s link even though the link has 220ms latency?

See the recent discussion on the FreeBSD-performance mailing list.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: raw socket programming SOLVED

2003-07-08 Thread Terry Lambert
David A. Gobeille wrote:
 Shouldn't the #included files themselves #include headers they are
 dependant on?  With the use of #ifndef and #define in the headers to
 keep them from being #included more than once?
 
 It seems silly(more work) for the programmer to have to arrange
 everything in a specific order.

The order is documented.

Maybe we could have one big header file that declares everything
in the right order so that no one has to think at all?  8-) 8-).

In general, FreeBSD headers are independent.  Unless you have
compiler support for precompiled header files, there is a
significant reduction in compile time for not having promiscuous
headers (headers which include other headers).  There are also
issues having to do with whether or not certain functions or
manifest constants end up in the namespace, and conflict with
user header definitions.  By including only a minimal set of the
headers needed to compile the program, you reduce the possibility
of such conflicts occurring.

As another point: the craftsman should know his tools; in other
words, you should be aware of which headers are necessary, because
you know what you are doing (or are able to figure it out, and
remember it for the next time you need the information).

Finally, POSIX compliance -- strict compliance -- requires that
certain headers *not* define values into the implementation space.
This is particularly problematic when it comes to inline functions
wrapped by macros (for example) or compiler built-ins (for another
example) when you want to get a specific implementation, or when
the compiler built-in is the only way to do something correctly,
as in the recent issues FreeBSD had with alloca() and stack
alignment with certain cputype values.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 5 Advanced networking questions

2003-07-08 Thread Terry Lambert
Socketd wrote:
 Ok, anyway to prevent sending ICMP's when ttl = 0? Or do I need a
 firewall?

I guess you want to do this so that you can break path MTU
discovery and fail to properly exchange packets with the DF
bit set in the headers, and which don't take into account
intermediate links with smaller MTUs, like VPNs or PPPOE
links?

What exactly are you getting from disabling ICMP, besides a
broken network connection to some systems you may wish to be
able to exchange packets with?

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: kernel hacking

2003-07-08 Thread Terry Lambert
Sandeep Kumar Davu wrote:
 I was making changes to 4.5 source code. I tried to recompile the kernel.
 it compiles well but is not able to link it.
 I used the function inet_aton in uipc_socket.c
 This is the error i got.
 
 uipc_socket.o(.text+0xid8): undefined refernce to '__inet_aton'
 
 I added all the header files that were required.
 
 Can anyone tell what is missing.

You are trying to call a libc function from within the kernel.

In general, you can not use /usr/include headers in kernel
code, only /usr/include/sys headers.  This is because the
kernel is not linked againsts libc.

Mostly, you should not be dealing with strings in the kernel;
if you are adding a kernel entry point to be called from user
space, you should convert the address into a sockaddr_in
before you pass it into the kernel, instead.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 5 Advanced networking questions

2003-07-08 Thread Terry Lambert
Socketd wrote:
  I guess you want to do this so that you can break path MTU
  discovery and fail to properly exchange packets with the DF
  bit set in the headers, and which don't take into account
  intermediate links with smaller MTUs, like VPNs or PPPOE
  links?
 
  What exactly are you getting from disabling ICMP, besides a
  broken network connection to some systems you may wish to be
  able to exchange packets with?
 
 I don't want to disable ICMP, just don't want to respond when ttl=0,
 meaning when my firewall/gateway is on a traceroute path.

You should specifically modify the ICMP code to not respond
to echo datagrams, or when ttl == 0, then, and work it that
way.  In other words, it's time to hack your network stack
to specifically add that feature.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Adding second-level resolution to cron(8).

2003-07-08 Thread Terry Lambert
Rich Morin wrote:
 I have a project for which I need a generalized time-based scheduling
 daemon.  cron(8) is almost ideal, but it only has minute-level resolution.
 So, I'm thinking about modifying cron to add second-level resolution.
 
 Before I start, I thought I'd ask a few questions:
 
*  Has someone already done this?

There are a couple of programs that can do this, but they
aren't cron, per se.

*  Is it a Really Bad Idea for some reason?

One really bad thing about using cron as your starting point
is that each time it wakes up, it stats the crontab file to
see if it has changed (historical cron implementations needed
to be sent a SIGHUP instead).  It's also lazy about time
changes.

The net effect of the first is that it wildly spins updating
atime on the file; this is bad from a number of perspectives,
but the worst thing about it is that there ar a limited number
of duty cycles on things like flash memory ATA devices that
you would use in an embedded system, so you end up having to
do a lot of work to get a copy of the crontab into a ramdisk.

The net effect of the second is that cron would, in effect,
need to increase its rate of stat to once a second, which
means sixty time less MTBF, even if we are talking a disk inode
write, rather than a flash device.  Again, the fix would have
to be move the crontab somewhere else.

Really, this type of thing needs a timer set to go off on or
after a spcific time, rather than using an interval timer, or
the code needs to get a lot smarter about calculating when the
next work needs to be done, and use a kqueue to watch the file
for modifications, instead, so it can use a select to implement
both the watching and the interval timer.  It also needs to be
smarter about not rereading the file every time it needs to run,
and using a cached copy instead.

Unfortunately, cron is not a happy program.  8-(.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 5 Advanced networking questions

2003-07-08 Thread Terry Lambert
Socketd wrote:
 On Tue, 08 Jul 2003 04:17:04 -0700
 Terry Lambert [EMAIL PROTECTED] wrote:
   I don't want to disable ICMP, just don't want to respond when ttl=0,
   meaning when my firewall/gateway is on a traceroute path.
 
  You should specifically modify the ICMP code to not respond
  to echo datagrams, or when ttl == 0, then, and work it that
  way.  In other words, it's time to hack your network stack
  to specifically add that feature.
 
 Hmm, why not just use a firewall?

Because most firewalls, even commercial ones, don't block the
ICMP messages you appear to be interested in blocking.

You appeared to want to turn your FreeBSD box into what's
normally called a stealth system: one that doesn't respond
at all to external probe attempts.  So it looked like you
were trying to *write* a firewall, or at least find a set
of rules that would let your FreeBSD box act as a stealth
one.

The current FreeBSD doesn't support stealth; it's generally
something you do to stop network finger-printing and/or to use
as a base for launching your own attacks and/or in an attempt
to protect a Windows box that can't protect itself very well.

If you want the feature in FreeBSD, you are going to need to
hack some code.  If you are willing to go out ans spend money
on a stealth firewall box, well, you should feel free to do
that, too; if you do, I reccomend SunScreen from Sun Microsystems,
though in general, I don't recommend using stealth firewalls,
since they break a number of end-to-end guarantees:

http://wwws.sun.com/software/securenet/index.html

If you want a real firewall, I recommend the Cisco PIX:

http://www.cisco.com/warp/public/cc/pd/fw/sqfw500/

I also recommend reading about the drawbacks of using stealth
firewalls, to help decide whether you want to avoid attackers
by hiding from them, or avoid attackers by having working firewall
software which has been usefully auidted, instead.  8-).

http://web.proetus.com/reference/stealthfw/

If you just want to avoid ICMP echo datagrams, I'd change my filter
criteria from what you are asking (TTL==0) to ICMP type, and filter
packets of type 11 and 0 using the ipfw icmptypes option on your
filter type.  It's not the same thing as a stealth firewall, but
it is good enough to handle your initial complaint, which was the
ability to traceroute.  Then you wouldn't need to buy another
machine.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: how to call a syscall in a kernel module?

2003-07-06 Thread Terry Lambert
zhuyi wrote:
 Dear all:
 How to call a syscall in a kernel module? In Linux, you can add two
 line into your source code.
 
 #define __KERNEL_SYSCALLS__
 #include linux/unistd.h

Most system calls call copyin or copyinstr or uiomove, which
assumes that the data is in the user process that is active at the
time of the system call.  This basically boils down to UIO_SYSSPACE
vs. UIO_USERSPACE.  Because of this, it's generally not possible to
call system calls from within the kernel.

It's probably reasonable to turn all the system calls into wrappers;
there would be an additional function call worth of overhead for all
system calls, but the benefit would be to enable calling of all calls
from kernel space.  The overhead is probably worth it (AIX works this
way, and it doesn't seem to hurt them at all).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: recovering data from a truncated vn-file possible?

2003-07-06 Thread Terry Lambert
Josh Brooks wrote:
 Long story short, I have a 4gig vn-backed filesystem.  The file backing it
 is now missing the last 750megs ... I can vnconfig it, but when I fsck it
 I see:

Probably the first thing you'll want to do is write a small program
to open the file and write a zero at the offset of the 750M to make
the device the right size.  Most of the recovery tools, including
fsck, go into convulsions if the device size shrinks on them.  So the
first thing you want to do is change the size back to what it should
be.

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: recovering data from a truncated vn-file possible?

2003-07-06 Thread Terry Lambert
Paul Schenkeveld wrote:
 $ truncate -s original_size file

Bah!  Why use a utility, when you can write a program?

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: non-32bit-relocation

2003-06-28 Thread Terry Lambert
Andrew wrote:
 I wrote a small assembly program to send a string to the floppy.
 I'm not familiar with nasm, the assembly language compiler I'm using and
 even less familiar with the as program.
 
 basically, other code aside, the nasm compiler says, when using -f elf, that
 it does not support non-32-bit relocations.  Ok, I'm not an assembly expert
 and less familiar with 'elf' so I need some help.

There are a number of assembly language resources for FreeBSD; here
are a couple of the better ones:

http://www.int80h.org/bsdasm/
http://linuxassembly.org/resources.html


 The instructions were
 
 mov es, seg buffer
 mov bx, [buffer]
 int 13h.
 
 Can anyone tell me how to do this on a FreeBSD machine?

You either don't do this at all, or you use the vm86() system
call to do the work in a virtual machine.  In general, this
will not work for this particular use, since INT 0x13 is used
to do disk I/O, and the BIOS interface does not own the disk,
the OS disk driver does.

To do what you want to do, you should probably call the open(2)
system call on the disk device from assembly, and then call the
write(2) system call from assembly, instead of trying to use
INT 0x13.

Here's a little program to write some garbage to /dev/fd0; I call
it foo.s.  Don't run it against a floppy you care about.  Compile
it with cc -o foo foo.s; it will create a dynamically linked
program, unless you tell it not to (e.g. cc -static -o foo foo.s.

If you don't want crt0 involved so that you can mix C and assembly,
then you should follow the link above to the first tutorial.

Here's the code:


#
#   Assembly program to write a string to the floppy
#

#
# Here are my strings
#
.section.rodata

# /dev/fd0 + NUL
.device_to_open:
.byte0x2f,0x64,0x65,0x76,0x2f,0x66,0x64,0x30,0x0

# my string
.string_to_write:
.byte0x6d,0x79,0x20,0x73,0x74,0x72,0x69,0x6e,0x67



  #
  # Code goes in text section
  #
.text
  .p2align 2,0x90

  # main() -- called by crt0's _main() function
.globl main
.typemain,@function
main:
  pushl %ebp# main is a function; this is the
  movl %esp,%ebp# entry preamble for a function with
  subl $8,%esp  # no parameters
  addl $-4,%esp #

  # fd = open(char *path, int flags, int mode)
  pushl $0  # mode 0 (not used; callee might care though)
  pushl $2  # flags; 2 = O_RDWR
  pushl $.device_to_open# path string; must be NUL terminated
  call open # call open function
  addl $16,%esp
  movl %eax,%eax
  movl %eax,fd  # system call return is in %eax (-1 == error)
  addl $-4,%esp

  # write(int fd, char *buf, int len)
  pushl $9  # len = 9 bytes
  pushl $.string_to_write   # buf; does not need NUL termination
  movl fd,%eax  # fd
  pushl %eax
  call write# call write function
  addl $16,%esp
  addl $-12,%esp

  # close(fd)
  movl fd,%eax  # fd
  pushl %eax
  call close# call close function
  addl $16,%esp

  # just return... we could have called exit, but _main will do that
  leave
  ret
.finish:
.sizemain,.finish-main  # text segment size

  .comm fd,4,4  # global variable 'fd'

  #
  # done.
  #


-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: TODO list?

2003-06-28 Thread Terry Lambert
Simon L. Nielsen wrote:
 On 2003.06.27 16:10:13 -0700, Joshua Oreman wrote:
  I currently have a lot of free time and I was wondering whether there was
  a TODO list of some sort for bugs that need fixing in FreeBSD. I really
  want to help the project, and I think such a list would make it much
  easier to do so. If there's no official TODO list, could someone point
  out some things? I know C/C++, but I'm very unfamiliar with the kernel.
 
 Great :-) There is always plenty to do.  I would suggest looking at the
 PR system and at the 'Contributing to FreeBSD' article which can be
 found at
 http://www.freebsd.org/doc/en_US.ISO8859-1/articles/contributing/index.html
 
 Hope you find something interesting to spend some time on.


Give him a commit bit, and he can quickly grind through all the
PR's that already have diff's attached to them, and have just sat
there forever.  All he'd need to do was verify that there was a
problem that was being fixed, and the code didn't look like it
would cause damage.  If it ends up causing damage anyway, the fix
can always be backed out later.  Making send-pr actually result in
code changes would probably be the most valuable thing anyone could
do for the project, and it would give him a chance to read and to
understand a lot of diverse code, in the process, to get up to speed
on writing his own fixes for PR's without fixes attached.

Just my $0.02...

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Page Coloring Defines in vm_page.h

2003-06-25 Thread Terry Lambert
Bruce M Simpson wrote:
 Something occurred to me whilst I was re-reading the 'Design Elements'
 article over the weekend; our page coloring algorithm, as it stands,
 might not be optimal for non-Intel CPUs.

Actually, it does very well for most architectures, and was originally
developed into product for SPARC and MIPS (by Sun and MIPS Inc.,
respectively).

There are 29 research papers that specifically mention page coloring;
here is the citation:

http://citeseer.nj.nec.com/cs?q=page+coloringcs=1

I'm going to claim that this is one of the better ones (;^)), mostly
because it has some nice real statistics, and represents a rather fun
approach to the problem, which doesn't actually involve page coloring,
and touches on the rather neat idea of doing cache affinity as part of
the scheduler operation:

http://citeseer.nj.nec.com/410298.html

Also, it has a couple of nice references to related works you aren't
going to find online: you will need to find yourself a technical
library; references 1, 6, 8, 9, and 12 especially.


- Why is this important? The vm code as it stands assumes that we
  colour for L2 cache. This might not be optimal for all architectures,
  or, indeed, forthcoming x86 chips.

The code previously colored for both L1 and L2.  It turns out
that the penalty you pay for L1 set overlap is relatively low,
comparatively, due to the difference in comparative size between
L1 and L2.  See also:

http://citeseer.nj.nec.com/kessler92page.html

   - The defines in vm_page.h seem to assume a 4-way set associative
 cache organization.

Yes.  This was the most common L2 cache hardware arrangement at
the time.

There were a couple of good postings on page coloring on the
FreeBSD lists back in the mid 1990's by John Dyson, who was the
original implementor of both the page coloring code and the
unified VM and buffer cache code, which Poul was complaining was
incomplete, in the recent FS stacking thread on -arch.  You can
probably find them in the archives on Minnie (Warren Toomey's
archival machine).


   - If someone could fill me in on how the primes are arrived at that
 would be very helpful.

It's an attempt to get a perfect hash without collision; page
coloring relies on avoiding cache line overlap, if at all possible
(sometimes it isn't, which is why the page coloring compiler work
and the cache affinity scheduler are such intriguing ideas, to me
at least, though the compiler work would probably be incredibly
hard to tune in any useful fashion to work across an entire CPU
family, rather than specific CPU instances).

-- Terry
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


  1   2   3   4   5   6   7   8   9   10   >