Re: 2.4.0-test6 network socket problems

2000-10-14 Thread J. Scott Kasten


Thanks Allen, you're exactly right.  I'm charged with the task of finding
lots of nasties like that in our old code base where a number of things
were just hacked in down and dirty.  Our embeded environment moved from
XINU on an SH2/SH3 with no mmu support and a BSD protocol stack we hacked
in ourselves to Linux kernel, MIPS, and mmu.  Codeing standards truely
have had to step up a couple levels due to the growing complexity of the
environment, but eventually it will be worth the pain.

> 
> 
>   alarm(1)
>   [sudden swap frenzy]
>   alarm is delivered.. do nothing
>   read
> 
> blocks forever. You need to make clever use of siglongjmp to avoid that one
> occurring or use select/poll.
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test6 network socket problems

2000-10-14 Thread J. Scott Kasten


Thanks Allen, you're exactly right.  I'm charged with the task of finding
lots of nasties like that in our old code base where a number of things
were just hacked in down and dirty.  Our embeded environment moved from
XINU on an SH2/SH3 with no mmu support and a BSD protocol stack we hacked
in ourselves to Linux kernel, MIPS, and mmu.  Codeing standards truely
have had to step up a couple levels due to the growing complexity of the
environment, but eventually it will be worth the pain.

 
 
   alarm(1)
   [sudden swap frenzy]
   alarm is delivered.. do nothing
   read
 
 blocks forever. You need to make clever use of siglongjmp to avoid that one
 occurring or use select/poll.
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test6 network socket problems

2000-10-13 Thread Alan Cox

> I've found the problem.  This type of loop does not work:
> 
> do {
> alarm(t);
> read(fd);
> if (EINT)
>exception();
> else
>alarm(0);
> } while (data);
> 
> There are some semantics here that differ from other *nix where this
> works.  The read() won't come out when the alarm comes, and the socket
> will effectively become broken.

The restart or continue behaviour is undefined unless you use sigaction()
to control your signal behaviour (see POSIX.1 or SuS). Even then your code
is buggy on every OS I know

Suppose this happens..


alarm(1)
[sudden swap frenzy]
alarm is delivered.. do nothing
read

blocks forever. You need to make clever use of siglongjmp to avoid that one
occurring or use select/poll.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test6 network socket problems

2000-10-13 Thread Richard B. Johnson

On Fri, 13 Oct 2000, J. Scott Kasten wrote:

> I've found the problem.  This type of loop does not work:
> 
> do {
> alarm(t);
> read(fd);
> if (EINT)
>exception();
> else
>alarm(0);
> } while (data);
> 
> There are some semantics here that differ from other *nix where this
> works.  The read() won't come out when the alarm comes, and the socket
> will effectively become broken.
> 
> Instead, it appears that I needed to use select(), which probably would
> have been better in the first place anyway.
> 
> Thanks to anyone that took the time to look at this.
> 

You can certainly use select() and it's 'better' and more useful.
However, the problem is the default nature of the way signal() in the 'C'
runtime library sets up the handler.

You should use sigaction() without the SA_RESTART flag. This lets
a signal unblock a system call, the resulting errno being EINTER.


Cheers,
Dick Johnson

Penguin : Linux version 2.2.17 on an i686 machine (801.18 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test6 network socket problems

2000-10-13 Thread J. Scott Kasten

I've found the problem.  This type of loop does not work:

do {
alarm(t);
read(fd);
if (EINT)
   exception();
else
   alarm(0);
} while (data);

There are some semantics here that differ from other *nix where this
works.  The read() won't come out when the alarm comes, and the socket
will effectively become broken.

Instead, it appears that I needed to use select(), which probably would
have been better in the first place anyway.

Thanks to anyone that took the time to look at this.

-S-



> I'm working with test6 on an embedded
> QED MIPS arch in big endian mode. I
> have run into some bizarre socket problems that appear to affect both
> udp and tcp transport.  Applications actively using sockets (examples,
> ftp, tftp, others...) will unexpectedly stop receiving data on the
> socket, even though data is present.  The process will be forever
> sleeping on the read even though data is queued up.  To illustrate my
> point, I've dug deep into the udp code (net/ipv4/udp.c) and the
> datagram core (net/core/datagram.c) researching the simple tftp
> example.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



2.4.0-test6 network socket problems

2000-10-13 Thread J. Scott Kasten


I'm working with test6 on an embedded QED MIPS arch in big endian mode.  I
have run into some bizarre socket problems that appear to affect both udp
and tcp transport.  Applications actively using sockets (examples, ftp,
tftp, others...) will unexpectedly stop receiving data on the socket, even
though data is present.  The process will be forever sleeping on the read
even though data is queued up.  To illustrate my point, I've dug deep into
the udp code (net/ipv4/udp.c) and the datagram core (net/core/datagram.c)
researching the simple tftp example.

After much debugging, here is what I know:

I have followed the packets in from the network driver all the way to
udp_rcv() in udp.c.  I see it do the sk lookup and drop it off in
udp_queue_rcv_skb().  Everything is fine on that end.

On the process end, I've been watching in the udp_recvmsg() function, also
in udp.c.  Under normal operation, I see it pick up the data from the
correct skb and return.  When the rare condition that causes failure
occurs, skb_recv_datagram() returns a NULL and err is set to -ERESTARTSYS.
It is only when the process gets hung on that socket that I see this
happen.  It never revisits this portion of the code again, however, until
the sender stops transmitting data from ACK timeouts, I see the packets
continue to pile up on the udp_rcv() side without incident.

I further looked at datagram.c to see what the skb_recv_datagram() was
doing.  It was spinning through the do {} while() loop waiting for
wait_for_packet() to hand it something.  It is in that routine that the
error code is generated.  The signal_pending() function returns true
and sock_intr_errno() returns the -ERESTARTSYS code, which gets passed
back down the chain from here.

The structure of the tftp code that I'm working with is such that it does
a generic blocking read() on the socket file handle and uses an alarm to
wake up when the critical timeout is reached.  Not the most glorious code,
but demonstrates a problem none-the-less.  The read() never returns an
EINTR or EAGAIN or anything.  It's just hung.  I'm assuming that the
signal_pending() return comes from my alarm(), which means that the
process had already been sitting on that socket for a while not seeing the
data that is clearly already present.  Thus, there may be two problems
here, the signal not returning, and data trapped in the skb.

I would appreciate it if anyone more familiar with this code could point
me better to what I should be looking at, or at least explain what should
be happening that isn't.

TIA,

-S-

--

J. Scott Kasten
Email: jsk AT tetracon-eng DOT net

"In most cases, all an argument proves
 is that two people were present.."


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test6 network socket problems

2000-10-13 Thread J. Scott Kasten

I've found the problem.  This type of loop does not work:

do {
alarm(t);
read(fd);
if (EINT)
   exception();
else
   alarm(0);
} while (data);

There are some semantics here that differ from other *nix where this
works.  The read() won't come out when the alarm comes, and the socket
will effectively become broken.

Instead, it appears that I needed to use select(), which probably would
have been better in the first place anyway.

Thanks to anyone that took the time to look at this.

-S-



 I'm working with test6 on an embedded
 QED MIPS arch in big endian mode. I
 have run into some bizarre socket problems that appear to affect both
 udp and tcp transport.  Applications actively using sockets (examples,
 ftp, tftp, others...) will unexpectedly stop receiving data on the
 socket, even though data is present.  The process will be forever
 sleeping on the read even though data is queued up.  To illustrate my
 point, I've dug deep into the udp code (net/ipv4/udp.c) and the
 datagram core (net/core/datagram.c) researching the simple tftp
 example.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test6 network socket problems

2000-10-13 Thread Alan Cox

 I've found the problem.  This type of loop does not work:
 
 do {
 alarm(t);
 read(fd);
 if (EINT)
exception();
 else
alarm(0);
 } while (data);
 
 There are some semantics here that differ from other *nix where this
 works.  The read() won't come out when the alarm comes, and the socket
 will effectively become broken.

The restart or continue behaviour is undefined unless you use sigaction()
to control your signal behaviour (see POSIX.1 or SuS). Even then your code
is buggy on every OS I know

Suppose this happens..


alarm(1)
[sudden swap frenzy]
alarm is delivered.. do nothing
read

blocks forever. You need to make clever use of siglongjmp to avoid that one
occurring or use select/poll.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/