infortunaltely, I can't make it work on aix.
but I have successfully reproduced problem with a simple piece of code:
a servlet, at first call, runs n threads, and each of them will open a
localhost url every s seconds (a kind of scheduler), to the same servlet
that initially launch threads (following calls just print some text/plain
data).
so, for n=50, process hangged after less than 18 hours.
I tried to use 'lsof' (suspecting some resource starvation), and it gives:
===
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 47238 desis cwd VDIR 33,48 512 14498 /bnp/desis_fs
(/dev/lv_desis)
java 47238 desis 0r VCHR 2,2 0t0 8482 /dev/null
java 47238 desis 1u VCHR 23,2 0t46340 8870 /dev/pts/2
java 47238 desis 2u VCHR 23,2 0t46340 8870 /dev/pts/2
java 47238 desis 3r VREG 33,48 7517162 14496 /bnp/desis_fs
(/dev/lv_desis)
java 47238 desis 4r VREG 33,48 3511645 14492 /bnp/desis_fs
(/dev/lv_desis)
java 47238 desis 5r VREG 33,48 375769 14339 /bnp/desis_fs
(/dev/lv_desis)
java 47238 desis 6r VREG 33,48 43606 14340 /bnp/desis_fs
(/dev/lv_desis)
java 47238 desis 7r VREG 33,48 330474 6298 /bnp/desis_fs
(/dev/lv_desis)
java 47238 desis 8r VREG 33,48 217958 6303 /bnp/desis_fs
(/dev/lv_desis)
java 47238 desis 9r VREG 33,48 5618 6296 /bnp/desis_fs
(/dev/lv_desis)
java 47238 desis 10r VREG 33,48 136133 6299 /bnp/desis_fs
(/dev/lv_desis)
java 47238 desis 11r VREG 33,48 40813 6294 /bnp/desis_fs (/dev/lv_desis)
java 47238 desis 12r VREG 33,48 431743 6297 /bnp/desis_fs (/dev/lv_desis)
java 47238 desis 13u IPv4 0x70327edc 0t0 TCP *:11223 (LISTEN)
java 47238 desis 14w VREG 33,52 758 2050 /bnp/desis_fs/logs
(/dev/lv_desis_logs)
java 47238 desis 15w VREG 33,52 382 2051 /bnp/desis_fs/logs
(/dev/lv_desis_logs)
java 47238 desis 16w VREG 33,52 197 2052 /bnp/desis_fs/logs
(/dev/lv_desis_logs)
java 47238 desis 17u IPv4 0x703c4adc 0t0 TCP *:11225 (LISTEN)
java 47238 desis 18u IPv4 0x70429edc 0t0 TCP
loopback:11223->loopback:64245 (ESTABLISHED)
java 47238 desis 19u IPv4 0t0 TCP no PCB, CANTSENDMORE,
CANTRCVMORE
java 47238 desis 20u IPv4 0x703ad100 0t0 UDP *:38797
java 47238 desis 21u IPv4 0x702352dc 0t0 TCP
loopback:64252->loopback:11223 (ESTABLISHED)
java 47238 desis 22u IPv4 0x702522dc 0t0 TCP
loopback:64253->loopback:11223 (ESTABLISHED)
java 47238 desis 23u IPv4 0x702f76dc 0t0 TCP
loopback:64245->loopback:11223 (ESTABLISHED)
java 47238 desis 24u IPv4 0x702c6edc 0t0 TCP
loopback:64254->loopback:11223 (ESTABLISHED)
java 47238 desis 25u IPv4 0x703a22dc 0t0 TCP
loopback:11223->loopback:64244 (ESTABLISHED)
java 47238 desis 26u IPv4 0x704ddedc 0t0 TCP
loopback:11223->loopback:64247 (ESTABLISHED)
java 47238 desis 27u IPv4 0x704e52dc 0t0 TCP
loopback:64244->loopback:11223 (ESTABLISHED)
java 47238 desis 29u IPv4 0x7041f6dc 0t0 TCP
loopback:11223->loopback:64249 (ESTABLISHED)
java 47238 desis 31u IPv4 0x703e2edc 0t0 TCP
loopback:64255->loopback:11223 (ESTABLISHED)
java 47238 desis 32u IPv4 0x70393edc 0t0 TCP
loopback:64247->loopback:11223 (ESTABLISHED)
java 47238 desis 33u IPv4 0x703a3adc 0t0 TCP
loopback:64248->loopback:11223 (CLOSE_WAIT)
java 47238 desis 34u IPv4 0x704ddadc 0t0 TCP
loopback:64249->loopback:11223 (ESTABLISHED)
java 47238 desis 35u IPv4 0x70252adc 0t0 TCP
loopback:64250->loopback:11223 (ESTABLISHED)
java 47238 desis 36u IPv4 0x702baedc 0t0 TCP
loopback:64251->loopback:11223 (ESTABLISHED)
java 47238 desis 37u IPv4 0x702346dc 0t0 TCP
loopback:11223->loopback:64250 (ESTABLISHED)
java 47238 desis 38u IPv4 0x702c66dc 0t0 TCP
loopback:11223->loopback:64251 (ESTABLISHED)
java 47238 desis 39u IPv4 0x702416dc 0t0 TCP
loopback:11223->loopback:64252 (ESTABLISHED)
java 47238 desis 40u IPv4 0x700752dc 0t0 TCP
loopback:11223->loopback:64253 (ESTABLISHED)
java 47238 desis 41u IPv4 0x702a92dc 0t0 TCP
loopback:11223->loopback:64254 (ESTABLISHED)
java 47238 desis 42u IPv4 0x700702dc 0t0 TCP
loopback:11223->loopback:64255 (ESTABLISHED)
java 47238 desis 43u IPv4 0x70228edc 0t0 TCP
loopback:64256->loopback:11223 (ESTABLISHED)
java 47238 desis 44u IPv4 0x702526dc 0t0 TCP
loopback:11223->loopback:64256 (ESTABLISHED)
===
I am not a socket specialist, and just wonder if the "no PCB" line means
something intersting, even after reading lsof faq:
===
4.8 What does lsof mean when it says, "no PCB, CANTSENDMORE,
CANTRCVMORE" in a socket file's NAME column?
When an AIX application calls shutdown(2) on an open socket
file, but hasn't called close(2) on the file, the file will
remain visible to lsof as an open socket file without any
extended protocol information.
Lsof reports that state in the NAME column by saying that
there is "no PCB" (Protocol Control Block) for the protocol
(e.g., TCP in the NODE column). If the open socket file
has the state variables SO_CANTSENDMORE and SO_CANTRCVMORE
set -- i.e., from the shutdown(2) call -- lsof reports them
with the CANTSENDMORE and CANTRCVMORE notes in the NAME
column.
===
any help ?
of course, I can give sample code (less than 5k source).
thanxs in advance
--
Joseph
Internet
[EMAIL PROTECTED] - 21/09/2001 02:22
Veuillez r�pondre � [EMAIL PROTECTED]
Pour : tomcat-user
cc :
ccc :
Objet : Re: tomcat 3.2.3 hangs, AIX 4.3.3 (jre 1.2.2 and 1.3.0)
In bug #1006 it suggests:
kill -3 <process-id>
*** The bug report also says that this command is for IBM JVM and Linux 6.2
(if I remember correctly) ***
This gives a stack-trace dump of all the threads. Looking at that may give
you a clue as to where and why the threads are hanging.
On Thu, 20 Sep 2001 17:00:05 +0200
[EMAIL PROTECTED] wrote:
>
> Hello...
>
> I have a problem which I can't resolve: tomcat "hangs" after some time
(may
> be 3 minutes or 2 hours).
> I mean tomcat is running ("ps" shows it), but no answer is received to
> requests when tomcat hangs.
>
> Strangely, problem occurs only on _some_ of AIX 4.3.3 machines where I
run
> tomcat.
> It happens either with ibm's jre 1.2.2 (with different builds) or 1.3.0,
> and tomcat 3.2.3.
> Some machines, with SAME jre/tomcat/code, run with no problem :-(
> And I did not met that on winNT4.
> Note that I got a javacore twice, if somebody can interpret it...
>
> I searched in bug list, and it looks like bug 1006
> (<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=1006>),
> or like problems described in
> <http://www.mail-archive.com/[email protected]/msg19652.html
>
> or in
> <http://www.mail-archive.com/[email protected]/msg16676.html
>.
>
> Since bug 1006 is said to be fixed, I don't understand what happens.
>
> Does somebody can help, or give tricks to diagnose?
>
> Thanks
> --
> Joseph Vallot
>
simon colston
[EMAIL PROTECTED]
This message and any attachments (the "message") is
intended solely for the addressees and is confidential.
If you receive this message in error, please delete it and
immediately notify the sender. Any use not in accord with
its purpose, any dissemination or disclosure, either whole
or partial, is prohibited except formal approval. The internet
can not guarantee the integrity of this message.
BNP PARIBAS (and its subsidiaries) shall (will) not
therefore be liable for the message if modified.
---------------------------------------------
Ce message et toutes les pieces jointes (ci-apres le
"message") sont etablis a l'intention exclusive de ses
destinataires et sont confidentiels. Si vous recevez ce
message par erreur, merci de le detruire et d'en avertir
immediatement l'expediteur. Toute utilisation de ce
message non conforme a sa destination, toute diffusion
ou toute publication, totale ou partielle, est interdite, sauf
autorisation expresse. L'internet ne permettant pas
d'assurer l'integrite de ce message, BNP PARIBAS (et ses
filiales) decline(nt) toute responsabilite au titre de ce
message, dans l'hypothese ou il aurait ete modifie.