Re: Befuddling deadlock (lock held but not in stack?)

2019-02-26 Thread Todd Lipcon
In case anyone else is running into similar issues, we're currently
thinking it might be this JVM bug:
https://bugs.openjdk.java.net/browse/JDK-8215355

(very similar symptoms)

-Todd

On Thu, Oct 12, 2017 at 1:12 PM Kirk Pepperdine  wrote:

>
> > On Oct 12, 2017, at 8:13 PM, Martin Thompson  wrote:
> >
> > I agree with Gil here.
>
> I think this was my original position. I was taken aback to suggest that
> it wasn’t interesting enough to be discussed here. Stack traces can be
> quite confusing but they can be key to understanding what’s going on.
>
> Regards,
> Kirk
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Befuddling deadlock (lock held but not in stack?)

2017-10-12 Thread Kirk Pepperdine

> On Oct 12, 2017, at 8:13 PM, Martin Thompson  wrote:
> 
> I agree with Gil here.

I think this was my original position. I was taken aback to suggest that it 
wasn’t interesting enough to be discussed here. Stack traces can be quite 
confusing but they can be key to understanding what’s going on.

Regards,
Kirk

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Befuddling deadlock (lock held but not in stack?)

2017-10-12 Thread Martin Thompson
I agree with Gil here. I often reject posts via moderation which range from 
things that should be in the lazy section for Stackoverflow or recruiters 
taking the piss. This question is from someone who has made good 
contributions to previous discussions and is posing a question that is 
trying to get to the bottom of something which is "befuddling" to someone 
trying to understand the mechanisms of a platform. To paraphrase Jackie 
Stewart who coined the mechanical sympathy phrase, to work in harmony with 
the machine one must understand the machine. If this was a question about 
the syntax of the Java language then that would be different.

In my own experiences I've often found discussions about managed runtimes 
and VMs in languages other than Java, which while not directly applicable, 
can be informative and give insights into the thinking behind the design of 
the "machine" on which we run. I don't write code in Pony or Rust for a 
living but studying the underlying implementations gives real insights into 
the possible future directions for memory management which is core to the 
"machine" for example.

I feel it is also valid to question the appropriateness and context for 
discussions. It is healthy to check our assumptions. We all win when 
discussions are civil and present opportunities to learn.

Martin...

On Thursday, 12 October 2017 17:09:06 UTC+1, Gil Tene wrote:
>
> The "machine" we run on is not just the hardware. It's also the BIOS, the 
> hypervisor, the kernel, the container system, the system libraries, and the 
> various runtimes. Stuff that is "interesting" and "strange" about how the 
> machine seems to behave is very appropriate to this group, IMO. And 
> mysterious deadlock behaviors and related issues at the machine level (e.g. 
> an apparent deadlock where no lock owner appears to exist, as is the case 
> discussed here) is certainly an "interesting" machine behavior. Java for 
> not. libposix or not, Linux or not. C#/C++/Rust or not. The fact that much 
> of concurrency work happens to be done in Java does make Java and JVMs a 
> more common context in which these issues are discussed, but the same can 
> be said about Linux, even tho this is not a Linux support group.
>
> E.g. the discussion we had a while back about Linux's futex wakeup bug 
>  
> (where certain kernel versions failed to wake up futex's, creating [among 
> other things] apparent deadlocks in non-deadlocking code) was presumably 
> appropriate for this group. I see Todd's query as no different. It is not a 
> "help! my program has a deadlock" question. It is an observed "deadlock 
> that isn't a deadlock" question. It may be a bug in tooling/reporting (e.g. 
> tooling might be deducing the deadlock based on non-atomic sampling of 
> stack state as some have suggested here), or it may be a bug in the lock 
> mechanisms. e.g. failed wakeup at the JVM level. Either way, it would be 
> just as interesting and relevant here as a failed wakeup or wrong lock 
> state intrumentation at the Linux kernel level. or at the .NET CLR level, 
> or at the golang runtime level, etc...
>
> On Wednesday, October 11, 2017 at 6:06:16 AM UTC-7, Jarkko Miettinen wrote:
>>
>> keskiviikko 11. lokakuuta 2017 11.27.17 UTC+3 Avi Kivity kirjoitti:
>>>
>>> If this is not off topic, what is the topic of this group?
>>>
>>>
>>> Is it a Java support group, or a "coding in a way that exploits the way 
>>> the hardware works" group?
>>>
>>
>>
>> I have to agree here with Avi. Better to have a group for "coding in a 
>> way that exploits the way the hardware works" and another group for Java 
>> support. Otherwise there will be a lot of discussion of no real relation to 
>> the topic except that people exploiting mechanical sympathy might have run 
>> into such problems.
>>
>> (I would also be interested in a Java support group for the level of 
>> problems that have been posted in this group before.)
>>
>>
>>>
>>> On 10/11/2017 10:29 AM, Kirk Pepperdine wrote:
>>>
>>> Not at all off topic… first, thread dumps lie like a rug… and here is 
>>> why… 
>>>
>>> for each thread {
>>> safe point
>>> create stack trace for that thread
>>> release threads from safe point
>>> }
>>>
>>> And while rugs may attempt to cover the debris that you’ve swept under 
>>> them, that debris leaves a clearly visible lump that suggests that you have 
>>> a congestion problem on locks in both sun.security.provider.Sun and 
>>> java.lang.Class…. What could possibly go wrong?
>>>
>>>
>>> Kind regards,
>>> Kirk
>>>
>>> On Oct 11, 2017, at 3:05 AM, Todd Lipcon  wrote:
>>>
>>> Hey folks, 
>>>
>>> Apologies for the slightly off-topic post, since this isn't performance 
>>> related, but I hope I'll be excused since this might be interesting to the 
>>> group members.
>>>
>>> We're recently facing an issue where a JVM is deadlocking in some SSL 
>>> code. The resulting 

Re: Befuddling deadlock (lock held but not in stack?)

2017-10-12 Thread Gil Tene
The "machine" we run on is not just the hardware. It's also the BIOS, the 
hypervisor, the kernel, the container system, the system libraries, and the 
various runtimes. Stuff that is "interesting" and "strange" about how the 
machine seems to behave is very appropriate to this group, IMO. And 
mysterious deadlock behaviors and related issues the machine level (e.g. an 
apparent deadlock where no lock owner appears to exist, as is the case 
discussed here) is certainly an "interesting" machine behavior. Java for 
not. libposix or not, Linux or not. C#/C++/Rust or not. The fact that much 
of concurrency work happens to be done in Java does make Java and JVMs a 
more common context in which these issues are discussed, but the same can 
be said about Linux, even tho this is not a Linux support group.

E.g. the discussion we had a while back about Linux's futex wakeup bug 
 
(where certain kernel versions failed to wake up futex's, creating ()among 
other things) apparent deadlocks in non-deadlocking code) was presumably 
appropriate for this group. I see Todd's query as no different. It is not 
"my program has a deadlock" question. It is an observed "deadlock that 
isn't a deadlock" question. It may be a bug in tooling/reporting (e.g. 
tooling might be deducing the deadlock based on non-atomic sampling of 
stack state as some have suggested here), or it may be a bug in the lock 
mechanisms. e.g. failed wakeup at the JVM level. Either way, to would be 
just as interesting and relevant here as a failed wakeup or wrong lock 
state intrumentation at the Linux kernel level. or at the .NET CLR level, 
or at the golang runtime level, etc...

On Wednesday, October 11, 2017 at 6:06:16 AM UTC-7, Jarkko Miettinen wrote:
>
> keskiviikko 11. lokakuuta 2017 11.27.17 UTC+3 Avi Kivity kirjoitti:
>>
>> If this is not off topic, what is the topic of this group?
>>
>>
>> Is it a Java support group, or a "coding in a way that exploits the way 
>> the hardware works" group?
>>
>
>
> I have to agree here with Avi. Better to have a group for "coding in a way 
> that exploits the way the hardware works" and another group for Java 
> support. Otherwise there will be a lot of discussion of no real relation to 
> the topic except that people exploiting mechanical sympathy might have run 
> into such problems.
>
> (I would also be interested in a Java support group for the level of 
> problems that have been posted in this group before.)
>
>
>>
>> On 10/11/2017 10:29 AM, Kirk Pepperdine wrote:
>>
>> Not at all off topic… first, thread dumps lie like a rug… and here is 
>> why… 
>>
>> for each thread {
>> safe point
>> create stack trace for that thread
>> release threads from safe point
>> }
>>
>> And while rugs may attempt to cover the debris that you’ve swept under 
>> them, that debris leaves a clearly visible lump that suggests that you have 
>> a congestion problem on locks in both sun.security.provider.Sun and 
>> java.lang.Class…. What could possibly go wrong?
>>
>>
>> Kind regards,
>> Kirk
>>
>> On Oct 11, 2017, at 3:05 AM, Todd Lipcon  wrote:
>>
>> Hey folks, 
>>
>> Apologies for the slightly off-topic post, since this isn't performance 
>> related, but I hope I'll be excused since this might be interesting to the 
>> group members.
>>
>> We're recently facing an issue where a JVM is deadlocking in some SSL 
>> code. The resulting jstack report is bizarre -- in the deadlock analysis 
>> section it indicates that one of the locks is held by some thread, but in 
>> that thread's stack, it doesn't show the lock anywhere. Was curious if 
>> anyone had any ideas on how a lock might be "held but not held".
>>
>> jstack output is as follows (with other threads and irrelevant bottom 
>> stack frames removed):
>>
>> Found one Java-level deadlock:
>> =
>> "Thread-38190":
>>   waiting to lock monitor 0x267f2628 (object 0x802ba7f8, 
>> a sun.security.provider.Sun),
>>   which is held by "New I/O worker #1810850"
>> "New I/O worker #1810850":
>>   waiting to lock monitor 0x7482f5f8 (object 0x80ac88f0, 
>> a java.lang.Class),
>>   which is held by "New I/O worker #1810853"
>> "New I/O worker #1810853":
>>   waiting to lock monitor 0x267f2628 (object 0x802ba7f8, 
>> a sun.security.provider.Sun),
>>   which is held by "New I/O worker #1810850"
>>
>> Java stack information for the threads listed above:
>> ===
>> "Thread-38190":
>> at java.security.Provider.getService(Provider.java:1035)
>> - waiting to lock <0x802ba7f8> (a 
>> sun.security.provider.Sun)
>> at sun.security.jca.ProviderList.getService(ProviderList.java:332)
>> at sun.security.jca.GetInstance.getInstance(GetInstance.java:157)
>> at javax.net.ssl.SSLContext.getInstance(SSLContext.java:156)
>> at 
>> 

Re: Befuddling deadlock (lock held but not in stack?)

2017-10-11 Thread Henri Tremblay
I will go for live lock as well. But yes, stack dumps lie. I think jstack
-F might be harsher and give a different result.

But that said, you still have a deadlock. As mentioned

"New I/O worker #1810850" waiting on 0x80ac88f0 is held by "New I/O
worker #1810853" <-- waiting for one another
"New I/O worker #1810853" waiting on 0x802ba7f8 is held by "New I/O
worker #1810850" <-- waiting for one another

My question would be: Why is kudu wanting to get so much SSLContexts? Can't
it reuse them? (I don't know the api well, so maybe it's normal)


On 11 October 2017 at 06:48, Kirk Pepperdine  wrote:

> My vote is for live lock.. but that would be conditional on seeing all the
> stack traces.
>
> — Kirk
>
> On Oct 11, 2017, at 12:07 PM, Todd Lipcon  wrote:
>
> On Wed, Oct 11, 2017 at 12:29 AM, Kirk Pepperdine 
> wrote:
>
>> Not at all off topic… first, thread dumps lie like a rug… and here is why…
>>
>> for each thread {
>> safe point
>> create stack trace for that thread
>> release threads from safe point
>> }
>>
>> And while rugs may attempt to cover the debris that you’ve swept under
>> them, that debris leaves a clearly visible lump that suggests that you have
>> a congestion problem on locks in both sun.security.provider.Sun and
>> java.lang.Class…. What could possibly go wrong?
>>
>
> Interesting. I remember investigating this back in the Java 6 days and at
> that time 'jstack -l' caused a global safepoint before iterating over the
> threads, and then only resumed upon completion (implemented by
> VM_ThreadDump in vm_operations.cpp). It appears that in JDK 7 this was
> switched over to the behavior that you describe.
>
> Either way, I'm looking at about 10 stack traces taken over a period of 3
> minutes and they have 100% identical "java-level deadlock" output. So I
> suppose it's possible that they are live-locked on the same sets of locks
> in the exact same threads but seems fairly unlikely.
>
> As for the congestion problem, I did find this JDK bug the other day:
> https://bugs.openjdk.java.net/browse/JDK-8133070 which aims to reduce
> contention, but no mention of a deadlock. We'll probably ask the customer
> to bump to a newer JDK with this fix and see if it magically goes away.
>
> -Todd
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Befuddling deadlock (lock held but not in stack?)

2017-10-11 Thread Todd Lipcon
On Wed, Oct 11, 2017 at 12:29 AM, Kirk Pepperdine  wrote:

> Not at all off topic… first, thread dumps lie like a rug… and here is why…
>
> for each thread {
> safe point
> create stack trace for that thread
> release threads from safe point
> }
>
> And while rugs may attempt to cover the debris that you’ve swept under
> them, that debris leaves a clearly visible lump that suggests that you have
> a congestion problem on locks in both sun.security.provider.Sun and
> java.lang.Class…. What could possibly go wrong?
>

Interesting. I remember investigating this back in the Java 6 days and at
that time 'jstack -l' caused a global safepoint before iterating over the
threads, and then only resumed upon completion (implemented by
VM_ThreadDump in vm_operations.cpp). It appears that in JDK 7 this was
switched over to the behavior that you describe.

Either way, I'm looking at about 10 stack traces taken over a period of 3
minutes and they have 100% identical "java-level deadlock" output. So I
suppose it's possible that they are live-locked on the same sets of locks
in the exact same threads but seems fairly unlikely.

As for the congestion problem, I did find this JDK bug the other day:
https://bugs.openjdk.java.net/browse/JDK-8133070 which aims to reduce
contention, but no mention of a deadlock. We'll probably ask the customer
to bump to a newer JDK with this fix and see if it magically goes away.

-Todd

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Befuddling deadlock (lock held but not in stack?)

2017-10-11 Thread Kirk Pepperdine
Not at all off topic… first, thread dumps lie like a rug… and here is why…

for each thread {
safe point
create stack trace for that thread
release threads from safe point
}

And while rugs may attempt to cover the debris that you’ve swept under them, 
that debris leaves a clearly visible lump that suggests that you have a 
congestion problem on locks in both sun.security.provider.Sun and 
java.lang.Class…. What could possibly go wrong?


Kind regards,
Kirk

> On Oct 11, 2017, at 3:05 AM, Todd Lipcon  wrote:
> 
> Hey folks,
> 
> Apologies for the slightly off-topic post, since this isn't performance 
> related, but I hope I'll be excused since this might be interesting to the 
> group members.
> 
> We're recently facing an issue where a JVM is deadlocking in some SSL code. 
> The resulting jstack report is bizarre -- in the deadlock analysis section it 
> indicates that one of the locks is held by some thread, but in that thread's 
> stack, it doesn't show the lock anywhere. Was curious if anyone had any ideas 
> on how a lock might be "held but not held".
> 
> jstack output is as follows (with other threads and irrelevant bottom stack 
> frames removed):
> 
> Found one Java-level deadlock:
> =
> "Thread-38190":
>   waiting to lock monitor 0x267f2628 (object 0x802ba7f8, a 
> sun.security.provider.Sun),
>   which is held by "New I/O worker #1810850"
> "New I/O worker #1810850":
>   waiting to lock monitor 0x7482f5f8 (object 0x80ac88f0, a 
> java.lang.Class),
>   which is held by "New I/O worker #1810853"
> "New I/O worker #1810853":
>   waiting to lock monitor 0x267f2628 (object 0x802ba7f8, a 
> sun.security.provider.Sun),
>   which is held by "New I/O worker #1810850"
> 
> Java stack information for the threads listed above:
> ===
> "Thread-38190":
> at java.security.Provider.getService(Provider.java:1035)
> - waiting to lock <0x802ba7f8> (a sun.security.provider.Sun)
> at sun.security.jca.ProviderList.getService(ProviderList.java:332)
> at sun.security.jca.GetInstance.getInstance(GetInstance.java:157)
> at javax.net.ssl.SSLContext.getInstance(SSLContext.java:156)
> at 
> org.apache.kudu.client.SecurityContext.(SecurityContext.java:84)
> ...
> "New I/O worker #1810850":
> at 
> sun.security.ssl.CipherSuite$BulkCipher.isAvailable(CipherSuite.java:542)
> - waiting to lock <0x80ac88f0> (a java.lang.Class for 
> sun.security.ssl.CipherSuite$BulkCipher)
> at 
> sun.security.ssl.CipherSuite$BulkCipher.isAvailable(CipherSuite.java:527)
> at sun.security.ssl.CipherSuite.isAvailable(CipherSuite.java:194)
> at sun.security.ssl.Handshaker.getActiveProtocols(Handshaker.java:712)
> at sun.security.ssl.Handshaker.activate(Handshaker.java:498)
> at 
> sun.security.ssl.SSLEngineImpl.kickstartHandshake(SSLEngineImpl.java:729)
> - locked <0x8eca76f8> (a sun.security.ssl.SSLEngineImpl)
> at 
> sun.security.ssl.SSLEngineImpl.beginHandshake(SSLEngineImpl.java:756)
> at 
> org.apache.kudu.client.shaded.org.jboss.netty.handler.ssl.SslHandler.handshake(SslHandler.java:361)
> - locked <0x8eca8298> (a java.lang.Object)
> at 
> org.apache.kudu.client.Negotiator.startTlsHandshake(Negotiator.java:432)
> ...
> "New I/O worker #1810853":
> at java.security.Provider.getService(Provider.java:1035)
> - waiting to lock <0x802ba7f8> (a sun.security.provider.Sun)
> at 
> sun.security.jca.ProviderList$ServiceList.tryGet(ProviderList.java:444)
> at 
> sun.security.jca.ProviderList$ServiceList.access$200(ProviderList.java:376)
> at 
> sun.security.jca.ProviderList$ServiceList$1.hasNext(ProviderList.java:486)
> at javax.crypto.Cipher.getInstance(Cipher.java:513)
> at sun.security.ssl.JsseJce.getCipher(JsseJce.java:229)
> at sun.security.ssl.CipherBox.(CipherBox.java:179)
> at sun.security.ssl.CipherBox.newCipherBox(CipherBox.java:263)
> at 
> sun.security.ssl.CipherSuite$BulkCipher.newCipher(CipherSuite.java:505)
> at 
> sun.security.ssl.CipherSuite$BulkCipher.isAvailable(CipherSuite.java:572)
> - locked <0x80ac88f0> (a java.lang.Class for 
> sun.security.ssl.CipherSuite$BulkCipher)
> at 
> sun.security.ssl.CipherSuite$BulkCipher.isAvailable(CipherSuite.java:527)
> at sun.security.ssl.CipherSuite.isAvailable(CipherSuite.java:194)
> at 
> sun.security.ssl.SSLContextImpl.getApplicableCipherSuiteList(SSLContextImpl.java:346)
> at 
> sun.security.ssl.SSLContextImpl.getDefaultCipherSuiteList(SSLContextImpl.java:297)
> - locked <0x8ebd1880> (a 
> sun.security.ssl.SSLContextImpl$TLSContext)
> at