On Wed, Nov 28, 2018 at 8:59 AM David Holmes <david.hol...@oracle.com> wrote: > > On 28/11/2018 5:46 pm, Thomas Stüfe wrote: > > Are you sure? > > No. I'm completely confusing myself - too many different "thread id's" :(
A cheerful multitude :) What throws me off usually is not that we have two of them, but that, on Linux, the generalized thread id (OSThread::thread_id()) is the kernel tid. That just feels weird, since pthread_t is what we use 99% of all cases, and the kernel tid only in those rare cases where we access the proc file system or similar. Well I guess this is all historical, probably pre-NPTL? In the AIX port we did it the other way around: made pthread_t the default "thread_id" and use special accessors for the kernel tid (OSThread::kernel_thread_id()) for those rare cases where we really need it. > > > On Linux, pthread id is set in the parent thread, after pthread_create > > returns. Only the kernel thread id is set by the child (I find this > > duality confusing). thread_cpu_time uses pthread_id. So, on Linux, it > > may or may not be set before the thread stack boundaries are > > initialized, depending on whether the native entry function ran before > > the parent thread continued. > > Yes you are right. The code is still broken but for a different reason. > > I may just fix this all in JDK-8214097 - Rework thread initialization > and teardown logic - by only adding to the list after the thread has run > its own initialization logic. Then the guard I just added to the > management code can be deleted again. > I agree, the way to go long term is to make sure that no uninitialized or cleaned up threads are in the threadlist. I also think your small is fine for now. I was just worried that we start using Thread::stack_size() > 0 as a synonym for something like "Thread::fully_initialized_and_still_alive()" and that once we change it, we need to hunt down all cases of "misused Thread::stack_size()". Therefore I would have prepared something like: Thread::is_alive() { return _osthread != NULL && _osthread->thread_id != 0 && _osthread->pthread_id != 0 && _stack_base != NULL; } and given that this is platform-specific, probably something like: Thread::is_alive() { return _osthread != NULL && _osthread->is->alive() && _stack_base != NULL; } OSThread::is_alive() (unix) { return _pthread_id != 0 && _thread_id != 0; } and at the end of call_run(), resetting Thread::_osthread to NULL or similar. But I am fine with your change, if you want to check it in as it is. ..Thomas > Thanks, > David > > > ..Thomas > > > > On Wed, Nov 28, 2018 at 8:35 AM David Holmes <david.hol...@oracle.com> > > wrote: > >> > >> #$%$#@! I missed the fact that the thread id is set _after_ we do > >> record_stack_base_and-size(). I thought that was in call_run() not > >> native_thread_entry(). > >> > >> <sigh> New bug being filed. > >> > >> Thanks, > >> David > >> > >> On 28/11/2018 5:26 pm, David Holmes wrote: > >>> Hi Thomas, > >>> > >>> On 28/11/2018 4:30 pm, Thomas Stüfe wrote: > >>>> Hi David, > >>>> > >>>> I admit I still do not really get it.. > >>>> > >>>> E.g. for GC threads: > >>>> > >>>> We have > >>>> > >>>> static ConcurrentMarkSweepThread* > >>>> ConcurrentMarkSweepThread::start(CMSCollector* collector); > >>>> > >>>> which creates a ConcurrentMarkSweepThread and then calls > >>>> os::create_thread() to hook it up with an OSThread and start it. > >>>> > >>>> ConcurrentMarkSweepThread derives from NonJavaThread, which in its > >>>> constructor adds the thread object to the list. > >>>> > >>>> So there is a time gap where the NJT is part of the list, but > >>>> Thread::_osthread is still NULL. > >>>> > >>>> In ThreadTimesClosure::do_thread(), we call > >>>> os::thread_cpu_time(thread)->fast_cpu_time(thread)->os::Linux::pthread_getcpuclockid(thread->osthread()->pthread_id(), > >>>> > >>>> &clockid); > >>>> > >>>> Should we then not crash when dereferencing the NULL > >>>> osthread()->pthread_id()? Why do we crash inside > >>>> pthread_getcpuclockid? > >>>> > >>>> If I look further into os::create_thread(), I see that there is > >>>> another, smaller time gap where we create OSThread and anchor it into > >>>> the Thread object: > >>>> > >>>> thread->set_osthread(osthread); > >>>> > >>>> and then later, after pthread_create is thru, set its thread id: > >>>> > >>>> // Store pthread info into the OSThread > >>>> osthread->set_pthread_id(tid); > >>>> > >>>> When OSThread is created, we call OSThread::pd_initialize() and set > >>>> its _threadid to 0. We do this in the constructor, before anchoring > >>>> the OSThread in its Thread. > >>>> > >>>> So for my understanding, we should have two situations: > >>>> > >>>> (1)- NJT is in list, but its _osthread member is NULL. In that case I > >>>> would expect a different crash. > >>>> (2)- NJT is in list, _osthread is set, but its _thread_id is NULL. > >>> > >>> Yes both situations may be possible. In the related bug report only the > >>> 0 thread_id was observed. > >>> > >>>> (Modulo any concurrency issues, e.g. could the > >>>> "thread->set_osthread(osthread);" be visible before OSThread is > >>>> initialized? > >>>> > >>>> Depending on what is happening, a fix for (1) would probably be a > >>>> querying for osthread==NULL, a fix for (2) would be to guard os::--- > >>>> functions - well, at least os::thread_cpu_time() - to disregard > >>>> Threads with pthread_t == 0. Both fixes seem better to me than > >>>> querying the stacksize() when walking the list - that seems a bit > >>>> awkward. > >>> > >>> I find it awkward that the list is populated with partially constructed > >>> threads in the first place. This particular code is only interested in > >>> live threads that can safely be queried. The stack_size() check is a > >>> surrogate for "this is a live thread that can be queried". > >>> > >>>> -- > >>>> P.s. > >>>> > >>>> ConcurrentGCThread::ConcurrentGCThread() : > >>>> _should_terminate(false), _has_terminated(false) { > >>>> }; > >>>> > >>>> I was surprised to see no invocation to the base class ctor in the > >>>> initializer list. I was not aware that this was even possible. For > >>>> code clearness, I would prefer the call to the base class ctor to be > >>>> explicit.) > >>> > >>> I assume it is implicit. But yes it should be explicit. > >>> > >>> Cheers, > >>> David > >>> > >>>> -- > >>>> > >>>> Cheers, Thomas > >>>> > >>>> > >>>> On Wed, Nov 28, 2018 at 3:58 AM David Holmes <david.hol...@oracle.com> > >>>> wrote: > >>>>> > >>>>> Sorry for the delayed response. > >>>>> > >>>>> On 21/11/2018 3:01 pm, Kim Barrett wrote: > >>>>>>> On Nov 20, 2018, at 3:50 AM, David Holmes <david.hol...@oracle.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>> After discussions with Kim I've decided to split out the NJT list > >>>>>>> update into a separate RFE: > >>>>>>> > >>>>>>> https://bugs.openjdk.java.net/browse/JDK-8214097 > >>>>>>> > >>>>>>> So only the change in management.cpp needs reviewing and testing. > >>>>>>> > >>>>>>> Updated webrev: > >>>>>>> > >>>>>>> http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/ > >>>>>> > >>>>>> Looks good. > >>>>> > >>>>> Thanks Kim. > >>>>> > >>>>> I've decided to stick with this simple fix for NJTs only. > >>>>> > >>>>> David > >>>>> > >>>>>>> > >>>>>>> Thanks, > >>>>>>> David > >>>>>>> > >>>>>>> On 20/11/2018 10:01 am, David Holmes wrote: > >>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8212207 > >>>>>>>> webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/ > >>>>>>>> There is an internal management API that reports CPU times for > >>>>>>>> NonJavaThreads (NJTs). That functionality requires a valid/live > >>>>>>>> target thread so that we can use its pthread_t identity to obtain > >>>>>>>> its CPU clock via pthread_getcpuclockid(). > >>>>>>>> There is an iteration mechanism for NJTs in which the NJT is > >>>>>>>> registered during its constructor and de-registered during its > >>>>>>>> destructor. A thread that has only been constructed has not yet > >>>>>>>> executed and so is not a valid target for this management API. > >>>>>>>> This seems to be the cause of failures reported in this bug (and > >>>>>>>> JDK-8213434). Registering a NJT only when it starts executing is > >>>>>>>> an appealing fix for this, but that impacts all current users of > >>>>>>>> the NJT list and straight-away causes a problem with the > >>>>>>>> BarrierSet initialization logic. So I don't attempt that. > >>>>>>>> Instead the first part of the fix is for > >>>>>>>> ThreadTimesClosure::do_thread to skip threads that have not yet > >>>>>>>> executed - which we can recognize by seeing an uninitialized (i.e. > >>>>>>>> zero) stackbase. > >>>>>>>> A second part of the fix, which can be deferred to a separate RFE > >>>>>>>> for NJT lifecycle management if desired, tackles the problem of > >>>>>>>> encountering a terminated thread during iteration - which can also > >>>>>>>> lead to SEGVs. This can arise because NJT's are not actually > >>>>>>>> "destructed", even if they terminate, and so they never get > >>>>>>>> removed from the NJT list. Calling destructors is problematic > >>>>>>>> because the code using these NJTs assume they are always valid. So > >>>>>>>> the fix in this case is to move the de-registering from the NJT > >>>>>>>> list out of the destructor and into the Thread::call_run() method > >>>>>>>> so it is done before a thread actually terminates. This can be > >>>>>>>> considered a first step in cleaning up the NJT lifecycle, where > >>>>>>>> the remaining steps touch on a lot of areas and so need to be > >>>>>>>> handled separately e.g. see JDK-8087340 for shutting down WorkGang > >>>>>>>> GC worker threads. > >>>>>>>> Testing: tiers 1 -3 > >>>>>>>> I should point out that I've been unable to reproduce this failure > >>>>>>>> locally, even after thousands of runs. I'm hoping Zhengyu can test > >>>>>>>> this in the conditions reported in JDK-8213434. > >>>>>>>> Thanks, > >>>>>>>> David > >>>>>> > >>>>>>