Re: Debugging info for a stuck SelectHiveQL processor
Great info Koji, thank you very much. Will do. patw On Thu, Jul 25, 2019 at 9:40 PM Koji Kawamura wrote: > Hi Pat, > > I recommend getting a thread-dump when you encounter the situation next > time. > Thread-dump shows what each thread is doing, including the stuck > SelectHiveQL thread. > > You can get thread-dump by executing: > ${NIFI_HOME}/bin/nifi.sh dump-file-name > > Then thread stack traces are logged to the specified file. > Lots of logs look like below: > "Timer-Driven Process Thread-8" Id=71 WAITING on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@ > 1b3abf12 > at sun.misc.Unsafe.park(Native Method) > at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > > Once you get the thread dump, please share it with us for further > investigation. > > Thanks, > Koji > > On Fri, Jul 26, 2019 at 1:57 AM Pat White > wrote: > > > > Hi Folks, > > > > Would like to ask for suggestions on debugging SelectHiveQL processors, > we've seen a very odd error mode twice now, where a SelectHiveQL processor > which had been running fine suddenly becomes "stuck". This is on 1.6.0, so > a bit dated compared to 1.9.2, but i'm still very puzzled at the lack of > error indications. > > > > Symptom; processor is running fine, continues to report 'running' on > canvas but the input port begins to queue up and show backlogs. Stopping > the processor in the canvas reports success and shows 'stopped', but trying > to start it again gets the popup "No eligible components are selected. > Please select the components to be stopped.". Making sure the processor is > clearly selected reports same error. Only way to get it unstuck is to > restart the primary, this appears to kill the affected threads and allow > the processor to begin running again, at that point it's ok again. > > > > Issue appears directly related to the processor itself, as opposed to > say the ConnectionPool. On that, tried restarting the ConnectionPool being > used, stop attempt hangs on the affected processor, to the point the stop > fails. Another oddity, tried stopping upstream objects to the affected > processor, they report "cannot be disabled because it is referenced by 1 > components that are currently running", even though the canvas clearly > shows that processor as stopped. > > > > What's really strange is the lack of error indications anywhere, see > nothing in the logs at all regarding the affected processor, until primary > restart. Then see the start event when the processor is coming back online > "StandardProcessScheduler Starting SelectHiveQL id=". > > > > Appreciate any suggestions on additional logging or other resources that > would help debug. Thanks! > > > > patw > > > > > > > > > > > > > > > > >
Re: Debugging info for a stuck SelectHiveQL processor
Hi Pat, I recommend getting a thread-dump when you encounter the situation next time. Thread-dump shows what each thread is doing, including the stuck SelectHiveQL thread. You can get thread-dump by executing: ${NIFI_HOME}/bin/nifi.sh dump-file-name Then thread stack traces are logged to the specified file. Lots of logs look like below: "Timer-Driven Process Thread-8" Id=71 WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@ 1b3abf12 at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) Once you get the thread dump, please share it with us for further investigation. Thanks, Koji On Fri, Jul 26, 2019 at 1:57 AM Pat White wrote: > > Hi Folks, > > Would like to ask for suggestions on debugging SelectHiveQL processors, we've > seen a very odd error mode twice now, where a SelectHiveQL processor which > had been running fine suddenly becomes "stuck". This is on 1.6.0, so a bit > dated compared to 1.9.2, but i'm still very puzzled at the lack of error > indications. > > Symptom; processor is running fine, continues to report 'running' on canvas > but the input port begins to queue up and show backlogs. Stopping the > processor in the canvas reports success and shows 'stopped', but trying to > start it again gets the popup "No eligible components are selected. Please > select the components to be stopped.". Making sure the processor is clearly > selected reports same error. Only way to get it unstuck is to restart the > primary, this appears to kill the affected threads and allow the processor to > begin running again, at that point it's ok again. > > Issue appears directly related to the processor itself, as opposed to say the > ConnectionPool. On that, tried restarting the ConnectionPool being used, stop > attempt hangs on the affected processor, to the point the stop fails. Another > oddity, tried stopping upstream objects to the affected processor, they > report "cannot be disabled because it is referenced by 1 components that are > currently running", even though the canvas clearly shows that processor as > stopped. > > What's really strange is the lack of error indications anywhere, see nothing > in the logs at all regarding the affected processor, until primary restart. > Then see the start event when the processor is coming back online > "StandardProcessScheduler Starting SelectHiveQL id=". > > Appreciate any suggestions on additional logging or other resources that > would help debug. Thanks! > > patw > > > > > > > >
Debugging info for a stuck SelectHiveQL processor
Hi Folks, Would like to ask for suggestions on debugging SelectHiveQL processors, we've seen a very odd error mode twice now, where a SelectHiveQL processor which had been running fine suddenly becomes "stuck". This is on 1.6.0, so a bit dated compared to 1.9.2, but i'm still very puzzled at the lack of error indications. Symptom; processor is running fine, continues to report 'running' on canvas but the input port begins to queue up and show backlogs. Stopping the processor in the canvas reports success and shows 'stopped', but trying to start it again gets the popup "No eligible components are selected. Please select the components to be stopped.". Making sure the processor is clearly selected reports same error. Only way to get it unstuck is to restart the primary, this appears to kill the affected threads and allow the processor to begin running again, at that point it's ok again. Issue appears directly related to the processor itself, as opposed to say the ConnectionPool. On that, tried restarting the ConnectionPool being used, stop attempt hangs on the affected processor, to the point the stop fails. Another oddity, tried stopping upstream objects to the affected processor, they report "cannot be disabled because it is referenced by 1 components that are currently running", even though the canvas clearly shows that processor as stopped. What's really strange is the lack of error indications anywhere, see nothing in the logs at all regarding the affected processor, until primary restart. Then see the start event when the processor is coming back online "StandardProcessScheduler Starting SelectHiveQL id=". Appreciate any suggestions on additional logging or other resources that would help debug. Thanks! patw