Hi Pat,
I recommend getting a thread-dump when you encounter the situation next time.
Thread-dump shows what each thread is doing, including the stuck
SelectHiveQL thread.
You can get thread-dump by executing:
${NIFI_HOME}/bin/nifi.sh dump-file-name
Then thread stack traces are logged to the specified file.
Lots of logs look like below:
"Timer-Driven Process Thread-8" Id=71 WAITING on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@
1b3abf12
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
Once you get the thread dump, please share it with us for further investigation.
Thanks,
Koji
On Fri, Jul 26, 2019 at 1:57 AM Pat White <[email protected]> wrote:
>
> Hi Folks,
>
> Would like to ask for suggestions on debugging SelectHiveQL processors, we've
> seen a very odd error mode twice now, where a SelectHiveQL processor which
> had been running fine suddenly becomes "stuck". This is on 1.6.0, so a bit
> dated compared to 1.9.2, but i'm still very puzzled at the lack of error
> indications.
>
> Symptom; processor is running fine, continues to report 'running' on canvas
> but the input port begins to queue up and show backlogs. Stopping the
> processor in the canvas reports success and shows 'stopped', but trying to
> start it again gets the popup "No eligible components are selected. Please
> select the components to be stopped.". Making sure the processor is clearly
> selected reports same error. Only way to get it unstuck is to restart the
> primary, this appears to kill the affected threads and allow the processor to
> begin running again, at that point it's ok again.
>
> Issue appears directly related to the processor itself, as opposed to say the
> ConnectionPool. On that, tried restarting the ConnectionPool being used, stop
> attempt hangs on the affected processor, to the point the stop fails. Another
> oddity, tried stopping upstream objects to the affected processor, they
> report "cannot be disabled because it is referenced by 1 components that are
> currently running", even though the canvas clearly shows that processor as
> stopped.
>
> What's really strange is the lack of error indications anywhere, see nothing
> in the logs at all regarding the affected processor, until primary restart.
> Then see the start event when the processor is coming back online
> "StandardProcessScheduler Starting SelectHiveQL id=".
>
> Appreciate any suggestions on additional logging or other resources that
> would help debug. Thanks!
>
> patw
>
>
>
>
>
>
>
>