Re: flowfiles stuck in load balanced queue; nifi 1.8

2019-02-08 Thread Boris Tyukin
The new feature is indeed amazing but we are in the same boat and based on
what've heard we do not want to risk losing flowfiles and certainly avoid
restarts of NiFi.

decided to wait for 1.9. I hope it will out pretty soon based on the
frequency of previous releases.



On Fri, Feb 8, 2019 at 9:55 AM Woodhead, Chad  wrote:

> My team is about to start using load balanced queues in 1.8 and one thing
> I wanted to understand before we do is if we run into this issue and we
> follow Dan’s workaround of disconnecting the node and then restarting the
> node node, do the flowfiles end up moving through the rest of the flow or
> do they get lost/dropped?
>
>
>
> -Chad
>
>
>
> *From: *dan young 
> *Reply-To: *"users@nifi.apache.org" 
> *Date: *Thursday, January 17, 2019 at 7:49 PM
> *To: *NiFi Mailing List 
> *Subject: *Re: flowfiles stuck in load balanced queue; nifi 1.8
>
>
>
> **External Message* - Use caution before opening links or attachments*
>
>
>
> Ok, sounds great. This is really frustrating and I don't want to go back
> to RPG if possible, although that has been rock solid. Will keep an eye out
> for 1.9!
>
>
>
> Regards
>
>
>
> Dano
>
>
>
> On Thu, Jan 17, 2019, 5:17 PM Mark Payne 
> Hey Dan,
>
>
>
> This can happen even within a process group, it is just much more likely
> when the destination of the connection is a Port or a Funnel because those
> components don’t really do any work, just push the FlowFile to the next
> connection and that makes them super fast.
>
>
>
> There are a few different PR’s that are awaiting review (unrelated to
> this) that I’d like to see merged in very soon and then I think it’s
> probably time to start talking about a 1.9.0 release. There are several bug
> fixes, especially related to the load balance connections, and enough new
> features that I think it’s worth considering a release soon.
>
> Sent from my iPhone
>
>
> On Jan 17, 2019, at 6:59 PM, dan young  wrote:
>
> Hello Mark,
>
>
>
> We're seeing "stuck" flow files again, this time within a PG...see
> attached screen shots :(
>
>
>
>
>
>
>
> On Fri, Dec 28, 2018 at 8:43 AM Mark Payne  wrote:
>
> Dan, et al,
>
>
>
> Great news! I was able to replicate this issue finally, by creating a
> Load-Balanced connection
>
> between two Process Groups/Ports instead of between two processors. The
> fact that it's between
>
> two Ports does not, in and of itself, matter. But there is a race
> condition, and Ports do no actual
>
> Processing of the FlowFile (simply pull it from one queue and transfer it
> to another). As a result, because
>
> it is extremely fast, it is more likely to trigger the race condition.
>
>
>
> So I created a JIRA [1] and have submitted a PR for it.
>
>
>
> Interestingly, while there is no real workaround that is fool-proof, until
> this fix is in and released, you could
>
> choose to update your flow so that the connection between Process Groups
> is not load balanced and instead
>
> the connection between the Input Port and the first Processor is load
> balanced. Again, this is not fool-proof,
>
> because it could affect the Load Balanced Connection even if it is
> connected to a Processor, but it is less likely
>
> to do so, so you would likely see the issue occur far less often.
>
>
>
> Thank you so much for sticking with us all as we diagnose this and figure
> it all out - would not have been able to
>
> figure it out without you spending the time to debug the issue!
>
>
>
> Thanks
>
> -Mark
>
>
>
> [1] https://issues.apache.org/jira/browse/NIFI-5919
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_NIFI-2D5919&d=DwMFaQ&c=gJN2jf8AyP5Q6Np0yWY19w&r=MJ04HXP0mOz9-J4odYRNRx3ln4A_OnHTjJvmsZOEG64&m=4ZPdhFEwAyA9zzYEieqAm66w18b44zDcysJlLnED1PM&s=c-UX3UlTwh4CyR8ZcPp1Kcehq2xlniW4R2kXeEuQjR0&e=>
>
>
>
>
>
> On Dec 26, 2018, at 10:31 PM, dan young  wrote:
>
>
>
> Hello Mark,
>
>
>
> I just stopped the destination processor, and then disconnected the node
> in question (nifi1-1). Once I disconnected the node, the flow file in the
> load balance connection disappeared from the queue.  After that, I
> reconnected the node (with the downstream processor disconnected) and once
> the node successfully rejoined the cluster, the flowfile showed up in the
> queue again. After this, I started the connected downstream processor, but
> the flowfile stays in the queue. The only way to clear the queue is if I
> actually restart the node.  If I disconnect the node, and

Re: flowfiles stuck in load balanced queue; nifi 1.8

2019-02-08 Thread dan young
Looking forward to this fix!  Thanx for all the hard work on NiFi

Regards,

Dano


On Fri, Feb 8, 2019 at 7:58 AM Mark Payne  wrote:

> Chad,
>
> Upon restart, they will continue on, there is no known data loss situation.
>
> That said, we are preparing to assemble version 1.9 of NiFi now, so I would
> guess that it will be voted on, perhaps as early as next week. So it may
> (or may not)
> make sense for you, depending on your situation, to wait for the 1.9
> release.
>
> Thanks
> -Mark
>
>
> On Feb 8, 2019, at 9:55 AM, Woodhead, Chad  wrote:
>
> My team is about to start using load balanced queues in 1.8 and one thing
> I wanted to understand before we do is if we run into this issue and we
> follow Dan’s workaround of disconnecting the node and then restarting the
> node node, do the flowfiles end up moving through the rest of the flow or
> do they get lost/dropped?
>
> -Chad
>
> *From: *dan young 
> *Reply-To: *"users@nifi.apache.org" 
> *Date: *Thursday, January 17, 2019 at 7:49 PM
> *To: *NiFi Mailing List 
> *Subject: *Re: flowfiles stuck in load balanced queue; nifi 1.8
>
> **External Message* - Use caution before opening links or attachments*
>
> Ok, sounds great. This is really frustrating and I don't want to go back
> to RPG if possible, although that has been rock solid. Will keep an eye out
> for 1.9!
>
> Regards
>
> Dano
>
> On Thu, Jan 17, 2019, 5:17 PM Mark Payne 
> Hey Dan,
>
> This can happen even within a process group, it is just much more likely
> when the destination of the connection is a Port or a Funnel because those
> components don’t really do any work, just push the FlowFile to the next
> connection and that makes them super fast.
>
>
> There are a few different PR’s that are awaiting review (unrelated to
> this) that I’d like to see merged in very soon and then I think it’s
> probably time to start talking about a 1.9.0 release. There are several bug
> fixes, especially related to the load balance connections, and enough new
> features that I think it’s worth considering a release soon.
> Sent from my iPhone
>
>
> On Jan 17, 2019, at 6:59 PM, dan young  wrote:
>
> Hello Mark,
>
> We're seeing "stuck" flow files again, this time within a PG...see
> attached screen shots :(
>
>
>
> On Fri, Dec 28, 2018 at 8:43 AM Mark Payne  wrote:
>
> Dan, et al,
>
> Great news! I was able to replicate this issue finally, by creating a
> Load-Balanced connection
> between two Process Groups/Ports instead of between two processors. The
> fact that it's between
> two Ports does not, in and of itself, matter. But there is a race
> condition, and Ports do no actual
> Processing of the FlowFile (simply pull it from one queue and transfer it
> to another). As a result, because
> it is extremely fast, it is more likely to trigger the race condition.
>
> So I created a JIRA [1] and have submitted a PR for it.
>
> Interestingly, while there is no real workaround that is fool-proof, until
> this fix is in and released, you could
> choose to update your flow so that the connection between Process Groups
> is not load balanced and instead
> the connection between the Input Port and the first Processor is load
> balanced. Again, this is not fool-proof,
> because it could affect the Load Balanced Connection even if it is
> connected to a Processor, but it is less likely
> to do so, so you would likely see the issue occur far less often.
>
> Thank you so much for sticking with us all as we diagnose this and figure
> it all out - would not have been able to
> figure it out without you spending the time to debug the issue!
>
> Thanks
> -Mark
>
> [1] https://issues.apache.org/jira/browse/NIFI-5919
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_NIFI-2D5919&d=DwMFaQ&c=gJN2jf8AyP5Q6Np0yWY19w&r=MJ04HXP0mOz9-J4odYRNRx3ln4A_OnHTjJvmsZOEG64&m=4ZPdhFEwAyA9zzYEieqAm66w18b44zDcysJlLnED1PM&s=c-UX3UlTwh4CyR8ZcPp1Kcehq2xlniW4R2kXeEuQjR0&e=>
>
>
>
> On Dec 26, 2018, at 10:31 PM, dan young  wrote:
>
> Hello Mark,
>
> I just stopped the destination processor, and then disconnected the node
> in question (nifi1-1). Once I disconnected the node, the flow file in the
> load balance connection disappeared from the queue.  After that, I
> reconnected the node (with the downstream processor disconnected) and once
> the node successfully rejoined the cluster, the flowfile showed up in the
> queue again. After this, I started the connected downstream processor, but
> the flowfile stays in the queue. The only way to clear the queue is if I
> actually restart the node.  If I disconnect t

Re: flowfiles stuck in load balanced queue; nifi 1.8

2019-02-08 Thread Mark Payne
Chad,

Upon restart, they will continue on, there is no known data loss situation.

That said, we are preparing to assemble version 1.9 of NiFi now, so I would
guess that it will be voted on, perhaps as early as next week. So it may (or 
may not)
make sense for you, depending on your situation, to wait for the 1.9 release.

Thanks
-Mark


On Feb 8, 2019, at 9:55 AM, Woodhead, Chad 
mailto:chad.woodh...@ncr.com>> wrote:

My team is about to start using load balanced queues in 1.8 and one thing I 
wanted to understand before we do is if we run into this issue and we follow 
Dan’s workaround of disconnecting the node and then restarting the node node, 
do the flowfiles end up moving through the rest of the flow or do they get 
lost/dropped?

-Chad

From: dan young mailto:danoyo...@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
mailto:users@nifi.apache.org>>
Date: Thursday, January 17, 2019 at 7:49 PM
To: NiFi Mailing List mailto:users@nifi.apache.org>>
Subject: Re: flowfiles stuck in load balanced queue; nifi 1.8

*External Message* - Use caution before opening links or attachments


Ok, sounds great. This is really frustrating and I don't want to go back to RPG 
if possible, although that has been rock solid. Will keep an eye out for 1.9!

Regards

Dano

On Thu, Jan 17, 2019, 5:17 PM Mark Payne 
mailto:marka...@hotmail.com> wrote:
Hey Dan,

This can happen even within a process group, it is just much more likely when 
the destination of the connection is a Port or a Funnel because those 
components don’t really do any work, just push the FlowFile to the next 
connection and that makes them super fast.

There are a few different PR’s that are awaiting review (unrelated to this) 
that I’d like to see merged in very soon and then I think it’s probably time to 
start talking about a 1.9.0 release. There are several bug fixes, especially 
related to the load balance connections, and enough new features that I think 
it’s worth considering a release soon.
Sent from my iPhone

On Jan 17, 2019, at 6:59 PM, dan young 
mailto:danoyo...@gmail.com>> wrote:
Hello Mark,

We're seeing "stuck" flow files again, this time within a PG...see attached 
screen shots :(



On Fri, Dec 28, 2018 at 8:43 AM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Dan, et al,

Great news! I was able to replicate this issue finally, by creating a 
Load-Balanced connection
between two Process Groups/Ports instead of between two processors. The fact 
that it's between
two Ports does not, in and of itself, matter. But there is a race condition, 
and Ports do no actual
Processing of the FlowFile (simply pull it from one queue and transfer it to 
another). As a result, because
it is extremely fast, it is more likely to trigger the race condition.

So I created a JIRA [1] and have submitted a PR for it.

Interestingly, while there is no real workaround that is fool-proof, until this 
fix is in and released, you could
choose to update your flow so that the connection between Process Groups is not 
load balanced and instead
the connection between the Input Port and the first Processor is load balanced. 
Again, this is not fool-proof,
because it could affect the Load Balanced Connection even if it is connected to 
a Processor, but it is less likely
to do so, so you would likely see the issue occur far less often.

Thank you so much for sticking with us all as we diagnose this and figure it 
all out - would not have been able to
figure it out without you spending the time to debug the issue!

Thanks
-Mark

[1] 
https://issues.apache.org/jira/browse/NIFI-5919<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_NIFI-2D5919&d=DwMFaQ&c=gJN2jf8AyP5Q6Np0yWY19w&r=MJ04HXP0mOz9-J4odYRNRx3ln4A_OnHTjJvmsZOEG64&m=4ZPdhFEwAyA9zzYEieqAm66w18b44zDcysJlLnED1PM&s=c-UX3UlTwh4CyR8ZcPp1Kcehq2xlniW4R2kXeEuQjR0&e=>



On Dec 26, 2018, at 10:31 PM, dan young 
mailto:danoyo...@gmail.com>> wrote:

Hello Mark,

I just stopped the destination processor, and then disconnected the node in 
question (nifi1-1). Once I disconnected the node, the flow file in the load 
balance connection disappeared from the queue.  After that, I reconnected the 
node (with the downstream processor disconnected) and once the node 
successfully rejoined the cluster, the flowfile showed up in the queue again. 
After this, I started the connected downstream processor, but the flowfile 
stays in the queue. The only way to clear the queue is if I actually restart 
the node.  If I disconnect the node, and then restart that node, the flowfile 
is no longer present in the queue.

Regards,

Dano


On Wed, Dec 26, 2018 at 6:13 PM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Ok, I just wanted to confirm that when you said “once it rejoins the cluster 
that flow file is gone” that you mean “the flowfile did not exist on the 
sys

Re: flowfiles stuck in load balanced queue; nifi 1.8

2019-02-08 Thread Woodhead, Chad
My team is about to start using load balanced queues in 1.8 and one thing I 
wanted to understand before we do is if we run into this issue and we follow 
Dan’s workaround of disconnecting the node and then restarting the node node, 
do the flowfiles end up moving through the rest of the flow or do they get 
lost/dropped?

-Chad

From: dan young 
Reply-To: "users@nifi.apache.org" 
Date: Thursday, January 17, 2019 at 7:49 PM
To: NiFi Mailing List 
Subject: Re: flowfiles stuck in load balanced queue; nifi 1.8

*External Message* - Use caution before opening links or attachments


Ok, sounds great. This is really frustrating and I don't want to go back to RPG 
if possible, although that has been rock solid. Will keep an eye out for 1.9!

Regards

Dano

On Thu, Jan 17, 2019, 5:17 PM Mark Payne 
mailto:marka...@hotmail.com> wrote:
Hey Dan,

This can happen even within a process group, it is just much more likely when 
the destination of the connection is a Port or a Funnel because those 
components don’t really do any work, just push the FlowFile to the next 
connection and that makes them super fast.

There are a few different PR’s that are awaiting review (unrelated to this) 
that I’d like to see merged in very soon and then I think it’s probably time to 
start talking about a 1.9.0 release. There are several bug fixes, especially 
related to the load balance connections, and enough new features that I think 
it’s worth considering a release soon.
Sent from my iPhone

On Jan 17, 2019, at 6:59 PM, dan young 
mailto:danoyo...@gmail.com>> wrote:
Hello Mark,

We're seeing "stuck" flow files again, this time within a PG...see attached 
screen shots :(



On Fri, Dec 28, 2018 at 8:43 AM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Dan, et al,

Great news! I was able to replicate this issue finally, by creating a 
Load-Balanced connection
between two Process Groups/Ports instead of between two processors. The fact 
that it's between
two Ports does not, in and of itself, matter. But there is a race condition, 
and Ports do no actual
Processing of the FlowFile (simply pull it from one queue and transfer it to 
another). As a result, because
it is extremely fast, it is more likely to trigger the race condition.

So I created a JIRA [1] and have submitted a PR for it.

Interestingly, while there is no real workaround that is fool-proof, until this 
fix is in and released, you could
choose to update your flow so that the connection between Process Groups is not 
load balanced and instead
the connection between the Input Port and the first Processor is load balanced. 
Again, this is not fool-proof,
because it could affect the Load Balanced Connection even if it is connected to 
a Processor, but it is less likely
to do so, so you would likely see the issue occur far less often.

Thank you so much for sticking with us all as we diagnose this and figure it 
all out - would not have been able to
figure it out without you spending the time to debug the issue!

Thanks
-Mark

[1] 
https://issues.apache.org/jira/browse/NIFI-5919<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_NIFI-2D5919&d=DwMFaQ&c=gJN2jf8AyP5Q6Np0yWY19w&r=MJ04HXP0mOz9-J4odYRNRx3ln4A_OnHTjJvmsZOEG64&m=4ZPdhFEwAyA9zzYEieqAm66w18b44zDcysJlLnED1PM&s=c-UX3UlTwh4CyR8ZcPp1Kcehq2xlniW4R2kXeEuQjR0&e=>



On Dec 26, 2018, at 10:31 PM, dan young 
mailto:danoyo...@gmail.com>> wrote:

Hello Mark,

I just stopped the destination processor, and then disconnected the node in 
question (nifi1-1). Once I disconnected the node, the flow file in the load 
balance connection disappeared from the queue.  After that, I reconnected the 
node (with the downstream processor disconnected) and once the node 
successfully rejoined the cluster, the flowfile showed up in the queue again. 
After this, I started the connected downstream processor, but the flowfile 
stays in the queue. The only way to clear the queue is if I actually restart 
the node.  If I disconnect the node, and then restart that node, the flowfile 
is no longer present in the queue.

Regards,

Dano


On Wed, Dec 26, 2018 at 6:13 PM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Ok, I just wanted to confirm that when you said “once it rejoins the cluster 
that flow file is gone” that you mean “the flowfile did not exist on the 
system” and NOT “the queue size was 0 by the time that I looked at the UI.” 
I.e., is it possible that the FlowFile did exist, was restored, and then was 
processed before you looked at the UI? Or the FlowFile definitely did not exist 
after the node was restarted? That’s why I was suggesting that you restart with 
the connection’s source and destination stopped. Just to make sure that the 
FlowFile didn’t just get processed quickly on restart.
Sent from my iPhone

On Dec 26, 2018, at 7:55 PM, dan young 
mailto:danoyo...@gmail.com>> wrote:
Heya Mark,

If we res

Re: flowfiles stuck in load balanced queue; nifi 1.8

2019-01-17 Thread dan young
y the queue, it will
>>> just say that there no flow files in the queue.
>>>
>>>
>>> On Wed, Dec 26, 2018, 5:22 PM Mark Payne >>
>>>> Hey Dan,
>>>>
>>>> Thanks, this is super useful! So, the following section is the damning
>>>> part of the JSON:
>>>>
>>>>   {
>>>> "totalFlowFileCount": 1,
>>>> "totalByteCount": 975890,
>>>> "nodeIdentifier": "nifi1-1:9443",
>>>> "localQueuePartition": {
>>>>   "totalFlowFileCount": 0,
>>>>   "totalByteCount": 0,
>>>>   "activeQueueFlowFileCount": 0,
>>>>   "activeQueueByteCount": 0,
>>>>   "swapFlowFileCount": 0,
>>>>   "swapByteCount": 0,
>>>>   "swapFiles": 0,
>>>>   "inFlightFlowFileCount": 0,
>>>>   "inFlightByteCount": 0,
>>>>   "allActiveQueueFlowFilesPenalized": false,
>>>>   "anyActiveQueueFlowFilesPenalized": false
>>>> },
>>>> "remoteQueuePartitions": [
>>>>   {
>>>> "totalFlowFileCount": 0,
>>>> "totalByteCount": 0,
>>>> "activeQueueFlowFileCount": 0,
>>>> "activeQueueByteCount": 0,
>>>> "swapFlowFileCount": 0,
>>>> "swapByteCount": 0,
>>>> "swapFiles": 0,
>>>> "inFlightFlowFileCount": 0,
>>>> "inFlightByteCount": 0,
>>>> "nodeIdentifier": "nifi2-1:9443"
>>>>   },
>>>>   {
>>>> "totalFlowFileCount": 0,
>>>> "totalByteCount": 0,
>>>> "activeQueueFlowFileCount": 0,
>>>> "activeQueueByteCount": 0,
>>>> "swapFlowFileCount": 0,
>>>> "swapByteCount": 0,
>>>> "swapFiles": 0,
>>>>         "inFlightFlowFileCount": 0,
>>>> "inFlightByteCount": 0,
>>>> "nodeIdentifier": "nifi3-1:9443"
>>>>   }
>>>> ]
>>>>   }
>>>>
>>>> It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 
>>>> 975890
>>>> bytes. But it also shows that the FlowFile is not in the "local partition"
>>>> or either of the two "remote partitions." So that leaves us with two
>>>> possibilities:
>>>>
>>>> 1) The Queue's Count is wrong, because it somehow did not get
>>>> decremented (perhaps a threading bug?)
>>>>
>>>> Or
>>>>
>>>> 2) The Count is correct and the FlowFile exists, but somehow the
>>>> reference to the FlowFile was lost by the FlowFile Queue (again, perhaps a
>>>> threading bug?)
>>>>
>>>> If possible, I would for you to stop both the source and destination of
>>>> that connection and then restart node nifi1-1. Once it has restarted, check
>>>> if the FlowFile is still in the connection. That will tell us which of the
>>>> two above scenarios is taking place. If the FlowFile exists upon restart,
>>>> then the Queue somehow lost the handle to it. If the FlowFile does not
>>>> exist in the connection upon restart (I'm guessing this will be the case),
>>>> then it indicates that somehow the count is incorrect.
>>>>
>>>> Many thanks
>>>> -Mark
>>>>
>>>> --
>>>> *From:* dan young 
>>>> *Sent:* Wednesday, December 26, 2018 9:18 AM
>>>> *To:* NiFi Mailing List
>>>> *Subject:* Re: flowfiles stuck in load balanced queue; nifi 1.8
>>>>
>>>> Heya Mark,
>>>>
>>>> So I added a Log Attribute Processor and routed the connection that had
&g

Re: flowfiles stuck in load balanced queue; nifi 1.8

2019-01-17 Thread Mark Payne
"inFlightByteCount": 0,
  "allActiveQueueFlowFilesPenalized": false,
  "anyActiveQueueFlowFilesPenalized": false
},
"remoteQueuePartitions": [
  {
"totalFlowFileCount": 0,
"totalByteCount": 0,
"activeQueueFlowFileCount": 0,
"activeQueueByteCount": 0,
"swapFlowFileCount": 0,
"swapByteCount": 0,
"swapFiles": 0,
"inFlightFlowFileCount": 0,
"inFlightByteCount": 0,
"nodeIdentifier": "nifi2-1:9443"
  },
  {
"totalFlowFileCount": 0,
"totalByteCount": 0,
"activeQueueFlowFileCount": 0,
"activeQueueByteCount": 0,
"swapFlowFileCount": 0,
"swapByteCount": 0,
"swapFiles": 0,
"inFlightFlowFileCount": 0,
"inFlightByteCount": 0,
"nodeIdentifier": "nifi3-1:9443"
  }
]
  }

It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890 
bytes. But it also shows that the FlowFile is not in the "local partition" or 
either of the two "remote partitions." So that leaves us with two possibilities:

1) The Queue's Count is wrong, because it somehow did not get decremented 
(perhaps a threading bug?)

Or

2) The Count is correct and the FlowFile exists, but somehow the reference to 
the FlowFile was lost by the FlowFile Queue (again, perhaps a threading bug?)

If possible, I would for you to stop both the source and destination of that 
connection and then restart node nifi1-1. Once it has restarted, check if the 
FlowFile is still in the connection. That will tell us which of the two above 
scenarios is taking place. If the FlowFile exists upon restart, then the Queue 
somehow lost the handle to it. If the FlowFile does not exist in the connection 
upon restart (I'm guessing this will be the case), then it indicates that 
somehow the count is incorrect.

Many thanks
-Mark


From: dan young mailto:danoyo...@gmail.com>>
Sent: Wednesday, December 26, 2018 9:18 AM
To: NiFi Mailing List
Subject: Re: flowfiles stuck in load balanced queue; nifi 1.8

Heya Mark,

So I added a Log Attribute Processor and routed the connection that had the 
"stuck" flowfile to it.   I ran a get diagnostics to the Log Attribute 
processor before I started it, and then ran another diagnostics after I started 
it.  The flowfile stayed in the load balanced connection/queue.  I've attached 
both files.  Please LMK if this helps.

Regards,

Dano


On Mon, Dec 24, 2018 at 10:35 AM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Dan,

You would want to get diagnostics for the processor that is the 
source/destination of the connection - not the FlowFile. But if you connection 
is connecting 2 process groups then both its source and destination are Ports, 
not Processors. So the easiest thing to do would be to drop a “dummy processor” 
into the flow between the 2 groups, drag the Connection to that processor, get 
diagnostics for the processor, and then drag it back to where it was. Does that 
make sense? Sorry for the hassle.

Thanks
-Mark

Sent from my iPhone

On Dec 24, 2018, at 11:40 AM, dan young 
mailto:danoyo...@gmail.com>> wrote:

Hello Bryan,

Thank you, that was the ticket!

Mark, I was able to run the diagnostics for a processor that's downstream from 
the connection where the flowfile appears to be "stuck". I'm not sure what 
processor is the source of this particular "stuck" flowfile since we have a 
number of upstream processor groups (PG) that feed into a funnel.  This funnel 
is then connected to a downstream PG. It is this connection between the funnel 
and a downstream PG where the flowfile is stuck. I might reduce the upstream 
"load balanced connections" between the various PGs to just one so I can narrow 
where we need to run diagnostics  If this isn't the correct processor to be 
gathering diagnostics, please LMK where else I should look or other diagnostics 
to run...

I've also attached the output (nifi-api/connections/{id}) of the get for that 
connection where the flowfile appears to be "stuck"

On Sun, Dec 23, 2018 at 8:36 PM Bryan Bende 
mailto:bbe...@gmail.com>> wrote:
You’ll need to get the token that was obtained when you logged in to the SSO 
and submit it on the curl requests the same way the UI is doing on all requests.

You should be able to open chrome dev tool tools while in

Re: flowfiles stuck in load balanced queue; nifi 1.8

2019-01-05 Thread Mark Payne
ount": 0,
"swapByteCount": 0,
"swapFiles": 0,
"inFlightFlowFileCount": 0,
"inFlightByteCount": 0,
"nodeIdentifier": "nifi2-1:9443"
  },
  {
"totalFlowFileCount": 0,
"totalByteCount": 0,
"activeQueueFlowFileCount": 0,
"activeQueueByteCount": 0,
"swapFlowFileCount": 0,
"swapByteCount": 0,
"swapFiles": 0,
"inFlightFlowFileCount": 0,
"inFlightByteCount": 0,
"nodeIdentifier": "nifi3-1:9443"
  }
]
  }

It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890 
bytes. But it also shows that the FlowFile is not in the "local partition" or 
either of the two "remote partitions." So that leaves us with two possibilities:

1) The Queue's Count is wrong, because it somehow did not get decremented 
(perhaps a threading bug?)

Or

2) The Count is correct and the FlowFile exists, but somehow the reference to 
the FlowFile was lost by the FlowFile Queue (again, perhaps a threading bug?)

If possible, I would for you to stop both the source and destination of that 
connection and then restart node nifi1-1. Once it has restarted, check if the 
FlowFile is still in the connection. That will tell us which of the two above 
scenarios is taking place. If the FlowFile exists upon restart, then the Queue 
somehow lost the handle to it. If the FlowFile does not exist in the connection 
upon restart (I'm guessing this will be the case), then it indicates that 
somehow the count is incorrect.

Many thanks
-Mark


From: dan young mailto:danoyo...@gmail.com>>
Sent: Wednesday, December 26, 2018 9:18 AM
To: NiFi Mailing List
Subject: Re: flowfiles stuck in load balanced queue; nifi 1.8

Heya Mark,

So I added a Log Attribute Processor and routed the connection that had the 
"stuck" flowfile to it.   I ran a get diagnostics to the Log Attribute 
processor before I started it, and then ran another diagnostics after I started 
it.  The flowfile stayed in the load balanced connection/queue.  I've attached 
both files.  Please LMK if this helps.

Regards,

Dano


On Mon, Dec 24, 2018 at 10:35 AM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Dan,

You would want to get diagnostics for the processor that is the 
source/destination of the connection - not the FlowFile. But if you connection 
is connecting 2 process groups then both its source and destination are Ports, 
not Processors. So the easiest thing to do would be to drop a “dummy processor” 
into the flow between the 2 groups, drag the Connection to that processor, get 
diagnostics for the processor, and then drag it back to where it was. Does that 
make sense? Sorry for the hassle.

Thanks
-Mark

Sent from my iPhone

On Dec 24, 2018, at 11:40 AM, dan young 
mailto:danoyo...@gmail.com>> wrote:

Hello Bryan,

Thank you, that was the ticket!

Mark, I was able to run the diagnostics for a processor that's downstream from 
the connection where the flowfile appears to be "stuck". I'm not sure what 
processor is the source of this particular "stuck" flowfile since we have a 
number of upstream processor groups (PG) that feed into a funnel.  This funnel 
is then connected to a downstream PG. It is this connection between the funnel 
and a downstream PG where the flowfile is stuck. I might reduce the upstream 
"load balanced connections" between the various PGs to just one so I can narrow 
where we need to run diagnostics  If this isn't the correct processor to be 
gathering diagnostics, please LMK where else I should look or other diagnostics 
to run...

I've also attached the output (nifi-api/connections/{id}) of the get for that 
connection where the flowfile appears to be "stuck"

On Sun, Dec 23, 2018 at 8:36 PM Bryan Bende 
mailto:bbe...@gmail.com>> wrote:
You’ll need to get the token that was obtained when you logged in to the SSO 
and submit it on the curl requests the same way the UI is doing on all requests.

You should be able to open chrome dev tool tools while in the UI and look at 
one of the request/responses and copy the value of the 'Authorization’ header 
which should be in the form ‘Bearer ’.

Then send this on the curl command by specifying a header of -H 'Authorization: 
Bearer '

On Sun, Dec 23, 2018 at 6:28 PM dan young 
mailto:danoyo...@gmail.com>> wrote:
I forgot to mention that we're using the OpenId Connect SSO .  Is there a way 
to run these command via curl when we have the cluster configured this way?  If 
so would

Re: flowfiles stuck in load balanced queue; nifi 1.8

2019-01-05 Thread dan young
,
>>>   "anyActiveQueueFlowFilesPenalized": false
>>> },
>>> "remoteQueuePartitions": [
>>>   {
>>> "totalFlowFileCount": 0,
>>> "totalByteCount": 0,
>>> "activeQueueFlowFileCount": 0,
>>> "activeQueueByteCount": 0,
>>> "swapFlowFileCount": 0,
>>> "swapByteCount": 0,
>>> "swapFiles": 0,
>>> "inFlightFlowFileCount": 0,
>>> "inFlightByteCount": 0,
>>> "nodeIdentifier": "nifi2-1:9443"
>>>   },
>>>   {
>>> "totalFlowFileCount": 0,
>>> "totalByteCount": 0,
>>> "activeQueueFlowFileCount": 0,
>>> "activeQueueByteCount": 0,
>>> "swapFlowFileCount": 0,
>>> "swapByteCount": 0,
>>> "swapFiles": 0,
>>> "inFlightFlowFileCount": 0,
>>> "inFlightByteCount": 0,
>>> "nodeIdentifier": "nifi3-1:9443"
>>>   }
>>> ]
>>>   }
>>>
>>> It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890
>>> bytes. But it also shows that the FlowFile is not in the "local partition"
>>> or either of the two "remote partitions." So that leaves us with two
>>> possibilities:
>>>
>>> 1) The Queue's Count is wrong, because it somehow did not get
>>> decremented (perhaps a threading bug?)
>>>
>>> Or
>>>
>>> 2) The Count is correct and the FlowFile exists, but somehow the
>>> reference to the FlowFile was lost by the FlowFile Queue (again, perhaps a
>>> threading bug?)
>>>
>>> If possible, I would for you to stop both the source and destination of
>>> that connection and then restart node nifi1-1. Once it has restarted, check
>>> if the FlowFile is still in the connection. That will tell us which of the
>>> two above scenarios is taking place. If the FlowFile exists upon restart,
>>> then the Queue somehow lost the handle to it. If the FlowFile does not
>>> exist in the connection upon restart (I'm guessing this will be the case),
>>> then it indicates that somehow the count is incorrect.
>>>
>>> Many thanks
>>> -Mark
>>>
>>> --
>>> *From:* dan young 
>>> *Sent:* Wednesday, December 26, 2018 9:18 AM
>>> *To:* NiFi Mailing List
>>> *Subject:* Re: flowfiles stuck in load balanced queue; nifi 1.8
>>>
>>> Heya Mark,
>>>
>>> So I added a Log Attribute Processor and routed the connection that had
>>> the "stuck" flowfile to it.   I ran a get diagnostics to the Log Attribute
>>> processor before I started it, and then ran another diagnostics after I
>>> started it.  The flowfile stayed in the load balanced connection/queue.
>>> I've attached both files.  Please LMK if this helps.
>>>
>>> Regards,
>>>
>>> Dano
>>>
>>>
>>> On Mon, Dec 24, 2018 at 10:35 AM Mark Payne 
>>> wrote:
>>>
>>> Dan,
>>>
>>> You would want to get diagnostics for the processor that is the
>>> source/destination of the connection - not the FlowFile. But if you
>>> connection is connecting 2 process groups then both its source and
>>> destination are Ports, not Processors. So the easiest thing to do would be
>>> to drop a “dummy processor” into the flow between the 2 groups, drag the
>>> Connection to that processor, get diagnostics for the processor, and then
>>> drag it back to where it was. Does that make sense? Sorry for the hassle.
>>>
>>> Thanks
>>> -Mark
>>>
>>> Sent from my iPhone
>>>
>>> On Dec 24, 2018, at 11:40 AM, dan young  wrote:
>>>
>>> Hello Bryan,
>>>
>>> Thank you, that was the ticket!
>>>
>>> Mark, I was able to run the diagnostics for a processor that's
>>> downstream from the connection where the flowfile appears to be "stuck".

Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-28 Thread dan young
},
>>> "remoteQueuePartitions": [
>>>   {
>>> "totalFlowFileCount": 0,
>>> "totalByteCount": 0,
>>> "activeQueueFlowFileCount": 0,
>>> "activeQueueByteCount": 0,
>>> "swapFlowFileCount": 0,
>>> "swapByteCount": 0,
>>> "swapFiles": 0,
>>> "inFlightFlowFileCount": 0,
>>> "inFlightByteCount": 0,
>>> "nodeIdentifier": "nifi2-1:9443"
>>>   },
>>>   {
>>> "totalFlowFileCount": 0,
>>> "totalByteCount": 0,
>>> "activeQueueFlowFileCount": 0,
>>> "activeQueueByteCount": 0,
>>> "swapFlowFileCount": 0,
>>> "swapByteCount": 0,
>>> "swapFiles": 0,
>>> "inFlightFlowFileCount": 0,
>>> "inFlightByteCount": 0,
>>> "nodeIdentifier": "nifi3-1:9443"
>>>   }
>>> ]
>>>   }
>>>
>>> It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890
>>> bytes. But it also shows that the FlowFile is not in the "local partition"
>>> or either of the two "remote partitions." So that leaves us with two
>>> possibilities:
>>>
>>> 1) The Queue's Count is wrong, because it somehow did not get
>>> decremented (perhaps a threading bug?)
>>>
>>> Or
>>>
>>> 2) The Count is correct and the FlowFile exists, but somehow the
>>> reference to the FlowFile was lost by the FlowFile Queue (again, perhaps a
>>> threading bug?)
>>>
>>> If possible, I would for you to stop both the source and destination of
>>> that connection and then restart node nifi1-1. Once it has restarted, check
>>> if the FlowFile is still in the connection. That will tell us which of the
>>> two above scenarios is taking place. If the FlowFile exists upon restart,
>>> then the Queue somehow lost the handle to it. If the FlowFile does not
>>> exist in the connection upon restart (I'm guessing this will be the case),
>>> then it indicates that somehow the count is incorrect.
>>>
>>> Many thanks
>>> -Mark
>>>
>>> --
>>> *From:* dan young 
>>> *Sent:* Wednesday, December 26, 2018 9:18 AM
>>> *To:* NiFi Mailing List
>>> *Subject:* Re: flowfiles stuck in load balanced queue; nifi 1.8
>>>
>>> Heya Mark,
>>>
>>> So I added a Log Attribute Processor and routed the connection that had
>>> the "stuck" flowfile to it.   I ran a get diagnostics to the Log Attribute
>>> processor before I started it, and then ran another diagnostics after I
>>> started it.  The flowfile stayed in the load balanced connection/queue.
>>> I've attached both files.  Please LMK if this helps.
>>>
>>> Regards,
>>>
>>> Dano
>>>
>>>
>>> On Mon, Dec 24, 2018 at 10:35 AM Mark Payne 
>>> wrote:
>>>
>>> Dan,
>>>
>>> You would want to get diagnostics for the processor that is the
>>> source/destination of the connection - not the FlowFile. But if you
>>> connection is connecting 2 process groups then both its source and
>>> destination are Ports, not Processors. So the easiest thing to do would be
>>> to drop a “dummy processor” into the flow between the 2 groups, drag the
>>> Connection to that processor, get diagnostics for the processor, and then
>>> drag it back to where it was. Does that make sense? Sorry for the hassle.
>>>
>>> Thanks
>>> -Mark
>>>
>>> Sent from my iPhone
>>>
>>> On Dec 24, 2018, at 11:40 AM, dan young  wrote:
>>>
>>> Hello Bryan,
>>>
>>> Thank you, that was the ticket!
>>>
>>> Mark, I was able to run the diagnostics for a processor that's
>>> downstream from the connection where the flowfile appears to be "stuck".
>>> I'm not sure what processor is the source of this particular "stuck"
>>> flowfile since

Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-28 Thread Boris Tyukin
  "anyActiveQueueFlowFilesPenalized": false
>>> },
>>> "remoteQueuePartitions": [
>>>   {
>>> "totalFlowFileCount": 0,
>>> "totalByteCount": 0,
>>> "activeQueueFlowFileCount": 0,
>>> "activeQueueByteCount": 0,
>>> "swapFlowFileCount": 0,
>>> "swapByteCount": 0,
>>> "swapFiles": 0,
>>> "inFlightFlowFileCount": 0,
>>> "inFlightByteCount": 0,
>>> "nodeIdentifier": "nifi2-1:9443"
>>>   },
>>>   {
>>> "totalFlowFileCount": 0,
>>> "totalByteCount": 0,
>>> "activeQueueFlowFileCount": 0,
>>> "activeQueueByteCount": 0,
>>> "swapFlowFileCount": 0,
>>> "swapByteCount": 0,
>>> "swapFiles": 0,
>>> "inFlightFlowFileCount": 0,
>>> "inFlightByteCount": 0,
>>> "nodeIdentifier": "nifi3-1:9443"
>>>   }
>>> ]
>>>   }
>>>
>>> It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890
>>> bytes. But it also shows that the FlowFile is not in the "local partition"
>>> or either of the two "remote partitions." So that leaves us with two
>>> possibilities:
>>>
>>> 1) The Queue's Count is wrong, because it somehow did not get
>>> decremented (perhaps a threading bug?)
>>>
>>> Or
>>>
>>> 2) The Count is correct and the FlowFile exists, but somehow the
>>> reference to the FlowFile was lost by the FlowFile Queue (again, perhaps a
>>> threading bug?)
>>>
>>> If possible, I would for you to stop both the source and destination of
>>> that connection and then restart node nifi1-1. Once it has restarted, check
>>> if the FlowFile is still in the connection. That will tell us which of the
>>> two above scenarios is taking place. If the FlowFile exists upon restart,
>>> then the Queue somehow lost the handle to it. If the FlowFile does not
>>> exist in the connection upon restart (I'm guessing this will be the case),
>>> then it indicates that somehow the count is incorrect.
>>>
>>> Many thanks
>>> -Mark
>>>
>>> --
>>> *From:* dan young 
>>> *Sent:* Wednesday, December 26, 2018 9:18 AM
>>> *To:* NiFi Mailing List
>>> *Subject:* Re: flowfiles stuck in load balanced queue; nifi 1.8
>>>
>>> Heya Mark,
>>>
>>> So I added a Log Attribute Processor and routed the connection that had
>>> the "stuck" flowfile to it.   I ran a get diagnostics to the Log Attribute
>>> processor before I started it, and then ran another diagnostics after I
>>> started it.  The flowfile stayed in the load balanced connection/queue.
>>> I've attached both files.  Please LMK if this helps.
>>>
>>> Regards,
>>>
>>> Dano
>>>
>>>
>>> On Mon, Dec 24, 2018 at 10:35 AM Mark Payne 
>>> wrote:
>>>
>>> Dan,
>>>
>>> You would want to get diagnostics for the processor that is the
>>> source/destination of the connection - not the FlowFile. But if you
>>> connection is connecting 2 process groups then both its source and
>>> destination are Ports, not Processors. So the easiest thing to do would be
>>> to drop a “dummy processor” into the flow between the 2 groups, drag the
>>> Connection to that processor, get diagnostics for the processor, and then
>>> drag it back to where it was. Does that make sense? Sorry for the hassle.
>>>
>>> Thanks
>>> -Mark
>>>
>>> Sent from my iPhone
>>>
>>> On Dec 24, 2018, at 11:40 AM, dan young  wrote:
>>>
>>> Hello Bryan,
>>>
>>> Thank you, that was the ticket!
>>>
>>> Mark, I was able to run the diagnostics for a processor that's
>>> downstream from the connection where the flowfile appears to be "stuck".
>>> I

Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-28 Thread dan young
 },
>>> "remoteQueuePartitions": [
>>>   {
>>> "totalFlowFileCount": 0,
>>> "totalByteCount": 0,
>>> "activeQueueFlowFileCount": 0,
>>> "activeQueueByteCount": 0,
>>> "swapFlowFileCount": 0,
>>> "swapByteCount": 0,
>>> "swapFiles": 0,
>>> "inFlightFlowFileCount": 0,
>>> "inFlightByteCount": 0,
>>> "nodeIdentifier": "nifi2-1:9443"
>>>   },
>>>   {
>>> "totalFlowFileCount": 0,
>>> "totalByteCount": 0,
>>> "activeQueueFlowFileCount": 0,
>>> "activeQueueByteCount": 0,
>>> "swapFlowFileCount": 0,
>>> "swapByteCount": 0,
>>> "swapFiles": 0,
>>> "inFlightFlowFileCount": 0,
>>> "inFlightByteCount": 0,
>>> "nodeIdentifier": "nifi3-1:9443"
>>>   }
>>> ]
>>>   }
>>>
>>> It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890
>>> bytes. But it also shows that the FlowFile is not in the "local partition"
>>> or either of the two "remote partitions." So that leaves us with two
>>> possibilities:
>>>
>>> 1) The Queue's Count is wrong, because it somehow did not get
>>> decremented (perhaps a threading bug?)
>>>
>>> Or
>>>
>>> 2) The Count is correct and the FlowFile exists, but somehow the
>>> reference to the FlowFile was lost by the FlowFile Queue (again, perhaps a
>>> threading bug?)
>>>
>>> If possible, I would for you to stop both the source and destination of
>>> that connection and then restart node nifi1-1. Once it has restarted, check
>>> if the FlowFile is still in the connection. That will tell us which of the
>>> two above scenarios is taking place. If the FlowFile exists upon restart,
>>> then the Queue somehow lost the handle to it. If the FlowFile does not
>>> exist in the connection upon restart (I'm guessing this will be the case),
>>> then it indicates that somehow the count is incorrect.
>>>
>>> Many thanks
>>> -Mark
>>>
>>> --
>>> *From:* dan young 
>>> *Sent:* Wednesday, December 26, 2018 9:18 AM
>>> *To:* NiFi Mailing List
>>> *Subject:* Re: flowfiles stuck in load balanced queue; nifi 1.8
>>>
>>> Heya Mark,
>>>
>>> So I added a Log Attribute Processor and routed the connection that had
>>> the "stuck" flowfile to it.   I ran a get diagnostics to the Log Attribute
>>> processor before I started it, and then ran another diagnostics after I
>>> started it.  The flowfile stayed in the load balanced connection/queue.
>>> I've attached both files.  Please LMK if this helps.
>>>
>>> Regards,
>>>
>>> Dano
>>>
>>>
>>> On Mon, Dec 24, 2018 at 10:35 AM Mark Payne 
>>> wrote:
>>>
>>> Dan,
>>>
>>> You would want to get diagnostics for the processor that is the
>>> source/destination of the connection - not the FlowFile. But if you
>>> connection is connecting 2 process groups then both its source and
>>> destination are Ports, not Processors. So the easiest thing to do would be
>>> to drop a “dummy processor” into the flow between the 2 groups, drag the
>>> Connection to that processor, get diagnostics for the processor, and then
>>> drag it back to where it was. Does that make sense? Sorry for the hassle.
>>>
>>> Thanks
>>> -Mark
>>>
>>> Sent from my iPhone
>>>
>>> On Dec 24, 2018, at 11:40 AM, dan young  wrote:
>>>
>>> Hello Bryan,
>>>
>>> Thank you, that was the ticket!
>>>
>>> Mark, I was able to run the diagnostics for a processor that's
>>> downstream from the connection where the flowfile appears to be "stuck".
>>> I'm not sure what processor is the source of this particular "stuck"
>>> flowfile since we

Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-28 Thread Mark Payne
,
"activeQueueByteCount": 0,
"swapFlowFileCount": 0,
"swapByteCount": 0,
"swapFiles": 0,
"inFlightFlowFileCount": 0,
"inFlightByteCount": 0,
"nodeIdentifier": "nifi3-1:9443"
  }
]
  }

It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890 
bytes. But it also shows that the FlowFile is not in the "local partition" or 
either of the two "remote partitions." So that leaves us with two possibilities:

1) The Queue's Count is wrong, because it somehow did not get decremented 
(perhaps a threading bug?)

Or

2) The Count is correct and the FlowFile exists, but somehow the reference to 
the FlowFile was lost by the FlowFile Queue (again, perhaps a threading bug?)

If possible, I would for you to stop both the source and destination of that 
connection and then restart node nifi1-1. Once it has restarted, check if the 
FlowFile is still in the connection. That will tell us which of the two above 
scenarios is taking place. If the FlowFile exists upon restart, then the Queue 
somehow lost the handle to it. If the FlowFile does not exist in the connection 
upon restart (I'm guessing this will be the case), then it indicates that 
somehow the count is incorrect.

Many thanks
-Mark


From: dan young mailto:danoyo...@gmail.com>>
Sent: Wednesday, December 26, 2018 9:18 AM
To: NiFi Mailing List
Subject: Re: flowfiles stuck in load balanced queue; nifi 1.8

Heya Mark,

So I added a Log Attribute Processor and routed the connection that had the 
"stuck" flowfile to it.   I ran a get diagnostics to the Log Attribute 
processor before I started it, and then ran another diagnostics after I started 
it.  The flowfile stayed in the load balanced connection/queue.  I've attached 
both files.  Please LMK if this helps.

Regards,

Dano


On Mon, Dec 24, 2018 at 10:35 AM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Dan,

You would want to get diagnostics for the processor that is the 
source/destination of the connection - not the FlowFile. But if you connection 
is connecting 2 process groups then both its source and destination are Ports, 
not Processors. So the easiest thing to do would be to drop a “dummy processor” 
into the flow between the 2 groups, drag the Connection to that processor, get 
diagnostics for the processor, and then drag it back to where it was. Does that 
make sense? Sorry for the hassle.

Thanks
-Mark

Sent from my iPhone

On Dec 24, 2018, at 11:40 AM, dan young 
mailto:danoyo...@gmail.com>> wrote:

Hello Bryan,

Thank you, that was the ticket!

Mark, I was able to run the diagnostics for a processor that's downstream from 
the connection where the flowfile appears to be "stuck". I'm not sure what 
processor is the source of this particular "stuck" flowfile since we have a 
number of upstream processor groups (PG) that feed into a funnel.  This funnel 
is then connected to a downstream PG. It is this connection between the funnel 
and a downstream PG where the flowfile is stuck. I might reduce the upstream 
"load balanced connections" between the various PGs to just one so I can narrow 
where we need to run diagnostics  If this isn't the correct processor to be 
gathering diagnostics, please LMK where else I should look or other diagnostics 
to run...

I've also attached the output (nifi-api/connections/{id}) of the get for that 
connection where the flowfile appears to be "stuck"

On Sun, Dec 23, 2018 at 8:36 PM Bryan Bende 
mailto:bbe...@gmail.com>> wrote:
You’ll need to get the token that was obtained when you logged in to the SSO 
and submit it on the curl requests the same way the UI is doing on all requests.

You should be able to open chrome dev tool tools while in the UI and look at 
one of the request/responses and copy the value of the 'Authorization’ header 
which should be in the form ‘Bearer ’.

Then send this on the curl command by specifying a header of -H 'Authorization: 
Bearer '

On Sun, Dec 23, 2018 at 6:28 PM dan young 
mailto:danoyo...@gmail.com>> wrote:
I forgot to mention that we're using the OpenId Connect SSO .  Is there a way 
to run these command via curl when we have the cluster configured this way?  If 
so would anyone be able to provide some insight/examples.

Happy Holidays!

Regards,

Dano

On Sun, Dec 23, 2018 at 3:53 PM dan young 
mailto:danoyo...@gmail.com>> wrote:
This is what I'm seeing in the logs when I try to access the 
nifi-api/flow/about for example...

2018-12-23 22:51:45,579 INFO [NiFi Web Server-24201] 
o.a.n.w.s.NiFiAuthenticationFilter Authentication success for 
d...@looker.com<mailto:d...@looker.com>
2018-1

Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-26 Thread Mark Payne
you that while I may be the only one from the NiFi side 
who's been engaging on debugging
this, I am far from the only one who cares about it! :) This is a pretty big 
new feature that was added to the latest
release, so understandably there are probably not yet a lot of people who 
understand the code well enough to
debug. I have tried replicating the issue, but have not been successful. I have 
a 3-node cluster that ran for well over
a month without a restart, and i've also tried restarting it every few hours 
for a couple of days. It has about 8 different
load-balanced connections, with varying data sizes and volumes. I've not been 
able to get into this situation, though,
unfortunately.

But yes, I think that we've seen this issue arise from each of the two of you 
and one other on the mailing list, so it
is certainly something that we need to nail down ASAP. Unfortunately, debugging 
an issue that involves communication
between multiple nodes is often difficult to fully understand, so it may not be 
a trivial task to debug.

Dano, if you are able to get to the diagnostics, as Josef mentioned, that is 
likely to be pretty helpful. Off the top of my head,
there are a few possibilities that are coming to mind, as to what kind of bug 
could cause such behavior:

1) Perhaps there really is no flowfile in the queue, but we somehow 
miscalculated the size of the queue. The diagnostics
info would tell us whether or not this is the case. It will look into the 
queues themselves to determine how many FlowFiles are
destined for each node in the cluster, rather than just returning the 
pre-calculated count. Failing that, you could also stop the source
and destination of the queue, restart the node, and then see if the FlowFile is 
entirely gone from the queue on restart, or if it remains
in the queue. If it is gone, then that likely indicates that the pre-computed 
count is somehow off.

2) We are having trouble communicating with the node that we are trying to send 
the data to. I would expect some sort of ERROR
log messages in this case.

3) The node is properly sending the FlowFile to where it needs to go, but for 
some reason the receiving node is then re-distributing it
to another node in the cluster, which then re-distributes it again, so that it 
never ends in the correct destination. I think this is unlikely
and would be easy to verify by looking at the "Summary" table [1] and doing the 
"Cluster view" and constantly refreshing for a few seconds
to see if the queue changes on any node in the cluster.

4) For some entirely unknown reason, there exists a bug that causes the node to 
simply see the FlowFile and just skip over it
entirely.

For additional logging, we can enable DEBUG logging on
org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask:


With that DEBUG logging turned on, it may or may not generate a lot of DEBUG 
logs. If it does not, then that in and of itself tells us something.
If it does generate a lot of DEBUG logs, then it would be good to see what it's 
dumping out in the logs.

And a big Thank You to you guys for staying engaged on this and your 
willingness to dig in!

Thanks!
-Mark

[1] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Summary_Page


On Dec 19, 2018, at 2:18 AM, 
mailto:josef.zahn...@swisscom.com>> 
mailto:josef.zahn...@swisscom.com>> wrote:

Hi Dano

Seems that the problem has been seen by a few people but until now nobody from 
NiFi team really cared about it – except Mark Payne. He mentioned the part 
below with the diagnostics, however in my case this doesn’t even work (tried it 
on standalone unsecured cluster as well as on secured cluster)! Can you get the 
diagnostics on your cluster?

I guess at the end we have to open a Jira ticket to narrow it down.

Cheers Josef


One thing that I would recommend, to get more information, is to go to the REST 
endpoint (in your browser is fine)
/nifi-api/processors//diagnostics

Where  is the UUID of either the source or the destination of the 
Connection in question. This gives us
a lot of information about the internals of Connection. The easiest way to get 
that Processor ID is to just click on the
processor on the canvas and look at the Operate palette on the left-hand side. 
You can copy & paste from there. If you
then send the diagnostics information to us, we can analyze that to help 
understand what's happening.



From: dan young mailto:danoyo...@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
mailto:users@nifi.apache.org>>
Date: Wednesday, 19 December 2018 at 05:28
To: NiFi Mailing List mailto:users@nifi.apache.org>>
Subject: flowfiles stuck in load balanced queue; nifi 1.8

We're seeing this more frequently where flowfiles seem to be stuck in a load 
balanced queue.  The only resolution is to disconnect the node and then restart 
that node.  After this, the flowfile disappears from the queue.  Any ideas on 
what might be going on here or what additional information I might be able to 
provide to debug this?

I've attached another thread dump and some screen shots


Regards,

Dano

--
Sent from Gmail Mobile





Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-26 Thread dan young
Hello Mark,

I just stopped the destination processor, and then disconnected the node in
question (nifi1-1). Once I disconnected the node, the flow file in the load
balance connection disappeared from the queue.  After that, I reconnected
the node (with the downstream processor disconnected) and once the node
successfully rejoined the cluster, the flowfile showed up in the queue
again. After this, I started the connected downstream processor, but the
flowfile stays in the queue. The only way to clear the queue is if I
actually restart the node.  If I disconnect the node, and then restart that
node, the flowfile is no longer present in the queue.

Regards,

Dano


On Wed, Dec 26, 2018 at 6:13 PM Mark Payne  wrote:

> Ok, I just wanted to confirm that when you said “once it rejoins the
> cluster that flow file is gone” that you mean “the flowfile did not exist
> on the system” and NOT “the queue size was 0 by the time that I looked at
> the UI.” I.e., is it possible that the FlowFile did exist, was restored,
> and then was processed before you looked at the UI? Or the FlowFile
> definitely did not exist after the node was restarted? That’s why I was
> suggesting that you restart with the connection’s source and destination
> stopped. Just to make sure that the FlowFile didn’t just get processed
> quickly on restart.
>
> Sent from my iPhone
>
> On Dec 26, 2018, at 7:55 PM, dan young  wrote:
>
> Heya Mark,
>
> If we restart the node, that "stuck" flowfile will disappear. This is the
> only way so far to clear out the flowfile. I usually disconnect the node,
> then once it's disconnected I restart nifi, and then once it rejoins the
> cluster that flow file is gone. If we try to empty the queue, it will just
> say that there no flow files in the queue.
>
>
> On Wed, Dec 26, 2018, 5:22 PM Mark Payne 
>> Hey Dan,
>>
>> Thanks, this is super useful! So, the following section is the damning
>> part of the JSON:
>>
>>   {
>> "totalFlowFileCount": 1,
>> "totalByteCount": 975890,
>> "nodeIdentifier": "nifi1-1:9443",
>> "localQueuePartition": {
>>   "totalFlowFileCount": 0,
>>   "totalByteCount": 0,
>>   "activeQueueFlowFileCount": 0,
>>   "activeQueueByteCount": 0,
>>   "swapFlowFileCount": 0,
>>   "swapByteCount": 0,
>>   "swapFiles": 0,
>>   "inFlightFlowFileCount": 0,
>>   "inFlightByteCount": 0,
>>   "allActiveQueueFlowFilesPenalized": false,
>>   "anyActiveQueueFlowFilesPenalized": false
>> },
>> "remoteQueuePartitions": [
>>   {
>> "totalFlowFileCount": 0,
>> "totalByteCount": 0,
>> "activeQueueFlowFileCount": 0,
>> "activeQueueByteCount": 0,
>> "swapFlowFileCount": 0,
>> "swapByteCount": 0,
>> "swapFiles": 0,
>> "inFlightFlowFileCount": 0,
>> "inFlightByteCount": 0,
>> "nodeIdentifier": "nifi2-1:9443"
>>   },
>>   {
>> "totalFlowFileCount": 0,
>> "totalByteCount": 0,
>> "activeQueueFlowFileCount": 0,
>> "activeQueueByteCount": 0,
>> "swapFlowFileCount": 0,
>> "swapByteCount": 0,
>> "swapFiles": 0,
>> "inFlightFlowFileCount": 0,
>> "inFlightByteCount": 0,
>> "nodeIdentifier": "nifi3-1:9443"
>>   }
>> ]
>>   }
>>
>> It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890
>> bytes. But it also shows that the FlowFile is not in the "local partition"
>> or either of the two "remote partitions." So that leaves us with two
>> possibilities:
>>
>> 1) The Queue's Count is wrong, because it somehow did not get decremented
>> (perhaps a threading bug?)
>>
>> Or
>>
>> 2) The Count is correct and the FlowFile exists, but somehow the
>> reference to the Fl

Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-26 Thread Mark Payne
Ok, I just wanted to confirm that when you said “once it rejoins the cluster 
that flow file is gone” that you mean “the flowfile did not exist on the 
system” and NOT “the queue size was 0 by the time that I looked at the UI.” 
I.e., is it possible that the FlowFile did exist, was restored, and then was 
processed before you looked at the UI? Or the FlowFile definitely did not exist 
after the node was restarted? That’s why I was suggesting that you restart with 
the connection’s source and destination stopped. Just to make sure that the 
FlowFile didn’t just get processed quickly on restart.

Sent from my iPhone

On Dec 26, 2018, at 7:55 PM, dan young 
mailto:danoyo...@gmail.com>> wrote:

Heya Mark,

If we restart the node, that "stuck" flowfile will disappear. This is the only 
way so far to clear out the flowfile. I usually disconnect the node, then once 
it's disconnected I restart nifi, and then once it rejoins the cluster that 
flow file is gone. If we try to empty the queue, it will just say that there no 
flow files in the queue.


On Wed, Dec 26, 2018, 5:22 PM Mark Payne 
mailto:marka...@hotmail.com> wrote:
Hey Dan,

Thanks, this is super useful! So, the following section is the damning part of 
the JSON:

  {
"totalFlowFileCount": 1,
"totalByteCount": 975890,
"nodeIdentifier": "nifi1-1:9443",
"localQueuePartition": {
  "totalFlowFileCount": 0,
  "totalByteCount": 0,
  "activeQueueFlowFileCount": 0,
  "activeQueueByteCount": 0,
  "swapFlowFileCount": 0,
  "swapByteCount": 0,
  "swapFiles": 0,
  "inFlightFlowFileCount": 0,
  "inFlightByteCount": 0,
  "allActiveQueueFlowFilesPenalized": false,
  "anyActiveQueueFlowFilesPenalized": false
},
"remoteQueuePartitions": [
  {
"totalFlowFileCount": 0,
"totalByteCount": 0,
"activeQueueFlowFileCount": 0,
"activeQueueByteCount": 0,
"swapFlowFileCount": 0,
"swapByteCount": 0,
"swapFiles": 0,
"inFlightFlowFileCount": 0,
"inFlightByteCount": 0,
"nodeIdentifier": "nifi2-1:9443"
  },
  {
"totalFlowFileCount": 0,
"totalByteCount": 0,
"activeQueueFlowFileCount": 0,
"activeQueueByteCount": 0,
"swapFlowFileCount": 0,
"swapByteCount": 0,
"swapFiles": 0,
"inFlightFlowFileCount": 0,
"inFlightByteCount": 0,
"nodeIdentifier": "nifi3-1:9443"
  }
]
  }

It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890 
bytes. But it also shows that the FlowFile is not in the "local partition" or 
either of the two "remote partitions." So that leaves us with two possibilities:

1) The Queue's Count is wrong, because it somehow did not get decremented 
(perhaps a threading bug?)

Or

2) The Count is correct and the FlowFile exists, but somehow the reference to 
the FlowFile was lost by the FlowFile Queue (again, perhaps a threading bug?)

If possible, I would for you to stop both the source and destination of that 
connection and then restart node nifi1-1. Once it has restarted, check if the 
FlowFile is still in the connection. That will tell us which of the two above 
scenarios is taking place. If the FlowFile exists upon restart, then the Queue 
somehow lost the handle to it. If the FlowFile does not exist in the connection 
upon restart (I'm guessing this will be the case), then it indicates that 
somehow the count is incorrect.

Many thanks
-Mark


From: dan young mailto:danoyo...@gmail.com>>
Sent: Wednesday, December 26, 2018 9:18 AM
To: NiFi Mailing List
Subject: Re: flowfiles stuck in load balanced queue; nifi 1.8

Heya Mark,

So I added a Log Attribute Processor and routed the connection that had the 
"stuck" flowfile to it.   I ran a get diagnostics to the Log Attribute 
processor before I started it, and then ran another diagnostics after I started 
it.  The flowfile stayed in the load balanced connection/queue.  I've attached 
both files.  Please LMK if this helps.

Regards,

Dano


On Mon, Dec 24, 2018 at 10:35 AM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Dan,


Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-26 Thread dan young
Heya Mark,

If we restart the node, that "stuck" flowfile will disappear. This is the
only way so far to clear out the flowfile. I usually disconnect the node,
then once it's disconnected I restart nifi, and then once it rejoins the
cluster that flow file is gone. If we try to empty the queue, it will just
say that there no flow files in the queue.


On Wed, Dec 26, 2018, 5:22 PM Mark Payne  Hey Dan,
>
> Thanks, this is super useful! So, the following section is the damning
> part of the JSON:
>
>   {
> "totalFlowFileCount": 1,
> "totalByteCount": 975890,
> "nodeIdentifier": "nifi1-1:9443",
> "localQueuePartition": {
>   "totalFlowFileCount": 0,
>   "totalByteCount": 0,
>   "activeQueueFlowFileCount": 0,
>   "activeQueueByteCount": 0,
>   "swapFlowFileCount": 0,
>   "swapByteCount": 0,
>   "swapFiles": 0,
>   "inFlightFlowFileCount": 0,
>   "inFlightByteCount": 0,
>   "allActiveQueueFlowFilesPenalized": false,
>   "anyActiveQueueFlowFilesPenalized": false
> },
> "remoteQueuePartitions": [
>   {
> "totalFlowFileCount": 0,
> "totalByteCount": 0,
> "activeQueueFlowFileCount": 0,
> "activeQueueByteCount": 0,
> "swapFlowFileCount": 0,
> "swapByteCount": 0,
> "swapFiles": 0,
> "inFlightFlowFileCount": 0,
> "inFlightByteCount": 0,
> "nodeIdentifier": "nifi2-1:9443"
>   },
>   {
> "totalFlowFileCount": 0,
> "totalByteCount": 0,
> "activeQueueFlowFileCount": 0,
> "activeQueueByteCount": 0,
> "swapFlowFileCount": 0,
> "swapByteCount": 0,
> "swapFiles": 0,
> "inFlightFlowFileCount": 0,
> "inFlightByteCount": 0,
> "nodeIdentifier": "nifi3-1:9443"
>   }
> ]
>   }
>
> It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890
> bytes. But it also shows that the FlowFile is not in the "local partition"
> or either of the two "remote partitions." So that leaves us with two
> possibilities:
>
> 1) The Queue's Count is wrong, because it somehow did not get decremented
> (perhaps a threading bug?)
>
> Or
>
> 2) The Count is correct and the FlowFile exists, but somehow the reference
> to the FlowFile was lost by the FlowFile Queue (again, perhaps a threading
> bug?)
>
> If possible, I would for you to stop both the source and destination of
> that connection and then restart node nifi1-1. Once it has restarted, check
> if the FlowFile is still in the connection. That will tell us which of the
> two above scenarios is taking place. If the FlowFile exists upon restart,
> then the Queue somehow lost the handle to it. If the FlowFile does not
> exist in the connection upon restart (I'm guessing this will be the case),
> then it indicates that somehow the count is incorrect.
>
> Many thanks
> -Mark
>
> --
> *From:* dan young 
> *Sent:* Wednesday, December 26, 2018 9:18 AM
> *To:* NiFi Mailing List
> *Subject:* Re: flowfiles stuck in load balanced queue; nifi 1.8
>
> Heya Mark,
>
> So I added a Log Attribute Processor and routed the connection that had
> the "stuck" flowfile to it.   I ran a get diagnostics to the Log Attribute
> processor before I started it, and then ran another diagnostics after I
> started it.  The flowfile stayed in the load balanced connection/queue.
> I've attached both files.  Please LMK if this helps.
>
> Regards,
>
> Dano
>
>
> On Mon, Dec 24, 2018 at 10:35 AM Mark Payne  wrote:
>
> Dan,
>
> You would want to get diagnostics for the processor that is the
> source/destination of the connection - not the FlowFile. But if you
> connection is connecting 2 process groups then both its source and
> destination are Ports, not Processors. So the easiest thing to do would be
> to drop a “dummy processor” into the flow between the 2 gro

Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-26 Thread Mark Payne
Hey Dan,

Thanks, this is super useful! So, the following section is the damning part of 
the JSON:

  {
"totalFlowFileCount": 1,
"totalByteCount": 975890,
"nodeIdentifier": "nifi1-1:9443",
"localQueuePartition": {
  "totalFlowFileCount": 0,
  "totalByteCount": 0,
  "activeQueueFlowFileCount": 0,
  "activeQueueByteCount": 0,
  "swapFlowFileCount": 0,
  "swapByteCount": 0,
  "swapFiles": 0,
  "inFlightFlowFileCount": 0,
  "inFlightByteCount": 0,
  "allActiveQueueFlowFilesPenalized": false,
  "anyActiveQueueFlowFilesPenalized": false
},
"remoteQueuePartitions": [
  {
"totalFlowFileCount": 0,
"totalByteCount": 0,
"activeQueueFlowFileCount": 0,
"activeQueueByteCount": 0,
"swapFlowFileCount": 0,
"swapByteCount": 0,
"swapFiles": 0,
"inFlightFlowFileCount": 0,
"inFlightByteCount": 0,
"nodeIdentifier": "nifi2-1:9443"
  },
  {
"totalFlowFileCount": 0,
"totalByteCount": 0,
"activeQueueFlowFileCount": 0,
"activeQueueByteCount": 0,
"swapFlowFileCount": 0,
"swapByteCount": 0,
"swapFiles": 0,
"inFlightFlowFileCount": 0,
"inFlightByteCount": 0,
"nodeIdentifier": "nifi3-1:9443"
  }
]
  }

It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890 
bytes. But it also shows that the FlowFile is not in the "local partition" or 
either of the two "remote partitions." So that leaves us with two possibilities:

1) The Queue's Count is wrong, because it somehow did not get decremented 
(perhaps a threading bug?)

Or

2) The Count is correct and the FlowFile exists, but somehow the reference to 
the FlowFile was lost by the FlowFile Queue (again, perhaps a threading bug?)

If possible, I would for you to stop both the source and destination of that 
connection and then restart node nifi1-1. Once it has restarted, check if the 
FlowFile is still in the connection. That will tell us which of the two above 
scenarios is taking place. If the FlowFile exists upon restart, then the Queue 
somehow lost the handle to it. If the FlowFile does not exist in the connection 
upon restart (I'm guessing this will be the case), then it indicates that 
somehow the count is incorrect.

Many thanks
-Mark


From: dan young 
Sent: Wednesday, December 26, 2018 9:18 AM
To: NiFi Mailing List
Subject: Re: flowfiles stuck in load balanced queue; nifi 1.8

Heya Mark,

So I added a Log Attribute Processor and routed the connection that had the 
"stuck" flowfile to it.   I ran a get diagnostics to the Log Attribute 
processor before I started it, and then ran another diagnostics after I started 
it.  The flowfile stayed in the load balanced connection/queue.  I've attached 
both files.  Please LMK if this helps.

Regards,

Dano


On Mon, Dec 24, 2018 at 10:35 AM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Dan,

You would want to get diagnostics for the processor that is the 
source/destination of the connection - not the FlowFile. But if you connection 
is connecting 2 process groups then both its source and destination are Ports, 
not Processors. So the easiest thing to do would be to drop a “dummy processor” 
into the flow between the 2 groups, drag the Connection to that processor, get 
diagnostics for the processor, and then drag it back to where it was. Does that 
make sense? Sorry for the hassle.

Thanks
-Mark

Sent from my iPhone

On Dec 24, 2018, at 11:40 AM, dan young 
mailto:danoyo...@gmail.com>> wrote:

Hello Bryan,

Thank you, that was the ticket!

Mark, I was able to run the diagnostics for a processor that's downstream from 
the connection where the flowfile appears to be "stuck". I'm not sure what 
processor is the source of this particular "stuck" flowfile since we have a 
number of upstream processor groups (PG) that feed into a funnel.  This funnel 
is then connected to a downstream PG. It is this connection between the funnel 
and a downstream PG where the flowfile is stuck. I might reduce the upstream 
"load balanced conn

Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-24 Thread Mark Payne
between multiple nodes is often difficult to fully understand, so it may not be 
a trivial task to debug.

Dano, if you are able to get to the diagnostics, as Josef mentioned, that is 
likely to be pretty helpful. Off the top of my head,
there are a few possibilities that are coming to mind, as to what kind of bug 
could cause such behavior:

1) Perhaps there really is no flowfile in the queue, but we somehow 
miscalculated the size of the queue. The diagnostics
info would tell us whether or not this is the case. It will look into the 
queues themselves to determine how many FlowFiles are
destined for each node in the cluster, rather than just returning the 
pre-calculated count. Failing that, you could also stop the source
and destination of the queue, restart the node, and then see if the FlowFile is 
entirely gone from the queue on restart, or if it remains
in the queue. If it is gone, then that likely indicates that the pre-computed 
count is somehow off.

2) We are having trouble communicating with the node that we are trying to send 
the data to. I would expect some sort of ERROR
log messages in this case.

3) The node is properly sending the FlowFile to where it needs to go, but for 
some reason the receiving node is then re-distributing it
to another node in the cluster, which then re-distributes it again, so that it 
never ends in the correct destination. I think this is unlikely
and would be easy to verify by looking at the "Summary" table [1] and doing the 
"Cluster view" and constantly refreshing for a few seconds
to see if the queue changes on any node in the cluster.

4) For some entirely unknown reason, there exists a bug that causes the node to 
simply see the FlowFile and just skip over it
entirely.

For additional logging, we can enable DEBUG logging on
org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask:


With that DEBUG logging turned on, it may or may not generate a lot of DEBUG 
logs. If it does not, then that in and of itself tells us something.
If it does generate a lot of DEBUG logs, then it would be good to see what it's 
dumping out in the logs.

And a big Thank You to you guys for staying engaged on this and your 
willingness to dig in!

Thanks!
-Mark

[1] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Summary_Page


On Dec 19, 2018, at 2:18 AM, 
mailto:josef.zahn...@swisscom.com>> 
mailto:josef.zahn...@swisscom.com>> wrote:

Hi Dano

Seems that the problem has been seen by a few people but until now nobody from 
NiFi team really cared about it – except Mark Payne. He mentioned the part 
below with the diagnostics, however in my case this doesn’t even work (tried it 
on standalone unsecured cluster as well as on secured cluster)! Can you get the 
diagnostics on your cluster?

I guess at the end we have to open a Jira ticket to narrow it down.

Cheers Josef


One thing that I would recommend, to get more information, is to go to the REST 
endpoint (in your browser is fine)
/nifi-api/processors//diagnostics

Where  is the UUID of either the source or the destination of the 
Connection in question. This gives us
a lot of information about the internals of Connection. The easiest way to get 
that Processor ID is to just click on the
processor on the canvas and look at the Operate palette on the left-hand side. 
You can copy & paste from there. If you
then send the diagnostics information to us, we can analyze that to help 
understand what's happening.



From: dan young mailto:danoyo...@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
mailto:users@nifi.apache.org>>
Date: Wednesday, 19 December 2018 at 05:28
To: NiFi Mailing List mailto:users@nifi.apache.org>>
Subject: flowfiles stuck in load balanced queue; nifi 1.8

We're seeing this more frequently where flowfiles seem to be stuck in a load 
balanced queue.  The only resolution is to disconnect the node and then restart 
that node.  After this, the flowfile disappears from the queue.  Any ideas on 
what might be going on here or what additional information I might be able to 
provide to debug this?

I've attached another thread dump and some screen shots


Regards,

Dano

--
Sent from Gmail Mobile





Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-23 Thread Bryan Bende
gt;>> FlowFile is entirely gone from the queue on restart, or if it remains
>>>> in the queue. If it is gone, then that likely indicates that the
>>>> pre-computed count is somehow off.
>>>>
>>>> 2) We are having trouble communicating with the node that we are trying
>>>> to send the data to. I would expect some sort of ERROR
>>>> log messages in this case.
>>>>
>>>> 3) The node is properly sending the FlowFile to where it needs to go,
>>>> but for some reason the receiving node is then re-distributing it
>>>> to another node in the cluster, which then re-distributes it again, so
>>>> that it never ends in the correct destination. I think this is unlikely
>>>> and would be easy to verify by looking at the "Summary" table [1] and
>>>> doing the "Cluster view" and constantly refreshing for a few seconds
>>>> to see if the queue changes on any node in the cluster.
>>>>
>>>> 4) For some entirely unknown reason, there exists a bug that causes the
>>>> node to simply see the FlowFile and just skip over it
>>>> entirely.
>>>>
>>>> For additional logging, we can enable DEBUG logging on
>>>> org.apache.nifi.controller.queue.clustered.client.async.nio.
>>>> NioAsyncLoadBalanceClientTask:
>>>> >>> level="DEBUG" />
>>>>
>>>> With that DEBUG logging turned on, it may or may not generate a lot of
>>>> DEBUG logs. If it does not, then that in and of itself tells us something.
>>>> If it does generate a lot of DEBUG logs, then it would be good to see
>>>> what it's dumping out in the logs.
>>>>
>>>> And a big Thank You to you guys for staying engaged on this and your
>>>> willingness to dig in!
>>>>
>>>> Thanks!
>>>> -Mark
>>>>
>>>> [1]
>>>> https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Summary_Page
>>>>
>>>>
>>>> On Dec 19, 2018, at 2:18 AM,  <
>>>> josef.zahn...@swisscom.com> wrote:
>>>>
>>>> Hi Dano
>>>>
>>>> Seems that the problem has been seen by a few people but until now
>>>> nobody from NiFi team really cared about it – except Mark Payne. He
>>>> mentioned the part below with the diagnostics, however in my case this
>>>> doesn’t even work (tried it on standalone unsecured cluster as well as on
>>>> secured cluster)! Can you get the diagnostics on your cluster?
>>>>
>>>> I guess at the end we have to open a Jira ticket to narrow it down.
>>>>
>>>> Cheers Josef
>>>>
>>>>
>>>> One thing that I would recommend, to get more information, is to go to
>>>> the REST endpoint (in your browser is fine)
>>>> /nifi-api/processors//diagnostics
>>>>
>>>> Where  is the UUID of either the source or the
>>>> destination of the Connection in question. This gives us
>>>> a lot of information about the internals of Connection. The easiest way
>>>> to get that Processor ID is to just click on the
>>>> processor on the canvas and look at the Operate palette on the
>>>> left-hand side. You can copy & paste from there. If you
>>>> then send the diagnostics information to us, we can analyze that to
>>>> help understand what's happening.
>>>>
>>>>
>>>>
>>>> *From: *dan young 
>>>> *Reply-To: *"users@nifi.apache.org" 
>>>> *Date: *Wednesday, 19 December 2018 at 05:28
>>>> *To: *NiFi Mailing List 
>>>> *Subject: *flowfiles stuck in load balanced queue; nifi 1.8
>>>>
>>>> We're seeing this more frequently where flowfiles seem to be stuck in a
>>>> load balanced queue.  The only resolution is to disconnect the node and
>>>> then restart that node.  After this, the flowfile disappears from the
>>>> queue.  Any ideas on what might be going on here or what additional
>>>> information I might be able to provide to debug this?
>>>>
>>>> I've attached another thread dump and some screen shots
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Dano
>>>>
>>>>
>>>> --
Sent from Gmail Mobile


Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-23 Thread dan young
gt; doing the "Cluster view" and constantly refreshing for a few seconds
>>> to see if the queue changes on any node in the cluster.
>>>
>>> 4) For some entirely unknown reason, there exists a bug that causes the
>>> node to simply see the FlowFile and just skip over it
>>> entirely.
>>>
>>> For additional logging, we can enable DEBUG logging on
>>> org.apache.nifi.controller.queue.clustered.client.async.nio.
>>> NioAsyncLoadBalanceClientTask:
>>> >> level="DEBUG" />
>>>
>>> With that DEBUG logging turned on, it may or may not generate a lot of
>>> DEBUG logs. If it does not, then that in and of itself tells us something.
>>> If it does generate a lot of DEBUG logs, then it would be good to see
>>> what it's dumping out in the logs.
>>>
>>> And a big Thank You to you guys for staying engaged on this and your
>>> willingness to dig in!
>>>
>>> Thanks!
>>> -Mark
>>>
>>> [1]
>>> https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Summary_Page
>>>
>>>
>>> On Dec 19, 2018, at 2:18 AM,  <
>>> josef.zahn...@swisscom.com> wrote:
>>>
>>> Hi Dano
>>>
>>> Seems that the problem has been seen by a few people but until now
>>> nobody from NiFi team really cared about it – except Mark Payne. He
>>> mentioned the part below with the diagnostics, however in my case this
>>> doesn’t even work (tried it on standalone unsecured cluster as well as on
>>> secured cluster)! Can you get the diagnostics on your cluster?
>>>
>>> I guess at the end we have to open a Jira ticket to narrow it down.
>>>
>>> Cheers Josef
>>>
>>>
>>> One thing that I would recommend, to get more information, is to go to
>>> the REST endpoint (in your browser is fine)
>>> /nifi-api/processors//diagnostics
>>>
>>> Where  is the UUID of either the source or the destination
>>> of the Connection in question. This gives us
>>> a lot of information about the internals of Connection. The easiest way
>>> to get that Processor ID is to just click on the
>>> processor on the canvas and look at the Operate palette on the left-hand
>>> side. You can copy & paste from there. If you
>>> then send the diagnostics information to us, we can analyze that to help
>>> understand what's happening.
>>>
>>>
>>>
>>> *From: *dan young 
>>> *Reply-To: *"users@nifi.apache.org" 
>>> *Date: *Wednesday, 19 December 2018 at 05:28
>>> *To: *NiFi Mailing List 
>>> *Subject: *flowfiles stuck in load balanced queue; nifi 1.8
>>>
>>> We're seeing this more frequently where flowfiles seem to be stuck in a
>>> load balanced queue.  The only resolution is to disconnect the node and
>>> then restart that node.  After this, the flowfile disappears from the
>>> queue.  Any ideas on what might be going on here or what additional
>>> information I might be able to provide to debug this?
>>>
>>> I've attached another thread dump and some screen shots
>>>
>>>
>>> Regards,
>>>
>>> Dano
>>>
>>>
>>>


Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-23 Thread dan young
s not, then that in and of itself tells us something.
>> If it does generate a lot of DEBUG logs, then it would be good to see
>> what it's dumping out in the logs.
>>
>> And a big Thank You to you guys for staying engaged on this and your
>> willingness to dig in!
>>
>> Thanks!
>> -Mark
>>
>> [1]
>> https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Summary_Page
>>
>>
>> On Dec 19, 2018, at 2:18 AM,  <
>> josef.zahn...@swisscom.com> wrote:
>>
>> Hi Dano
>>
>> Seems that the problem has been seen by a few people but until now nobody
>> from NiFi team really cared about it – except Mark Payne. He mentioned the
>> part below with the diagnostics, however in my case this doesn’t even work
>> (tried it on standalone unsecured cluster as well as on secured cluster)!
>> Can you get the diagnostics on your cluster?
>>
>> I guess at the end we have to open a Jira ticket to narrow it down.
>>
>> Cheers Josef
>>
>>
>> One thing that I would recommend, to get more information, is to go to
>> the REST endpoint (in your browser is fine)
>> /nifi-api/processors//diagnostics
>>
>> Where  is the UUID of either the source or the destination
>> of the Connection in question. This gives us
>> a lot of information about the internals of Connection. The easiest way
>> to get that Processor ID is to just click on the
>> processor on the canvas and look at the Operate palette on the left-hand
>> side. You can copy & paste from there. If you
>> then send the diagnostics information to us, we can analyze that to help
>> understand what's happening.
>>
>>
>>
>> *From: *dan young 
>> *Reply-To: *"users@nifi.apache.org" 
>> *Date: *Wednesday, 19 December 2018 at 05:28
>> *To: *NiFi Mailing List 
>> *Subject: *flowfiles stuck in load balanced queue; nifi 1.8
>>
>> We're seeing this more frequently where flowfiles seem to be stuck in a
>> load balanced queue.  The only resolution is to disconnect the node and
>> then restart that node.  After this, the flowfile disappears from the
>> queue.  Any ideas on what might be going on here or what additional
>> information I might be able to provide to debug this?
>>
>> I've attached another thread dump and some screen shots
>>
>>
>> Regards,
>>
>> Dano
>>
>>
>>


Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-23 Thread dan young
s well as on secured cluster)!
> Can you get the diagnostics on your cluster?
>
> I guess at the end we have to open a Jira ticket to narrow it down.
>
> Cheers Josef
>
>
> One thing that I would recommend, to get more information, is to go to the
> REST endpoint (in your browser is fine)
> /nifi-api/processors//diagnostics
>
> Where  is the UUID of either the source or the destination
> of the Connection in question. This gives us
> a lot of information about the internals of Connection. The easiest way to
> get that Processor ID is to just click on the
> processor on the canvas and look at the Operate palette on the left-hand
> side. You can copy & paste from there. If you
> then send the diagnostics information to us, we can analyze that to help
> understand what's happening.
>
>
>
> *From: *dan young 
> *Reply-To: *"users@nifi.apache.org" 
> *Date: *Wednesday, 19 December 2018 at 05:28
> *To: *NiFi Mailing List 
> *Subject: *flowfiles stuck in load balanced queue; nifi 1.8
>
> We're seeing this more frequently where flowfiles seem to be stuck in a
> load balanced queue.  The only resolution is to disconnect the node and
> then restart that node.  After this, the flowfile disappears from the
> queue.  Any ideas on what might be going on here or what additional
> information I might be able to provide to debug this?
>
> I've attached another thread dump and some screen shots
>
>
> Regards,
>
> Dano
>
>
>


Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-19 Thread dan young
user-guide.html#Summary_Page
>>
>>
>> On Dec 19, 2018, at 2:18 AM,  <
>> josef.zahn...@swisscom.com> wrote:
>>
>> Hi Dano
>>
>> Seems that the problem has been seen by a few people but until now nobody
>> from NiFi team really cared about it – except Mark Payne. He mentioned the
>> part below with the diagnostics, however in my case this doesn’t even work
>> (tried it on standalone unsecured cluster as well as on secured cluster)!
>> Can you get the diagnostics on your cluster?
>>
>> I guess at the end we have to open a Jira ticket to narrow it down.
>>
>> Cheers Josef
>>
>>
>> One thing that I would recommend, to get more information, is to go to
>> the REST endpoint (in your browser is fine)
>> /nifi-api/processors//diagnostics
>>
>> Where  is the UUID of either the source or the destination
>> of the Connection in question. This gives us
>> a lot of information about the internals of Connection. The easiest way
>> to get that Processor ID is to just click on the
>> processor on the canvas and look at the Operate palette on the left-hand
>> side. You can copy & paste from there. If you
>> then send the diagnostics information to us, we can analyze that to help
>> understand what's happening.
>>
>>
>>
>> *From: *dan young 
>> *Reply-To: *"users@nifi.apache.org" 
>> *Date: *Wednesday, 19 December 2018 at 05:28
>> *To: *NiFi Mailing List 
>> *Subject: *flowfiles stuck in load balanced queue; nifi 1.8
>>
>> We're seeing this more frequently where flowfiles seem to be stuck in a
>> load balanced queue.  The only resolution is to disconnect the node and
>> then restart that node.  After this, the flowfile disappears from the
>> queue.  Any ideas on what might be going on here or what additional
>> information I might be able to provide to debug this?
>>
>> I've attached another thread dump and some screen shots
>>
>>
>> Regards,
>>
>> Dano
>>
>>
>>
>


Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-19 Thread Mark Payne
the source or the destination of the 
Connection in question. This gives us
a lot of information about the internals of Connection. The easiest way to get 
that Processor ID is to just click on the
processor on the canvas and look at the Operate palette on the left-hand side. 
You can copy & paste from there. If you
then send the diagnostics information to us, we can analyze that to help 
understand what's happening.



From: dan young mailto:danoyo...@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
mailto:users@nifi.apache.org>>
Date: Wednesday, 19 December 2018 at 05:28
To: NiFi Mailing List mailto:users@nifi.apache.org>>
Subject: flowfiles stuck in load balanced queue; nifi 1.8

We're seeing this more frequently where flowfiles seem to be stuck in a load 
balanced queue.  The only resolution is to disconnect the node and then restart 
that node.  After this, the flowfile disappears from the queue.  Any ideas on 
what might be going on here or what additional information I might be able to 
provide to debug this?

I've attached another thread dump and some screen shots


Regards,

Dano




Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-19 Thread dan young
on. This gives us
> a lot of information about the internals of Connection. The easiest way to
> get that Processor ID is to just click on the
> processor on the canvas and look at the Operate palette on the left-hand
> side. You can copy & paste from there. If you
> then send the diagnostics information to us, we can analyze that to help
> understand what's happening.
>
>
>
> *From: *dan young 
> *Reply-To: *"users@nifi.apache.org" 
> *Date: *Wednesday, 19 December 2018 at 05:28
> *To: *NiFi Mailing List 
> *Subject: *flowfiles stuck in load balanced queue; nifi 1.8
>
> We're seeing this more frequently where flowfiles seem to be stuck in a
> load balanced queue.  The only resolution is to disconnect the node and
> then restart that node.  After this, the flowfile disappears from the
> queue.  Any ideas on what might be going on here or what additional
> information I might be able to provide to debug this?
>
> I've attached another thread dump and some screen shots
>
>
> Regards,
>
> Dano
>
>
>


Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-19 Thread Boris Tyukin
we were about to start using this feature but I guess we would have to wait
since so many people having issues with it and there are still no comments
from NiFi developers who implemented it...Thanks for the heads up guys

On Tue, Dec 18, 2018 at 11:27 PM dan young  wrote:

> We're seeing this more frequently where flowfiles seem to be stuck in a
> load balanced queue.  The only resolution is to disconnect the node and
> then restart that node.  After this, the flowfile disappears from the
> queue.  Any ideas on what might be going on here or what additional
> information I might be able to provide to debug this?
>
> I've attached another thread dump and some screen shots
>
>
> Regards,
>
> Dano
>
>


Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-19 Thread Mark Payne
Hey Josef, Dano,

Firstly, let me assure you that while I may be the only one from the NiFi side 
who's been engaging on debugging
this, I am far from the only one who cares about it! :) This is a pretty big 
new feature that was added to the latest
release, so understandably there are probably not yet a lot of people who 
understand the code well enough to
debug. I have tried replicating the issue, but have not been successful. I have 
a 3-node cluster that ran for well over
a month without a restart, and i've also tried restarting it every few hours 
for a couple of days. It has about 8 different
load-balanced connections, with varying data sizes and volumes. I've not been 
able to get into this situation, though,
unfortunately.

But yes, I think that we've seen this issue arise from each of the two of you 
and one other on the mailing list, so it
is certainly something that we need to nail down ASAP. Unfortunately, debugging 
an issue that involves communication
between multiple nodes is often difficult to fully understand, so it may not be 
a trivial task to debug.

Dano, if you are able to get to the diagnostics, as Josef mentioned, that is 
likely to be pretty helpful. Off the top of my head,
there are a few possibilities that are coming to mind, as to what kind of bug 
could cause such behavior:

1) Perhaps there really is no flowfile in the queue, but we somehow 
miscalculated the size of the queue. The diagnostics
info would tell us whether or not this is the case. It will look into the 
queues themselves to determine how many FlowFiles are
destined for each node in the cluster, rather than just returning the 
pre-calculated count. Failing that, you could also stop the source
and destination of the queue, restart the node, and then see if the FlowFile is 
entirely gone from the queue on restart, or if it remains
in the queue. If it is gone, then that likely indicates that the pre-computed 
count is somehow off.

2) We are having trouble communicating with the node that we are trying to send 
the data to. I would expect some sort of ERROR
log messages in this case.

3) The node is properly sending the FlowFile to where it needs to go, but for 
some reason the receiving node is then re-distributing it
to another node in the cluster, which then re-distributes it again, so that it 
never ends in the correct destination. I think this is unlikely
and would be easy to verify by looking at the "Summary" table [1] and doing the 
"Cluster view" and constantly refreshing for a few seconds
to see if the queue changes on any node in the cluster.

4) For some entirely unknown reason, there exists a bug that causes the node to 
simply see the FlowFile and just skip over it
entirely.

For additional logging, we can enable DEBUG logging on
org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask:


With that DEBUG logging turned on, it may or may not generate a lot of DEBUG 
logs. If it does not, then that in and of itself tells us something.
If it does generate a lot of DEBUG logs, then it would be good to see what it's 
dumping out in the logs.

And a big Thank You to you guys for staying engaged on this and your 
willingness to dig in!

Thanks!
-Mark

[1] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Summary_Page


On Dec 19, 2018, at 2:18 AM, 
mailto:josef.zahn...@swisscom.com>> 
mailto:josef.zahn...@swisscom.com>> wrote:

Hi Dano

Seems that the problem has been seen by a few people but until now nobody from 
NiFi team really cared about it – except Mark Payne. He mentioned the part 
below with the diagnostics, however in my case this doesn’t even work (tried it 
on standalone unsecured cluster as well as on secured cluster)! Can you get the 
diagnostics on your cluster?

I guess at the end we have to open a Jira ticket to narrow it down.

Cheers Josef


One thing that I would recommend, to get more information, is to go to the REST 
endpoint (in your browser is fine)
/nifi-api/processors//diagnostics

Where  is the UUID of either the source or the destination of the 
Connection in question. This gives us
a lot of information about the internals of Connection. The easiest way to get 
that Processor ID is to just click on the
processor on the canvas and look at the Operate palette on the left-hand side. 
You can copy & paste from there. If you
then send the diagnostics information to us, we can analyze that to help 
understand what's happening.



From: dan young mailto:danoyo...@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
mailto:users@nifi.apache.org>>
Date: Wednesday, 19 December 2018 at 05:28
To: NiFi Mailing List mailto:users@nifi.apache.org>>
Subject: flowfiles stuck in load balanced queue; nifi 1.8

We're seeing this more frequently where flowfiles seem to be stuck in a load 
balanced queue.  The only resolution is to d

Re: flowfiles stuck in load balanced queue; nifi 1.8

2018-12-18 Thread Josef.Zahner1
Hi Dano

Seems that the problem has been seen by a few people but until now nobody from 
NiFi team really cared about it – except Mark Payne. He mentioned the part 
below with the diagnostics, however in my case this doesn’t even work (tried it 
on standalone unsecured cluster as well as on secured cluster)! Can you get the 
diagnostics on your cluster?

I guess at the end we have to open a Jira ticket to narrow it down.

Cheers Josef


One thing that I would recommend, to get more information, is to go to the REST 
endpoint (in your browser is fine)
/nifi-api/processors//diagnostics

Where  is the UUID of either the source or the destination of the 
Connection in question. This gives us
a lot of information about the internals of Connection. The easiest way to get 
that Processor ID is to just click on the
processor on the canvas and look at the Operate palette on the left-hand side. 
You can copy & paste from there. If you
then send the diagnostics information to us, we can analyze that to help 
understand what's happening.



From: dan young 
Reply-To: "users@nifi.apache.org" 
Date: Wednesday, 19 December 2018 at 05:28
To: NiFi Mailing List 
Subject: flowfiles stuck in load balanced queue; nifi 1.8

We're seeing this more frequently where flowfiles seem to be stuck in a load 
balanced queue.  The only resolution is to disconnect the node and then restart 
that node.  After this, the flowfile disappears from the queue.  Any ideas on 
what might be going on here or what additional information I might be able to 
provide to debug this?

I've attached another thread dump and some screen shots


Regards,

Dano



smime.p7s
Description: S/MIME Cryptographic Signature