RE: Content repository filling up

2017-04-25 Thread Olav Jordens
Hi Mark,

Thanks so much for looking into this. I think you are correct in that my large 
files are being trapped by the smaller ones in the claim. If nifi is not 
honouring nifi.content.claim.max.flow.files, is it honouring 
nifi.content.caim.max.appendable.size? If it is, then what is the smallest 
value I can set for this parameter that would essentially force one flowfile 
per claim? My overall workload is of the order of about 1000 large files per 
day (> 1GB) and about 100 000 small files (<1 KB) per day.)

Cheers,
Olav




From: Mark Payne [mailto:marka...@hotmail.com]
Sent: Wednesday, 26 April 2017 1:59 a.m.
To: users@nifi.apache.org
Subject: Re: Content repository filling up

Olav,

It does look like we are no longer honoring those properties. I created a JIRA 
[1] to address this.
The way that it is working right now, it will write to a single Content Claim 
until that Content Claim exceeds
1 MB. It will then stop 'recycling' the content claim. What you may be seeing, 
though, is that you have say 100
Content Claims each consisting of one small FlowFile and then one really big 
FlowFile. We should certainly
be honoring the "nifi.content.claim.max.flow.files" property but it looks like 
that was lost in a refactoring that
happened. My apologies for that. Thanks for bringing this to our attention!

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-3736



On Apr 24, 2017, at 11:19 PM, Olav Jordens 
> 
wrote:

Hi Joe,

Yes indeed – this is a vexing problem. The flow is in production with many many 
components for the various independent workflows. I have a few SplitJSONs, but 
not many SplitTexts as I recall. The strange thing is, the very large > 1GB 
files are part of an independent workflow that has been behaving very well. 
Recently I created a new workflow that looks in a database to determine which 
sqls to run. It appears to have caused the latest round of problems. This 
workflow generates small ‘task’ files containing the json spec which determines 
which temporary hive tables are inserted into (permanent) orc tables. The 
flowfile content is transformed from json content to HQL content required to do 
the job. So this flow generates lots of small files and does not ingest 
anything large at all. However, as the small task files have queued up, it 
appears that these small files are in the same claims as some of the larger 1GB 
files which are in a completely independent flow within the same nifi instance.

I use back pressure quite a bit – a very useful feature. The new flow mentioned 
above does not yet have back pressure built in, but I am still surprised to see 
a queue of a few thousand small files jam up the entire instance. I could 
certainly use back pressure to prevent this, but I would also like to 
understand why it has happened.

The other issue with testing is that this is a production instance, and I need 
to be careful about what stays running…

Anyway, I will try the lower value for max.appendable.size and let you know how 
it goes.

Thanks,
Olav


From: Joe Witt [mailto:joe.w...@gmail.com]
Sent: Tuesday, 25 April 2017 3:01 p.m.
To: users@nifi.apache.org
Subject: Re: FW: Content repository filling up

Olav

Yeah you could give that other property a go as well to tighten the contraints 
on how you want the content repository to behave.  I'm starting to wonder if 
there are other gremlins at play here.

The dump writes to the logs/nifi-bootstrap.log by default.

Do you have SplitText processors in the flow?  Can you list the processor in 
use and how many flowfiles are in the flow before it needs to get into this 
draining state?  Are you taking advantage of backpressure?  This is a little 
tough to help with via email so the more detail about your flow that you can 
share the better.

Thanks
Joe

On Mon, Apr 24, 2017 at 10:43 PM, Olav Jordens 
> 
wrote:
Hi Joe,

When restarting, the content repository stays high. What always works is to 
drain out the flow without letting new flow files be created, and then as I see 
the small queued files work their way through and disappear, large numbers of 
gigabytes drop off the content repository. This made me suspect that the small 
files are locking the claims. Which files are generated by bin/nifi.sh dump?

If the settings are a hint rather than a rule, does it make sense to reduce the 
max.append.size from 10MB down to say 1 KB?

Thanks,
Olav


From: Joe Witt [mailto:joe.w...@gmail.com]
Sent: Tuesday, 25 April 2017 2:16 p.m.

To: users@nifi.apache.org
Subject: Re: FW: Content repository filling up

Olav,

I agree that sounds like unexpected behavior but it is possible the 
implementation of the content repository automatically keeps together data for 
some small chunk of data size and 

Re: NPE in ListenSyslog processor

2017-04-25 Thread Andrew Grande
I wonder if the cause of zero length messages is the health check from the
f5 balancer. Worth verifying with your team.

Andrew

On Tue, Apr 25, 2017, 3:15 PM Andy LoPresto  wrote:

> PR 1694 [1] is available for this issue.
>
> [1] https://github.com/apache/nifi/pull/1694
>
> Andy LoPresto
> alopre...@apache.org
> *alopresto.apa...@gmail.com *
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Apr 25, 2017, at 10:07 AM, Conrad Crampton  > wrote:
>
> Hi,
> Thanks for the swift reply (as usual).
> NIFI-3738 created [1].
>
> I have passed over to infrastructure to try and establish cause of the
> zero length datagrams, but at least I now know there isn’t anything
> fundamentally wrong here and can (safely) ignore the errors.
>
> Thanks
> Conrad
>
>
> [1] https://issues.apache.org/jira/browse/NIFI-3738
>
> On 25/04/2017, 17:46, "Bryan Bende"  wrote:
>
>Hi Conrad,
>
>Line 431 of ListenSyslog has the following code:
>
>if (!valid || !event.isValid())
>
>So to get an NPE there means event must be null, and event comes from
> this code:
>
>boolean valid = true;
>try {
>event = parser.parseEvent(rawSyslogEvent.getData(), sender);
>} catch (final ProcessException pe) {
>getLogger().warn("Failed to parse Syslog event; routing to
> invalid");
>valid = false;
>}
>
>The parser returns null if the bytes sent in are null or length 0.
>
>We should be checking if (!valid || event == null || !event.isValid())
>to avoid this case, and I think a similar situation exists in the
>ParseSyslog processor. It appears this would only happen if parsing
>messages is enabled in ListenSyslog.
>
>Do you want to create a JIRA for this?
>
>The other question is why you are ending up with these 0 length
>messages, but that one I am not sure about. In the case of UDP, its
>just reading from a datagram channel into a byte buffer and passing
>those bytes a long, so I think it means its receiving a 0 byte
>datagram from the sender.
>
>Thanks,
>
>Bryan
>
>
>On Tue, Apr 25, 2017 at 12:31 PM, Conrad Crampton
> wrote:
>
> Hi,
>
> Been away for a bit from this community due to other work pressures, but
> picking up Nifi again and successfully upgraded to 1.1.2 (apart from
> screwing up one of the nodes temporarily).
>
> So, with the renewed interest in log processing our infrastructure team has
> put in an F5 load balancer to distribute the syslog traffic I am collecting
> to my 6 node cluster. This is to stop one node being the only workhorse for
> receiving syslog traffic. I had previously used the ‘standard’ pattern of
> having the ListenSyslog processor connect to a RPG and then the rest of my
> data processing flow receive via a local port – to effectively distribute
> the processing load. I was finding though that the single node was getting
> too many warnings about buffer, sockets being full etc. – hence the
> external
> load balancing.
>
>
>
> I am no load balancing expert, but what I believe happens is the F5 load
> balancer receives syslog traffic (over UDP) then distributes this load to
> all Nifi nodes (gives a bit of syslog traffic to each I believe). All
> appears fine, but then I start getting NPE in my node logs thus:
>
>
>
> 2017-04-25 17:16:34,832 ERROR [Timer-Driven Process Thread-7]
> o.a.n.processors.standard.ListenSyslog
> ListenSyslog[id=0a932c37-0158-1000--656754bf]
> ListenSyslog[id=0a932c37-0158-1000--656754bf] failed to process due
> to java.lang.NullPointerException; rolling back session:
> java.lang.NullPointerException
>
> 2017-04-25 17:16:34,833 ERROR [Timer-Driven Process Thread-7]
> o.a.n.processors.standard.ListenSyslog
>
> java.lang.NullPointerException: null
>
>at
>
> org.apache.nifi.processors.standard.ListenSyslog.onTrigger(ListenSyslog.java:431)
> ~[nifi-standard-processors-1.1.2.jar:1.1.2]
>
>at
>
> org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
> ~[nifi-api-1.1.2.jar:1.1.2]
>
>at
>
> org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1099)
> [nifi-framework-core-1.1.2.jar:1.1.2]
>
>at
>
> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:136)
> [nifi-framework-core-1.1.2.jar:1.1.2]
>
>at
>
> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47)
> [nifi-framework-core-1.1.2.jar:1.1.2]
>
>at
>
> org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:132)
> [nifi-framework-core-1.1.2.jar:1.1.2]
>
>at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [na:1.8.0_51]
>
>at 

Re: How to identify MiNiFi source edge devices

2017-04-25 Thread Andy LoPresto
Jeff,

That’s a great explanation and a common thought exercise scenario we’ve used 
when planning other features of MiNiFi. I think what Andre suggested below 
would be the easiest and most successful way to accomplish what you are looking 
for. UpdateAttribute will let you get as specific as you want by pulling 
hostname or from variable registry (or you could even run a stream command on 
the host or read from a file on system to get some unique identifier), and then 
all of your downstream processors have access to that attribute. You can also 
filter provenance data within NiFi using that discriminator.

Good luck.


Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Apr 25, 2017, at 9:57 AM, Jeff Zemerick  wrote:
> 
> Aldrin,
> 
> To simplify it, the situation is analogous to a deployment of temperature 
> sensors. Each sensor has a unique ID that is assigned by us at deployment 
> time and each sensor periodically adds a new row to a database table that is 
> stored on the sensor. Each sensor uses the same database schema so if you 
> combined all the rows you couldn't tell which rows originated from which 
> sensor. In NiFi, I need to do different things based on where the data 
> originated and I need to associate the sensor's ID with its data. (Such as 
> inserting the data into DynamoDB with the sensor ID as the Hash key and a 
> timestamp as the Range key.) The goal is to use the same MiNiFi configuration 
> for all devices.
> 
> I can easily use the ExecuteSQL processor to grab the new rows. But I need 
> some way to attach an attribute to the data that identifies where it 
> originated. That was what led to the initial question in this thread. The 
> Variable Registry along with the UpdateAttribute processor appears to satisfy 
> that need cleaner than a custom processor.
> 
> I hope that explains the situation a bit!
> 
> Thanks,
> Jeff
> 
> 
> 
> On Tue, Apr 25, 2017 at 11:17 AM, Aldrin Piri  > wrote:
> Jeff,
> 
> Could you expand upon what a device id is in your case?  Something intrinsic 
> to the device? The agent?  Are these generated and assigned during 
> provisioning?   How are you making use of these when the data arrives at its 
> desired destination?
> 
> What you are expressing is certainly a common need.  Would welcome any 
> perspective on what your deployment looks like such that we can frame uses 
> people are rolling out to guide assumptions that get made during our 
> development and design processes.
> 
> Thanks for diving in and exploring!
> --Aldrin
> 
> 
> On Tue, Apr 25, 2017 at 11:05 AM, Andre  > wrote:
> Jeff,
> 
> That would be next suggestion. :-)
> 
> Cheers
> 
> On Wed, Apr 26, 2017 at 1:04 AM, Jeff Zemerick  > wrote:
> It is possible. I will take a look to see if the hostname is sufficient for 
> the device ID.
> 
> I just learned about the Variable Registry. It seems if I use the Variable 
> Registry to store the device ID it would be available to the UpdateAttribute 
> processor. Is that correct?
> 
> Thanks,
> Jeff
> 
> 
> On Tue, Apr 25, 2017 at 10:48 AM, Andre  > wrote:
> Jeff,
> 
> Would if be feasible for you use UpdateAttribute (which I believe is part of 
> MiNiFi core processors) and use the ${hostname(true)} Expression language 
> function?
> 
> More about it can be found here:
> 
> https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#hostname
>  
> 
> 
> Cheers
> 
> On Wed, Apr 26, 2017 at 12:39 AM, Jeff Zemerick  > wrote:
> When processing data in NiFi that was received via MiNiFi edge devices I need 
> to be able to identify the source of the data. All of the data on the edge 
> devices will be pulled from a database and will not contain any data that 
> self-identifies the source. My attempt to solve this was to write a processor 
> that reads a configuration file on the edge device to get its device ID and 
> put that ID as an attribute in the flowfile. This appears to work, but, I was 
> wondering if there is a more recommended approach?
> 
> Thanks,
> Jeff
> 
> 
> 
> 
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: How to identify MiNiFi source edge devices

2017-04-25 Thread Jeff Zemerick
Aldrin,

To simplify it, the situation is analogous to a deployment of temperature
sensors. Each sensor has a unique ID that is assigned by us at deployment
time and each sensor periodically adds a new row to a database table that
is stored on the sensor. Each sensor uses the same database schema so if
you combined all the rows you couldn't tell which rows originated from
which sensor. In NiFi, I need to do different things based on where the
data originated and I need to associate the sensor's ID with its data.
(Such as inserting the data into DynamoDB with the sensor ID as the Hash
key and a timestamp as the Range key.) The goal is to use the same MiNiFi
configuration for all devices.

I can easily use the ExecuteSQL processor to grab the new rows. But I need
some way to attach an attribute to the data that identifies where it
originated. That was what led to the initial question in this thread. The
Variable Registry along with the UpdateAttribute processor appears to
satisfy that need cleaner than a custom processor.

I hope that explains the situation a bit!

Thanks,
Jeff



On Tue, Apr 25, 2017 at 11:17 AM, Aldrin Piri  wrote:

> Jeff,
>
> Could you expand upon what a device id is in your case?  Something
> intrinsic to the device? The agent?  Are these generated and assigned
> during provisioning?   How are you making use of these when the data
> arrives at its desired destination?
>
> What you are expressing is certainly a common need.  Would welcome any
> perspective on what your deployment looks like such that we can frame uses
> people are rolling out to guide assumptions that get made during our
> development and design processes.
>
> Thanks for diving in and exploring!
> --Aldrin
>
>
> On Tue, Apr 25, 2017 at 11:05 AM, Andre  wrote:
>
>> Jeff,
>>
>> That would be next suggestion. :-)
>>
>> Cheers
>>
>> On Wed, Apr 26, 2017 at 1:04 AM, Jeff Zemerick 
>> wrote:
>>
>>> It is possible. I will take a look to see if the hostname is sufficient
>>> for the device ID.
>>>
>>> I just learned about the Variable Registry. It seems if I use the
>>> Variable Registry to store the device ID it would be available to the
>>> UpdateAttribute processor. Is that correct?
>>>
>>> Thanks,
>>> Jeff
>>>
>>>
>>> On Tue, Apr 25, 2017 at 10:48 AM, Andre  wrote:
>>>
 Jeff,

 Would if be feasible for you use UpdateAttribute (which I believe is
 part of MiNiFi core processors) and use the ${hostname(true)} Expression
 language function?

 More about it can be found here:

 https://nifi.apache.org/docs/nifi-docs/html/expression-langu
 age-guide.html#hostname

 Cheers

 On Wed, Apr 26, 2017 at 12:39 AM, Jeff Zemerick 
 wrote:

> When processing data in NiFi that was received via MiNiFi edge devices
> I need to be able to identify the source of the data. All of the data on
> the edge devices will be pulled from a database and will not contain any
> data that self-identifies the source. My attempt to solve this was to 
> write
> a processor that reads a configuration file on the edge device to get its
> device ID and put that ID as an attribute in the flowfile. This appears to
> work, but, I was wondering if there is a more recommended approach?
>
> Thanks,
> Jeff
>


>>>
>>
>


Re: NPE in ListenSyslog processor

2017-04-25 Thread Conrad Crampton
Hi,
Thanks for the swift reply (as usual).
NIFI-3738 created [1].

I have passed over to infrastructure to try and establish cause of the zero 
length datagrams, but at least I now know there isn’t anything fundamentally 
wrong here and can (safely) ignore the errors.

Thanks
Conrad


[1] https://issues.apache.org/jira/browse/NIFI-3738

On 25/04/2017, 17:46, "Bryan Bende"  wrote:

Hi Conrad,

Line 431 of ListenSyslog has the following code:

if (!valid || !event.isValid())

So to get an NPE there means event must be null, and event comes from this 
code:

boolean valid = true;
try {
event = parser.parseEvent(rawSyslogEvent.getData(), sender);
} catch (final ProcessException pe) {
getLogger().warn("Failed to parse Syslog event; routing to invalid");
valid = false;
}

The parser returns null if the bytes sent in are null or length 0.

We should be checking if (!valid || event == null || !event.isValid())
to avoid this case, and I think a similar situation exists in the
ParseSyslog processor. It appears this would only happen if parsing
messages is enabled in ListenSyslog.

Do you want to create a JIRA for this?

The other question is why you are ending up with these 0 length
messages, but that one I am not sure about. In the case of UDP, its
just reading from a datagram channel into a byte buffer and passing
those bytes a long, so I think it means its receiving a 0 byte
datagram from the sender.

Thanks,

Bryan


On Tue, Apr 25, 2017 at 12:31 PM, Conrad Crampton
 wrote:
> Hi,
>
> Been away for a bit from this community due to other work pressures, but
> picking up Nifi again and successfully upgraded to 1.1.2 (apart from
> screwing up one of the nodes temporarily).
>
> So, with the renewed interest in log processing our infrastructure team 
has
> put in an F5 load balancer to distribute the syslog traffic I am 
collecting
> to my 6 node cluster. This is to stop one node being the only workhorse 
for
> receiving syslog traffic. I had previously used the ‘standard’ pattern of
> having the ListenSyslog processor connect to a RPG and then the rest of my
> data processing flow receive via a local port – to effectively distribute
> the processing load. I was finding though that the single node was getting
> too many warnings about buffer, sockets being full etc. – hence the 
external
> load balancing.
>
>
>
> I am no load balancing expert, but what I believe happens is the F5 load
> balancer receives syslog traffic (over UDP) then distributes this load to
> all Nifi nodes (gives a bit of syslog traffic to each I believe). All
> appears fine, but then I start getting NPE in my node logs thus:
>
>
>
> 2017-04-25 17:16:34,832 ERROR [Timer-Driven Process Thread-7]
> o.a.n.processors.standard.ListenSyslog
> ListenSyslog[id=0a932c37-0158-1000--656754bf]
> ListenSyslog[id=0a932c37-0158-1000--656754bf] failed to process 
due
> to java.lang.NullPointerException; rolling back session:
> java.lang.NullPointerException
>
> 2017-04-25 17:16:34,833 ERROR [Timer-Driven Process Thread-7]
> o.a.n.processors.standard.ListenSyslog
>
> java.lang.NullPointerException: null
>
> at
> 
org.apache.nifi.processors.standard.ListenSyslog.onTrigger(ListenSyslog.java:431)
> ~[nifi-standard-processors-1.1.2.jar:1.1.2]
>
> at
> 
org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
> ~[nifi-api-1.1.2.jar:1.1.2]
>
> at
> 
org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1099)
> [nifi-framework-core-1.1.2.jar:1.1.2]
>
> at
> 
org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:136)
> [nifi-framework-core-1.1.2.jar:1.1.2]
>
> at
> 
org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47)
> [nifi-framework-core-1.1.2.jar:1.1.2]
>
> at
> 
org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:132)
> [nifi-framework-core-1.1.2.jar:1.1.2]
>
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [na:1.8.0_51]
>
> at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> [na:1.8.0_51]
>
> at
> 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> [na:1.8.0_51]
>
> at
> 

NPE in ListenSyslog processor

2017-04-25 Thread Conrad Crampton
Hi,
Been away for a bit from this community due to other work pressures, but 
picking up Nifi again and successfully upgraded to 1.1.2 (apart from screwing 
up one of the nodes temporarily).
So, with the renewed interest in log processing our infrastructure team has put 
in an F5 load balancer to distribute the syslog traffic I am collecting to my 6 
node cluster. This is to stop one node being the only workhorse for receiving 
syslog traffic. I had previously used the ‘standard’ pattern of having the 
ListenSyslog processor connect to a RPG and then the rest of my data processing 
flow receive via a local port – to effectively distribute the processing load. 
I was finding though that the single node was getting too many warnings about 
buffer, sockets being full etc. – hence the external load balancing.

I am no load balancing expert, but what I believe happens is the F5 load 
balancer receives syslog traffic (over UDP) then distributes this load to all 
Nifi nodes (gives a bit of syslog traffic to each I believe). All appears fine, 
but then I start getting NPE in my node logs thus:

2017-04-25 17:16:34,832 ERROR [Timer-Driven Process Thread-7] 
o.a.n.processors.standard.ListenSyslog 
ListenSyslog[id=0a932c37-0158-1000--656754bf] 
ListenSyslog[id=0a932c37-0158-1000--656754bf] failed to process due to 
java.lang.NullPointerException; rolling back session: 
java.lang.NullPointerException
2017-04-25 17:16:34,833 ERROR [Timer-Driven Process Thread-7] 
o.a.n.processors.standard.ListenSyslog
java.lang.NullPointerException: null
at 
org.apache.nifi.processors.standard.ListenSyslog.onTrigger(ListenSyslog.java:431)
 ~[nifi-standard-processors-1.1.2.jar:1.1.2]
at 
org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
 ~[nifi-api-1.1.2.jar:1.1.2]
at 
org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1099)
 [nifi-framework-core-1.1.2.jar:1.1.2]
at 
org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:136)
 [nifi-framework-core-1.1.2.jar:1.1.2]
at 
org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47)
 [nifi-framework-core-1.1.2.jar:1.1.2]
at 
org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:132)
 [nifi-framework-core-1.1.2.jar:1.1.2]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[na:1.8.0_51]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
[na:1.8.0_51]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 [na:1.8.0_51]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 [na:1.8.0_51]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_51]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_51]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]
2017-04-25 17:16:34,833 ERROR [Timer-Driven Process Thread-7] 
o.a.n.processors.standard.ListenSyslog 
ListenSyslog[id=0a932c37-0158-1000--656754bf] 
ListenSyslog[id=0a932c37-0158-1000--656754bf] failed to process session 
due to java.lang.NullPointerException: java.lang.NullPointerException
2017-04-25 17:16:34,833 ERROR [Timer-Driven Process Thread-7] 
o.a.n.processors.standard.ListenSyslog
java.lang.NullPointerException: null
at 
org.apache.nifi.processors.standard.ListenSyslog.onTrigger(ListenSyslog.java:431)
 ~[na:na]
at 
org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
 ~[nifi-api-1.1.2.jar:1.1.2]
at 
org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1099)
 ~[nifi-framework-core-1.1.2.jar:1.1.2]
at 
org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:136)
 [nifi-framework-core-1.1.2.jar:1.1.2]
   at 
org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47)
 [nifi-framework-core-1.1.2.jar:1.1.2]
at 
org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:132)
 [nifi-framework-core-1.1.2.jar:1.1.2]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[na:1.8.0_51]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
[na:1.8.0_51]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 [na:1.8.0_51]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 [na:1.8.0_51]
at 

Re: NPE in ListenSyslog processor

2017-04-25 Thread Bryan Bende
Hi Conrad,

Line 431 of ListenSyslog has the following code:

if (!valid || !event.isValid())

So to get an NPE there means event must be null, and event comes from this code:

boolean valid = true;
try {
event = parser.parseEvent(rawSyslogEvent.getData(), sender);
} catch (final ProcessException pe) {
getLogger().warn("Failed to parse Syslog event; routing to invalid");
valid = false;
}

The parser returns null if the bytes sent in are null or length 0.

We should be checking if (!valid || event == null || !event.isValid())
to avoid this case, and I think a similar situation exists in the
ParseSyslog processor. It appears this would only happen if parsing
messages is enabled in ListenSyslog.

Do you want to create a JIRA for this?

The other question is why you are ending up with these 0 length
messages, but that one I am not sure about. In the case of UDP, its
just reading from a datagram channel into a byte buffer and passing
those bytes a long, so I think it means its receiving a 0 byte
datagram from the sender.

Thanks,

Bryan


On Tue, Apr 25, 2017 at 12:31 PM, Conrad Crampton
 wrote:
> Hi,
>
> Been away for a bit from this community due to other work pressures, but
> picking up Nifi again and successfully upgraded to 1.1.2 (apart from
> screwing up one of the nodes temporarily).
>
> So, with the renewed interest in log processing our infrastructure team has
> put in an F5 load balancer to distribute the syslog traffic I am collecting
> to my 6 node cluster. This is to stop one node being the only workhorse for
> receiving syslog traffic. I had previously used the ‘standard’ pattern of
> having the ListenSyslog processor connect to a RPG and then the rest of my
> data processing flow receive via a local port – to effectively distribute
> the processing load. I was finding though that the single node was getting
> too many warnings about buffer, sockets being full etc. – hence the external
> load balancing.
>
>
>
> I am no load balancing expert, but what I believe happens is the F5 load
> balancer receives syslog traffic (over UDP) then distributes this load to
> all Nifi nodes (gives a bit of syslog traffic to each I believe). All
> appears fine, but then I start getting NPE in my node logs thus:
>
>
>
> 2017-04-25 17:16:34,832 ERROR [Timer-Driven Process Thread-7]
> o.a.n.processors.standard.ListenSyslog
> ListenSyslog[id=0a932c37-0158-1000--656754bf]
> ListenSyslog[id=0a932c37-0158-1000--656754bf] failed to process due
> to java.lang.NullPointerException; rolling back session:
> java.lang.NullPointerException
>
> 2017-04-25 17:16:34,833 ERROR [Timer-Driven Process Thread-7]
> o.a.n.processors.standard.ListenSyslog
>
> java.lang.NullPointerException: null
>
> at
> org.apache.nifi.processors.standard.ListenSyslog.onTrigger(ListenSyslog.java:431)
> ~[nifi-standard-processors-1.1.2.jar:1.1.2]
>
> at
> org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
> ~[nifi-api-1.1.2.jar:1.1.2]
>
> at
> org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1099)
> [nifi-framework-core-1.1.2.jar:1.1.2]
>
> at
> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:136)
> [nifi-framework-core-1.1.2.jar:1.1.2]
>
> at
> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47)
> [nifi-framework-core-1.1.2.jar:1.1.2]
>
> at
> org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:132)
> [nifi-framework-core-1.1.2.jar:1.1.2]
>
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [na:1.8.0_51]
>
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> [na:1.8.0_51]
>
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> [na:1.8.0_51]
>
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> [na:1.8.0_51]
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_51]
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_51]
>
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]
>
> 2017-04-25 17:16:34,833 ERROR [Timer-Driven Process Thread-7]
> o.a.n.processors.standard.ListenSyslog
> ListenSyslog[id=0a932c37-0158-1000--656754bf]
> ListenSyslog[id=0a932c37-0158-1000--656754bf] failed to process
> session due to java.lang.NullPointerException:
> java.lang.NullPointerException
>
> 2017-04-25 17:16:34,833 ERROR [Timer-Driven Process Thread-7]
> o.a.n.processors.standard.ListenSyslog
>
> java.lang.NullPointerException: null
>
> at
> 

Re: NPE in ListenSyslog processor

2017-04-25 Thread Andy LoPresto
PR 1694 [1] is available for this issue.

[1] https://github.com/apache/nifi/pull/1694 


Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Apr 25, 2017, at 10:07 AM, Conrad Crampton  
> wrote:
> 
> Hi,
> Thanks for the swift reply (as usual).
> NIFI-3738 created [1].
> 
> I have passed over to infrastructure to try and establish cause of the zero 
> length datagrams, but at least I now know there isn’t anything fundamentally 
> wrong here and can (safely) ignore the errors.
> 
> Thanks
> Conrad
> 
> 
> [1] https://issues.apache.org/jira/browse/NIFI-3738
> 
> On 25/04/2017, 17:46, "Bryan Bende"  wrote:
> 
>Hi Conrad,
> 
>Line 431 of ListenSyslog has the following code:
> 
>if (!valid || !event.isValid())
> 
>So to get an NPE there means event must be null, and event comes from this 
> code:
> 
>boolean valid = true;
>try {
>event = parser.parseEvent(rawSyslogEvent.getData(), sender);
>} catch (final ProcessException pe) {
>getLogger().warn("Failed to parse Syslog event; routing to invalid");
>valid = false;
>}
> 
>The parser returns null if the bytes sent in are null or length 0.
> 
>We should be checking if (!valid || event == null || !event.isValid())
>to avoid this case, and I think a similar situation exists in the
>ParseSyslog processor. It appears this would only happen if parsing
>messages is enabled in ListenSyslog.
> 
>Do you want to create a JIRA for this?
> 
>The other question is why you are ending up with these 0 length
>messages, but that one I am not sure about. In the case of UDP, its
>just reading from a datagram channel into a byte buffer and passing
>those bytes a long, so I think it means its receiving a 0 byte
>datagram from the sender.
> 
>Thanks,
> 
>Bryan
> 
> 
>On Tue, Apr 25, 2017 at 12:31 PM, Conrad Crampton
> wrote:
>> Hi,
>> 
>> Been away for a bit from this community due to other work pressures, but
>> picking up Nifi again and successfully upgraded to 1.1.2 (apart from
>> screwing up one of the nodes temporarily).
>> 
>> So, with the renewed interest in log processing our infrastructure team has
>> put in an F5 load balancer to distribute the syslog traffic I am collecting
>> to my 6 node cluster. This is to stop one node being the only workhorse for
>> receiving syslog traffic. I had previously used the ‘standard’ pattern of
>> having the ListenSyslog processor connect to a RPG and then the rest of my
>> data processing flow receive via a local port – to effectively distribute
>> the processing load. I was finding though that the single node was getting
>> too many warnings about buffer, sockets being full etc. – hence the external
>> load balancing.
>> 
>> 
>> 
>> I am no load balancing expert, but what I believe happens is the F5 load
>> balancer receives syslog traffic (over UDP) then distributes this load to
>> all Nifi nodes (gives a bit of syslog traffic to each I believe). All
>> appears fine, but then I start getting NPE in my node logs thus:
>> 
>> 
>> 
>> 2017-04-25 17:16:34,832 ERROR [Timer-Driven Process Thread-7]
>> o.a.n.processors.standard.ListenSyslog
>> ListenSyslog[id=0a932c37-0158-1000--656754bf]
>> ListenSyslog[id=0a932c37-0158-1000--656754bf] failed to process due
>> to java.lang.NullPointerException; rolling back session:
>> java.lang.NullPointerException
>> 
>> 2017-04-25 17:16:34,833 ERROR [Timer-Driven Process Thread-7]
>> o.a.n.processors.standard.ListenSyslog
>> 
>> java.lang.NullPointerException: null
>> 
>>at
>> org.apache.nifi.processors.standard.ListenSyslog.onTrigger(ListenSyslog.java:431)
>> ~[nifi-standard-processors-1.1.2.jar:1.1.2]
>> 
>>at
>> org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
>> ~[nifi-api-1.1.2.jar:1.1.2]
>> 
>>at
>> org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1099)
>> [nifi-framework-core-1.1.2.jar:1.1.2]
>> 
>>at
>> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:136)
>> [nifi-framework-core-1.1.2.jar:1.1.2]
>> 
>>at
>> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47)
>> [nifi-framework-core-1.1.2.jar:1.1.2]
>> 
>>at
>> org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:132)
>> [nifi-framework-core-1.1.2.jar:1.1.2]
>> 
>>at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> [na:1.8.0_51]
>> 
>>at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>> [na:1.8.0_51]
>> 
>>at
>> 

Re: Content repository filling up

2017-04-25 Thread Mark Payne
Olav,

It does look like we are no longer honoring those properties. I created a JIRA 
[1] to address this.
The way that it is working right now, it will write to a single Content Claim 
until that Content Claim exceeds
1 MB. It will then stop 'recycling' the content claim. What you may be seeing, 
though, is that you have say 100
Content Claims each consisting of one small FlowFile and then one really big 
FlowFile. We should certainly
be honoring the "nifi.content.claim.max.flow.files" property but it looks like 
that was lost in a refactoring that
happened. My apologies for that. Thanks for bringing this to our attention!

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-3736



On Apr 24, 2017, at 11:19 PM, Olav Jordens 
> 
wrote:

Hi Joe,

Yes indeed – this is a vexing problem. The flow is in production with many many 
components for the various independent workflows. I have a few SplitJSONs, but 
not many SplitTexts as I recall. The strange thing is, the very large > 1GB 
files are part of an independent workflow that has been behaving very well. 
Recently I created a new workflow that looks in a database to determine which 
sqls to run. It appears to have caused the latest round of problems. This 
workflow generates small ‘task’ files containing the json spec which determines 
which temporary hive tables are inserted into (permanent) orc tables. The 
flowfile content is transformed from json content to HQL content required to do 
the job. So this flow generates lots of small files and does not ingest 
anything large at all. However, as the small task files have queued up, it 
appears that these small files are in the same claims as some of the larger 1GB 
files which are in a completely independent flow within the same nifi instance.

I use back pressure quite a bit – a very useful feature. The new flow mentioned 
above does not yet have back pressure built in, but I am still surprised to see 
a queue of a few thousand small files jam up the entire instance. I could 
certainly use back pressure to prevent this, but I would also like to 
understand why it has happened.

The other issue with testing is that this is a production instance, and I need 
to be careful about what stays running…

Anyway, I will try the lower value for max.appendable.size and let you know how 
it goes.

Thanks,
Olav


From: Joe Witt [mailto:joe.w...@gmail.com]
Sent: Tuesday, 25 April 2017 3:01 p.m.
To: users@nifi.apache.org
Subject: Re: FW: Content repository filling up

Olav

Yeah you could give that other property a go as well to tighten the contraints 
on how you want the content repository to behave.  I'm starting to wonder if 
there are other gremlins at play here.

The dump writes to the logs/nifi-bootstrap.log by default.

Do you have SplitText processors in the flow?  Can you list the processor in 
use and how many flowfiles are in the flow before it needs to get into this 
draining state?  Are you taking advantage of backpressure?  This is a little 
tough to help with via email so the more detail about your flow that you can 
share the better.

Thanks
Joe

On Mon, Apr 24, 2017 at 10:43 PM, Olav Jordens 
> 
wrote:
Hi Joe,

When restarting, the content repository stays high. What always works is to 
drain out the flow without letting new flow files be created, and then as I see 
the small queued files work their way through and disappear, large numbers of 
gigabytes drop off the content repository. This made me suspect that the small 
files are locking the claims. Which files are generated by bin/nifi.sh dump?

If the settings are a hint rather than a rule, does it make sense to reduce the 
max.append.size from 10MB down to say 1 KB?

Thanks,
Olav


From: Joe Witt [mailto:joe.w...@gmail.com]
Sent: Tuesday, 25 April 2017 2:16 p.m.

To: users@nifi.apache.org
Subject: Re: FW: Content repository filling up

Olav,

I agree that sounds like unexpected behavior but it is possible the 
implementation of the content repository automatically keeps together data for 
some small chunk of data size and treats that value more like a hint than an 
absolute rule.  If this is something that is pretty easy to repeat/reproduce 
could you try to generate some thread dumps during the running.  As you're 
describing it this sounds like there should not be data sitting around in the 
content repository.  Do you find after a nifi restart that a bunch of files get 
deleted out of the content repository?

To generate a thread dump run 'bin/nifi.sh dump'

Thanks
JOe

On Mon, Apr 24, 2017 at 10:10 PM, Olav Jordens 
> 
wrote:
Hi Joe,

Thanks so much for your quick response. My content repository  has about 400GB 
storage, and 

How to identify MiNiFi source edge devices

2017-04-25 Thread Jeff Zemerick
When processing data in NiFi that was received via MiNiFi edge devices I
need to be able to identify the source of the data. All of the data on the
edge devices will be pulled from a database and will not contain any data
that self-identifies the source. My attempt to solve this was to write a
processor that reads a configuration file on the edge device to get its
device ID and put that ID as an attribute in the flowfile. This appears to
work, but, I was wondering if there is a more recommended approach?

Thanks,
Jeff


Re: How to identify MiNiFi source edge devices

2017-04-25 Thread Andre
Jeff,

Would if be feasible for you use UpdateAttribute (which I believe is part
of MiNiFi core processors) and use the ${hostname(true)} Expression
language function?

More about it can be found here:

https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#hostname

Cheers

On Wed, Apr 26, 2017 at 12:39 AM, Jeff Zemerick 
wrote:

> When processing data in NiFi that was received via MiNiFi edge devices I
> need to be able to identify the source of the data. All of the data on the
> edge devices will be pulled from a database and will not contain any data
> that self-identifies the source. My attempt to solve this was to write a
> processor that reads a configuration file on the edge device to get its
> device ID and put that ID as an attribute in the flowfile. This appears to
> work, but, I was wondering if there is a more recommended approach?
>
> Thanks,
> Jeff
>


Re: How to identify MiNiFi source edge devices

2017-04-25 Thread Jeff Zemerick
It is possible. I will take a look to see if the hostname is sufficient for
the device ID.

I just learned about the Variable Registry. It seems if I use the Variable
Registry to store the device ID it would be available to the
UpdateAttribute processor. Is that correct?

Thanks,
Jeff


On Tue, Apr 25, 2017 at 10:48 AM, Andre  wrote:

> Jeff,
>
> Would if be feasible for you use UpdateAttribute (which I believe is part
> of MiNiFi core processors) and use the ${hostname(true)} Expression
> language function?
>
> More about it can be found here:
>
> https://nifi.apache.org/docs/nifi-docs/html/expression-
> language-guide.html#hostname
>
> Cheers
>
> On Wed, Apr 26, 2017 at 12:39 AM, Jeff Zemerick 
> wrote:
>
>> When processing data in NiFi that was received via MiNiFi edge devices I
>> need to be able to identify the source of the data. All of the data on the
>> edge devices will be pulled from a database and will not contain any data
>> that self-identifies the source. My attempt to solve this was to write a
>> processor that reads a configuration file on the edge device to get its
>> device ID and put that ID as an attribute in the flowfile. This appears to
>> work, but, I was wondering if there is a more recommended approach?
>>
>> Thanks,
>> Jeff
>>
>
>


Re: How to identify MiNiFi source edge devices

2017-04-25 Thread Andre
Jeff,

That would be next suggestion. :-)

Cheers

On Wed, Apr 26, 2017 at 1:04 AM, Jeff Zemerick  wrote:

> It is possible. I will take a look to see if the hostname is sufficient
> for the device ID.
>
> I just learned about the Variable Registry. It seems if I use the Variable
> Registry to store the device ID it would be available to the
> UpdateAttribute processor. Is that correct?
>
> Thanks,
> Jeff
>
>
> On Tue, Apr 25, 2017 at 10:48 AM, Andre  wrote:
>
>> Jeff,
>>
>> Would if be feasible for you use UpdateAttribute (which I believe is part
>> of MiNiFi core processors) and use the ${hostname(true)} Expression
>> language function?
>>
>> More about it can be found here:
>>
>> https://nifi.apache.org/docs/nifi-docs/html/expression-langu
>> age-guide.html#hostname
>>
>> Cheers
>>
>> On Wed, Apr 26, 2017 at 12:39 AM, Jeff Zemerick 
>> wrote:
>>
>>> When processing data in NiFi that was received via MiNiFi edge devices I
>>> need to be able to identify the source of the data. All of the data on the
>>> edge devices will be pulled from a database and will not contain any data
>>> that self-identifies the source. My attempt to solve this was to write a
>>> processor that reads a configuration file on the edge device to get its
>>> device ID and put that ID as an attribute in the flowfile. This appears to
>>> work, but, I was wondering if there is a more recommended approach?
>>>
>>> Thanks,
>>> Jeff
>>>
>>
>>
>


Re: How to identify MiNiFi source edge devices

2017-04-25 Thread Aldrin Piri
Jeff,

Could you expand upon what a device id is in your case?  Something
intrinsic to the device? The agent?  Are these generated and assigned
during provisioning?   How are you making use of these when the data
arrives at its desired destination?

What you are expressing is certainly a common need.  Would welcome any
perspective on what your deployment looks like such that we can frame uses
people are rolling out to guide assumptions that get made during our
development and design processes.

Thanks for diving in and exploring!
--Aldrin


On Tue, Apr 25, 2017 at 11:05 AM, Andre  wrote:

> Jeff,
>
> That would be next suggestion. :-)
>
> Cheers
>
> On Wed, Apr 26, 2017 at 1:04 AM, Jeff Zemerick 
> wrote:
>
>> It is possible. I will take a look to see if the hostname is sufficient
>> for the device ID.
>>
>> I just learned about the Variable Registry. It seems if I use the
>> Variable Registry to store the device ID it would be available to the
>> UpdateAttribute processor. Is that correct?
>>
>> Thanks,
>> Jeff
>>
>>
>> On Tue, Apr 25, 2017 at 10:48 AM, Andre  wrote:
>>
>>> Jeff,
>>>
>>> Would if be feasible for you use UpdateAttribute (which I believe is
>>> part of MiNiFi core processors) and use the ${hostname(true)} Expression
>>> language function?
>>>
>>> More about it can be found here:
>>>
>>> https://nifi.apache.org/docs/nifi-docs/html/expression-langu
>>> age-guide.html#hostname
>>>
>>> Cheers
>>>
>>> On Wed, Apr 26, 2017 at 12:39 AM, Jeff Zemerick 
>>> wrote:
>>>
 When processing data in NiFi that was received via MiNiFi edge devices
 I need to be able to identify the source of the data. All of the data on
 the edge devices will be pulled from a database and will not contain any
 data that self-identifies the source. My attempt to solve this was to write
 a processor that reads a configuration file on the edge device to get its
 device ID and put that ID as an attribute in the flowfile. This appears to
 work, but, I was wondering if there is a more recommended approach?

 Thanks,
 Jeff

>>>
>>>
>>
>