Re: Optimizing Performance of Apache NiFi's Network Listening Processors

Bryan Bende Tue, 06 Aug 2019 07:24:42 -0700

Ok makes sense, there are basically two options to make it efficient...

A) You can use ListenSyslog with batching, followed by ValidateRecord
with one of the syslog record readers  [1][2].


B) You can use ListenTCPRecord with a syslog record reader.

A will probably work better for a larger number of TCP connections, B
would work better for a smaller number of connections.

One challenge with both of them is that there isn't a syslog record
writer, so you would probably have to use the
FreeFormTextRecordSetWriter with some expression that rewrites the
message using the record fields, like "${hostname} ${body}" if you
wanted to rewrite each message with the hostname and body.

[1] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services-nar/1.9.2/org.apache.nifi.syslog.SyslogReader/index.html
[2] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services-nar/1.9.2/org.apache.nifi.syslog.Syslog5424Reader/index.html
[3] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services-nar/1.9.2/org.apache.nifi.text.FreeFormTextRecordSetWriter/index.html

On Tue, Aug 6, 2019 at 10:08 AM Clay Teahouse <[email protected]> wrote:
>
> Hello Bryan,
>
> I am ingesting millions of syslog records from various data sources. I need 
> to make sure the format is valid and then prefix each message with the host 
> name (from syslog header) and some other meta data and push the records to 
> various consumers.
>
> thanks
> Clay
>
> On Tue, Aug 6, 2019 at 6:26 AM Bryan Bende <[email protected]> wrote:
>>
>> Can you describe what you want to do with each message?
>>
>> Right now I’m not following why you need to parse them.
>>
>> On Tue, Aug 6, 2019 at 6:40 AM Clay Teahouse <[email protected]> wrote:
>>>
>>> Bryan,
>>> Understood, but wouldn't then this processor be inefficient if you are 
>>> dealing with a very large number of syslog messages, if you don't have the 
>>> batching option? I suppose we could have had the option of parsing each 
>>> syslog record in a batch and then writing the syslog message along with the 
>>> syslog headers to the flowfile content.
>>> thanks
>>> Clay
>>>
>>> On Mon, Aug 5, 2019 at 12:12 PM Bryan Bende <[email protected]> wrote:
>>>>
>>>> Clay,
>>>>
>>>> You can only parse when its 1 message per flow file because parsing
>>>> adds all the field/value pairs as flow file attributes, which wouldn't
>>>> really make sense when you have say 1k messages with all different
>>>> values for those fields.
>>>>
>>>> -Bryan
>>>>
>>>> On Mon, Aug 5, 2019 at 11:25 AM Clay Teahouse <[email protected]> 
>>>> wrote:
>>>> >
>>>> > Hi Edward, Bryan
>>>> > One more question regarding ListenSyslog. Is it possible to set batch 
>>>> > size > 1 with parse set to true? I am ingesting a very high volume of 
>>>> > syslog records and want to avoid flowfiles containing only one record 
>>>> > but at the same time, I want to be able to parse the records. Is there a 
>>>> > way around this?
>>>> >
>>>> > thanks
>>>> > Clay
>>>> >
>>>> > On Fri, Aug 2, 2019 at 8:50 AM Edward Armes <[email protected]> 
>>>> > wrote:
>>>> >>
>>>> >> HI Clay,
>>>> >>
>>>> >> So as Bryan has said the actual connection is managed by a selector and 
>>>> >> all this does is goes through each connection and once that connection 
>>>> >> has data to receive it the selector then hands that over to a thread in 
>>>> >> the TCP receiving thread pool which does then some basic TCP processing 
>>>> >> and puts it into a buffer for an instance of associated ListenSyslog 
>>>> >> processor to processes, when the framework executes an instance of that 
>>>> >> processor.
>>>> >>
>>>> >> Just so you're aware while setting the maximum number of connections 
>>>> >> does create a thread pool of 4,000 threads. In reality these threads 
>>>> >> don't really exist until one is created by the selector to run on the 
>>>> >> pool. So in short unless a single Nifi server gets 4,000 syslog 
>>>> >> messages in a very short space time (< 1 micro-second) I can't see it 
>>>> >> being an issue.
>>>> >>
>>>> >> Edward
>>>> >>
>>>> >> On Fri, Aug 2, 2019 at 2:06 PM Bryan Bende <[email protected]> wrote:
>>>> >>>
>>>> >>> The actual connections themselves are managed with a selector, so if
>>>> >>> all the connections are idle there should only be one thread for the
>>>> >>> socket.
>>>> >>>
>>>> >>> As soon as a connection has something available to read then a thread
>>>> >>> is spawned to start reading the connection until either no matter is
>>>> >>> available, or it is closed.
>>>> >>>
>>>> >>> On Fri, Aug 2, 2019 at 7:18 AM Clay Teahouse <[email protected]> 
>>>> >>> wrote:
>>>> >>> >
>>>> >>> > Hello Edward,
>>>> >>> > So, if have of to listen to 32,000 tcp connections and I have only 
>>>> >>> > 80 cores, and I configure each ListenSyslog instance for 4,000 
>>>> >>> > connections, doesn't each spawn 4,000 threads behind the scene? The 
>>>> >>> > tcp connections will be idle most of the time.
>>>> >>> >
>>>> >>> > thanks
>>>> >>> > Clay
>>>> >>> >
>>>> >>> >
>>>> >>> > On Fri, Aug 2, 2019 at 6:10 AM Edward Armes <[email protected]> 
>>>> >>> > wrote:
>>>> >>> >>
>>>> >>> >> Hi Clay,
>>>> >>> >>
>>>> >>> >> Because Nifi underneath uses a thread pool for it's own threading 
>>>> >>> >> underneath, and each instance processor runs does so in it's own 
>>>> >>> >> thread, I don't see any reason why not. One thing to note that the 
>>>> >>> >> way the ListenTCP processor appears to have been written such that 
>>>> >>> >> it gets all the requests that have been received on that socket and 
>>>> >>> >> processes them until either it has no more requests left or process 
>>>> >>> >> or that instance of the processor is no longer scheduled to run.
>>>> >>> >>
>>>> >>> >> Hope that helps
>>>> >>> >>
>>>> >>> >> Edward
>>>> >>> >>
>>>> >>> >> On Fri, Aug 2, 2019 at 11:28 AM Clay Teahouse 
>>>> >>> >> <[email protected]> wrote:
>>>> >>> >>>
>>>> >>> >>> Hello All,
>>>> >>> >>>
>>>> >>> >>> I need to listen to and process thousands of persistent TCP 
>>>> >>> >>> connections. I have 10 nodes, each having 8 cores.
>>>> >>> >>> My understanding is that with existing NiFi listening processors, 
>>>> >>> >>> such as ListnSyslog, a thread is utilized for each TCP connection. 
>>>> >>> >>> Does this scale? Do I need to write a custom processor that 
>>>> >>> >>> utilizes a thread pool for reading the data from the socket and 
>>>> >>> >>> processing them?
>>>> >>> >>>
>>>> >>> >>> thanks
>>>> >>> >>> Clay
>>
>> --
>> Sent from Gmail Mobile

Re: Optimizing Performance of Apache NiFi's Network Listening Processors

Reply via email to