RE: NiFi Performance Analysis Clarification

2018-06-13 Thread V, Prashanth (Nokia - IN/Bangalore)
I am updating & adding few fields in csv. Hence used UpdateRecord.. Thanks & Regards, Prashanth From: Mark Payne [mailto:marka...@hotmail.com] Sent: Wednesday, June 13, 2018 10:49 PM To: users@nifi.apache.org Subject: Re: NiFi Performance Analysis Clarification Prashanth, Also of note, are you

Re: NiFi Performance Analysis Clarification

2018-06-13 Thread Mark Payne
Prashanth, Also of note, are you actually updating any fields in the CSV that you receive with UpdateRecord / your custom processor? Or are you just using that to convert the CSV to Avro? If the latter, you can actually just remove this processor from your flow entirely and simply use

Re: NiFi Performance Analysis Clarification

2018-06-13 Thread Mark Payne
Prashanth, "will it will it spread out the stop-the-world time across the intervals. In that case, my average would fall to same figures right? It's hard to say - you'd have to give it a try and see if it improves. There are a lot of different optimizations, both at the JVM and the Operating

RE: NiFi Performance Analysis Clarification

2018-06-13 Thread V, Prashanth (Nokia - IN/Bangalore)
Mark, Thanks for the reply. Please find the comments inline. Thanks & Regards, Prashanth From: Mark Payne [mailto:marka...@hotmail.com] Sent: Wednesday, June 13, 2018 6:07 PM To: users@nifi.apache.org Subject: Re: NiFi Performance Analysis Clarification Prashanth, Whenever the FlowFile

RE: NiFi Performance Analysis Clarification

2018-06-13 Thread V, Prashanth (Nokia - IN/Bangalore)
Joe, Thanks for the reply. Please find the answers inline. Thanks & Regards, Prashanth -Original Message- From: Joe Witt [mailto:joe.w...@gmail.com] Sent: Wednesday, June 13, 2018 6:04 PM To: users@nifi.apache.org Subject: Re: NiFi Performance Analysis Clarification Prasanth

Re: Fun with DistributeLoad

2018-06-13 Thread Martijn Dekkers
Mark, That sounds great, thanks! On 13 June 2018 at 16:49, Mark Payne wrote: > Martijn, > > "As an aside, does DistributeLoad use backpressure to know what processor > is / is not available?" > - It depends on the value that you set for the Processor's "Distribution > Strategy." The default is

Re: Fun with DistributeLoad

2018-06-13 Thread Mark Payne
Martijn, "As an aside, does DistributeLoad use backpressure to know what processor is / is not available?" - It depends on the value that you set for the Processor's "Distribution Strategy." The default is Round Robin, which means that if any of the connections applies Back Pressure, then

Re: Fun with DistributeLoad

2018-06-13 Thread Kevin Doran
Thanks for the additional details. It sounds like you have already explored alternatives quite a bit and have found the best path. :) Looks like Mark has some good advice for making this flow manageable, so if this is working for you, I’d take his suggestions where it makes sense and run with

Re: Fun with DistributeLoad

2018-06-13 Thread Martijn Dekkers
Hi Mark! Typically when I come across a set of processors like this, I go with an > approach like https://imgur.com/a/3Zh3FeN > So we have a DistributeLoad going to one of 24 different PutS3Object > processors. Each processor's 'failure' > relationship is then routed to a funnel, and that funnel

Re: Fun with DistributeLoad

2018-06-13 Thread Mark Payne
Martijn, Typically when I come across a set of processors like this, I go with an approach like https://imgur.com/a/3Zh3FeN So we have a DistributeLoad going to one of 24 different PutS3Object processors. Each processor's 'failure' relationship is then routed to a funnel, and that funnel just

Re: Fun with DistributeLoad

2018-06-13 Thread Martijn Dekkers
Hi Kevin! Thanks for your reply. > Can you share more about the details of what your DistributeLoad process > group is doing and how the 24 endpoints of the particular S3-compatible > storage service work? Are they fixed or could they change? Just hoping to > understand what are the constraints

Re: Fun with DistributeLoad

2018-06-13 Thread Kevin Doran
Hi Martijn, Can you share more about the details of what your DistributeLoad process group is doing and how the 24 endpoints of the particular S3-compatible storage service work? Are they fixed or could they change? Just hoping to understand what are the constraints you have to work within.

RE: NiFi Performance Analysis Clarification

2018-06-13 Thread V, Prashanth (Nokia - IN/Bangalore)
Hi Jeremy, With build-in processor[UpdateRecord] with controller service CsvReader & AvroSetWriter. I can send average of ~50MBps to kafka. I also created custom processor for my business logic with internal avro conversion(not using controller service) , I can push it to average of ~80Mbps.

Re: NiFi Performance Analysis Clarification

2018-06-13 Thread Mark Payne
Prashanth, Whenever the FlowFile Repository performs a Checkpoint, it has to ensure that it has flushed all data to disk before continuing, so it performs an fsync() call so that any data buffered by the Operating System is flushed to disk as well. If you're using the same physical drive /

Re: NiFi Performance Analysis Clarification

2018-06-13 Thread Joe Witt
Prasanth I strongly recommend you reduce your JVM heap size for NiFi to 2 or 4 and no more than 8GB. The flow, well configured, will certainly not need anywhere near that much and the more ram you give it the more work GC has to do (some GCs are different and can be tuned/etc.. but ...that is

Re: NiFi Performance Analysis Clarification

2018-06-13 Thread Jeremy Dyer
Prashanth - just out of curiosity could you share the average size of those Avro files you are pushing to Kafka? It would be nice to know for some other benchmark tests I am doing Thanks, Jeremy Dyer Thanks - Jeremy Dyer From: V, Prashanth (Nokia -

Re: NiFi Performance Analysis Clarification

2018-06-13 Thread Mike Thomsen
Relevant: http://www.idata.co.il/2016/09/moving-binary-data-with-kafka/ If you're throwing 1MB and bigger files at Kafka, that's probably where your slowdown is occurring. Particularly if you're running a single node or just two nodes. Kafka was designed to process extremely high volumes of small

RE: NiFi Performance Analysis Clarification

2018-06-13 Thread V, Prashanth (Nokia - IN/Bangalore)
Please find answers inline Thanks & Regards, Prashanth From: Pierre Villard [mailto:pierre.villard...@gmail.com] Sent: Wednesday, June 13, 2018 3:56 PM To: users@nifi.apache.org Subject: Re: NiFi Performance Analysis Clarification Hi, What's the version of NiFi you're using? 1.6.0 What are

Re: NiFi Performance Analysis Clarification

2018-06-13 Thread Pierre Villard
Hi, What's the version of NiFi you're using? What are the file systems you're using for the repositories? I think that changing the heap won't make any different in this case. I'd keep it to something like 8GB (unless you're doing very specific stuff that are memory consuming) and let the

RE: NiFi Performance Analysis Clarification

2018-06-13 Thread V, Prashanth (Nokia - IN/Bangalore)
Hi Mike, I am retrieving many small csv files each of size 1MB (total folder size around ~100GB). In update step, I am doing some enrichment on ingress csv. Anyway my flow doesn’t do anything with the stop the world time right? Can you please tell me about flowfile checkpointing related

Re: NiFi Performance Analysis Clarification

2018-06-13 Thread Mike Thomsen
What are you retrieving (particularly size) and what happens in the "update" step? Thanks, Mike On Wed, Jun 13, 2018 at 4:10 AM V, Prashanth (Nokia - IN/Bangalore) < prashant...@nokia.com> wrote: > Hi Team, > > > > I am doing some performance testing in NiFi. WorkFlow is *GetSFTP -> > update

RE: NiFi Performance Analysis Clarification

2018-06-13 Thread V, Prashanth (Nokia - IN/Bangalore)
Hi Team, I am doing some performance testing in NiFi. WorkFlow is GetSFTP -> update -> PutKafka. I want to tune my setup to achieve high throughput without much queuing. But my throughput average drops during flowfile checkpointing duration. I believe stop-the-world is happening during that

Re: Fun with DistributeLoad

2018-06-13 Thread Martijn Dekkers
see https://imgur.com/c4MDP7o The success routes are not yet in place - each PutS3Object needs to be routed to a success handling set of processors. Thanks Martijn On 13 June 2018 at 08:42, Sivaprasanna wrote: > Is it possible to share screenshots of the flow which feels cluttered? I > have

Re: Fun with DistributeLoad

2018-06-13 Thread Sivaprasanna
Is it possible to share screenshots of the flow which feels cluttered? I have a hard time picturing how the PutS3Objects are routed to failures and successes. A picture would certainly help. Thanks. On Wed, Jun 13, 2018 at 12:07 PM, Martijn Dekkers wrote: > Thanks, I already use process groups

Re: Fun with DistributeLoad

2018-06-13 Thread Martijn Dekkers
Thanks, I already use process groups specifically for the PutS3Object processors. However, with 24 of those, all needing a failure and success connection, this screen is very cluttered. Thanks Martijn On 13 June 2018 at 08:30, Sivaprasanna wrote: > Martijn, > > One clean up approach that comes

Re: Fun with DistributeLoad

2018-06-13 Thread Sivaprasanna
Martijn, One clean up approach that comes immediately to my mind is to use 'Process Groups'. Using which, you can group processors together that perform a related sequence of actions. You can think of them as 'functions' or 'methods' in programming terms. And since you mentioned that you are

Fun with DistributeLoad

2018-06-13 Thread Martijn Dekkers
All, I have a more general question. We will be uploading files to an S3 compatible storage system. In our case, this system presents 24 endpoints to upload to. Given the volume of data we are sending to this device, we want to avoid using a loadbalancer like HAProxy for some use-cases, to avoid