Thank you very much Jeff and Joe. Based on your discussion I will carefully consider each processor on its own merits and in terms of what my priorities are for my workflow at that point before I make changes and trade-offs between latency and throughput.
Jim On Fri, Apr 7, 2017 at 2:40 PM, Joe Witt <[email protected]> wrote: > Jeff, > > Not really though - your comment is true. Or it could be true if it > isn't. What I mean is that the slider is really a way to the let the > user 'hint to the framework' their preference and how dedicated to > that preference they are. We will continue to use that information to > make under the covers tweaks to improve performance. What you mention > is a very good example. What I mentioned is another good example. > There will be more. But instead of giving the user tons of knobs to > tune we're trying to roll that up to "hey all things being equal if > you could have lowest latency or give up some milliseconds of latency > and instead get higher throughput which do you prefer"? That simple > question gives us a lot of options to work with and tune under the > covers. > > Thanks > joe > > On Fri, Apr 7, 2017 at 2:37 PM, Jeff <[email protected]> wrote: > > It looks like the way I think about it might be a bit off base. :) > > > > On Fri, Apr 7, 2017 at 2:31 PM Joe Witt <[email protected]> wrote: > >> > >> The concept of run duration there is one of the ways we allow users to > >> hint to the framework what their preference is. In general all users > >> want the thing to 'go fast'. But what 'fast' means for you is > >> throughput and what fast means for someone else is low latency. > >> > >> What this really means under the covers at this point is that for > >> processors which are willing to delegate the responsibility of 'when > >> to commit what they've done in a transactional sense' to the framework > >> then the framework can use that knowledge to automatically combine one > >> or more transactions into a single transaction. This has the effect > >> of trading off some very small latency for what is arguably higher > >> throughput because what that means is we can do a single write to our > >> flowfile repository instead of many. This reduces burden on various > >> locks, the file system/interrupts, etc.. It is in general just a bit > >> more friendly and does indeed have the effect of higher throughput. > >> > >> Now, with regard to what should be the default value we cannot really > >> know whether one prefers, generically speaking, to have the system > >> operate more latency sensitive or more throughput sensitive. Further, > >> it isn't really that tight of a relationship. Also, consider that in > >> a given NiFi cluster it can have and handle flows from numerous teams > >> and organizations at the same time. Each with its own needs and > >> interests and preferences. So, we allow it to be selected. > >> > >> As to the question about some processors supporting it and some not > >> the reason for this is simply that sometimes the processor cannot and > >> is not willing to let the framework choose when to commit the session. > >> Why? Because they might have operations which are not 'side effect > >> free' meaning once they've done something the environment has been > >> altered in ways that cannot be recovered from. Take for example a > >> processor which sends data via SFTP. Once a given file is sent we > >> cannot 'unsend it' nor can we simply repeat that process without a > >> side effect. By allowing the framework to handle it for the processor > >> the point is that the operation can be easily undone/redone within the > >> confines of NiFi and not have changed some external system state. So, > >> this is a really important thing to appreciate. > >> > >> Thanks > >> Joe > >> > >> On Fri, Apr 7, 2017 at 2:18 PM, Jeff <[email protected]> wrote: > >> > James, > >> > > >> > The way I look at it (abstractly speaking) is that the slider > represents > >> > how > >> > long a processor will be able to use a thread to work on flowfiles > (from > >> > its > >> > inbound queue, allowing onTrigger to run more times to generate more > >> > outbound flowfiles, etc). Moving that slider towards higher > throughput, > >> > the > >> > processor will do more work, but will hog that thread for a longer > >> > period of > >> > time before another processor can use it. So, overall latency could > go > >> > down, because flowfiles will sit in other queues for possibly longer > >> > periods > >> > of time before another processor gets a thread to start doing work, > but > >> > that > >> > particular processor will probably see higher throughput. > >> > > >> > That's in pretty general terms, though. > >> > > >> > On Fri, Apr 7, 2017 at 9:49 AM James McMahon <[email protected]> > >> > wrote: > >> >> > >> >> I see that some processors provide a slider to set a balance between > >> >> Latency and Throughput. Not all processors provide this, but some do. > >> >> They > >> >> seem to be inversely related. > >> >> > >> >> I also notice that the default appears to be Lower latency, implying > >> >> also > >> >> lower throughput. Why is that the default? I would think that being a > >> >> workflow, maximizing throughput would be the ultimate goal. Yet it > >> >> seems > >> >> that the processors opt for defaults to lowest latency, lowest > >> >> throughput. > >> >> > >> >> What is the relationship between Latency and Throughput? Do most > folks > >> >> in > >> >> the user group typically go in and change that to Highest on > >> >> throughput? Is > >> >> that something to avoid because of demands on CPU, RAM, and disk IO? > >> >> > >> >> Thanks very much. -Jim >
