Hi Mark, 1. I send Flowfile coming through a ListenUDP, with a batch of 100. So most of the time, the flowfiles are multiple lines long. Yet, after the route text, I get as many flowfiles as lines, regardless of the grouping parameter. Is that expected?
2. I have opened a JIRA: https://issues.apache.org/jira/browse/NIFI-2169 I have few questions: Regarding the fact that it's better to operate on text that have many lines, and if I manage to get RouteText to output many lines: a) Can ExtractText, ReplaceText, PutMongo, ConvertJSONtoSQL, PutSQL operate on each individual line within a flowfile? (that's basically all the components in my flow) b) is satisfies expression: ${filename:contains('new'):and(filename:contains('2016'))} going to perform better than RegEx: .*new.*2016.* ? c) I have a lot of data coming in (1000 udp packets a second), and yes, the provenance database has been cramming because we have 6 processors dealing with this flow before the data exits NiFi. Are there any optimization I could deal with out of the box? Thanks, Stephane On Fri, Jul 1, 2016 at 10:48 PM Mark Payne <[email protected]> wrote: > Hi Stephane, > > For #1, when you say that you get as many output as lines of text, are you > sending in FlowFiles that are only > one line of text each? The Processor does not aggregate multiple FlowFiles > together, so if you are sending in > 1-line FlowFiles, it can only route that FlowFile in 1-line outputs. > > Re #2: The regular expression is compiled every time. This is done, > though, because the Regex allows the Expression > Language to be used, so the Regex could actually be different for each > FlowFile. That being said, it could certainly be > improved by either (a) pre-compiling in the case that no Expression > Language is used and/or (b) cache up to say 10 > Regex'es once they are compiled. Do you mind filing a JIRA to improve the > efficiency of this processor? > > Also, when you say that the processor is having trouble keeping up with a > batch size of 1, there are a few thoughts that > come to mind: > > * How many concurrent tasks do you have assigned to the processor? Have > you tried increasing it? > * When processing text in NiFi it is is generally going to be much more > efficient to process a single FlowFile with many lines, > instead of many small FlowFiles, due to the expense of the Data Provenance > that has to be generated. There are some things > that we can do to improve efficiency of the data provenance as well, but > those improvements have generally been made > 'high' priority rather than 'extremely high priority' :) so i would expect > to see them coming out possibly toward the end of this year, > after 1.0 and a few other major features come out. > * Rather than using a Regular Expression, the "Satisfies Expression" > Matching Strategy is likely to be more efficient in many cases > if it is able to provide the routing logic that you need. It also tends to > be easier to read than regular expressions, which is nice when > you (or someone else) goes back later to modify the flow. > > Please let me know if anything here doesn't make sense or if you have any > more questions. > > Thanks! > -Mark > > > > On Jun 30, 2016, at 9:04 PM, Stéphane Maarek <[email protected]> > wrote: > > > > Hi, > > > > I have a question regarding RouteText. The processor works just fine for > me but maybe I'm missing a couple subtleties: > > > > 1) I have a regex to group data by (a pair of IDs), but what do I use > the grouping attribute for? I still get as many outputs as lines > > 2) My data is coming from a listenUDP. If my batch size is 1, RouteText > is having a lot of trouble processing all the data. I would guess that it > compiles the regex everytime it is executed, is it correct? When I increase > the batch size to 100, RouteText processes everything well. I was wondering > if there could be some sort of optimization on the RouteText to keep the > regex compile nonetheless of the state of the processor? > > > > > > Thanks a lot! > > Stephane > >
