Hi Mark,

1. I send Flowfile coming through a ListenUDP, with a batch of 100. So most
of the time, the flowfiles are multiple lines long. Yet, after the route
text, I get as many flowfiles as lines, regardless of the grouping
parameter. Is that expected?

2. I have opened a JIRA: https://issues.apache.org/jira/browse/NIFI-2169

I have few questions:
Regarding the fact that it's better to operate on text that have many
lines, and if I manage to get RouteText to output many lines:
 a) Can ExtractText, ReplaceText, PutMongo, ConvertJSONtoSQL, PutSQL
operate on each individual line within a flowfile? (that's basically all
the components in my flow)
b) is satisfies expression:
${filename:contains('new'):and(filename:contains('2016'))} going to perform
better than RegEx: .*new.*2016.* ?
c) I have a lot of data coming in (1000 udp packets a second), and yes, the
provenance database has been cramming because we have 6 processors dealing
with this flow before the data exits NiFi. Are there any optimization I
could deal with out of the box?

Thanks,
Stephane

On Fri, Jul 1, 2016 at 10:48 PM Mark Payne <[email protected]> wrote:

> Hi Stephane,
>
> For #1, when you say that you get as many output as lines of text, are you
> sending in FlowFiles that are only
> one line of text each? The Processor does not aggregate multiple FlowFiles
> together, so if you are sending in
> 1-line FlowFiles, it can only route that FlowFile in 1-line outputs.
>
> Re #2: The regular expression is compiled every time. This is done,
> though, because the Regex allows the Expression
> Language to be used, so the Regex could actually be different for each
> FlowFile. That being said, it could certainly be
> improved by either (a) pre-compiling in the case that no Expression
> Language is used and/or (b) cache up to say 10
> Regex'es once they are compiled. Do you mind filing a JIRA to improve the
> efficiency of this processor?
>
> Also, when you say that the processor is having trouble keeping up with a
> batch size of 1, there are a few thoughts that
> come to mind:
>
> * How many concurrent tasks do you have assigned to the processor? Have
> you tried increasing it?
> * When processing text in NiFi it is is generally going to be much more
> efficient to process a single FlowFile with many lines,
> instead of many small FlowFiles, due to the expense of the Data Provenance
> that has to be generated. There are some things
> that we can do to improve efficiency of the data provenance as well, but
> those improvements have generally been made
> 'high' priority rather than 'extremely high priority' :) so i would expect
> to see them coming out possibly toward the end of this year,
> after 1.0 and a few other major features come out.
> * Rather than using a Regular Expression, the "Satisfies Expression"
> Matching Strategy is likely to be more efficient in many cases
> if it is able to provide the routing logic that you need. It also tends to
> be easier to read than regular expressions, which is nice when
> you (or someone else) goes back later to modify the flow.
>
> Please let me know if anything here doesn't make sense or if you have any
> more questions.
>
> Thanks!
> -Mark
>
>
> > On Jun 30, 2016, at 9:04 PM, Stéphane Maarek <[email protected]>
> wrote:
> >
> > Hi,
> >
> > I have a question regarding RouteText. The processor works just fine for
> me but maybe I'm missing a couple subtleties:
> >
> > 1) I have a regex to group data by (a pair of IDs), but what do I use
> the grouping attribute for? I still get as many outputs as lines
> > 2) My data is coming from a listenUDP. If my batch size is 1, RouteText
> is having a lot of trouble processing all the data. I would guess that it
> compiles the regex everytime it is executed, is it correct? When I increase
> the batch size to 100, RouteText processes everything well. I was wondering
> if there could be some sort of optimization on the RouteText to keep the
> regex compile nonetheless of the state of the processor?
> >
> >
> > Thanks a lot!
> > Stephane
>
>

Reply via email to