Re: RouteText questions (regex, grouping, performance)

Mark Payne Wed, 06 Jul 2016 08:32:09 -0700

Stephane,

So the Processors that you mention there mostly would require that you split 
your data up into one-line chunks.


When you indicate that the expression you would use is 
"${filename:contains('new'):and(filename:contains('2016'))}"
that looks like you are routing only on the attributes, not on the content of 
the text itself. If this is the case, you should
use RouteOnAttribute, as it will be much more efficient than RouteText. In 
general, though, that expression would be
much more efficient than using a regex to match against .*new.*2016.*

So I would certainly recommend using RouteOnAttribute and using the Expression 
Language to route based on attributes.
You can also just add two different properties:

containsNew = ${filename:contains('new')}
is2016 = ${filename:contains('2016')}

And then set the routing strategy to Route to 'match' if all match. This will 
help make the processor's configuration easier
to understand if you look at it again in the future.

Ingesting 1000 packets per second should not be a problem at all on a single 
node. Some things to consider:

- Ideally, you would have a separate disk for your content repo, your flowfile 
repo, and your prov repo.

- You may want to change the log level to WARN for processors (by adding to 
your conf/logback.xml <logger name="org.apache.nifi.processors" level="WARN" />)
  This may or may not make a difference, depending on how resource constrained 
your disks are.

- Making the change above to use RouteOnAttribute will certainly help alleviate 
pressure on both your CPU and your disk.

- If you don't have enough disks to separate out each of your repositories, 
would recommend at least putting prov repo on its own disk.

- If you do have enough disks, you can strip the content repo and your prov 
repo across multiple disks to scale vertically, and you'll
  see much better performance this way.


Thanks
-Markk


> On Jul 3, 2016, at 8:27 PM, Stéphane Maarek <[email protected]> wrote:
> 
> Hi Mark,
> 
> 1. I send Flowfile coming through a ListenUDP, with a batch of 100. So most 
> of the time, the flowfiles are multiple lines long. Yet, after the route 
> text, I get as many flowfiles as lines, regardless of the grouping parameter. 
> Is that expected?
> 
> 2. I have opened a JIRA: https://issues.apache.org/jira/browse/NIFI-2169 
> <https://issues.apache.org/jira/browse/NIFI-2169> 
> 
> I have few questions:
> Regarding the fact that it's better to operate on text that have many lines, 
> and if I manage to get RouteText to output many lines:
>  a) Can ExtractText, ReplaceText, PutMongo, ConvertJSONtoSQL, PutSQL operate 
> on each individual line within a flowfile? (that's basically all the 
> components in my flow)
> b) is satisfies expression: 
> ${filename:contains('new'):and(filename:contains('2016'))} going to perform 
> better than RegEx: .*new.*2016.* ?
> c) I have a lot of data coming in (1000 udp packets a second), and yes, the 
> provenance database has been cramming because we have 6 processors dealing 
> with this flow before the data exits NiFi. Are there any optimization I could 
> deal with out of the box?
> 
> Thanks,
> Stephane
> 
> On Fri, Jul 1, 2016 at 10:48 PM Mark Payne <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Stephane,
> 
> For #1, when you say that you get as many output as lines of text, are you 
> sending in FlowFiles that are only
> one line of text each? The Processor does not aggregate multiple FlowFiles 
> together, so if you are sending in
> 1-line FlowFiles, it can only route that FlowFile in 1-line outputs.
> 
> Re #2: The regular expression is compiled every time. This is done, though, 
> because the Regex allows the Expression
> Language to be used, so the Regex could actually be different for each 
> FlowFile. That being said, it could certainly be
> improved by either (a) pre-compiling in the case that no Expression Language 
> is used and/or (b) cache up to say 10
> Regex'es once they are compiled. Do you mind filing a JIRA to improve the 
> efficiency of this processor?
> 
> Also, when you say that the processor is having trouble keeping up with a 
> batch size of 1, there are a few thoughts that
> come to mind:
> 
> * How many concurrent tasks do you have assigned to the processor? Have you 
> tried increasing it?
> * When processing text in NiFi it is is generally going to be much more 
> efficient to process a single FlowFile with many lines,
> instead of many small FlowFiles, due to the expense of the Data Provenance 
> that has to be generated. There are some things
> that we can do to improve efficiency of the data provenance as well, but 
> those improvements have generally been made
> 'high' priority rather than 'extremely high priority' :) so i would expect to 
> see them coming out possibly toward the end of this year,
> after 1.0 and a few other major features come out.
> * Rather than using a Regular Expression, the "Satisfies Expression" Matching 
> Strategy is likely to be more efficient in many cases
> if it is able to provide the routing logic that you need. It also tends to be 
> easier to read than regular expressions, which is nice when
> you (or someone else) goes back later to modify the flow.
> 
> Please let me know if anything here doesn't make sense or if you have any 
> more questions.
> 
> Thanks!
> -Mark
> 
> 
> > On Jun 30, 2016, at 9:04 PM, Stéphane Maarek <[email protected] 
> > <mailto:[email protected]>> wrote:
> >
> > Hi,
> >
> > I have a question regarding RouteText. The processor works just fine for me 
> > but maybe I'm missing a couple subtleties:
> >
> > 1) I have a regex to group data by (a pair of IDs), but what do I use the 
> > grouping attribute for? I still get as many outputs as lines
> > 2) My data is coming from a listenUDP. If my batch size is 1, RouteText is 
> > having a lot of trouble processing all the data. I would guess that it 
> > compiles the regex everytime it is executed, is it correct? When I increase 
> > the batch size to 100, RouteText processes everything well. I was wondering 
> > if there could be some sort of optimization on the RouteText to keep the 
> > regex compile nonetheless of the state of the processor?
> >
> >
> > Thanks a lot!
> > Stephane
>

Re: RouteText questions (regex, grouping, performance)

Reply via email to