Re: RouteText questions (regex, grouping, performance)

Mark Payne Thu, 07 Jul 2016 05:37:07 -0700

Stephane,

Excellent. In this case, I would say that it may require a bit of 
experimentation, but I would think that #1 would perform
better, in most cases. #2 would have to read the data only once but would 
require a lot of CPU to evaluate that regex
(evaluating .* in a regex is super expensive). SplitText would have to read the 
data again to scan for new-line characters,
but if your system has a reasonable amount of RAM, chances are that the data 
will be stored in your Operating System's
disk cache anyway, so you will end up reading the content from disk for 
SplitText. So I think SplitText will yield better
performance for you.


Thanks
-Mark


> On Jul 6, 2016, at 9:17 PM, Stéphane Maarek <[email protected]> wrote:
> 
> Hi Mark,
> 
> Thanks a lot for the insights. I'm using RouteText because I needed the 
> ${line} attribute. I've separated my disks and added the logging you 
> recommended. 
> Final question, and that's I guess a little optimization:
> Is it better to 
> 1) RouteText with an empty group field, then having a splitline processor OR
> 2) RouteText with a group field being (.*), and as my lines are unique, 
> they'll come out already splitted
> 
> Thanks!
> Stephane
> 
> On Thu, Jul 7, 2016 at 1:31 AM Mark Payne <[email protected] 
> <mailto:[email protected]>> wrote:
> Stephane,
> 
> So the Processors that you mention there mostly would require that you split 
> your data up into one-line chunks.
> 
> When you indicate that the expression you would use is 
> "${filename:contains('new'):and(filename:contains('2016'))}"
> that looks like you are routing only on the attributes, not on the content of 
> the text itself. If this is the case, you should
> use RouteOnAttribute, as it will be much more efficient than RouteText. In 
> general, though, that expression would be
> much more efficient than using a regex to match against .*new.*2016.*
> 
> So I would certainly recommend using RouteOnAttribute and using the 
> Expression Language to route based on attributes.
> You can also just add two different properties:
> 
> containsNew = ${filename:contains('new')}
> is2016 = ${filename:contains('2016')}
> 
> And then set the routing strategy to Route to 'match' if all match. This will 
> help make the processor's configuration easier
> to understand if you look at it again in the future.
> 
> Ingesting 1000 packets per second should not be a problem at all on a single 
> node. Some things to consider:
> 
> - Ideally, you would have a separate disk for your content repo, your 
> flowfile repo, and your prov repo.
> 
> - You may want to change the log level to WARN for processors (by adding to 
> your conf/logback.xml <logger name="org.apache.nifi.processors" level="WARN" 
> />)
>   This may or may not make a difference, depending on how resource 
> constrained your disks are.
> 
> - Making the change above to use RouteOnAttribute will certainly help 
> alleviate pressure on both your CPU and your disk.
> 
> - If you don't have enough disks to separate out each of your repositories, 
> would recommend at least putting prov repo on its own disk.
> 
> - If you do have enough disks, you can strip the content repo and your prov 
> repo across multiple disks to scale vertically, and you'll
>   see much better performance this way.
> 
> 
> Thanks
> -Markk
> 
> 
>> On Jul 3, 2016, at 8:27 PM, Stéphane Maarek <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi Mark,
>> 
>> 1. I send Flowfile coming through a ListenUDP, with a batch of 100. So most 
>> of the time, the flowfiles are multiple lines long. Yet, after the route 
>> text, I get as many flowfiles as lines, regardless of the grouping 
>> parameter. Is that expected?
>> 
>> 2. I have opened a JIRA: https://issues.apache.org/jira/browse/NIFI-2169 
>> <https://issues.apache.org/jira/browse/NIFI-2169> 
>> 
>> I have few questions:
>> Regarding the fact that it's better to operate on text that have many lines, 
>> and if I manage to get RouteText to output many lines:
>>  a) Can ExtractText, ReplaceText, PutMongo, ConvertJSONtoSQL, PutSQL operate 
>> on each individual line within a flowfile? (that's basically all the 
>> components in my flow)
>> b) is satisfies expression: 
>> ${filename:contains('new'):and(filename:contains('2016'))} going to perform 
>> better than RegEx: .*new.*2016.* ?
>> c) I have a lot of data coming in (1000 udp packets a second), and yes, the 
>> provenance database has been cramming because we have 6 processors dealing 
>> with this flow before the data exits NiFi. Are there any optimization I 
>> could deal with out of the box?
>> 
>> Thanks,
>> Stephane
>> 
>> On Fri, Jul 1, 2016 at 10:48 PM Mark Payne <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi Stephane,
>> 
>> For #1, when you say that you get as many output as lines of text, are you 
>> sending in FlowFiles that are only
>> one line of text each? The Processor does not aggregate multiple FlowFiles 
>> together, so if you are sending in
>> 1-line FlowFiles, it can only route that FlowFile in 1-line outputs.
>> 
>> Re #2: The regular expression is compiled every time. This is done, though, 
>> because the Regex allows the Expression
>> Language to be used, so the Regex could actually be different for each 
>> FlowFile. That being said, it could certainly be
>> improved by either (a) pre-compiling in the case that no Expression Language 
>> is used and/or (b) cache up to say 10
>> Regex'es once they are compiled. Do you mind filing a JIRA to improve the 
>> efficiency of this processor?
>> 
>> Also, when you say that the processor is having trouble keeping up with a 
>> batch size of 1, there are a few thoughts that
>> come to mind:
>> 
>> * How many concurrent tasks do you have assigned to the processor? Have you 
>> tried increasing it?
>> * When processing text in NiFi it is is generally going to be much more 
>> efficient to process a single FlowFile with many lines,
>> instead of many small FlowFiles, due to the expense of the Data Provenance 
>> that has to be generated. There are some things
>> that we can do to improve efficiency of the data provenance as well, but 
>> those improvements have generally been made
>> 'high' priority rather than 'extremely high priority' :) so i would expect 
>> to see them coming out possibly toward the end of this year,
>> after 1.0 and a few other major features come out.
>> * Rather than using a Regular Expression, the "Satisfies Expression" 
>> Matching Strategy is likely to be more efficient in many cases
>> if it is able to provide the routing logic that you need. It also tends to 
>> be easier to read than regular expressions, which is nice when
>> you (or someone else) goes back later to modify the flow.
>> 
>> Please let me know if anything here doesn't make sense or if you have any 
>> more questions.
>> 
>> Thanks!
>> -Mark
>> 
>> 
>> > On Jun 30, 2016, at 9:04 PM, Stéphane Maarek <[email protected] 
>> > <mailto:[email protected]>> wrote:
>> >
>> > Hi,
>> >
>> > I have a question regarding RouteText. The processor works just fine for 
>> > me but maybe I'm missing a couple subtleties:
>> >
>> > 1) I have a regex to group data by (a pair of IDs), but what do I use the 
>> > grouping attribute for? I still get as many outputs as lines
>> > 2) My data is coming from a listenUDP. If my batch size is 1, RouteText is 
>> > having a lot of trouble processing all the data. I would guess that it 
>> > compiles the regex everytime it is executed, is it correct? When I 
>> > increase the batch size to 100, RouteText processes everything well. I was 
>> > wondering if there could be some sort of optimization on the RouteText to 
>> > keep the regex compile nonetheless of the state of the processor?
>> >
>> >
>> > Thanks a lot!
>> > Stephane
>> 
>

Re: RouteText questions (regex, grouping, performance)

Reply via email to