This [toy] example was to learn how this system works, to be honest.  I'm
glad I used it because the multi-line regex caught me off guard.  To be
honest, developing these is always a combination of regexpal, grep, and
some other hacky-ness that's painful.  If I knew more about the regex
engine going in that'd be nice to have in the docs but long term if NiFi is
going to require high-fidelity content matches then pulling in
regexpal/regexr would be a good thing.

The [real] case(s) I'm going to be working on once I understand this better
deals with a variety of log data - not all of it created equal - which is
why I started here.  The current case is log data where multiple
applications *may* write to the same log file (syslog) with different
payloads (strings, json, etc).  I'd rather not build the routing functions
outside of NiFi in code (rsyslog/Python/Kafka/Spark/etc) and use the
security/provenance mechanisms that are part of this system and pipeline
that data out to other places - whether they be files, Hive, HBase,
websockets, etc.

This should be simple to implement since it's just stdout and not too
dissimilar to how you read from a Kafka topic.  In fact, that's probably
the path I'll go down initially - but would like some solution for those
files that don't fit this model.

What I'd like to do is provide a list of regex's and RouteOnMatch and
perform some action like insert into Hive/HBase, send to Spark/Kafka, or
other processors.  Imagine a critical syslog alert that MUST go to a
critical service desk queue as opposed to *.info which might just pipe into
a Hive table for exploratory analysis later.

I know RouteOnContent has this capability and I will most likely pipe
syslog/Kafka data initially - but as I said, not all log files are equal
and there may be some that just get read in a la carte and discriminating
between line 10 and 100 may be important.  I also think that adding an
attribute and then sending it along could be short-circuited with just a
"match and forward" mechanism rather than copying content into attributes
which again goes back into the hit-or-miss regex machine.

I don't know about the impact of multiple FlowFiles but is there an
accumulator that will allow me to take N lines and accumulate them into a
single flow file?

-Chris

On Tue, Sep 8, 2015 at 3:00 PM, Bryan Bende <[email protected]> wrote:

> Chris,
>
> After you extract the lines you are interested in, what do you want to do
> with the data after that? are you delivering to another system? performing
> more processing on the data?
>
> Just want to make sure we fully understand the scenario so we can offer
> the best possible solution.
>
> Thanks,
>
> Bryan
>
> On Tue, Sep 8, 2015 at 2:43 PM, Christopher Wilson <[email protected]>
> wrote:
>
>> I've moved the ball a bit closer to the goal - I enabled DOTALL Mode and
>> increased the Capture Group Length to 4096.  That grabs everything from the
>> first line beginning with "R" to some of the "S"'s.
>>
>> Having a bit of trouble terminating the regex though.
>>
>> Once I get that sorted I'll post the result, but I have to say that the
>> capture group length could be problematic "in the wild".  In a perfect
>> world you would know the length up front - but I can see plenty of cases
>> where that's not going to be the case.
>>
>> -Chris
>>
>> On Tue, Sep 8, 2015 at 2:05 PM, Mark Payne <[email protected]> wrote:
>>
>>> Agreed. Bryan's suggestion will give you the ability to match each line
>>> against the regex,
>>> rather than trying to match the entire file. It would result in a new
>>> FlowFile for each line of
>>> text, though, as he said. But if you need to rebuild a single file,
>>> those could potentially be
>>> merged together using a MergeContent processor, as well.
>>>
>>> ________________________________
>>> > Date: Tue, 8 Sep 2015 13:03:08 -0400
>>> > Subject: Re: ExtractText usage
>>> > From: [email protected]
>>> > To: [email protected]
>>> >
>>> > Chris,
>>> >
>>> > I think the issue is that ExtractText is not reading the file line by
>>> > line, and then applying your pattern to each line. It is applying the
>>> > pattern to the whole content of the file so you would need a regex that
>>> > repeated the pattern you were looking for so that it captured multiple
>>> > times.
>>> >
>>> > When I tested your example, it was actually extracting the first match
>>> > 3 times which I think is because of the following...
>>> > - It always puts the first match in the property base name, in this
>>> > case "regex",
>>> > - then it puts the entire match in index 0, in this case regex.0, and
>>> > in this case it is only matching the first occurrence
>>> > - and then all of the matches would be in order after that staring with
>>> > index 1, which in this case there is only 1 match so it is just regex.1
>>> >
>>> > Another solution that might simpler is to put a SplitText processor
>>> > between GetFile and ExtractText, and set the Line Split Count to 1.
>>> > This will send 1 line at a time to your ExtractTextProcessor which
>>> > would then match only the lines starting with 'R'.
>>> > The downside is that all of the lines with 'R' would be in different
>>> > FlowFiles, but this may or may not matter depending what you wanted to
>>> > do with them after.
>>> >
>>> > -Bryan
>>> >
>>> >
>>> > On Tue, Sep 8, 2015 at 12:12 PM, Christopher Wilson
>>> > <[email protected]<mailto:[email protected]>> wrote:
>>> > I'm trying to read a directory of .csv files which have 3 different
>>> > schemas/list types (not my idea). The descriptor is in the first
>>> > column of the csv file. I'm reading the files in using GetFile and
>>> > passing them into ExtractText, but I'm only getting the first 3 (of 8)
>>> > lines matching my first regex. What I want to do is grab all the lines
>>> > beginning with "R" and dump them off to a file (for now). My end goal
>>> > would be to loop through these grab lines, or blocks of lines, by regex
>>> > and route them downstream based on that regex.
>>> >
>>> > Details and first 11 lines of a sample file below.
>>> >
>>> > Thanks in advance.
>>> >
>>> > -Chris
>>> >
>>> > NiFi version: 0.2.1
>>> > OS: Ubuntu 14.01
>>> > JVM: java-1.7.0-openjdk-amd64
>>> >
>>> > ExtractText:
>>> >
>>> > Enable Multiline = True
>>> > Enable Unix Lines Mode = True
>>> > regex = ^("R.*)$
>>> >
>>> >
>>> > "H","USA","BP","20140502","9","D","BP"
>>> > "R","1","TB","CLM"," "," ","3U"," ","47000","0","47000","0"," ","0","
>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","25000","25000","
>>> > ","650","F","D","D","6"," "," "," ","1:20PM ","1:51PM ","0122"," ","Clm
>>> > 25000","Fast","","16","87","
>>> > ","","","64","117.39","2266","4648","11129","0","0","
>>> > ","","112089","Good","Cloudy","","","Y"
>>> > "R","2","TB","CLM"," ","B","3U"," ","34000","0","34000","0"," ","0","
>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","25000","25000","
>>> > ","600","F","D","D","7"," "," "," ","1:51PM ","2:22PM ","0151"," ","Clm
>>> > 25000N2L","Fast","","16","79","
>>> > ","","","64","112.36","2444","4803","10003","0","0","
>>> > ","","261868","Poor","Cloudy","","","Y"
>>> > "R","3","TB","STK","S"," ","3U","
>>> > ","100000","0","100000","0","A","100000"," ","0"," ","0"," ","0","
>>> > ","0"," ","0"," ","0","0","0"," ","600","F","D","D","6"," ","Affirmed
>>> > Success S.","AfrmdScsB","2:22PM ","2:53PM ","0222","
>>> > ","AfrmdScsB100k","Fast","","16","88","
>>> > ","","","64","110.54","2323","4618","5810","0","0","
>>> > ","","259015","5","Clear","","","Y"
>>> > "R","4","TB","MCL"," "," ","3U"," ","49200","0","49200","0"," ","0","
>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","40000","40000","
>>> > ","850","F","D","D","8"," "," "," ","2:53PM ","3:24PM ","0253"," ","Md
>>> > 40000","Fast","Y","30","72","
>>> > ","","","64","145.58","2425","4829","11358","13909","0","
>>> > ","","260343","9","Clear","0","","Y"
>>> > "R","5","TB","ALW"," "," ","3U"," ","77000","0","77000","0"," ","0","
>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0","
>>> > ","900","F","D","D","7"," "," "," ","3:24PM ","3:55PM ","0325"," ","Alw
>>> > 77000N1X","Fast","Y","30","74","
>>> > ","","","64","151.69","2330","4643","11156","13832","0","
>>> > ","","302065","Good","Clear","","","Y"
>>> > "R","6","TB","MSW","S","B","3U"," ","60000","1200","60000","0","
>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0","
>>> > ","800","F","D","D","5"," "," "," ","3:55PM ","4:26PM ","0355"," ","Md
>>> > Sp Wt 58k","Fast","","30","61","
>>> > ","","","64","140.64","2481","4931","11477","0","0","
>>> > ","","161404","Good","Clear","","","Y"
>>> > "R","7","TB","CLM"," ","B","3U"," ","40000","0","40000","0"," ","0","
>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","20000","20000","
>>> > ","800","F","D","D","6"," "," "," ","4:26PM ","4:57PM ","0427"," ","Clm
>>> > 20000","Fast","","30","68","
>>> > ","","","64","139.31","2337","4770","11402","0","0","
>>> > ","","344306","Good","Clear","","","Y"
>>> > "R","8","TB","ALW"," ","B","3U"," ","77000","0","77000","0"," ","0","
>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0","
>>> > ","850","F","D","D","7"," "," "," ","4:57PM ","5:28PM ","0457"," ","Alw
>>> > 77000N1X","Fast","","30","76","
>>> > ","","","64","144.76","2416","4847","11365","13836","0","
>>> > ","","213021","Good","Clear","","","Y"
>>> > "R","9","TB","STR"," "," ","3U"," ","60000","0","60000","0"," ","0","
>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","40000","
>>> > ","700","F","D","D","8"," "," "," ","5:28PM "," ","0528"," ","Alw
>>> > 40000s","Fast","Y","16","81","
>>> > ","","","64","124.66","2339","4740","11211","0","0","
>>> > ","","332649","6,8","Clear","0","","Y"
>>> >
>>> "S","1","000008813341TB","Coolusive","20100124","KY","TB","Colt","Bay","Ice
>>> > Cool Kitty","2003","TB","Elusive Quality","1993","TB","Tomorrows
>>> > Cat","1995","TB","Gone
>>> >
>>> West","1984","TB","122","0","L","","28200","Velasquez","Cornelio","H.","
>>> > ","Jacobson","David"," ","Drawing Away Stable and Jacobson, David","
>>> > "," ","265","N","
>>> >
>>> ","0","N","5","5","3","3","4","0","0","1","1","1","10","200","0","0","100","75","510","320","0","0","0","0","N","25000","4w
>>> > into lane, held","chase 2o turn, bid 4w turning for home,took over,
>>> > held
>>> >
>>> sway","7.30","3.80","2.70","Y","000000002103TE","TE","Barbara","Robert","
>>> > ","000001976480O6","O6","Averill","Bradley","E.","
>>> > ","N","0","N","","0","","87","Lansdon B. Robbins & Kevin
>>> > Callahan","000000257611TE","000000002695JE"
>>> >
>>> >
>>>
>>>
>>
>>
>

Reply via email to