This [toy] example was to learn how this system works, to be honest. I'm glad I used it because the multi-line regex caught me off guard. To be honest, developing these is always a combination of regexpal, grep, and some other hacky-ness that's painful. If I knew more about the regex engine going in that'd be nice to have in the docs but long term if NiFi is going to require high-fidelity content matches then pulling in regexpal/regexr would be a good thing.
The [real] case(s) I'm going to be working on once I understand this better deals with a variety of log data - not all of it created equal - which is why I started here. The current case is log data where multiple applications *may* write to the same log file (syslog) with different payloads (strings, json, etc). I'd rather not build the routing functions outside of NiFi in code (rsyslog/Python/Kafka/Spark/etc) and use the security/provenance mechanisms that are part of this system and pipeline that data out to other places - whether they be files, Hive, HBase, websockets, etc. This should be simple to implement since it's just stdout and not too dissimilar to how you read from a Kafka topic. In fact, that's probably the path I'll go down initially - but would like some solution for those files that don't fit this model. What I'd like to do is provide a list of regex's and RouteOnMatch and perform some action like insert into Hive/HBase, send to Spark/Kafka, or other processors. Imagine a critical syslog alert that MUST go to a critical service desk queue as opposed to *.info which might just pipe into a Hive table for exploratory analysis later. I know RouteOnContent has this capability and I will most likely pipe syslog/Kafka data initially - but as I said, not all log files are equal and there may be some that just get read in a la carte and discriminating between line 10 and 100 may be important. I also think that adding an attribute and then sending it along could be short-circuited with just a "match and forward" mechanism rather than copying content into attributes which again goes back into the hit-or-miss regex machine. I don't know about the impact of multiple FlowFiles but is there an accumulator that will allow me to take N lines and accumulate them into a single flow file? -Chris On Tue, Sep 8, 2015 at 3:00 PM, Bryan Bende <[email protected]> wrote: > Chris, > > After you extract the lines you are interested in, what do you want to do > with the data after that? are you delivering to another system? performing > more processing on the data? > > Just want to make sure we fully understand the scenario so we can offer > the best possible solution. > > Thanks, > > Bryan > > On Tue, Sep 8, 2015 at 2:43 PM, Christopher Wilson <[email protected]> > wrote: > >> I've moved the ball a bit closer to the goal - I enabled DOTALL Mode and >> increased the Capture Group Length to 4096. That grabs everything from the >> first line beginning with "R" to some of the "S"'s. >> >> Having a bit of trouble terminating the regex though. >> >> Once I get that sorted I'll post the result, but I have to say that the >> capture group length could be problematic "in the wild". In a perfect >> world you would know the length up front - but I can see plenty of cases >> where that's not going to be the case. >> >> -Chris >> >> On Tue, Sep 8, 2015 at 2:05 PM, Mark Payne <[email protected]> wrote: >> >>> Agreed. Bryan's suggestion will give you the ability to match each line >>> against the regex, >>> rather than trying to match the entire file. It would result in a new >>> FlowFile for each line of >>> text, though, as he said. But if you need to rebuild a single file, >>> those could potentially be >>> merged together using a MergeContent processor, as well. >>> >>> ________________________________ >>> > Date: Tue, 8 Sep 2015 13:03:08 -0400 >>> > Subject: Re: ExtractText usage >>> > From: [email protected] >>> > To: [email protected] >>> > >>> > Chris, >>> > >>> > I think the issue is that ExtractText is not reading the file line by >>> > line, and then applying your pattern to each line. It is applying the >>> > pattern to the whole content of the file so you would need a regex that >>> > repeated the pattern you were looking for so that it captured multiple >>> > times. >>> > >>> > When I tested your example, it was actually extracting the first match >>> > 3 times which I think is because of the following... >>> > - It always puts the first match in the property base name, in this >>> > case "regex", >>> > - then it puts the entire match in index 0, in this case regex.0, and >>> > in this case it is only matching the first occurrence >>> > - and then all of the matches would be in order after that staring with >>> > index 1, which in this case there is only 1 match so it is just regex.1 >>> > >>> > Another solution that might simpler is to put a SplitText processor >>> > between GetFile and ExtractText, and set the Line Split Count to 1. >>> > This will send 1 line at a time to your ExtractTextProcessor which >>> > would then match only the lines starting with 'R'. >>> > The downside is that all of the lines with 'R' would be in different >>> > FlowFiles, but this may or may not matter depending what you wanted to >>> > do with them after. >>> > >>> > -Bryan >>> > >>> > >>> > On Tue, Sep 8, 2015 at 12:12 PM, Christopher Wilson >>> > <[email protected]<mailto:[email protected]>> wrote: >>> > I'm trying to read a directory of .csv files which have 3 different >>> > schemas/list types (not my idea). The descriptor is in the first >>> > column of the csv file. I'm reading the files in using GetFile and >>> > passing them into ExtractText, but I'm only getting the first 3 (of 8) >>> > lines matching my first regex. What I want to do is grab all the lines >>> > beginning with "R" and dump them off to a file (for now). My end goal >>> > would be to loop through these grab lines, or blocks of lines, by regex >>> > and route them downstream based on that regex. >>> > >>> > Details and first 11 lines of a sample file below. >>> > >>> > Thanks in advance. >>> > >>> > -Chris >>> > >>> > NiFi version: 0.2.1 >>> > OS: Ubuntu 14.01 >>> > JVM: java-1.7.0-openjdk-amd64 >>> > >>> > ExtractText: >>> > >>> > Enable Multiline = True >>> > Enable Unix Lines Mode = True >>> > regex = ^("R.*)$ >>> > >>> > >>> > "H","USA","BP","20140502","9","D","BP" >>> > "R","1","TB","CLM"," "," ","3U"," ","47000","0","47000","0"," ","0"," >>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","25000","25000"," >>> > ","650","F","D","D","6"," "," "," ","1:20PM ","1:51PM ","0122"," ","Clm >>> > 25000","Fast","","16","87"," >>> > ","","","64","117.39","2266","4648","11129","0","0"," >>> > ","","112089","Good","Cloudy","","","Y" >>> > "R","2","TB","CLM"," ","B","3U"," ","34000","0","34000","0"," ","0"," >>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","25000","25000"," >>> > ","600","F","D","D","7"," "," "," ","1:51PM ","2:22PM ","0151"," ","Clm >>> > 25000N2L","Fast","","16","79"," >>> > ","","","64","112.36","2444","4803","10003","0","0"," >>> > ","","261868","Poor","Cloudy","","","Y" >>> > "R","3","TB","STK","S"," ","3U"," >>> > ","100000","0","100000","0","A","100000"," ","0"," ","0"," ","0"," >>> > ","0"," ","0"," ","0","0","0"," ","600","F","D","D","6"," ","Affirmed >>> > Success S.","AfrmdScsB","2:22PM ","2:53PM ","0222"," >>> > ","AfrmdScsB100k","Fast","","16","88"," >>> > ","","","64","110.54","2323","4618","5810","0","0"," >>> > ","","259015","5","Clear","","","Y" >>> > "R","4","TB","MCL"," "," ","3U"," ","49200","0","49200","0"," ","0"," >>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","40000","40000"," >>> > ","850","F","D","D","8"," "," "," ","2:53PM ","3:24PM ","0253"," ","Md >>> > 40000","Fast","Y","30","72"," >>> > ","","","64","145.58","2425","4829","11358","13909","0"," >>> > ","","260343","9","Clear","0","","Y" >>> > "R","5","TB","ALW"," "," ","3U"," ","77000","0","77000","0"," ","0"," >>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0"," >>> > ","900","F","D","D","7"," "," "," ","3:24PM ","3:55PM ","0325"," ","Alw >>> > 77000N1X","Fast","Y","30","74"," >>> > ","","","64","151.69","2330","4643","11156","13832","0"," >>> > ","","302065","Good","Clear","","","Y" >>> > "R","6","TB","MSW","S","B","3U"," ","60000","1200","60000","0"," >>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0"," >>> > ","800","F","D","D","5"," "," "," ","3:55PM ","4:26PM ","0355"," ","Md >>> > Sp Wt 58k","Fast","","30","61"," >>> > ","","","64","140.64","2481","4931","11477","0","0"," >>> > ","","161404","Good","Clear","","","Y" >>> > "R","7","TB","CLM"," ","B","3U"," ","40000","0","40000","0"," ","0"," >>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","20000","20000"," >>> > ","800","F","D","D","6"," "," "," ","4:26PM ","4:57PM ","0427"," ","Clm >>> > 20000","Fast","","30","68"," >>> > ","","","64","139.31","2337","4770","11402","0","0"," >>> > ","","344306","Good","Clear","","","Y" >>> > "R","8","TB","ALW"," ","B","3U"," ","77000","0","77000","0"," ","0"," >>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0"," >>> > ","850","F","D","D","7"," "," "," ","4:57PM ","5:28PM ","0457"," ","Alw >>> > 77000N1X","Fast","","30","76"," >>> > ","","","64","144.76","2416","4847","11365","13836","0"," >>> > ","","213021","Good","Clear","","","Y" >>> > "R","9","TB","STR"," "," ","3U"," ","60000","0","60000","0"," ","0"," >>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","40000"," >>> > ","700","F","D","D","8"," "," "," ","5:28PM "," ","0528"," ","Alw >>> > 40000s","Fast","Y","16","81"," >>> > ","","","64","124.66","2339","4740","11211","0","0"," >>> > ","","332649","6,8","Clear","0","","Y" >>> > >>> "S","1","000008813341TB","Coolusive","20100124","KY","TB","Colt","Bay","Ice >>> > Cool Kitty","2003","TB","Elusive Quality","1993","TB","Tomorrows >>> > Cat","1995","TB","Gone >>> > >>> West","1984","TB","122","0","L","","28200","Velasquez","Cornelio","H."," >>> > ","Jacobson","David"," ","Drawing Away Stable and Jacobson, David"," >>> > "," ","265","N"," >>> > >>> ","0","N","5","5","3","3","4","0","0","1","1","1","10","200","0","0","100","75","510","320","0","0","0","0","N","25000","4w >>> > into lane, held","chase 2o turn, bid 4w turning for home,took over, >>> > held >>> > >>> sway","7.30","3.80","2.70","Y","000000002103TE","TE","Barbara","Robert"," >>> > ","000001976480O6","O6","Averill","Bradley","E."," >>> > ","N","0","N","","0","","87","Lansdon B. Robbins & Kevin >>> > Callahan","000000257611TE","000000002695JE" >>> > >>> > >>> >>> >> >> >
