Hello, Mark, I think it should be possible to do it using UpdateAttribute in advanced mode: define a condition for each of the different formats, and once the particular format type is identified, get the appropriate substring into a new attribute - or into the filename attribute if you want to normalize naming. If I remember correctly, there is no support to extract the regex groups in Nifi Expression Language in 0.3.0.
Hope this helps J On Wed, Nov 4, 2015 at 7:04 AM, Mark Petronic <[email protected]> wrote: > Looking for some help on best way to extract a field from a filename. I > need to parse out the date from the core filename attribute set by the > UnpackContent processor. I am unzipping files that contain many CSV files > and these CSV file names vary in format but each has a timestamp included > in the filename. Example formats are: > > Priority_002_20151104123456_00.csv (20151104123456 is yyyyMMddHHmmss) > ABC_02_1447586912344.csv (1447586912344 is Unix time in ms) > XYZ_20151104_1234.csv (20151104_1234 is yyyyMMdd_HHmm) > > So, there are various forms to deal with. I need to normalize these into > yyyyMMddHHmmss. A regex with capture groups would be perfect but I cannot > quite figure out how to do it. ExtractText does regex with capture groups > but only against flowfile contents and these are attributes. > UpdateAttribute only support expression language and that does not have > regex based extracts of capture groups. > > In Python, I would just do something like: > > date, time = re.search(r"XYZ_(\d+)_(\d+)\.csv", > "XYZ_20151104_1234.csv").groups() > > Then I could use the expression language format or doDate functions to > normalize the dates > > I know I could use a utility script with ExecuteStreamCommand that I could > call with the filepath and get back the tokens but was looking for an > internal way to do it without forking out as there are a lot of archives in > each zip and that would add to latency in heavy loads. > > Any thoughts? > > Thanks! > >
