Looking for some help on best way to extract a field from a filename. I need to parse out the date from the core filename attribute set by the UnpackContent processor. I am unzipping files that contain many CSV files and these CSV file names vary in format but each has a timestamp included in the filename. Example formats are:
Priority_002_20151104123456_00.csv (20151104123456 is yyyyMMddHHmmss) ABC_02_1447586912344.csv (1447586912344 is Unix time in ms) XYZ_20151104_1234.csv (20151104_1234 is yyyyMMdd_HHmm) So, there are various forms to deal with. I need to normalize these into yyyyMMddHHmmss. A regex with capture groups would be perfect but I cannot quite figure out how to do it. ExtractText does regex with capture groups but only against flowfile contents and these are attributes. UpdateAttribute only support expression language and that does not have regex based extracts of capture groups. In Python, I would just do something like: date, time = re.search(r"XYZ_(\d+)_(\d+)\.csv", "XYZ_20151104_1234.csv").groups() Then I could use the expression language format or doDate functions to normalize the dates I know I could use a utility script with ExecuteStreamCommand that I could call with the filepath and get back the tokens but was looking for an internal way to do it without forking out as there are a lot of archives in each zip and that would add to latency in heavy loads. Any thoughts? Thanks!
