Suggestion on how to parse field out of filename

Mark Petronic Tue, 03 Nov 2015 22:04:45 -0800

Looking for some help on best way to extract a field from a filename. I
need to parse out the date from the core filename attribute set by the
UnpackContent processor. I am unzipping files that contain many CSV files
and these CSV file names vary in format but each has a timestamp included
in the filename. Example formats are:


Priority_002_20151104123456_00.csv  (20151104123456 is yyyyMMddHHmmss)
ABC_02_1447586912344.csv (1447586912344 is Unix time in ms)
XYZ_20151104_1234.csv (20151104_1234 is yyyyMMdd_HHmm)

So, there are various forms to deal with. I need to normalize these into
yyyyMMddHHmmss. A regex with capture groups would be perfect but I cannot
quite figure out how to do it. ExtractText does regex with capture groups
but only against flowfile contents and these are attributes.
UpdateAttribute only support expression language and that does not have
regex based extracts of capture groups.

In Python, I would just do something like:

date, time = re.search(r"XYZ_(\d+)_(\d+)\.csv",
"XYZ_20151104_1234.csv").groups()

Then I could use the expression language format or doDate functions to
normalize the dates

I know I could use a utility script with ExecuteStreamCommand that I could
call with the filepath and get back the tokens but was looking for an
internal way to do it without forking out as there are a lot of archives in
each zip and that would add to latency in heavy loads.

Any thoughts?

Thanks!

Suggestion on how to parse field out of filename

Reply via email to