Re: Suggestion on how to parse field out of filename

Mark Petronic Wed, 04 Nov 2015 07:03:56 -0800

Thank guys. replaceAll looks like the solution. I read through that doc
numerous time and cannot believe I missed that one. LOL. Juan, I was
looking at UpdateAttribute advanced as well, but just got stuck on how to
do the regex in that context using expression language.


Appreciate the help guys. Now I can have a happy day getting this working
 :)

On Wed, Nov 4, 2015 at 7:04 AM, Ryan Ward <[email protected]> wrote:

> Mark,
>
> Take a look at the replaceAll function. Juan is correct you will want to
> use UpdateAttribute in the advance mode.
>
>
> https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#replaceall
>
> Ryan
>
> On Wed, Nov 4, 2015 at 6:52 AM, Juan Jose Escobar <
> [email protected]> wrote:
>
>>
>> Hello, Mark,
>>
>> I think it should be possible to do it using UpdateAttribute in advanced
>> mode: define a condition for each of the different formats, and once the
>> particular format type is identified, get the appropriate substring into a
>> new attribute - or into the filename attribute if you want to normalize
>> naming. If I remember correctly, there is no support to extract the regex
>> groups in Nifi Expression Language in 0.3.0.
>>
>> Hope this helps
>>
>> J
>>
>> On Wed, Nov 4, 2015 at 7:04 AM, Mark Petronic <[email protected]>
>> wrote:
>>
>>> Looking for some help on best way to extract a field from a filename. I
>>> need to parse out the date from the core filename attribute set by the
>>> UnpackContent processor. I am unzipping files that contain many CSV files
>>> and these CSV file names vary in format but each has a timestamp included
>>> in the filename. Example formats are:
>>>
>>> Priority_002_20151104123456_00.csv  (20151104123456 is yyyyMMddHHmmss)
>>> ABC_02_1447586912344.csv (1447586912344 is Unix time in ms)
>>> XYZ_20151104_1234.csv (20151104_1234 is yyyyMMdd_HHmm)
>>>
>>> So, there are various forms to deal with. I need to normalize these into
>>> yyyyMMddHHmmss. A regex with capture groups would be perfect but I cannot
>>> quite figure out how to do it. ExtractText does regex with capture groups
>>> but only against flowfile contents and these are attributes.
>>> UpdateAttribute only support expression language and that does not have
>>> regex based extracts of capture groups.
>>>
>>> In Python, I would just do something like:
>>>
>>> date, time = re.search(r"XYZ_(\d+)_(\d+)\.csv",
>>> "XYZ_20151104_1234.csv").groups()
>>>
>>> Then I could use the expression language format or doDate functions to
>>> normalize the dates
>>>
>>> I know I could use a utility script with ExecuteStreamCommand that I
>>> could call with the filepath and get back the tokens but was looking for an
>>> internal way to do it without forking out as there are a lot of archives in
>>> each zip and that would add to latency in heavy loads.
>>>
>>> Any thoughts?
>>>
>>> Thanks!
>>>
>>>
>>
>

Re: Suggestion on how to parse field out of filename

Reply via email to