[jira] [Comment Edited] (NIFI-2072) Support named captures in ExtractText

2020-07-07 Thread Malthe Borch (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152904#comment-17152904
 ] 

Malthe Borch edited comment on NIFI-2072 at 7/7/20, 5:14 PM:
-

I would be happy then with "Enable named group support".

In terms of what happens if an unnamed capture group is used, I think it would 
be better to either:

- Allow it. I often enough see named captures mixed with unnamed ones, simply 
because the author has not bothered to use a non-capturing group.
- Implement a validation step that scans the expression for unnamed capture 
groups (i.e. those that are not named and not non-capturing). It would then be 
an error to use a regex that has unnamed capture groups.


was (Author: malthe):
I would be happy then with "Enable named group support".

In terms of what happens if an unnamed capture group is used, I think it would 
be better to either:

- Allow it.
- Implement a validation step that scans the expression for unnamed capture 
groups (i.e. those that are not named and not non-capturing).

> Support named captures in ExtractText
> -
>
> Key: NIFI-2072
> URL: https://issues.apache.org/jira/browse/NIFI-2072
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Joey Frazee
>Assignee: Otto Fowler
>Priority: Major
>  Labels: extracttext
>
> ExtractText currently captures and creates attributes using numeric indices 
> (e.g, attribute.name.0, attribute.name.1, etc.) whether or not the capture 
> groups are named, i.e., patterns like (?\w+).
> In addition to being more faithful to the provided regexes, named captures 
> could help simplify data flows because you wouldn't have to add superfluous 
> UpdateAttribute steps which are just renaming the indexed captures to more 
> interpretable names.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (NIFI-2072) Support named captures in ExtractText

2020-07-06 Thread Malthe Borch (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151827#comment-17151827
 ] 

Malthe Borch edited comment on NIFI-2072 at 7/6/20, 7:20 AM:
-

Is it really necessary to _enable_ named capture group rather than just use 
them? If I don't want a named capture group, I suppose I am just not going to 
name them, opting instead for enumerated ones.


was (Author: malthe):
Is it really necessary to _enable _named capture group rather than just use 
them? If I don't want a named capture group, I suppose I am just not going to 
name them, opting instead for enumerated ones.

> Support named captures in ExtractText
> -
>
> Key: NIFI-2072
> URL: https://issues.apache.org/jira/browse/NIFI-2072
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Joey Frazee
>Assignee: Otto Fowler
>Priority: Major
>  Labels: extracttext
>
> ExtractText currently captures and creates attributes using numeric indices 
> (e.g, attribute.name.0, attribute.name.1, etc.) whether or not the capture 
> groups are named, i.e., patterns like (?\w+).
> In addition to being more faithful to the provided regexes, named captures 
> could help simplify data flows because you wouldn't have to add superfluous 
> UpdateAttribute steps which are just renaming the indexed captures to more 
> interpretable names.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (NIFI-2072) Support named captures in ExtractText

2020-07-06 Thread Malthe Borch (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151827#comment-17151827
 ] 

Malthe Borch edited comment on NIFI-2072 at 7/6/20, 7:20 AM:
-

Is it really necessary to _enable _named capture group rather than just use 
them? If I don't want a named capture group, I suppose I am just not going to 
name them, opting instead for enumerated ones.


was (Author: malthe):
Is it really necessary to enable named capture group rather than just use them? 
If I don't want a named capture group, I suppose I am just not going to name 
them, opting instead for enumerated ones.

> Support named captures in ExtractText
> -
>
> Key: NIFI-2072
> URL: https://issues.apache.org/jira/browse/NIFI-2072
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Joey Frazee
>Assignee: Otto Fowler
>Priority: Major
>  Labels: extracttext
>
> ExtractText currently captures and creates attributes using numeric indices 
> (e.g, attribute.name.0, attribute.name.1, etc.) whether or not the capture 
> groups are named, i.e., patterns like (?\w+).
> In addition to being more faithful to the provided regexes, named captures 
> could help simplify data flows because you wouldn't have to add superfluous 
> UpdateAttribute steps which are just renaming the indexed captures to more 
> interpretable names.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (NIFI-2072) Support named captures in ExtractText

2020-06-30 Thread Otto Fowler (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148956#comment-17148956
 ] 

Otto Fowler edited comment on NIFI-2072 at 6/30/20, 9:21 PM:
-

[~pvillard]

Something like this?  The restriction on the property to enable is:  if you 
want name groups, all your capturing groups MUST be named.  You can't mix named 
and unnamed captures.  If they don't match, it falls back to the old way.

But I haven't written the verify yet either


{code:java}
final String SAMPLE_STRING = 
"foo\r\nbar1\r\nbar2\r\nbar3\r\nhello\r\nworld\r\n";

 @Test
public void testProcessorWithGroupNames() throws Exception {

final TestRunner testRunner = TestRunners.newTestRunner(new 
ExtractText());

testRunner.setProperty("regex.result1", "(?s)(?.*)");
testRunner.setProperty("regex.result2", "(?s).*(?bar1).*");
testRunner.setProperty("regex.result3", "(?s).*?(?bar\\d).*"); 
testRunner.setProperty("regex.result4", 
"(?s).*?(?:bar\\d).*?(?bar\\d).*?(?bar3).*"); 
testRunner.setProperty("regex.result5", "(?s).*(?bar\\d).*"); 
testRunner.setProperty("regex.result6", "(?s)^(?.*)$");
testRunner.setProperty("regex.result7", "(?s)(?XXX)");
testRunner.setProperty(ENABLE_NAMED_GROUPS, "true");
testRunner.enqueue(SAMPLE_STRING.getBytes("UTF-8"));
testRunner.run();

testRunner.assertAllFlowFilesTransferred(ExtractText.REL_MATCH, 1);
final MockFlowFile out = 
testRunner.getFlowFilesForRelationship(ExtractText.REL_MATCH).get(0);
java.util.Map attributes = out.getAttributes();
out.assertAttributeEquals("regex.result1.all", SAMPLE_STRING);
out.assertAttributeEquals("regex.result2.bar1", "bar1");
out.assertAttributeEquals("regex.result3.bar1", "bar1");
out.assertAttributeEquals("regex.result4.bar2", "bar2");
out.assertAttributeEquals("regex.result4.bar2", "bar2");
out.assertAttributeEquals("regex.result4.bar3", "bar3");
out.assertAttributeEquals("regex.result5.bar3", "bar3");
out.assertAttributeEquals("regex.result6.all", SAMPLE_STRING);
out.assertAttributeEquals("regex.result7.miss", null);
}
{code}



was (Author: ottobackwards):
[~pvillard]

Something like this?  The restriction on the property to enable is:  if you 
want name groups, all your capturing groups MUST be named.  You can't mix named 
and unnamed captures.


{code:java}
final String SAMPLE_STRING = 
"foo\r\nbar1\r\nbar2\r\nbar3\r\nhello\r\nworld\r\n";

 @Test
public void testProcessorWithGroupNames() throws Exception {

final TestRunner testRunner = TestRunners.newTestRunner(new 
ExtractText());

testRunner.setProperty("regex.result1", "(?s)(?.*)");
testRunner.setProperty("regex.result2", "(?s).*(?bar1).*");
testRunner.setProperty("regex.result3", "(?s).*?(?bar\\d).*"); 
testRunner.setProperty("regex.result4", 
"(?s).*?(?:bar\\d).*?(?bar\\d).*?(?bar3).*"); 
testRunner.setProperty("regex.result5", "(?s).*(?bar\\d).*"); 
testRunner.setProperty("regex.result6", "(?s)^(?.*)$");
testRunner.setProperty("regex.result7", "(?s)(?XXX)");
testRunner.setProperty(ENABLE_NAMED_GROUPS, "true");
testRunner.enqueue(SAMPLE_STRING.getBytes("UTF-8"));
testRunner.run();

testRunner.assertAllFlowFilesTransferred(ExtractText.REL_MATCH, 1);
final MockFlowFile out = 
testRunner.getFlowFilesForRelationship(ExtractText.REL_MATCH).get(0);
java.util.Map attributes = out.getAttributes();
out.assertAttributeEquals("regex.result1.all", SAMPLE_STRING);
out.assertAttributeEquals("regex.result2.bar1", "bar1");
out.assertAttributeEquals("regex.result3.bar1", "bar1");
out.assertAttributeEquals("regex.result4.bar2", "bar2");
out.assertAttributeEquals("regex.result4.bar2", "bar2");
out.assertAttributeEquals("regex.result4.bar3", "bar3");
out.assertAttributeEquals("regex.result5.bar3", "bar3");
out.assertAttributeEquals("regex.result6.all", SAMPLE_STRING);
out.assertAttributeEquals("regex.result7.miss", null);
}
{code}


> Support named captures in ExtractText
> -
>
> Key: NIFI-2072
> URL: https://issues.apache.org/jira/browse/NIFI-2072
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Joey Frazee
>Assignee: Otto Fowler
>Priority: Major
>
> ExtractText currently captures and creates attributes using numeric indices 
> (e.g, attribute.name.0, attribute.name.1, etc.) whether or not the capture 
> groups are named, i.e., patterns like (?\w+).
> In addition to being more faithful to the provided regexes, named captures 
> could help simplify data flows because you wouldn't 

[jira] [Comment Edited] (NIFI-2072) Support named captures in ExtractText

2020-06-26 Thread Malthe Borch (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146104#comment-17146104
 ] 

Malthe Borch edited comment on NIFI-2072 at 6/26/20, 7:53 AM:
--

[~pvillard] did you ever make any headway with this or is it open for work, 
assuming that you are still happy with the suggested behavior?

I did not know about the {{ExtractGrok}} processor before I saw this issue. In 
terms of usability, I think it still does make sense to improve {{ExtractText}} 
to support named capturing groups.

I bet most users will not be familiar with Grok and not immediately understand 
that it might be useful to them.


was (Author: malthe):
[~pvillard] did you ever make any headway with this or is it open for work, 
assuming that you are still happy with the suggested behavior?

> Support named captures in ExtractText
> -
>
> Key: NIFI-2072
> URL: https://issues.apache.org/jira/browse/NIFI-2072
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Joey Frazee
>Priority: Major
>
> ExtractText currently captures and creates attributes using numeric indices 
> (e.g, attribute.name.0, attribute.name.1, etc.) whether or not the capture 
> groups are named, i.e., patterns like (?\w+).
> In addition to being more faithful to the provided regexes, named captures 
> could help simplify data flows because you wouldn't have to add superfluous 
> UpdateAttribute steps which are just renaming the indexed captures to more 
> interpretable names.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)