[jira] [Comment Edited] (NIFI-2072) Support named captures in ExtractText
[ https://issues.apache.org/jira/browse/NIFI-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152904#comment-17152904 ] Malthe Borch edited comment on NIFI-2072 at 7/7/20, 5:14 PM: - I would be happy then with "Enable named group support". In terms of what happens if an unnamed capture group is used, I think it would be better to either: - Allow it. I often enough see named captures mixed with unnamed ones, simply because the author has not bothered to use a non-capturing group. - Implement a validation step that scans the expression for unnamed capture groups (i.e. those that are not named and not non-capturing). It would then be an error to use a regex that has unnamed capture groups. was (Author: malthe): I would be happy then with "Enable named group support". In terms of what happens if an unnamed capture group is used, I think it would be better to either: - Allow it. - Implement a validation step that scans the expression for unnamed capture groups (i.e. those that are not named and not non-capturing). > Support named captures in ExtractText > - > > Key: NIFI-2072 > URL: https://issues.apache.org/jira/browse/NIFI-2072 > Project: Apache NiFi > Issue Type: Improvement >Reporter: Joey Frazee >Assignee: Otto Fowler >Priority: Major > Labels: extracttext > > ExtractText currently captures and creates attributes using numeric indices > (e.g, attribute.name.0, attribute.name.1, etc.) whether or not the capture > groups are named, i.e., patterns like (?\w+). > In addition to being more faithful to the provided regexes, named captures > could help simplify data flows because you wouldn't have to add superfluous > UpdateAttribute steps which are just renaming the indexed captures to more > interpretable names. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (NIFI-2072) Support named captures in ExtractText
[ https://issues.apache.org/jira/browse/NIFI-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151827#comment-17151827 ] Malthe Borch edited comment on NIFI-2072 at 7/6/20, 7:20 AM: - Is it really necessary to _enable_ named capture group rather than just use them? If I don't want a named capture group, I suppose I am just not going to name them, opting instead for enumerated ones. was (Author: malthe): Is it really necessary to _enable _named capture group rather than just use them? If I don't want a named capture group, I suppose I am just not going to name them, opting instead for enumerated ones. > Support named captures in ExtractText > - > > Key: NIFI-2072 > URL: https://issues.apache.org/jira/browse/NIFI-2072 > Project: Apache NiFi > Issue Type: Improvement >Reporter: Joey Frazee >Assignee: Otto Fowler >Priority: Major > Labels: extracttext > > ExtractText currently captures and creates attributes using numeric indices > (e.g, attribute.name.0, attribute.name.1, etc.) whether or not the capture > groups are named, i.e., patterns like (?\w+). > In addition to being more faithful to the provided regexes, named captures > could help simplify data flows because you wouldn't have to add superfluous > UpdateAttribute steps which are just renaming the indexed captures to more > interpretable names. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (NIFI-2072) Support named captures in ExtractText
[ https://issues.apache.org/jira/browse/NIFI-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151827#comment-17151827 ] Malthe Borch edited comment on NIFI-2072 at 7/6/20, 7:20 AM: - Is it really necessary to _enable _named capture group rather than just use them? If I don't want a named capture group, I suppose I am just not going to name them, opting instead for enumerated ones. was (Author: malthe): Is it really necessary to enable named capture group rather than just use them? If I don't want a named capture group, I suppose I am just not going to name them, opting instead for enumerated ones. > Support named captures in ExtractText > - > > Key: NIFI-2072 > URL: https://issues.apache.org/jira/browse/NIFI-2072 > Project: Apache NiFi > Issue Type: Improvement >Reporter: Joey Frazee >Assignee: Otto Fowler >Priority: Major > Labels: extracttext > > ExtractText currently captures and creates attributes using numeric indices > (e.g, attribute.name.0, attribute.name.1, etc.) whether or not the capture > groups are named, i.e., patterns like (?\w+). > In addition to being more faithful to the provided regexes, named captures > could help simplify data flows because you wouldn't have to add superfluous > UpdateAttribute steps which are just renaming the indexed captures to more > interpretable names. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (NIFI-2072) Support named captures in ExtractText
[ https://issues.apache.org/jira/browse/NIFI-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148956#comment-17148956 ] Otto Fowler edited comment on NIFI-2072 at 6/30/20, 9:21 PM: - [~pvillard] Something like this? The restriction on the property to enable is: if you want name groups, all your capturing groups MUST be named. You can't mix named and unnamed captures. If they don't match, it falls back to the old way. But I haven't written the verify yet either {code:java} final String SAMPLE_STRING = "foo\r\nbar1\r\nbar2\r\nbar3\r\nhello\r\nworld\r\n"; @Test public void testProcessorWithGroupNames() throws Exception { final TestRunner testRunner = TestRunners.newTestRunner(new ExtractText()); testRunner.setProperty("regex.result1", "(?s)(?.*)"); testRunner.setProperty("regex.result2", "(?s).*(?bar1).*"); testRunner.setProperty("regex.result3", "(?s).*?(?bar\\d).*"); testRunner.setProperty("regex.result4", "(?s).*?(?:bar\\d).*?(?bar\\d).*?(?bar3).*"); testRunner.setProperty("regex.result5", "(?s).*(?bar\\d).*"); testRunner.setProperty("regex.result6", "(?s)^(?.*)$"); testRunner.setProperty("regex.result7", "(?s)(?XXX)"); testRunner.setProperty(ENABLE_NAMED_GROUPS, "true"); testRunner.enqueue(SAMPLE_STRING.getBytes("UTF-8")); testRunner.run(); testRunner.assertAllFlowFilesTransferred(ExtractText.REL_MATCH, 1); final MockFlowFile out = testRunner.getFlowFilesForRelationship(ExtractText.REL_MATCH).get(0); java.util.Map attributes = out.getAttributes(); out.assertAttributeEquals("regex.result1.all", SAMPLE_STRING); out.assertAttributeEquals("regex.result2.bar1", "bar1"); out.assertAttributeEquals("regex.result3.bar1", "bar1"); out.assertAttributeEquals("regex.result4.bar2", "bar2"); out.assertAttributeEquals("regex.result4.bar2", "bar2"); out.assertAttributeEquals("regex.result4.bar3", "bar3"); out.assertAttributeEquals("regex.result5.bar3", "bar3"); out.assertAttributeEquals("regex.result6.all", SAMPLE_STRING); out.assertAttributeEquals("regex.result7.miss", null); } {code} was (Author: ottobackwards): [~pvillard] Something like this? The restriction on the property to enable is: if you want name groups, all your capturing groups MUST be named. You can't mix named and unnamed captures. {code:java} final String SAMPLE_STRING = "foo\r\nbar1\r\nbar2\r\nbar3\r\nhello\r\nworld\r\n"; @Test public void testProcessorWithGroupNames() throws Exception { final TestRunner testRunner = TestRunners.newTestRunner(new ExtractText()); testRunner.setProperty("regex.result1", "(?s)(?.*)"); testRunner.setProperty("regex.result2", "(?s).*(?bar1).*"); testRunner.setProperty("regex.result3", "(?s).*?(?bar\\d).*"); testRunner.setProperty("regex.result4", "(?s).*?(?:bar\\d).*?(?bar\\d).*?(?bar3).*"); testRunner.setProperty("regex.result5", "(?s).*(?bar\\d).*"); testRunner.setProperty("regex.result6", "(?s)^(?.*)$"); testRunner.setProperty("regex.result7", "(?s)(?XXX)"); testRunner.setProperty(ENABLE_NAMED_GROUPS, "true"); testRunner.enqueue(SAMPLE_STRING.getBytes("UTF-8")); testRunner.run(); testRunner.assertAllFlowFilesTransferred(ExtractText.REL_MATCH, 1); final MockFlowFile out = testRunner.getFlowFilesForRelationship(ExtractText.REL_MATCH).get(0); java.util.Map attributes = out.getAttributes(); out.assertAttributeEquals("regex.result1.all", SAMPLE_STRING); out.assertAttributeEquals("regex.result2.bar1", "bar1"); out.assertAttributeEquals("regex.result3.bar1", "bar1"); out.assertAttributeEquals("regex.result4.bar2", "bar2"); out.assertAttributeEquals("regex.result4.bar2", "bar2"); out.assertAttributeEquals("regex.result4.bar3", "bar3"); out.assertAttributeEquals("regex.result5.bar3", "bar3"); out.assertAttributeEquals("regex.result6.all", SAMPLE_STRING); out.assertAttributeEquals("regex.result7.miss", null); } {code} > Support named captures in ExtractText > - > > Key: NIFI-2072 > URL: https://issues.apache.org/jira/browse/NIFI-2072 > Project: Apache NiFi > Issue Type: Improvement >Reporter: Joey Frazee >Assignee: Otto Fowler >Priority: Major > > ExtractText currently captures and creates attributes using numeric indices > (e.g, attribute.name.0, attribute.name.1, etc.) whether or not the capture > groups are named, i.e., patterns like (?\w+). > In addition to being more faithful to the provided regexes, named captures > could help simplify data flows because you wouldn't
[jira] [Comment Edited] (NIFI-2072) Support named captures in ExtractText
[ https://issues.apache.org/jira/browse/NIFI-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146104#comment-17146104 ] Malthe Borch edited comment on NIFI-2072 at 6/26/20, 7:53 AM: -- [~pvillard] did you ever make any headway with this or is it open for work, assuming that you are still happy with the suggested behavior? I did not know about the {{ExtractGrok}} processor before I saw this issue. In terms of usability, I think it still does make sense to improve {{ExtractText}} to support named capturing groups. I bet most users will not be familiar with Grok and not immediately understand that it might be useful to them. was (Author: malthe): [~pvillard] did you ever make any headway with this or is it open for work, assuming that you are still happy with the suggested behavior? > Support named captures in ExtractText > - > > Key: NIFI-2072 > URL: https://issues.apache.org/jira/browse/NIFI-2072 > Project: Apache NiFi > Issue Type: Improvement >Reporter: Joey Frazee >Priority: Major > > ExtractText currently captures and creates attributes using numeric indices > (e.g, attribute.name.0, attribute.name.1, etc.) whether or not the capture > groups are named, i.e., patterns like (?\w+). > In addition to being more faithful to the provided regexes, named captures > could help simplify data flows because you wouldn't have to add superfluous > UpdateAttribute steps which are just renaming the indexed captures to more > interpretable names. -- This message was sent by Atlassian Jira (v8.3.4#803005)