[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885052#comment-16885052 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885051#comment-16885051 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-511349056 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884788#comment-16884788 ] ASF GitHub Bot commented on DRILL-7293: --- paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-511243796 Rebased on master. Addressed review comments. Squashed commits. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881342#comment-16881342 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r301666390 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/store/log/TestLogReader.java ## @@ -607,4 +611,65 @@ public void testSchemaOnlyWithMissingCols() throws Exception { client.resetSession(ExecConstants.STORE_TABLE_USE_SCHEMA_FILE); } } + + @Test + public void testEmptyPattern() throws Exception { +String tablePath = buildTable(tableFuncDir, "tf", "emptyRegex", +"sample.logf", "/regex/simple.log1"); +try { + String sql = "SELECT * FROM %s"; + client.queryBuilder().sql(sql, tablePath).run(); +} catch (Exception e) { + assertTrue(e.getMessage().contains("Regex property is required")); +} + } + + /** + * Test the ability to use table functions to specify the regex. + */ + + @Test + public void testTableFunction() throws Exception { +String tablePath = buildTable(tableFuncDir, "tf", "table1", +"sample.logf", "/regex/simple.log1"); + +// Run a query using a table function. + +String escaped = DATE_ONLY_PATTERN.replace("\\", ""); +String sql = "SELECT * FROM table(%s(type => '%s', regex => '%s', maxErrors => 10))"; +// String sql = "SELECT * FROM %s"; +RowSet results = client.queryBuilder().sql(sql, tablePath, LogFormatPlugin.PLUGIN_NAME, escaped).rowSet(); + +// Verify that the returned data used the schema. + +BatchSchema expectedSchema = new SchemaBuilder() +.addNullable("field_0", MinorType.VARCHAR) +.addNullable("field_1", MinorType.VARCHAR) +.addNullable("field_2", MinorType.VARCHAR) +.build(); + +RowSet expected = client.rowSetBuilder(expectedSchema) +.addRow("2017", "12", "17") +.addRow("2017", "12", "18") +.addRow("2017", "12", "19") +.build(); + +RowSetUtilities.verify(expected, results); + } + + @Test + public void testTableFunctionNoGroups() throws Exception { +String tablePath = buildTable(tableFuncDir, "tf", "noGroups", +"sample.logf", "/regex/simple.log1"); + +// Use a table function to pass in a regex without a group. + +try { + String sql = "SELECT * FROM table(%s(type => '%s', regex => '''foo'''))"; + client.queryBuilder().sql(sql, tablePath, LogFormatPlugin.PLUGIN_NAME).run(); Review comment: Same here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881341#comment-16881341 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r301667671 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogBatchReader.java ## @@ -43,60 +44,135 @@ public static final String RAW_LINE_COL_NAME = "_raw"; public static final String UNMATCHED_LINE_COL_NAME = "_unmatched_rows"; - private final LogFormatConfig formatConfig; - private final Pattern pattern; - private final TupleMetadata schema; - private final int maxErrors; + public static class LogReaderConfig { +protected final LogFormatPlugin plugin; +protected final Pattern pattern; +protected final TupleMetadata schema; +protected final boolean asArray; +protected final int groupCount; +protected final int maxErrors; + +public LogReaderConfig(LogFormatPlugin plugin, Pattern pattern, +TupleMetadata schema, boolean asArray, +int groupCount, int maxErrors) { + this.plugin = plugin; + this.pattern = pattern; + this.schema = schema; + this.asArray = asArray; + this.groupCount = groupCount; + this.maxErrors = maxErrors; +} + } + + /** + * Write group values to value vectors. + */ + + private interface VectorWriter { +void loadVectors(Matcher m); + } + + /** + * Write group values to individual scalar columns. + */ + + private static class ScalarGroupWriter implements VectorWriter { + +private final TupleWriter rowWriter; + +public ScalarGroupWriter(TupleWriter rowWriter) { + this.rowWriter = rowWriter; +} + +@Override +public void loadVectors(Matcher m) { + for (int i = 0; i < m.groupCount(); i++) { +String value = m.group(i + 1); +if (value != null) { + rowWriter.scalar(i).setString(value); +} + } +} + } + + /** + * Write group values to the columns[] array. + */ + + private static class ColumnsArrayWriter implements VectorWriter { + +private final ScalarWriter elementWriter; + +public ColumnsArrayWriter(TupleWriter rowWriter) { + elementWriter = rowWriter.array(0).scalar(); +} + + @Override +public void loadVectors(Matcher m) { + for (int i = 0; i < m.groupCount(); i++) { +String value = m.group(i + 1); +elementWriter.setString(value == null ? "" : value); + } +} + } + + private final LogReaderConfig config; private FileSplit split; private BufferedReader reader; - private int capturingGroups; private ResultSetLoader loader; + private VectorWriter vectorWriter; private ScalarWriter rawColWriter; private ScalarWriter unmatchedColWriter; private boolean saveMatchedRows; private int lineNumber; private int errorCount; - public LogBatchReader(LogFormatConfig formatConfig, Pattern pattern, - TupleMetadata schema, int maxErrors) { -this.formatConfig = formatConfig; -this.pattern = pattern; -this.schema = schema; -this.maxErrors = maxErrors; + public LogBatchReader(LogReaderConfig config) { +this.config = config; } @Override public boolean open(FileSchemaNegotiator negotiator) { split = negotiator.split(); -setupPattern(); -negotiator.setTableSchema(schema, true); +negotiator.setTableSchema(config.schema, true); loader = negotiator.build(); bindColumns(loader.writer()); openFile(negotiator); return true; } - private void setupPattern() { -// Turns out the only way to learn the capturing group count -// is to create a matcher. We do so with a dummy string. - -Matcher m = pattern.matcher("dummy"); -capturingGroups = m.groupCount(); - } - private void bindColumns(RowSetLoader writer) { -for (int i = 0; i < capturingGroups; i++) { - saveMatchedRows |= writer.scalar(i).isProjected(); -} rawColWriter = writer.scalar(RAW_LINE_COL_NAME); -saveMatchedRows |= rawColWriter.isProjected(); unmatchedColWriter = writer.scalar(UNMATCHED_LINE_COL_NAME); +saveMatchedRows = rawColWriter.isProjected(); // If no match-case columns are projected, and the unmatched // columns is unprojected, then we want to count (matched) // rows. saveMatchedRows |= !unmatchedColWriter.isProjected(); + +// This reader is unusual: it can save only unmatched rows, Review comment: Not quite sure I understand meaning of such reader. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881340#comment-16881340 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r301666308 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/store/log/TestLogReader.java ## @@ -607,4 +611,65 @@ public void testSchemaOnlyWithMissingCols() throws Exception { client.resetSession(ExecConstants.STORE_TABLE_USE_SCHEMA_FILE); } } + + @Test + public void testEmptyPattern() throws Exception { +String tablePath = buildTable(tableFuncDir, "tf", "emptyRegex", +"sample.logf", "/regex/simple.log1"); +try { + String sql = "SELECT * FROM %s"; + client.queryBuilder().sql(sql, tablePath).run(); Review comment: Please add `fail()` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875900#comment-16875900 ] ASF GitHub Bot commented on DRILL-7293: --- paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-507084448 Rebased on latest master. One unrelated test fails: ``` [ERROR] Errors: [ERROR] TestDynamicUDFSupport.testDropFunction ยป UserRemote VALIDATION ERROR: From lin... ``` This test fails about 50% of the time, so it is probably not related to this change. The failure prevents running subsequent tests. But, mock data source support is likely not used by the other packages. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873852#comment-16873852 ] ASF GitHub Bot commented on DRILL-7293: --- paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-506209593 @arina-ielchiieva, added a unit test to show that the schema-only table function works. Tried to create a test that combined a "plugin" table function with the "schema" attribute. This failed due to the unfortunate use of "schema" as plugin property name. You've pointed out this issue all along, I finally understood why it was a problem. Still, I'm reluctant to change the config property name for fear of breaking compatibility. As it turns out, this limitation is only a minor nuisance since the only reason to combine the two kinds of table functions is to specify the regex property. A unit test shows that the regex can be specified as a table property instead. Also, went ahead and added support for the `columns` column. If no schema is provided (not in the plugin config, not in a table function, not in a provided schema), then rather than creating a set of dummy fields `field_0`, `field_1`, etc., the plugin how follows the text format plugin and puts the fields into the `columns` array. The dummy fields are still used if the user specifies at least one column schema, but the regex has more groups than specified columns. This means that, if the user uses a table function to specify just the regex, the user gets a reasonable result: the fields come back in the `columns` array. Unit tests show the new `columns` array support. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872983#comment-16872983 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-505737730 @paul-rogers this was introduced in https://issues.apache.org/jira/browse/DRILL-6965 as part of schema provisioning project, Jira has the description and has doc-impacting label, hopefully it would be documented some day. Yes, you are right plugin configs are initialized the usual way and once `DrillTable` instance is created, it is enriched with schema from schema parameter. All magic is done in `org.apache.drill.exec.store.AbstractSchema` class. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872979#comment-16872979 ] ASF GitHub Bot commented on DRILL-7293: --- paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-505735209 @arina-ielchiieva, thanks, I didn't realize we'd extended the table functions. Out of curiosity, how does the schema form know which plugin config to use? Or, does this form create a schema object and use the normal path for the plugin config? Might we have this documented somewhere? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872963#comment-16872963 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-505733281 @paul-rogers thanks for making the changes. Regarding List parameter in table function, yes, it does not work and its a known issue, format plugins should not use list. Though in my previous comments I have asked to try schema parameter in String not in List. Adding example of the queries I expect should work once again: 1. SELECT * FROM table(dfs.tf.noGroups( type => 'logRegex', regex => '(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) .*', schema=>'inline=(month varchar)')) 2. select * from table(t(schema=>'inline=(col1 varchar)')) Examples of the tests can be found in `org.apache.drill.TestSchemaWithTableFunction`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872962#comment-16872962 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-505733281 @paul-rogers thanks for making the changes. Regarding List parameter in table function, yes, it does not work and its a known issue, format plugins should not use list. Though in my previous comments I have asked to try schema parameter in String not in List. Adding example of the queries I expect should work once again: 1. SELECT * FROM table(dfs.tf.noGroups( type => 'logRegex', regex => '(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) .*', schema=>'inline=(month varchar)')) 2. select * from table(t(schema=>'inline=(col1 varchar)')) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872961#comment-16872961 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-505733281 @paul-rogers thanks for making the changes. Regarding List parameter in table function, yes, it does not work and its a known issue, format plugins should not use list. Though in my previous comments I have asked to try schema parameter in String not in List. Adding example of the queries I expect should work once again: 1. SELECT * FROM table(dfs.tf.noGroups( type => 'logRegex', regex => '(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) .*', schema=>'inline(month varchar)')) 2. select * from table(t(schema=>'inline=(col1 varchar)')) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872959#comment-16872959 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-505733281 @paul-rogers thanks for making the changes. Regarding List parameter in table function, yes, it does not work and its a known issue, format plugins should not use list. Though in my previous comments I have asked to try schema parameter in String not in List. Adding example of the queries I expect should work once again: 1. `SELECT * FROM table(dfs.tf.noGroups( type => 'logRegex', regex => '(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) .*', `schema`=>inline(`month` varchar))) 2. select * from table(t(schema=>'inline=(col1 varchar)')) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872960#comment-16872960 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-505733281 @paul-rogers thanks for making the changes. Regarding List parameter in table function, yes, it does not work and its a known issue, format plugins should not use list. Though in my previous comments I have asked to try schema parameter in String not in List. Adding example of the queries I expect should work once again: 1. SELECT * FROM table(dfs.tf.noGroups( type => 'logRegex', regex => '(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) .*', schema=>inline(month varchar))) 2. select * from table(t(schema=>'inline=(col1 varchar)')) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872948#comment-16872948 ] ASF GitHub Bot commented on DRILL-7293: --- paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-505730216 One solution to the schema issue for table functions is to use the `columns` trick from the text reader. If no schema is provided, then instead of creating a set of `field_n` columns, create a single `columns` array column. Specifically, if there is no schema defined for the table, and no schema in the plugin config (perhaps because the plugin config was created via a table function), then just use `columns`. If I get some time, I'll try this out. With the EVF, this might actually be pretty simple. Might be best to add such a feature via another PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872907#comment-16872907 ] ASF GitHub Bot commented on DRILL-7293: --- paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-505714454 @arina-ielchiieva, I was able to get the plugin to work for this query: ``` SELECT * FROM table(dfs.tf.table1( type => 'logRegex', regex => '(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) .*', maxErrors => 10)) ``` To do this, I had to fix some of the issues described in DRILL-7298. In particular, DRILL-6672 notes that table functions are not able to call {{setFoo()}} methods as Jackson can, so table functions only work if the format plugin config fields are {{public}}. The were not public for the log format plugin, so I changed them to {{public}} to get the above query to work. If we look at the code in [`FormatPluginOptionsDescriptor.createConfigForTable()`](https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FormatPluginOptionsDescriptor.java#L123), we'll see that there is nothing that would handle the `values` syntax suggested in your note. The only supported types are Java primitives. When I tried this query: ``` SELECT * FROM table(dfs.tf.noGroups( type => 'logRegex', regex => '(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) .*', `schema`=>values('month', 'VARCHAR'))) ``` I got this result: ``` PARSE ERROR: Encountered "values" at line 1, column 115. SQL Query: SELECT * FROM table(dfs.tf.noGroups(type => 'logRegex', regex => '(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) .*', `schema`=>values('month', 'VARCHAR'))) ^ ``` So, looks like the {{values}} trick does not work. Even if it did, the code to produce the values argument would use some kind of Java collection which would not match the {{List}} of the {{schema}} field. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870914#comment-16870914 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-504905705 @paul-rogers I am still unclear if you have tried the following query for log plugin data: `select * from table(t(schema=>'inline=(col1 varchar)'))` where `t` is table with log plugin data. Did you try it? I suppose it should work. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867012#comment-16867012 ] ASF GitHub Bot commented on DRILL-7293: --- paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-503297370 Rebased on latest master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866352#comment-16866352 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r294674117 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md ## @@ -143,6 +143,14 @@ cardinality. You may find it helpful to specify the regex and column names via the plugin config, types via the `CREATE SCHEMA` command. +## Table Functions + +Log files come in many forms. It would be very convenient to use Drill table Review comment: I guess initial choice of list property did not take into account that it does not work with table function. I don't think you can fix this backward compatibility in ZK but since this plugin is a role model for others I think it should have proper configuration, so changing `schema` property to be `String` instead `List` might be reasonable. Or can we have both properties one in String, another in `List`? We can indicate in release notes that log plugin has been changed and config in ZK must be updated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866348#comment-16866348 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r294670491 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md ## @@ -129,19 +129,62 @@ Drill 1.16 introduced the `CREATE SCHEMA` command to allow you to define the schema for your table. This plugin was created earlier. Here is how the two schema systems interact. +### Plugin Config Provides Regex and Field Names + +The first way to use the provided schema is just to define column types. +In this use case, the plugin config provides the physical layout (pattern +and column names), the provided schema provides data types and default +values (for missing columns.) + +In this case: + * The plugin config must provide the regex. -* The plugin config should provide the list of column names. (If not provided, +* The plugin config provides the list of column names. (If not provided, the names will be `field_1`, `field_2`, etc.) -* The plugin config can provide a type for each field. Text data from the regex -is converted to a nullable column of the specified type. -* The table can provide a schema via `CREATE SCHEMA`. If so, the column names -in the schema must match those in the plugin config. The types in the provided -schema are used instead of those specified in the plugin config. The schema +* The plugin config should not provide column types. +* The table provides a schema via `CREATE SCHEMA`. Column names +in the schema must match those in the plugin config by name. The types in the +provided schema are used instead of those specified in the plugin config. The schema allows you to specify the data type, and either nullable or `not null` cardinality. -You may find it helpful to specify the regex and column names via the plugin -config, types via the `CREATE SCHEMA` command. +### Provided Schema Provides The Regex + +Another way to use the provided schema is to define an empty plugin config; don't +even provide the regex. Use table properties to define the regex (and the maximum +error count, if desired.) + +In this case: + +* Set the table property `drill.regex.regex` to the desired pattern. Review comment: I think using `drill.logRegex.regex` will be fine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866249#comment-16866249 ] ASF GitHub Bot commented on DRILL-7293: --- paul-rogers commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r294615119 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md ## @@ -129,19 +129,62 @@ Drill 1.16 introduced the `CREATE SCHEMA` command to allow you to define the schema for your table. This plugin was created earlier. Here is how the two schema systems interact. +### Plugin Config Provides Regex and Field Names + +The first way to use the provided schema is just to define column types. +In this use case, the plugin config provides the physical layout (pattern +and column names), the provided schema provides data types and default +values (for missing columns.) + +In this case: + * The plugin config must provide the regex. -* The plugin config should provide the list of column names. (If not provided, +* The plugin config provides the list of column names. (If not provided, the names will be `field_1`, `field_2`, etc.) -* The plugin config can provide a type for each field. Text data from the regex -is converted to a nullable column of the specified type. -* The table can provide a schema via `CREATE SCHEMA`. If so, the column names -in the schema must match those in the plugin config. The types in the provided -schema are used instead of those specified in the plugin config. The schema +* The plugin config should not provide column types. +* The table provides a schema via `CREATE SCHEMA`. Column names +in the schema must match those in the plugin config by name. The types in the +provided schema are used instead of those specified in the plugin config. The schema allows you to specify the data type, and either nullable or `not null` cardinality. -You may find it helpful to specify the regex and column names via the plugin -config, types via the `CREATE SCHEMA` command. +### Provided Schema Provides The Regex + +Another way to use the provided schema is to define an empty plugin config; don't +even provide the regex. Use table properties to define the regex (and the maximum +error count, if desired.) + +In this case: + +* Set the table property `drill.regex.regex` to the desired pattern. Review comment: Agree, it is pretty awkward. The saving grace is that I did, I believe, change "regex" to "logRegex" as you suggested. That is, the second item is the plugin "type" name. When we worked on the text reader, I had first tried to choose good names for the third item. You rightly pointed out that it might be easier to remember if we simply use the existing config field names, which is what I did here. So, even if the names are awkward, the pattern we've evolved is: ``` drill.. ``` That said, I'm open to suggestions if there is a better way to handle these names; now is the time to make improvements before folks deploy schema files with the names. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866246#comment-16866246 ] ASF GitHub Bot commented on DRILL-7293: --- paul-rogers commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r294614546 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md ## @@ -143,6 +143,14 @@ cardinality. You may find it helpful to specify the regex and column names via the plugin config, types via the `CREATE SCHEMA` command. +## Table Functions + +Log files come in many forms. It would be very convenient to use Drill table Review comment: As I recall, Drill does not have a good way to deal with changes to the schema of a storage plugin. Some time back, I remember struggling to understand why my server would not start, only to eventually learn that some plugin or other changed its config and so Drill failed when trying to load the existing config from ZK. Has this been fixed? If we change schema to a string, we'd need to run code to convert old configs. Also, we'd have the problem of what to do with the type property. We could not easily convert an existing config into a table schema. Given these uncertainties, my thought was to leave the config alone and try to fit in the provided schema as best we can on top of the existing config. Can you suggest a better approach? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866245#comment-16866245 ] ASF GitHub Bot commented on DRILL-7293: --- paul-rogers commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r294614546 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md ## @@ -143,6 +143,14 @@ cardinality. You may find it helpful to specify the regex and column names via the plugin config, types via the `CREATE SCHEMA` command. +## Table Functions + +Log files come in many forms. It would be very convenient to use Drill table Review comment: As I recall, Drill does not have a good way to deal with changes to the schema of a storage plugin. Some time back, I remember struggling to understand why my server would not start, only to eventually learn that some plugin or other changed its config and so Drill failed when trying to load the existing config from ZK. Has this been fixed? If we change schema to a string, we'd need to run code to convert old configs. Also, we'd have the problem of what to do with the type property. We could not easily convert an existing config into a table schema. Given these uncertainties, my thought was to leave the config alone and try to fit in the provided schema as best we can on top of the existing config. Can you think of a better approach? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863791#comment-16863791 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293692425 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md ## @@ -129,19 +129,62 @@ Drill 1.16 introduced the `CREATE SCHEMA` command to allow you to define the schema for your table. This plugin was created earlier. Here is how the two schema systems interact. +### Plugin Config Provides Regex and Field Names + +The first way to use the provided schema is just to define column types. +In this use case, the plugin config provides the physical layout (pattern +and column names), the provided schema provides data types and default +values (for missing columns.) + +In this case: + * The plugin config must provide the regex. -* The plugin config should provide the list of column names. (If not provided, +* The plugin config provides the list of column names. (If not provided, the names will be `field_1`, `field_2`, etc.) -* The plugin config can provide a type for each field. Text data from the regex -is converted to a nullable column of the specified type. -* The table can provide a schema via `CREATE SCHEMA`. If so, the column names -in the schema must match those in the plugin config. The types in the provided -schema are used instead of those specified in the plugin config. The schema +* The plugin config should not provide column types. +* The table provides a schema via `CREATE SCHEMA`. Column names +in the schema must match those in the plugin config by name. The types in the +provided schema are used instead of those specified in the plugin config. The schema allows you to specify the data type, and either nullable or `not null` cardinality. -You may find it helpful to specify the regex and column names via the plugin -config, types via the `CREATE SCHEMA` command. +### Provided Schema Provides The Regex + +Another way to use the provided schema is to define an empty plugin config; don't +even provide the regex. Use table properties to define the regex (and the maximum +error count, if desired.) + +In this case: + +* Set the table property `drill.regex.regex` to the desired pattern. Review comment: I think we should use different naming, `drill.regex.regex` look awkward. Maybe `drill.regex.pattern` or something like this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863788#comment-16863788 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293690640 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogFormatField.java ## @@ -18,35 +18,31 @@ package org.apache.drill.exec.store.log; +import org.apache.drill.shaded.guava.com.google.common.annotations.VisibleForTesting; + import com.fasterxml.jackson.annotation.JsonInclude; import com.fasterxml.jackson.annotation.JsonTypeName; + +/** + * The three configuration options for a field are: + * + * The field name + * The data type (fieldType). Field type defaults to VARCHAR Review comment: Extra space before `Field` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863789#comment-16863789 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293691586 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md ## @@ -143,6 +143,14 @@ cardinality. You may find it helpful to specify the regex and column names via the plugin config, types via the `CREATE SCHEMA` command. +## Table Functions + +Log files come in many forms. It would be very convenient to use Drill table Review comment: Table function will work for all log format properties, except of list. Knowing that list is not supported, does it makes sense to replace list schema parameter with String and rename it to avoid clash with schema parameter for schema provisioning. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863790#comment-16863790 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293692739 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogFormatPlugin.java ## @@ -47,23 +47,30 @@ import org.slf4j.LoggerFactory; public class LogFormatPlugin extends EasyFormatPlugin { - public static final String PLUGIN_NAME = "logRegex"; private static final Logger logger = LoggerFactory.getLogger(LogFormatPlugin.class); + public static final String PLUGIN_NAME = "logRegex"; + public static final String PROP_PREFIX = TupleMetadata.DRILL_PROP_PREFIX + "regex."; Review comment: Since properties are `log` specific should we add `log` in the properties naming as well as we did for `text` properties? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863780#comment-16863780 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293690297 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/store/log/TestLogReader.java ## @@ -389,4 +416,56 @@ public void testRawUMNoSchema() throws RpcException { RowSetUtilities.verify(expected, results); } + + @Test + public void testProvidedSchema() throws Exception { Review comment: `select * from table(t(schema=>'inline=(col1 varchar)'))` should work disregarding format properties. But log format has schema property so I am wondering if there will be a clash or schema parameter will be correctly resolved, since log format has it as list and schema provisioning in string. Since now log format supports schema provisioning, the above query should apply schema for log files, could you please check this query? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863659#comment-16863659 ] ASF GitHub Bot commented on DRILL-7293: --- paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-501956556 Added the ability to specify the regex (and column schema) in the provided schema. Defined a table property for the regex. Although we can't (yet) use table properties to define the schema, we can now use `CREATE SCHEMA` to define both the regex and the schema. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863654#comment-16863654 ] ASF GitHub Bot commented on DRILL-7293: --- paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#issuecomment-501956415 This PR is now failing due to the Protobuf errors. Thanks @vvysotskyi for fixing them. I'll rebase onto that fix once it is committed and the reviewers have approved the commits. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863601#comment-16863601 ] ASF GitHub Bot commented on DRILL-7293: --- paul-rogers commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293633490 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/store/log/TestLogReader.java ## @@ -389,4 +416,56 @@ public void testRawUMNoSchema() throws RpcException { RowSetUtilities.verify(expected, results); } + + @Test + public void testProvidedSchema() throws Exception { Review comment: Short answer: it does not seem to work. I tried this in the past and found that table functions take only simple values (numbers, strings), not lists. Since this plugin uses a list, I never could figure out how to use it with table functions. In particular, how would the table function know how to create the instance of `LogFormatField` within the list? Am I missing something? This plugin, in particular, would very much benefit from the use of a table function so that the user does not have to define a new plugin config for each new file type. If there is a way to make this work, we can add the test and describe the answer in the README file. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862957#comment-16862957 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293324471 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md ## @@ -11,26 +18,50 @@ If you wanted to analyze log files such as the MySQL log sample shown below usin 070917 16:29:01 21 Query select * from location 070917 16:29:12 21 Query select * from location where id = 1 LIMIT 1 ``` -This plugin will allow you to configure Drill to directly query logfiles of any configuration. + +Using this plugin, you can configure Drill to directly query log files of +any configuration. ## Configuration Options -* **`type`**: This tells Drill which extension to use. In this case, it must be `logRegex`. This field is mandatory. -* **`regex`**: This is the regular expression which defines how the log file lines will be split. You must enclose the parts of the regex in grouping parentheses that you wish to extract. Note that this plugin uses Java regular expressions and requires that shortcuts such as `\d` have an additional slash: ie `\\d`. This field is mandatory. -* **`extension`**: This option tells Drill which file extensions should be mapped to this configuration. Note that you can have multiple configurations of this plugin to allow you to query various log files. This field is mandatory. -* **`maxErrors`**: Log files can be inconsistent and messy. The `maxErrors` variable allows you to set how many errors the reader will ignore before halting execution and throwing an error. Defaults to 10. -* **`schema`**: The `schema` field is where you define the structure of the log file. This section is optional. If you do not define a schema, all fields will be assigned a column name of `field_n` where `n` is the index of the field. The undefined fields will be assigned a default data type of `VARCHAR`. + +* **`type`**: This tells Drill which extension to use. In this case, it must Review comment: ```suggestion * **`type`**: This tells Drill which extension to use. In this case, it must ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862966#comment-16862966 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293327844 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/store/log/TestLogReader.java ## @@ -389,4 +416,56 @@ public void testRawUMNoSchema() throws RpcException { RowSetUtilities.verify(expected, results); } + + @Test + public void testProvidedSchema() throws Exception { Review comment: Could you please add unit tests to check how this format plugin works with schema parameter in table function? Example: `org.apache.drill.TestSchemaWithTableFunction` We might need to check two cases: `select * from table(t(schema=>'inline=(col1 varchar)'))` `select * from table(t(type=>'logRegex', schema=>'inline=(col1 varchar)'))` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862968#comment-16862968 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293324511 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md ## @@ -11,26 +18,50 @@ If you wanted to analyze log files such as the MySQL log sample shown below usin 070917 16:29:01 21 Query select * from location 070917 16:29:12 21 Query select * from location where id = 1 LIMIT 1 ``` -This plugin will allow you to configure Drill to directly query logfiles of any configuration. + +Using this plugin, you can configure Drill to directly query log files of +any configuration. ## Configuration Options -* **`type`**: This tells Drill which extension to use. In this case, it must be `logRegex`. This field is mandatory. -* **`regex`**: This is the regular expression which defines how the log file lines will be split. You must enclose the parts of the regex in grouping parentheses that you wish to extract. Note that this plugin uses Java regular expressions and requires that shortcuts such as `\d` have an additional slash: ie `\\d`. This field is mandatory. -* **`extension`**: This option tells Drill which file extensions should be mapped to this configuration. Note that you can have multiple configurations of this plugin to allow you to query various log files. This field is mandatory. -* **`maxErrors`**: Log files can be inconsistent and messy. The `maxErrors` variable allows you to set how many errors the reader will ignore before halting execution and throwing an error. Defaults to 10. -* **`schema`**: The `schema` field is where you define the structure of the log file. This section is optional. If you do not define a schema, all fields will be assigned a column name of `field_n` where `n` is the index of the field. The undefined fields will be assigned a default data type of `VARCHAR`. + +* **`type`**: This tells Drill which extension to use. In this case, it must +be `logRegex`. This field is mandatory. Review comment: ```suggestion be `logRegex`. This field is mandatory. ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862955#comment-16862955 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293323024 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogBatchReader.java ## @@ -0,0 +1,210 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.store.log; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import java.util.regex.PatternSyntaxException; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator; +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader; +import org.apache.drill.exec.physical.rowSet.ResultSetLoader; +import org.apache.drill.exec.physical.rowSet.RowSetLoader; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.exec.vector.accessor.ScalarWriter; +import org.apache.drill.shaded.guava.com.google.common.base.Charsets; +import org.apache.hadoop.mapred.FileSplit; + +public class LogBatchReader implements ManagedReader { + + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(LogBatchReader.class); Review comment: No need to use full imports. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862967#comment-16862967 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293327274 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogFormatPlugin.java ## @@ -18,86 +18,224 @@ package org.apache.drill.exec.store.log; -import java.io.IOException; -import org.apache.drill.exec.planner.common.DrillStatsTable; -import org.apache.drill.shaded.guava.com.google.common.collect.Lists; +import java.util.List; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import java.util.regex.PatternSyntaxException; + import org.apache.drill.common.exceptions.ExecutionSetupException; -import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.exceptions.UserException; import org.apache.drill.common.logical.StoragePluginConfig; -import org.apache.drill.exec.ops.FragmentContext; -import org.apache.drill.exec.proto.UserBitShared; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileReaderFactory; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileScanBuilder; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator; +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader; +import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType; +import org.apache.drill.exec.record.metadata.ColumnMetadata; +import org.apache.drill.exec.record.metadata.SchemaBuilder; +import org.apache.drill.exec.record.metadata.TupleMetadata; import org.apache.drill.exec.server.DrillbitContext; -import org.apache.drill.exec.store.RecordReader; -import org.apache.drill.exec.store.RecordWriter; -import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.server.options.OptionManager; import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin; -import org.apache.drill.exec.store.dfs.easy.EasyWriter; -import org.apache.drill.exec.store.dfs.easy.FileWork; +import org.apache.drill.exec.store.dfs.easy.EasySubScan; +import org.apache.drill.shaded.guava.com.google.common.base.Strings; +import org.apache.drill.shaded.guava.com.google.common.collect.Lists; import org.apache.hadoop.conf.Configuration; -import java.util.List; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; - public class LogFormatPlugin extends EasyFormatPlugin { + public static final String PLUGIN_NAME = "logRegex"; + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(LogFormatPlugin.class); + + private static class LogReaderFactory extends FileReaderFactory { +private final LogFormatPlugin plugin; +private final Pattern pattern; +private final TupleMetadata schema; + +public LogReaderFactory(LogFormatPlugin plugin, Pattern pattern, TupleMetadata schema) { + this.plugin = plugin; + this.pattern = pattern; + this.schema = schema; +} - public static final String DEFAULT_NAME = "logRegex"; - private final LogFormatConfig formatConfig; +@Override +public ManagedReader newReader() { + return new LogBatchReader(plugin.getConfig(), pattern, schema); +} + } public LogFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig storageConfig, LogFormatConfig formatConfig) { -super(name, context, fsConf, storageConfig, formatConfig, -true, // readable -false, // writable -true, // blockSplittable -true, // compressible -Lists.newArrayList(formatConfig.getExtension()), -DEFAULT_NAME); -this.formatConfig = formatConfig; +super(name, easyConfig(fsConf, formatConfig), context, storageConfig, formatConfig); } - @Override - public RecordReader getRecordReader(FragmentContext context, - DrillFileSystem dfs, FileWork fileWork, List columns, - String userName) throws ExecutionSetupException { -return new LogRecordReader(context, dfs, fileWork, -columns, userName, formatConfig); + private static EasyFormatConfig easyConfig(Configuration fsConf, LogFormatConfig pluginConfig) { +EasyFormatConfig config = new EasyFormatConfig(); +config.readable = true; +config.writable = false; +// Should be block splitable, but logic not yet implemented. +config.blockSplittable = false; +config.compressible = true; +config.supportsProjectPushdown = true; +config.extensions
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862958#comment-16862958 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293324364 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md ## @@ -1,8 +1,15 @@ # Drill Regex/Logfile Plugin -Plugin for Apache Drill that allows Drill to read and query arbitrary files where the schema can be defined by a regex. The original intent was for this to be used for log files, however, it can be used for any structured data. + +Plugin for Apache Drill that allows Drill to read and query arbitrary files +where the schema can be defined by a regex. The original intent was for this Review comment: ```suggestion where the schema can be defined by a regex. The original intent was for this ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862961#comment-16862961 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293325049 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md ## @@ -11,26 +18,50 @@ If you wanted to analyze log files such as the MySQL log sample shown below usin 070917 16:29:01 21 Query select * from location 070917 16:29:12 21 Query select * from location where id = 1 LIMIT 1 ``` -This plugin will allow you to configure Drill to directly query logfiles of any configuration. + +Using this plugin, you can configure Drill to directly query log files of +any configuration. ## Configuration Options -* **`type`**: This tells Drill which extension to use. In this case, it must be `logRegex`. This field is mandatory. -* **`regex`**: This is the regular expression which defines how the log file lines will be split. You must enclose the parts of the regex in grouping parentheses that you wish to extract. Note that this plugin uses Java regular expressions and requires that shortcuts such as `\d` have an additional slash: ie `\\d`. This field is mandatory. -* **`extension`**: This option tells Drill which file extensions should be mapped to this configuration. Note that you can have multiple configurations of this plugin to allow you to query various log files. This field is mandatory. -* **`maxErrors`**: Log files can be inconsistent and messy. The `maxErrors` variable allows you to set how many errors the reader will ignore before halting execution and throwing an error. Defaults to 10. -* **`schema`**: The `schema` field is where you define the structure of the log file. This section is optional. If you do not define a schema, all fields will be assigned a column name of `field_n` where `n` is the index of the field. The undefined fields will be assigned a default data type of `VARCHAR`. + +* **`type`**: This tells Drill which extension to use. In this case, it must +be `logRegex`. This field is mandatory. +* **`regex`**: This is the regular expression which defines how the log file +lines will be split. You must enclose the parts of the regex in grouping Review comment: Looks like everywhere in the doc there are two spaces before sentences instead of one. Could you please check and fix? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862962#comment-16862962 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293326614 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogBatchReader.java ## @@ -0,0 +1,210 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.store.log; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import java.util.regex.PatternSyntaxException; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator; +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader; +import org.apache.drill.exec.physical.rowSet.ResultSetLoader; +import org.apache.drill.exec.physical.rowSet.RowSetLoader; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.exec.vector.accessor.ScalarWriter; +import org.apache.drill.shaded.guava.com.google.common.base.Charsets; +import org.apache.hadoop.mapred.FileSplit; + +public class LogBatchReader implements ManagedReader { + + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(LogBatchReader.class); + public static final String RAW_LINE_COL_NAME = "_raw"; + public static final String UNMATCHED_LINE_COL_NAME = "_unmatched_rows"; + + private FileSplit split; + private final LogFormatConfig formatConfig; + private final Pattern pattern; + private final TupleMetadata schema; + private BufferedReader reader; + private int capturingGroups; + private ResultSetLoader loader; + private ScalarWriter rawColWriter; + private ScalarWriter unmatchedColWriter; + private boolean saveMatchedRows; + private int maxErrors; + private int lineNumber; + private int errorCount; + + public LogBatchReader(LogFormatConfig formatConfig, Pattern pattern, TupleMetadata schema) { +this.formatConfig = formatConfig; +this.maxErrors = Math.max(0, formatConfig.getMaxErrors()); +this.pattern = pattern; +this.schema = schema; + } + + @Override + public boolean open(FileSchemaNegotiator negotiator) { +split = negotiator.split(); +setupPattern(); +negotiator.setTableSchema(schema, true); +loader = negotiator.build(); +bindColumns(loader.writer()); +openFile(negotiator); +return true; + } + + private void setupPattern() { +try { + Matcher m = pattern.matcher("test"); + capturingGroups = m.groupCount(); +} catch (PatternSyntaxException e) { + throw UserException + .validationError(e) + .message("Failed to parse regex: \"%s\"", formatConfig.getRegex()) + .build(logger); +} + } + + private void bindColumns(RowSetLoader writer) { +for (int i = 0; i < capturingGroups; i++) { + saveMatchedRows |= writer.scalar(i).isProjected(); +} +rawColWriter = writer.scalar(RAW_LINE_COL_NAME); +saveMatchedRows |= rawColWriter.isProjected(); +unmatchedColWriter = writer.scalar(UNMATCHED_LINE_COL_NAME); + +// If no match-case columns are projected, and the unmatched +// columns is unprojected, then we want to count (matched) +// rows. + +saveMatchedRows |= !unmatchedColWriter.isProjected(); + } + + private void openFile(FileSchemaNegotiator negotiator) { +InputStream in; +try { + in = negotiator.fileSystem().open(split.getPath()); +} catch (Exception e) { + throw UserException + .dataReadError(e) + .message("Failed to open open input file: %s", split.getPath()) + .addContext("User name", negotiator.userName()) + .build(logger); +} +reader = new BufferedReader(new InputStreamReader(in, Charsets.UTF_8)); + } + + @Override + public boolean next() { +
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862959#comment-16862959 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293323694 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogFormatPlugin.java ## @@ -18,86 +18,224 @@ package org.apache.drill.exec.store.log; -import java.io.IOException; -import org.apache.drill.exec.planner.common.DrillStatsTable; -import org.apache.drill.shaded.guava.com.google.common.collect.Lists; +import java.util.List; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import java.util.regex.PatternSyntaxException; + import org.apache.drill.common.exceptions.ExecutionSetupException; -import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.exceptions.UserException; import org.apache.drill.common.logical.StoragePluginConfig; -import org.apache.drill.exec.ops.FragmentContext; -import org.apache.drill.exec.proto.UserBitShared; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileReaderFactory; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileScanBuilder; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator; +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader; +import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType; +import org.apache.drill.exec.record.metadata.ColumnMetadata; +import org.apache.drill.exec.record.metadata.SchemaBuilder; +import org.apache.drill.exec.record.metadata.TupleMetadata; import org.apache.drill.exec.server.DrillbitContext; -import org.apache.drill.exec.store.RecordReader; -import org.apache.drill.exec.store.RecordWriter; -import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.server.options.OptionManager; import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin; -import org.apache.drill.exec.store.dfs.easy.EasyWriter; -import org.apache.drill.exec.store.dfs.easy.FileWork; +import org.apache.drill.exec.store.dfs.easy.EasySubScan; +import org.apache.drill.shaded.guava.com.google.common.base.Strings; +import org.apache.drill.shaded.guava.com.google.common.collect.Lists; import org.apache.hadoop.conf.Configuration; -import java.util.List; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; - public class LogFormatPlugin extends EasyFormatPlugin { + public static final String PLUGIN_NAME = "logRegex"; + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(LogFormatPlugin.class); + + private static class LogReaderFactory extends FileReaderFactory { +private final LogFormatPlugin plugin; +private final Pattern pattern; +private final TupleMetadata schema; + +public LogReaderFactory(LogFormatPlugin plugin, Pattern pattern, TupleMetadata schema) { + this.plugin = plugin; + this.pattern = pattern; + this.schema = schema; +} - public static final String DEFAULT_NAME = "logRegex"; - private final LogFormatConfig formatConfig; +@Override +public ManagedReader newReader() { + return new LogBatchReader(plugin.getConfig(), pattern, schema); +} + } public LogFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig storageConfig, LogFormatConfig formatConfig) { -super(name, context, fsConf, storageConfig, formatConfig, -true, // readable -false, // writable -true, // blockSplittable -true, // compressible -Lists.newArrayList(formatConfig.getExtension()), -DEFAULT_NAME); -this.formatConfig = formatConfig; +super(name, easyConfig(fsConf, formatConfig), context, storageConfig, formatConfig); } - @Override - public RecordReader getRecordReader(FragmentContext context, - DrillFileSystem dfs, FileWork fileWork, List columns, - String userName) throws ExecutionSetupException { -return new LogRecordReader(context, dfs, fileWork, -columns, userName, formatConfig); + private static EasyFormatConfig easyConfig(Configuration fsConf, LogFormatConfig pluginConfig) { +EasyFormatConfig config = new EasyFormatConfig(); +config.readable = true; +config.writable = false; +// Should be block splitable, but logic not yet implemented. +config.blockSplittable = false; +config.compressible = true; +config.supportsProjectPushdown = true; +config.extensions
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862969#comment-16862969 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293326805 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogBatchReader.java ## @@ -0,0 +1,210 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.store.log; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import java.util.regex.PatternSyntaxException; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator; +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader; +import org.apache.drill.exec.physical.rowSet.ResultSetLoader; +import org.apache.drill.exec.physical.rowSet.RowSetLoader; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.exec.vector.accessor.ScalarWriter; +import org.apache.drill.shaded.guava.com.google.common.base.Charsets; +import org.apache.hadoop.mapred.FileSplit; + +public class LogBatchReader implements ManagedReader { + + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(LogBatchReader.class); + public static final String RAW_LINE_COL_NAME = "_raw"; + public static final String UNMATCHED_LINE_COL_NAME = "_unmatched_rows"; + + private FileSplit split; + private final LogFormatConfig formatConfig; + private final Pattern pattern; + private final TupleMetadata schema; + private BufferedReader reader; + private int capturingGroups; + private ResultSetLoader loader; + private ScalarWriter rawColWriter; + private ScalarWriter unmatchedColWriter; + private boolean saveMatchedRows; + private int maxErrors; + private int lineNumber; + private int errorCount; + + public LogBatchReader(LogFormatConfig formatConfig, Pattern pattern, TupleMetadata schema) { +this.formatConfig = formatConfig; +this.maxErrors = Math.max(0, formatConfig.getMaxErrors()); +this.pattern = pattern; +this.schema = schema; + } + + @Override + public boolean open(FileSchemaNegotiator negotiator) { +split = negotiator.split(); +setupPattern(); +negotiator.setTableSchema(schema, true); +loader = negotiator.build(); +bindColumns(loader.writer()); +openFile(negotiator); +return true; + } + + private void setupPattern() { +try { + Matcher m = pattern.matcher("test"); + capturingGroups = m.groupCount(); +} catch (PatternSyntaxException e) { + throw UserException + .validationError(e) + .message("Failed to parse regex: \"%s\"", formatConfig.getRegex()) + .build(logger); +} + } + + private void bindColumns(RowSetLoader writer) { +for (int i = 0; i < capturingGroups; i++) { + saveMatchedRows |= writer.scalar(i).isProjected(); +} +rawColWriter = writer.scalar(RAW_LINE_COL_NAME); +saveMatchedRows |= rawColWriter.isProjected(); +unmatchedColWriter = writer.scalar(UNMATCHED_LINE_COL_NAME); + +// If no match-case columns are projected, and the unmatched +// columns is unprojected, then we want to count (matched) +// rows. + +saveMatchedRows |= !unmatchedColWriter.isProjected(); + } + + private void openFile(FileSchemaNegotiator negotiator) { +InputStream in; +try { + in = negotiator.fileSystem().open(split.getPath()); +} catch (Exception e) { + throw UserException + .dataReadError(e) + .message("Failed to open open input file: %s", split.getPath()) + .addContext("User name", negotiator.userName()) + .build(logger); +} +reader = new BufferedReader(new InputStreamReader(in, Charsets.UTF_8)); + } + + @Override + public boolean next() { +
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862956#comment-16862956 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293323517 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogFormatPlugin.java ## @@ -18,86 +18,224 @@ package org.apache.drill.exec.store.log; -import java.io.IOException; -import org.apache.drill.exec.planner.common.DrillStatsTable; -import org.apache.drill.shaded.guava.com.google.common.collect.Lists; +import java.util.List; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import java.util.regex.PatternSyntaxException; + import org.apache.drill.common.exceptions.ExecutionSetupException; -import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.exceptions.UserException; import org.apache.drill.common.logical.StoragePluginConfig; -import org.apache.drill.exec.ops.FragmentContext; -import org.apache.drill.exec.proto.UserBitShared; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileReaderFactory; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileScanBuilder; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator; +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader; +import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType; +import org.apache.drill.exec.record.metadata.ColumnMetadata; +import org.apache.drill.exec.record.metadata.SchemaBuilder; +import org.apache.drill.exec.record.metadata.TupleMetadata; import org.apache.drill.exec.server.DrillbitContext; -import org.apache.drill.exec.store.RecordReader; -import org.apache.drill.exec.store.RecordWriter; -import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.server.options.OptionManager; import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin; -import org.apache.drill.exec.store.dfs.easy.EasyWriter; -import org.apache.drill.exec.store.dfs.easy.FileWork; +import org.apache.drill.exec.store.dfs.easy.EasySubScan; +import org.apache.drill.shaded.guava.com.google.common.base.Strings; +import org.apache.drill.shaded.guava.com.google.common.collect.Lists; import org.apache.hadoop.conf.Configuration; -import java.util.List; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; - public class LogFormatPlugin extends EasyFormatPlugin { + public static final String PLUGIN_NAME = "logRegex"; + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(LogFormatPlugin.class); + + private static class LogReaderFactory extends FileReaderFactory { +private final LogFormatPlugin plugin; +private final Pattern pattern; +private final TupleMetadata schema; + +public LogReaderFactory(LogFormatPlugin plugin, Pattern pattern, TupleMetadata schema) { + this.plugin = plugin; + this.pattern = pattern; + this.schema = schema; +} - public static final String DEFAULT_NAME = "logRegex"; - private final LogFormatConfig formatConfig; +@Override +public ManagedReader newReader() { + return new LogBatchReader(plugin.getConfig(), pattern, schema); +} + } public LogFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig storageConfig, LogFormatConfig formatConfig) { -super(name, context, fsConf, storageConfig, formatConfig, -true, // readable -false, // writable -true, // blockSplittable -true, // compressible -Lists.newArrayList(formatConfig.getExtension()), -DEFAULT_NAME); -this.formatConfig = formatConfig; +super(name, easyConfig(fsConf, formatConfig), context, storageConfig, formatConfig); } - @Override - public RecordReader getRecordReader(FragmentContext context, - DrillFileSystem dfs, FileWork fileWork, List columns, - String userName) throws ExecutionSetupException { -return new LogRecordReader(context, dfs, fileWork, -columns, userName, formatConfig); + private static EasyFormatConfig easyConfig(Configuration fsConf, LogFormatConfig pluginConfig) { +EasyFormatConfig config = new EasyFormatConfig(); +config.readable = true; +config.writable = false; +// Should be block splitable, but logic not yet implemented. +config.blockSplittable = false; +config.compressible = true; +config.supportsProjectPushdown = true; +config.extensions
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862965#comment-16862965 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r29332 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogBatchReader.java ## @@ -0,0 +1,210 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.store.log; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import java.util.regex.PatternSyntaxException; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator; +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader; +import org.apache.drill.exec.physical.rowSet.ResultSetLoader; +import org.apache.drill.exec.physical.rowSet.RowSetLoader; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.exec.vector.accessor.ScalarWriter; +import org.apache.drill.shaded.guava.com.google.common.base.Charsets; +import org.apache.hadoop.mapred.FileSplit; + +public class LogBatchReader implements ManagedReader { + + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(LogBatchReader.class); + public static final String RAW_LINE_COL_NAME = "_raw"; + public static final String UNMATCHED_LINE_COL_NAME = "_unmatched_rows"; + + private FileSplit split; + private final LogFormatConfig formatConfig; + private final Pattern pattern; + private final TupleMetadata schema; + private BufferedReader reader; + private int capturingGroups; + private ResultSetLoader loader; + private ScalarWriter rawColWriter; + private ScalarWriter unmatchedColWriter; + private boolean saveMatchedRows; + private int maxErrors; + private int lineNumber; + private int errorCount; + + public LogBatchReader(LogFormatConfig formatConfig, Pattern pattern, TupleMetadata schema) { +this.formatConfig = formatConfig; +this.maxErrors = Math.max(0, formatConfig.getMaxErrors()); +this.pattern = pattern; +this.schema = schema; + } + + @Override + public boolean open(FileSchemaNegotiator negotiator) { +split = negotiator.split(); +setupPattern(); +negotiator.setTableSchema(schema, true); +loader = negotiator.build(); +bindColumns(loader.writer()); +openFile(negotiator); +return true; + } + + private void setupPattern() { +try { + Matcher m = pattern.matcher("test"); + capturingGroups = m.groupCount(); +} catch (PatternSyntaxException e) { + throw UserException + .validationError(e) + .message("Failed to parse regex: \"%s\"", formatConfig.getRegex()) + .build(logger); +} + } + + private void bindColumns(RowSetLoader writer) { +for (int i = 0; i < capturingGroups; i++) { + saveMatchedRows |= writer.scalar(i).isProjected(); +} +rawColWriter = writer.scalar(RAW_LINE_COL_NAME); +saveMatchedRows |= rawColWriter.isProjected(); +unmatchedColWriter = writer.scalar(UNMATCHED_LINE_COL_NAME); + +// If no match-case columns are projected, and the unmatched +// columns is unprojected, then we want to count (matched) +// rows. + +saveMatchedRows |= !unmatchedColWriter.isProjected(); + } + + private void openFile(FileSchemaNegotiator negotiator) { +InputStream in; +try { + in = negotiator.fileSystem().open(split.getPath()); +} catch (Exception e) { + throw UserException + .dataReadError(e) + .message("Failed to open open input file: %s", split.getPath()) + .addContext("User name", negotiator.userName()) + .build(logger); +} +reader = new BufferedReader(new InputStreamReader(in, Charsets.UTF_8)); + } + + @Override + public boolean next() { +
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862964#comment-16862964 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293327227 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogFormatPlugin.java ## @@ -18,86 +18,224 @@ package org.apache.drill.exec.store.log; -import java.io.IOException; -import org.apache.drill.exec.planner.common.DrillStatsTable; -import org.apache.drill.shaded.guava.com.google.common.collect.Lists; +import java.util.List; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import java.util.regex.PatternSyntaxException; + import org.apache.drill.common.exceptions.ExecutionSetupException; -import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.exceptions.UserException; import org.apache.drill.common.logical.StoragePluginConfig; -import org.apache.drill.exec.ops.FragmentContext; -import org.apache.drill.exec.proto.UserBitShared; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileReaderFactory; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileScanBuilder; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator; +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader; +import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType; +import org.apache.drill.exec.record.metadata.ColumnMetadata; +import org.apache.drill.exec.record.metadata.SchemaBuilder; +import org.apache.drill.exec.record.metadata.TupleMetadata; import org.apache.drill.exec.server.DrillbitContext; -import org.apache.drill.exec.store.RecordReader; -import org.apache.drill.exec.store.RecordWriter; -import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.server.options.OptionManager; import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin; -import org.apache.drill.exec.store.dfs.easy.EasyWriter; -import org.apache.drill.exec.store.dfs.easy.FileWork; +import org.apache.drill.exec.store.dfs.easy.EasySubScan; +import org.apache.drill.shaded.guava.com.google.common.base.Strings; +import org.apache.drill.shaded.guava.com.google.common.collect.Lists; import org.apache.hadoop.conf.Configuration; -import java.util.List; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; - public class LogFormatPlugin extends EasyFormatPlugin { + public static final String PLUGIN_NAME = "logRegex"; + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(LogFormatPlugin.class); + + private static class LogReaderFactory extends FileReaderFactory { +private final LogFormatPlugin plugin; +private final Pattern pattern; +private final TupleMetadata schema; + +public LogReaderFactory(LogFormatPlugin plugin, Pattern pattern, TupleMetadata schema) { + this.plugin = plugin; + this.pattern = pattern; + this.schema = schema; +} - public static final String DEFAULT_NAME = "logRegex"; - private final LogFormatConfig formatConfig; +@Override +public ManagedReader newReader() { + return new LogBatchReader(plugin.getConfig(), pattern, schema); +} + } public LogFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig storageConfig, LogFormatConfig formatConfig) { -super(name, context, fsConf, storageConfig, formatConfig, -true, // readable -false, // writable -true, // blockSplittable -true, // compressible -Lists.newArrayList(formatConfig.getExtension()), -DEFAULT_NAME); -this.formatConfig = formatConfig; +super(name, easyConfig(fsConf, formatConfig), context, storageConfig, formatConfig); } - @Override - public RecordReader getRecordReader(FragmentContext context, - DrillFileSystem dfs, FileWork fileWork, List columns, - String userName) throws ExecutionSetupException { -return new LogRecordReader(context, dfs, fileWork, -columns, userName, formatConfig); + private static EasyFormatConfig easyConfig(Configuration fsConf, LogFormatConfig pluginConfig) { +EasyFormatConfig config = new EasyFormatConfig(); +config.readable = true; +config.writable = false; +// Should be block splitable, but logic not yet implemented. +config.blockSplittable = false; +config.compressible = true; +config.supportsProjectPushdown = true; +config.extensions
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862963#comment-16862963 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293326946 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogBatchReader.java ## @@ -0,0 +1,210 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.store.log; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import java.util.regex.PatternSyntaxException; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator; +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader; +import org.apache.drill.exec.physical.rowSet.ResultSetLoader; +import org.apache.drill.exec.physical.rowSet.RowSetLoader; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.exec.vector.accessor.ScalarWriter; +import org.apache.drill.shaded.guava.com.google.common.base.Charsets; +import org.apache.hadoop.mapred.FileSplit; + +public class LogBatchReader implements ManagedReader { + + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(LogBatchReader.class); + public static final String RAW_LINE_COL_NAME = "_raw"; + public static final String UNMATCHED_LINE_COL_NAME = "_unmatched_rows"; + + private FileSplit split; + private final LogFormatConfig formatConfig; + private final Pattern pattern; + private final TupleMetadata schema; + private BufferedReader reader; + private int capturingGroups; + private ResultSetLoader loader; + private ScalarWriter rawColWriter; + private ScalarWriter unmatchedColWriter; + private boolean saveMatchedRows; + private int maxErrors; + private int lineNumber; + private int errorCount; + + public LogBatchReader(LogFormatConfig formatConfig, Pattern pattern, TupleMetadata schema) { +this.formatConfig = formatConfig; +this.maxErrors = Math.max(0, formatConfig.getMaxErrors()); +this.pattern = pattern; +this.schema = schema; + } + + @Override + public boolean open(FileSchemaNegotiator negotiator) { +split = negotiator.split(); +setupPattern(); +negotiator.setTableSchema(schema, true); +loader = negotiator.build(); +bindColumns(loader.writer()); +openFile(negotiator); +return true; + } + + private void setupPattern() { +try { + Matcher m = pattern.matcher("test"); + capturingGroups = m.groupCount(); +} catch (PatternSyntaxException e) { + throw UserException + .validationError(e) + .message("Failed to parse regex: \"%s\"", formatConfig.getRegex()) + .build(logger); +} + } + + private void bindColumns(RowSetLoader writer) { +for (int i = 0; i < capturingGroups; i++) { + saveMatchedRows |= writer.scalar(i).isProjected(); +} +rawColWriter = writer.scalar(RAW_LINE_COL_NAME); +saveMatchedRows |= rawColWriter.isProjected(); +unmatchedColWriter = writer.scalar(UNMATCHED_LINE_COL_NAME); + +// If no match-case columns are projected, and the unmatched +// columns is unprojected, then we want to count (matched) +// rows. + +saveMatchedRows |= !unmatchedColWriter.isProjected(); + } + + private void openFile(FileSchemaNegotiator negotiator) { +InputStream in; +try { + in = negotiator.fileSystem().open(split.getPath()); +} catch (Exception e) { + throw UserException + .dataReadError(e) + .message("Failed to open open input file: %s", split.getPath()) Review comment: ```suggestion .message("Failed to open input file: %s", split.getPath()) ``` This is an automated message from the Apache Git
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862960#comment-16862960 ] ASF GitHub Bot commented on DRILL-7293: --- arina-ielchiieva commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807#discussion_r293323284 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogFormatPlugin.java ## @@ -18,86 +18,224 @@ package org.apache.drill.exec.store.log; -import java.io.IOException; -import org.apache.drill.exec.planner.common.DrillStatsTable; -import org.apache.drill.shaded.guava.com.google.common.collect.Lists; +import java.util.List; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import java.util.regex.PatternSyntaxException; + import org.apache.drill.common.exceptions.ExecutionSetupException; -import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.exceptions.UserException; import org.apache.drill.common.logical.StoragePluginConfig; -import org.apache.drill.exec.ops.FragmentContext; -import org.apache.drill.exec.proto.UserBitShared; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileReaderFactory; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileScanBuilder; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator; +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader; +import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType; +import org.apache.drill.exec.record.metadata.ColumnMetadata; +import org.apache.drill.exec.record.metadata.SchemaBuilder; +import org.apache.drill.exec.record.metadata.TupleMetadata; import org.apache.drill.exec.server.DrillbitContext; -import org.apache.drill.exec.store.RecordReader; -import org.apache.drill.exec.store.RecordWriter; -import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.server.options.OptionManager; import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin; -import org.apache.drill.exec.store.dfs.easy.EasyWriter; -import org.apache.drill.exec.store.dfs.easy.FileWork; +import org.apache.drill.exec.store.dfs.easy.EasySubScan; +import org.apache.drill.shaded.guava.com.google.common.base.Strings; +import org.apache.drill.shaded.guava.com.google.common.collect.Lists; import org.apache.hadoop.conf.Configuration; -import java.util.List; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; - public class LogFormatPlugin extends EasyFormatPlugin { + public static final String PLUGIN_NAME = "logRegex"; + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(LogFormatPlugin.class); Review comment: Same here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF
[ https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862618#comment-16862618 ] ASF GitHub Bot commented on DRILL-7293: --- paul-rogers commented on pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF URL: https://github.com/apache/drill/pull/1807 Converts the log format plugin (which uses a regex for parsing) to work with the Extended Vector Format. This commit provides the basic conversion: * Use the plugin config object to pass config to the Easy framework. * Use the EVF scan mechanism in place of the legacy "ScanBatch" mechanism. * Minor code and README cleanup. This commit corresponds to the Basic Tutorial steps in the EVF tutorial. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert the regex ("log") plugin to use EVF > --- > > Key: DRILL-7293 > URL: https://issues.apache.org/jira/browse/DRILL-7293 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.17.0 > > > The "log" plugin (which uses a regex to define the row format) is the subject > of Chapter 12 of the Learning Apache Drill book (though the version in the > book is simpler than the one in the master branch.) > The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set > framework") gives Drill control over the size of batches created by readers, > and allows readers to use the recently-added provided schema mechanism. > We wish to use the log reader as an example for how to convert a Drill format > plugin to use the EVF so that other developers can convert their own plugins. > This PR provides the first set of log plugin changes to enable us to publish > a tutorial on the EVF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)