[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071256#comment-17071256 ] ASF GitHub Bot commented on DRILL-7641: --- asfgit commented on pull request #2024: DRILL-7641: Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Labels: ready-to-commit > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068687#comment-17068687 ] ASF GitHub Bot commented on DRILL-7641: --- arina-ielchiieva commented on issue #2024: DRILL-7641: Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#issuecomment-604996784 +1, LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068685#comment-17068685 ] ASF GitHub Bot commented on DRILL-7641: --- cgivre commented on issue #2024: DRILL-7641: Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#issuecomment-604993669 @arina-ielchiieva Thanks for the review. Commits squashed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068684#comment-17068684 ] ASF GitHub Bot commented on DRILL-7641: --- cgivre commented on pull request #2024: DRILL-7641: Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399254035 ## File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java ## @@ -134,121 +131,131 @@ public ExcelBatchReader(ExcelReaderConfig readerConfig) { @Override public boolean open(FileSchemaNegotiator negotiator) { split = negotiator.split(); -loader = negotiator.build(); +ResultSetLoader loader = negotiator.build(); rowWriter = loader.writer(); openFile(negotiator); defineSchema(); return true; } + /** + * This method opens the Excel file, initializes the Streaming Excel Reader, and initializes the sheet variable. + * @param negotiator The Drill file negotiator object that represents the file system + */ private void openFile(FileScanFramework.FileSchemaNegotiator negotiator) { try { fsStream = negotiator.fileSystem().openPossiblyCompressedStream(split.getPath()); - workbook = new XSSFWorkbook(fsStream); + + // Open streaming reader + workbook = StreamingReader.builder() +.rowCacheSize(ROW_CACHE_SIZE) +.bufferSize(BUFFER_SIZE) +.open(fsStream); } catch (Exception e) { throw UserException .dataReadError(e) .message("Failed to open open input file: %s", split.getPath().toString()) -.message(e.getMessage()) +.addContext(e.getMessage()) .build(logger); } - -// Evaluate formulae -evaluator = workbook.getCreationHelper().createFormulaEvaluator(); - -workbook.setMissingCellPolicy(Row.MissingCellPolicy.CREATE_NULL_AS_BLANK); sheet = getSheet(); } /** * This function defines the schema from the header row. - * @return TupleMedata of the discovered schema */ - private TupleMetadata defineSchema() { + private void defineSchema() { SchemaBuilder builder = new SchemaBuilder(); -return getColumnHeaders(builder); +getColumnHeaders(builder); } - private TupleMetadata getColumnHeaders(SchemaBuilder builder) { + private void getColumnHeaders(SchemaBuilder builder) { //Get the field names -int columnCount = 0; +int columnCount; -// Case for empty sheet. -if (sheet.getFirstRowNum() == 0 && sheet.getLastRowNum() == 0) { - return builder.buildSchema(); +// Case for empty sheet +if (sheet.getLastRowNum() == 0) { + builder.buildSchema(); + return; } +rowIterator = sheet.iterator(); + // Get the number of columns. columnCount = getColumnCount(); -excelFieldNames = new ArrayList<>(columnCount); -cellWriterArray = new ArrayList<>(columnCount); -rowIterator = sheet.iterator(); +excelFieldNames = new ArrayList<>(); +cellWriterArray = new ArrayList<>(); //If there are no headers, create columns names of field_n if (readerConfig.headerRow == -1) { String missingFieldName; - for (int i = 0; i < columnCount; i++) { + int i = 0; + + for(Cell c : currentRow) { Review comment: Fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068682#comment-17068682 ] ASF GitHub Bot commented on DRILL-7641: --- arina-ielchiieva commented on pull request #2024: DRILL-7641: Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399250290 ## File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java ## @@ -134,121 +131,131 @@ public ExcelBatchReader(ExcelReaderConfig readerConfig) { @Override public boolean open(FileSchemaNegotiator negotiator) { split = negotiator.split(); -loader = negotiator.build(); +ResultSetLoader loader = negotiator.build(); rowWriter = loader.writer(); openFile(negotiator); defineSchema(); return true; } + /** + * This method opens the Excel file, initializes the Streaming Excel Reader, and initializes the sheet variable. + * @param negotiator The Drill file negotiator object that represents the file system + */ private void openFile(FileScanFramework.FileSchemaNegotiator negotiator) { try { fsStream = negotiator.fileSystem().openPossiblyCompressedStream(split.getPath()); - workbook = new XSSFWorkbook(fsStream); + + // Open streaming reader + workbook = StreamingReader.builder() +.rowCacheSize(ROW_CACHE_SIZE) +.bufferSize(BUFFER_SIZE) +.open(fsStream); } catch (Exception e) { throw UserException .dataReadError(e) .message("Failed to open open input file: %s", split.getPath().toString()) -.message(e.getMessage()) +.addContext(e.getMessage()) .build(logger); } - -// Evaluate formulae -evaluator = workbook.getCreationHelper().createFormulaEvaluator(); - -workbook.setMissingCellPolicy(Row.MissingCellPolicy.CREATE_NULL_AS_BLANK); sheet = getSheet(); } /** * This function defines the schema from the header row. - * @return TupleMedata of the discovered schema */ - private TupleMetadata defineSchema() { + private void defineSchema() { SchemaBuilder builder = new SchemaBuilder(); -return getColumnHeaders(builder); +getColumnHeaders(builder); } - private TupleMetadata getColumnHeaders(SchemaBuilder builder) { + private void getColumnHeaders(SchemaBuilder builder) { //Get the field names -int columnCount = 0; +int columnCount; -// Case for empty sheet. -if (sheet.getFirstRowNum() == 0 && sheet.getLastRowNum() == 0) { - return builder.buildSchema(); +// Case for empty sheet +if (sheet.getLastRowNum() == 0) { + builder.buildSchema(); + return; } +rowIterator = sheet.iterator(); + // Get the number of columns. columnCount = getColumnCount(); -excelFieldNames = new ArrayList<>(columnCount); -cellWriterArray = new ArrayList<>(columnCount); -rowIterator = sheet.iterator(); +excelFieldNames = new ArrayList<>(); +cellWriterArray = new ArrayList<>(); //If there are no headers, create columns names of field_n if (readerConfig.headerRow == -1) { String missingFieldName; - for (int i = 0; i < columnCount; i++) { + int i = 0; + + for(Cell c : currentRow) { Review comment: ```suggestion for (Cell c : currentRow) { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068677#comment-17068677 ] ASF GitHub Bot commented on DRILL-7641: --- cgivre commented on issue #2024: DRILL-7641: Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#issuecomment-604989138 @arina-ielchiieva Thank you for the review. I addressed all your comments and rebased to latest master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068674#comment-17068674 ] ASF GitHub Bot commented on DRILL-7641: --- cgivre commented on pull request #2024: DRILL-7641: Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399247726 ## File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java ## @@ -134,121 +126,131 @@ public ExcelBatchReader(ExcelReaderConfig readerConfig) { @Override public boolean open(FileSchemaNegotiator negotiator) { split = negotiator.split(); -loader = negotiator.build(); +ResultSetLoader loader = negotiator.build(); rowWriter = loader.writer(); openFile(negotiator); defineSchema(); return true; } + /** + * This method opens the Excel file, initializes the Streaming Excel Reader, and initializes the sheet variable. + * @param negotiator The Drill file negotiator object that represents the file system + */ private void openFile(FileScanFramework.FileSchemaNegotiator negotiator) { try { fsStream = negotiator.fileSystem().openPossiblyCompressedStream(split.getPath()); - workbook = new XSSFWorkbook(fsStream); + + // Open streaming reader + workbook = StreamingReader.builder() +.rowCacheSize(100) // Possible configuration option? Review comment: Converted to constants. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068670#comment-17068670 ] ASF GitHub Bot commented on DRILL-7641: --- cgivre commented on pull request #2024: DRILL-7641: Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399246782 ## File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java ## @@ -267,14 +269,21 @@ private XSSFSheet getSheet() { /** * Returns the column count. There are a few gotchas here in that we have to know the header row and count the physical number of cells - * in that row. Since the user can define the header row, + * in that row. This function also has to move the rowIterator object to the first row of data. * @return The number of actual columns */ private int getColumnCount() { +// Initialize +currentRow = rowIterator.next(); int rowNumber = readerConfig.headerRow > 0 ? sheet.getFirstRowNum() : 0; -XSSFRow sheetRow = sheet.getRow(rowNumber); -return sheetRow != null ? sheetRow.getPhysicalNumberOfCells() : 0; +// If the headerRow is greater than zero, advance the iterator to the first row of data +// This is unfortunately necessary since the streaming reader eliminated the getRow() method. +for(int i = 1; i < rowNumber; i++) { Review comment: Fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068671#comment-17068671 ] ASF GitHub Bot commented on DRILL-7641: --- cgivre commented on pull request #2024: DRILL-7641: Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399246903 ## File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java ## @@ -134,121 +126,131 @@ public ExcelBatchReader(ExcelReaderConfig readerConfig) { @Override public boolean open(FileSchemaNegotiator negotiator) { split = negotiator.split(); -loader = negotiator.build(); +ResultSetLoader loader = negotiator.build(); rowWriter = loader.writer(); openFile(negotiator); defineSchema(); return true; } + /** + * This method opens the Excel file, initializes the Streaming Excel Reader, and initializes the sheet variable. + * @param negotiator The Drill file negotiator object that represents the file system + */ private void openFile(FileScanFramework.FileSchemaNegotiator negotiator) { try { fsStream = negotiator.fileSystem().openPossiblyCompressedStream(split.getPath()); - workbook = new XSSFWorkbook(fsStream); + + // Open streaming reader + workbook = StreamingReader.builder() +.rowCacheSize(100) // Possible configuration option? +.bufferSize(4096) // Possible configuration option? +.open(fsStream); } catch (Exception e) { throw UserException .dataReadError(e) .message("Failed to open open input file: %s", split.getPath().toString()) .message(e.getMessage()) Review comment: Changed to addContext() This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068667#comment-17068667 ] ASF GitHub Bot commented on DRILL-7641: --- cgivre commented on pull request #2024: DRILL-7641: Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399246616 ## File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java ## @@ -68,20 +66,18 @@ private final ExcelReaderConfig readerConfig; - private XSSFSheet sheet; + private Sheet sheet; - private XSSFWorkbook workbook; + private Row currentRow; - private InputStream fsStream; + private Workbook workbook; - private FormulaEvaluator evaluator; + private InputStream fsStream; private ArrayList excelFieldNames; Review comment: Fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068668#comment-17068668 ] ASF GitHub Bot commented on DRILL-7641: --- cgivre commented on pull request #2024: DRILL-7641: Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399246672 ## File path: contrib/format-excel/src/test/java/org/apache/drill/exec/store/excel/TestExcelFormat.java ## @@ -339,11 +339,12 @@ public void testInconsistentDataQuery() throws Exception { testBuilder() .sqlQuery(sql) - .ordered().baselineColumns("col1", "col2") - .baselineValues("1.0", "Bob") - .baselineValues("2.0", "Steve") - .baselineValues("3.0", "Anne") - .baselineValues("Bob", "3.0") + .ordered() Review comment: Fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068669#comment-17068669 ] ASF GitHub Bot commented on DRILL-7641: --- cgivre commented on pull request #2024: DRILL-7641: Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399246726 ## File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java ## @@ -289,83 +298,78 @@ public boolean next() { } private boolean nextLine(RowSetLoader rowWriter) { -if( sheet.getFirstRowNum() == 0 && sheet.getLastRowNum() == 0) { +if(sheet.getLastRowNum() == 0) { Review comment: Fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068649#comment-17068649 ] ASF GitHub Bot commented on DRILL-7641: --- arina-ielchiieva commented on pull request #2024: DRILL-7641 Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399236022 ## File path: contrib/format-excel/src/test/java/org/apache/drill/exec/store/excel/TestExcelFormat.java ## @@ -339,11 +339,12 @@ public void testInconsistentDataQuery() throws Exception { testBuilder() .sqlQuery(sql) - .ordered().baselineColumns("col1", "col2") - .baselineValues("1.0", "Bob") - .baselineValues("2.0", "Steve") - .baselineValues("3.0", "Anne") - .baselineValues("Bob", "3.0") + .ordered() Review comment: Your test does not include order by thus returned result order is unpredictable. Please change to unordered here and in other tests above. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068647#comment-17068647 ] ASF GitHub Bot commented on DRILL-7641: --- arina-ielchiieva commented on pull request #2024: DRILL-7641 Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399234204 ## File path: contrib/format-excel/pom.xml ## @@ -64,6 +64,11 @@ ${project.version} test + + com.github.pjfanning + excel-streaming-reader + 2.3.1 Review comment: Please use the latest version of the reader: https://mvnrepository.com/artifact/com.github.pjfanning/excel-streaming-reader This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068650#comment-17068650 ] ASF GitHub Bot commented on DRILL-7641: --- arina-ielchiieva commented on pull request #2024: DRILL-7641 Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399237321 ## File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java ## @@ -134,121 +126,131 @@ public ExcelBatchReader(ExcelReaderConfig readerConfig) { @Override public boolean open(FileSchemaNegotiator negotiator) { split = negotiator.split(); -loader = negotiator.build(); +ResultSetLoader loader = negotiator.build(); rowWriter = loader.writer(); openFile(negotiator); defineSchema(); return true; } + /** + * This method opens the Excel file, initializes the Streaming Excel Reader, and initializes the sheet variable. + * @param negotiator The Drill file negotiator object that represents the file system + */ private void openFile(FileScanFramework.FileSchemaNegotiator negotiator) { try { fsStream = negotiator.fileSystem().openPossiblyCompressedStream(split.getPath()); - workbook = new XSSFWorkbook(fsStream); + + // Open streaming reader + workbook = StreamingReader.builder() +.rowCacheSize(100) // Possible configuration option? +.bufferSize(4096) // Possible configuration option? +.open(fsStream); } catch (Exception e) { throw UserException .dataReadError(e) .message("Failed to open open input file: %s", split.getPath().toString()) .message(e.getMessage()) Review comment: Can't have two messages, use addContext for the second message instead. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068646#comment-17068646 ] ASF GitHub Bot commented on DRILL-7641: --- arina-ielchiieva commented on pull request #2024: DRILL-7641 Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399234441 ## File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java ## @@ -68,20 +66,18 @@ private final ExcelReaderConfig readerConfig; - private XSSFSheet sheet; + private Sheet sheet; - private XSSFWorkbook workbook; + private Row currentRow; - private InputStream fsStream; + private Workbook workbook; - private FormulaEvaluator evaluator; + private InputStream fsStream; private ArrayList excelFieldNames; Review comment: Use List instead of ArrayList here and in other fields below. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068651#comment-17068651 ] ASF GitHub Bot commented on DRILL-7641: --- arina-ielchiieva commented on pull request #2024: DRILL-7641 Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399236600 ## File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java ## @@ -267,14 +269,21 @@ private XSSFSheet getSheet() { /** * Returns the column count. There are a few gotchas here in that we have to know the header row and count the physical number of cells - * in that row. Since the user can define the header row, + * in that row. This function also has to move the rowIterator object to the first row of data. * @return The number of actual columns */ private int getColumnCount() { +// Initialize +currentRow = rowIterator.next(); int rowNumber = readerConfig.headerRow > 0 ? sheet.getFirstRowNum() : 0; -XSSFRow sheetRow = sheet.getRow(rowNumber); -return sheetRow != null ? sheetRow.getPhysicalNumberOfCells() : 0; +// If the headerRow is greater than zero, advance the iterator to the first row of data +// This is unfortunately necessary since the streaming reader eliminated the getRow() method. +for(int i = 1; i < rowNumber; i++) { Review comment: ```suggestion for (int i = 1; i < rowNumber; i++) { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068652#comment-17068652 ] ASF GitHub Bot commented on DRILL-7641: --- arina-ielchiieva commented on pull request #2024: DRILL-7641 Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399237699 ## File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java ## @@ -134,121 +126,131 @@ public ExcelBatchReader(ExcelReaderConfig readerConfig) { @Override public boolean open(FileSchemaNegotiator negotiator) { split = negotiator.split(); -loader = negotiator.build(); +ResultSetLoader loader = negotiator.build(); rowWriter = loader.writer(); openFile(negotiator); defineSchema(); return true; } + /** + * This method opens the Excel file, initializes the Streaming Excel Reader, and initializes the sheet variable. + * @param negotiator The Drill file negotiator object that represents the file system + */ private void openFile(FileScanFramework.FileSchemaNegotiator negotiator) { try { fsStream = negotiator.fileSystem().openPossiblyCompressedStream(split.getPath()); - workbook = new XSSFWorkbook(fsStream); + + // Open streaming reader + workbook = StreamingReader.builder() +.rowCacheSize(100) // Possible configuration option? Review comment: Please either make values as conf options or create constants. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068648#comment-17068648 ] ASF GitHub Bot commented on DRILL-7641: --- arina-ielchiieva commented on pull request #2024: DRILL-7641 Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024#discussion_r399236555 ## File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java ## @@ -289,83 +298,78 @@ public boolean next() { } private boolean nextLine(RowSetLoader rowWriter) { -if( sheet.getFirstRowNum() == 0 && sheet.getLastRowNum() == 0) { +if(sheet.getLastRowNum() == 0) { Review comment: ```suggestion if (sheet.getLastRowNum() == 0) { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066212#comment-17066212 ] Arina Ielchiieva commented on DRILL-7641: - [~cgivre] sorry made this by mistake. > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066199#comment-17066199 ] Charles Givre commented on DRILL-7641: -- [~arina], Why was the fix version changed? I believe this PR is reviewable and is not a major change. I don't think this will be a major task to get this into Drill 1.18. > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
[ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058303#comment-17058303 ] ASF GitHub Bot commented on DRILL-7641: --- cgivre commented on pull request #2024: DRILL-7641 Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024 # [DRILL-7641](https://issues.apache.org/jira/browse/DRILL-7641): Convert Excel Reader to use Streaming Reader ## Description The current implementation of the Excel reader uses the Apache POI reader, which uses excessive amounts of memory. As a result, attempting to read large Excel files will cause out of memory errors. This PR converts the format plugin to use a streaming reader, based still on the POI library. The documentation for the streaming reader can be found here. [1]. This library was billed as a drop in replacement for the POI reader, however I had to make some minor changes to the batch reader to get this to work. Minor code cleanup as well. [1]: https://github.com/pjfanning/excel-streaming-reader ## Documentation No user visible changes. ## Testing All unit tests from the original plugin pass. Additionally, I tested this with large Excel files on my local machine and Drill was able to query them whereas before this PR, Drill would run out of memory. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Convert Excel Reader to Use Streaming Reader > > > Key: DRILL-7641 > URL: https://issues.apache.org/jira/browse/DRILL-7641 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > The current implementation of the Excel reader uses the Apache POI reader, > which uses excessive amounts of memory. As a result, attempting to read large > Excel files will cause out of memory errors. > This PR converts the format plugin to use a streaming reader, based still on > the POI library. The documentation for the streaming reader can be found > here. [1] > All unit tests pass and I tested the plugin with some large Excel files on my > computer. > [1]: [https://github.com/pjfanning/excel-streaming-reader] > -- This message was sent by Atlassian Jira (v8.3.4#803005)