[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-06-15 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1231194122


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/pom.xml:
##
@@ -75,6 +75,11 @@
 com.fasterxml.jackson.dataformat
 jackson-dataformat-csv
 
+
+com.github.pjfanning
+excel-streaming-reader

Review Comment:
   This dependency can be removed now that the ExcelReader is in a different 
module.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-06-05 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1218567911


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java:
##
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import com.github.pjfanning.xlsx.StreamingReader;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.processor.exception.ProcessException;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.ss.usermodel.Sheet;
+import org.apache.poi.ss.usermodel.Workbook;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RowIterator implements Iterator, Closeable {
+private final Workbook workbook;
+private final Iterator sheets;
+private final Map requiredSheets;
+private final int firstRow;
+private final ComponentLog logger;
+private Sheet currentSheet;
+private Iterator currentRows;
+private Row currentRow;
+
+public RowIterator(InputStream in, List requiredSheets, int 
firstRow, ComponentLog logger) {
+this.workbook = StreamingReader.builder()
+.rowCacheSize(100)
+.bufferSize(4096)
+.open(in);
+this.sheets = this.workbook.iterator();
+this.requiredSheets = requiredSheets != null ? requiredSheets.stream()
+.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) 
: new HashMap<>();
+this.firstRow = firstRow;
+this.logger = logger;
+}
+
+@Override
+public boolean hasNext() {
+setCurrent();

Review Comment:
   Although it may not be the most intuitive approach, having `hasNext()` throw 
an `UnsupportedOperationException` and changing callers to just use `next()` 
could work, although it raises a question of logic handling for the calling 
class. Another option could be folding the functionality into the calling class.
   
   I think it would be cleaner in the end to rework the RowIterator, in 
addition to removing the `public` visibility from the class. I could take a 
look at reworking RowIterator and pushing the changes to this branch if you 
think the changes are too involved.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-06-05 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1218506840


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java:
##
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import com.github.pjfanning.xlsx.StreamingReader;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.processor.exception.ProcessException;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.ss.usermodel.Sheet;
+import org.apache.poi.ss.usermodel.Workbook;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RowIterator implements Iterator, Closeable {
+private final Workbook workbook;
+private final Iterator sheets;
+private final Map requiredSheets;
+private final int firstRow;
+private final ComponentLog logger;
+private Sheet currentSheet;
+private Iterator currentRows;
+private Row currentRow;
+
+public RowIterator(InputStream in, List requiredSheets, int 
firstRow, ComponentLog logger) {
+this.workbook = StreamingReader.builder()
+.rowCacheSize(100)
+.bufferSize(4096)
+.open(in);
+this.sheets = this.workbook.iterator();
+this.requiredSheets = requiredSheets != null ? requiredSheets.stream()
+.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) 
: new HashMap<>();
+this.firstRow = firstRow;
+this.logger = logger;
+}
+
+@Override
+public boolean hasNext() {
+setCurrent();

Review Comment:
   It should be possible to call `hasNext()` in any order. Taking a closer 
look, it seems like the current implementation of `setCurrent` needs further 
refinement because it sets `currentSheet` to `null` on every invocation. One 
general approach could be to set the initial sheet and row in the constructor.
   
   In general, it should be possible to call `hasNext()` and unlimited number 
of times without advancing the iterator, and calling `next()` should advance 
the iterator forward. When all sheets and rows have been read, `next()` should 
throw an `NoSuchElementException` if called again.



##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java:
##
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import com.github.pjfanning.xlsx.StreamingReader;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.processor.exception.ProcessException;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.ss.usermodel.Sheet;
+import org.apache.poi.ss.usermodel.Workbook;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RowIterator implements Iterator, Closeable {
+private final Workbook workbook;
+private final Iterator 

[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-06-05 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1218506840


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java:
##
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import com.github.pjfanning.xlsx.StreamingReader;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.processor.exception.ProcessException;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.ss.usermodel.Sheet;
+import org.apache.poi.ss.usermodel.Workbook;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RowIterator implements Iterator, Closeable {
+private final Workbook workbook;
+private final Iterator sheets;
+private final Map requiredSheets;
+private final int firstRow;
+private final ComponentLog logger;
+private Sheet currentSheet;
+private Iterator currentRows;
+private Row currentRow;
+
+public RowIterator(InputStream in, List requiredSheets, int 
firstRow, ComponentLog logger) {
+this.workbook = StreamingReader.builder()
+.rowCacheSize(100)
+.bufferSize(4096)
+.open(in);
+this.sheets = this.workbook.iterator();
+this.requiredSheets = requiredSheets != null ? requiredSheets.stream()
+.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) 
: new HashMap<>();
+this.firstRow = firstRow;
+this.logger = logger;
+}
+
+@Override
+public boolean hasNext() {
+setCurrent();

Review Comment:
   It should be possible to call `hasNext()` in any order. Taking a closer 
look, it seems like the current implementation of `setCurrent` needs further 
refinement because it sets `currentSheet` to `null` on every invocation. One 
general approach could be to set the initial sheet and row in the constructor.
   
   In general, it should be possible to call `hasNext()` and unlimited number 
of times without advancing the iterator, and it calling `next()` should advance 
the iterator forward. When all sheets and rows have been read, `next()` should 
throw an `NoSuchElementException` if called again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-06-05 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1218506840


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java:
##
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import com.github.pjfanning.xlsx.StreamingReader;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.processor.exception.ProcessException;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.ss.usermodel.Sheet;
+import org.apache.poi.ss.usermodel.Workbook;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RowIterator implements Iterator, Closeable {
+private final Workbook workbook;
+private final Iterator sheets;
+private final Map requiredSheets;
+private final int firstRow;
+private final ComponentLog logger;
+private Sheet currentSheet;
+private Iterator currentRows;
+private Row currentRow;
+
+public RowIterator(InputStream in, List requiredSheets, int 
firstRow, ComponentLog logger) {
+this.workbook = StreamingReader.builder()
+.rowCacheSize(100)
+.bufferSize(4096)
+.open(in);
+this.sheets = this.workbook.iterator();
+this.requiredSheets = requiredSheets != null ? requiredSheets.stream()
+.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) 
: new HashMap<>();
+this.firstRow = firstRow;
+this.logger = logger;
+}
+
+@Override
+public boolean hasNext() {
+setCurrent();

Review Comment:
   It should be possible to call `hasNext()` in any order. Taking a closer 
look, it seems like the current implementation of `setCurrent` needs further 
refinement because it sets `currentSheet` to `null` on every invocation. One 
general approach could be to set the initial sheet and row in the constructor.
   
   In general, it should be possible to call `hasNext()` and unlimited number 
of times without advancing the iterator, and it calling `next()` should advance 
the iterator forward. When all sheets and rows have been read, `next()` should 
throw an `IllegalStateException` if called again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-06-05 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1218310754


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelSchemaInference.java:
##
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import org.apache.nifi.schema.inference.FieldTypeInference;
+import org.apache.nifi.schema.inference.RecordSource;
+import org.apache.nifi.schema.inference.SchemaInferenceEngine;
+import org.apache.nifi.schema.inference.TimeValueInference;
+import org.apache.nifi.serialization.SimpleRecordSchema;
+import org.apache.nifi.serialization.record.DataType;
+import org.apache.nifi.serialization.record.RecordField;
+import org.apache.nifi.serialization.record.RecordSchema;
+import org.apache.nifi.util.SchemaInferenceUtil;
+import org.apache.poi.ss.usermodel.Cell;
+import org.apache.poi.ss.usermodel.DataFormatter;
+import org.apache.poi.ss.usermodel.Row;
+
+import java.io.IOException;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Locale;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+public class ExcelSchemaInference implements SchemaInferenceEngine {
+private final TimeValueInference timeValueInference;
+private final DataFormatter dataFormatter;
+
+public ExcelSchemaInference(TimeValueInference timeValueInference) {
+this(timeValueInference, null);
+}
+
+public ExcelSchemaInference(TimeValueInference timeValueInference, Locale 
locale) {
+this.timeValueInference = timeValueInference;
+this.dataFormatter = locale == null ? new DataFormatter() : new 
DataFormatter(locale);
+}
+
+@Override
+public RecordSchema inferSchema(RecordSource recordSource) throws 
IOException {
+final Map typeMap = new LinkedHashMap<>();
+Row row;
+while((row = recordSource.next()) != null) {
+inferSchema(row, typeMap);
+}
+return createSchema(typeMap);
+}
+
+private void inferSchema(final Row row, final Map typeMap) {
+if (ExcelUtils.hasCells(row)) {
+IntStream.range(0, row.getLastCellNum())
+.forEach(index -> {
+final Cell cell = row.getCell(index);
+final String fieldName = Integer.toString(index);

Review Comment:
   What do you think of prefixing this field name with `column_`? So the first 
field name would be `column_0`? That would be easier to work with in absence of 
a defined schema.



##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java:
##
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import com.github.pjfanning.xlsx.StreamingReader;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.processor.exception.ProcessException;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.ss.usermodel.Sheet;
+import org.apache.poi.ss.usermodel.Workbook;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;

[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-31 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1212254476


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelHeaderSchemaStrategy.java:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import org.apache.nifi.context.PropertyContext;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.schema.access.SchemaAccessStrategy;
+import org.apache.nifi.schema.access.SchemaField;
+import org.apache.nifi.schema.access.SchemaNotFoundException;
+import org.apache.nifi.schema.inference.FieldTypeInference;
+import org.apache.nifi.schema.inference.RecordSource;
+import org.apache.nifi.schema.inference.TimeValueInference;
+import org.apache.nifi.serialization.DateTimeUtils;
+import org.apache.nifi.serialization.SimpleRecordSchema;
+import org.apache.nifi.serialization.record.DataType;
+import org.apache.nifi.serialization.record.RecordField;
+import org.apache.nifi.serialization.record.RecordSchema;
+import org.apache.nifi.util.SchemaInferenceUtil;
+import org.apache.poi.ss.usermodel.Cell;
+import org.apache.poi.ss.usermodel.DataFormatter;
+import org.apache.poi.ss.usermodel.Row;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.EnumSet;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+public class ExcelHeaderSchemaStrategy implements SchemaAccessStrategy {
+private static final Set schemaFields = 
EnumSet.noneOf(SchemaField.class);
+
+private final PropertyContext context;
+
+private final DataFormatter dataFormatter;
+
+private final ComponentLog logger;
+
+public ExcelHeaderSchemaStrategy(PropertyContext context, ComponentLog 
logger) {
+this(context, logger, null);
+}
+
+public ExcelHeaderSchemaStrategy(PropertyContext context, ComponentLog 
logger, Locale locale) {
+this.context = context;
+this.logger = logger;
+this.dataFormatter = locale == null ? new DataFormatter() : new 
DataFormatter(locale);
+}
+
+@Override
+public RecordSchema getSchema(Map variables, InputStream 
contentStream, RecordSchema readSchema) throws SchemaNotFoundException, 
IOException {
+if (this.context == null) {
+throw new SchemaNotFoundException("Schema Access Strategy intended 
only for validation purposes and cannot obtain schema");
+}
+
+String errMsg = "Failed to read Header line from Excel worksheet";
+RecordSource recordSource;
+try {
+recordSource = new ExcelRecordSource(contentStream, context, 
variables, logger);
+} catch (Exception e) {
+throw new SchemaNotFoundException(errMsg, e);
+}
+
+Row headerRow = recordSource.next();
+if (!ExcelUtils.hasCells(headerRow)) {
+throw new SchemaNotFoundException("The chosen header line in the 
Excel worksheet had no cells");
+}
+
+try {
+String dateFormat = 
context.getProperty(DateTimeUtils.DATE_FORMAT).getValue();
+String timeFormat = 
context.getProperty(DateTimeUtils.TIME_FORMAT).getValue();
+String timestampFormat = 
context.getProperty(DateTimeUtils.TIMESTAMP_FORMAT).getValue();
+final TimeValueInference timeValueInference = new 
TimeValueInference(dateFormat, timeFormat, timestampFormat);
+final Map typeMap = new 
LinkedHashMap<>();
+IntStream.range(0, headerRow.getLastCellNum())
+.forEach(index -> {
+final Cell cell = headerRow.getCell(index);
+final String fieldName = Integer.toString(index);

Review Comment:
   The string typing is also a problem as you noted. Theoretically that could 
also be changed to follow an approach similar to the Infer Schema strategy, 
reading all rows to determine potential values. In the interest of completing a 
first 

[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-31 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1212240524


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelHeaderSchemaStrategy.java:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import org.apache.nifi.context.PropertyContext;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.schema.access.SchemaAccessStrategy;
+import org.apache.nifi.schema.access.SchemaField;
+import org.apache.nifi.schema.access.SchemaNotFoundException;
+import org.apache.nifi.schema.inference.FieldTypeInference;
+import org.apache.nifi.schema.inference.RecordSource;
+import org.apache.nifi.schema.inference.TimeValueInference;
+import org.apache.nifi.serialization.DateTimeUtils;
+import org.apache.nifi.serialization.SimpleRecordSchema;
+import org.apache.nifi.serialization.record.DataType;
+import org.apache.nifi.serialization.record.RecordField;
+import org.apache.nifi.serialization.record.RecordSchema;
+import org.apache.nifi.util.SchemaInferenceUtil;
+import org.apache.poi.ss.usermodel.Cell;
+import org.apache.poi.ss.usermodel.DataFormatter;
+import org.apache.poi.ss.usermodel.Row;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.EnumSet;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+public class ExcelHeaderSchemaStrategy implements SchemaAccessStrategy {
+private static final Set schemaFields = 
EnumSet.noneOf(SchemaField.class);
+
+private final PropertyContext context;
+
+private final DataFormatter dataFormatter;
+
+private final ComponentLog logger;
+
+public ExcelHeaderSchemaStrategy(PropertyContext context, ComponentLog 
logger) {
+this(context, logger, null);
+}
+
+public ExcelHeaderSchemaStrategy(PropertyContext context, ComponentLog 
logger, Locale locale) {
+this.context = context;
+this.logger = logger;
+this.dataFormatter = locale == null ? new DataFormatter() : new 
DataFormatter(locale);
+}
+
+@Override
+public RecordSchema getSchema(Map variables, InputStream 
contentStream, RecordSchema readSchema) throws SchemaNotFoundException, 
IOException {
+if (this.context == null) {
+throw new SchemaNotFoundException("Schema Access Strategy intended 
only for validation purposes and cannot obtain schema");
+}
+
+String errMsg = "Failed to read Header line from Excel worksheet";
+RecordSource recordSource;
+try {
+recordSource = new ExcelRecordSource(contentStream, context, 
variables, logger);
+} catch (Exception e) {
+throw new SchemaNotFoundException(errMsg, e);
+}
+
+Row headerRow = recordSource.next();
+if (!ExcelUtils.hasCells(headerRow)) {
+throw new SchemaNotFoundException("The chosen header line in the 
Excel worksheet had no cells");
+}
+
+try {
+String dateFormat = 
context.getProperty(DateTimeUtils.DATE_FORMAT).getValue();
+String timeFormat = 
context.getProperty(DateTimeUtils.TIME_FORMAT).getValue();
+String timestampFormat = 
context.getProperty(DateTimeUtils.TIMESTAMP_FORMAT).getValue();
+final TimeValueInference timeValueInference = new 
TimeValueInference(dateFormat, timeFormat, timestampFormat);
+final Map typeMap = new 
LinkedHashMap<>();
+IntStream.range(0, headerRow.getLastCellNum())
+.forEach(index -> {
+final Cell cell = headerRow.getCell(index);
+final String fieldName = Integer.toString(index);

Review Comment:
   @dan-s1, although the header column does not have to be a string, the 
expectation for this strategy is that the first row cell values should be used 
as the field names, analogous to the CSV implementation. In its current state, 
I agree the 

[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-31 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1212152365


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelHeaderSchemaStrategy.java:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import org.apache.nifi.context.PropertyContext;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.schema.access.SchemaAccessStrategy;
+import org.apache.nifi.schema.access.SchemaField;
+import org.apache.nifi.schema.access.SchemaNotFoundException;
+import org.apache.nifi.schema.inference.FieldTypeInference;
+import org.apache.nifi.schema.inference.RecordSource;
+import org.apache.nifi.schema.inference.TimeValueInference;
+import org.apache.nifi.serialization.DateTimeUtils;
+import org.apache.nifi.serialization.SimpleRecordSchema;
+import org.apache.nifi.serialization.record.DataType;
+import org.apache.nifi.serialization.record.RecordField;
+import org.apache.nifi.serialization.record.RecordSchema;
+import org.apache.nifi.util.SchemaInferenceUtil;
+import org.apache.poi.ss.usermodel.Cell;
+import org.apache.poi.ss.usermodel.DataFormatter;
+import org.apache.poi.ss.usermodel.Row;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.EnumSet;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+public class ExcelHeaderSchemaStrategy implements SchemaAccessStrategy {
+private static final Set schemaFields = 
EnumSet.noneOf(SchemaField.class);
+
+private final PropertyContext context;
+
+private final DataFormatter dataFormatter;
+
+private final ComponentLog logger;
+
+public ExcelHeaderSchemaStrategy(PropertyContext context, ComponentLog 
logger) {
+this(context, logger, null);
+}
+
+public ExcelHeaderSchemaStrategy(PropertyContext context, ComponentLog 
logger, Locale locale) {
+this.context = context;
+this.logger = logger;
+this.dataFormatter = locale == null ? new DataFormatter() : new 
DataFormatter(locale);
+}
+
+@Override
+public RecordSchema getSchema(Map variables, InputStream 
contentStream, RecordSchema readSchema) throws SchemaNotFoundException, 
IOException {
+if (this.context == null) {
+throw new SchemaNotFoundException("Schema Access Strategy intended 
only for validation purposes and cannot obtain schema");
+}
+
+String errMsg = "Failed to read Header line from Excel worksheet";
+RecordSource recordSource;
+try {
+recordSource = new ExcelRecordSource(contentStream, context, 
variables, logger);
+} catch (Exception e) {
+throw new SchemaNotFoundException(errMsg, e);
+}
+
+Row headerRow = recordSource.next();
+if (!ExcelUtils.hasCells(headerRow)) {
+throw new SchemaNotFoundException("The chosen header line in the 
Excel worksheet had no cells");
+}
+
+try {
+String dateFormat = 
context.getProperty(DateTimeUtils.DATE_FORMAT).getValue();
+String timeFormat = 
context.getProperty(DateTimeUtils.TIME_FORMAT).getValue();
+String timestampFormat = 
context.getProperty(DateTimeUtils.TIMESTAMP_FORMAT).getValue();
+final TimeValueInference timeValueInference = new 
TimeValueInference(dateFormat, timeFormat, timestampFormat);
+final Map typeMap = new 
LinkedHashMap<>();
+IntStream.range(0, headerRow.getLastCellNum())
+.forEach(index -> {
+final Cell cell = headerRow.getCell(index);
+final String fieldName = Integer.toString(index);

Review Comment:
   This approach uses the field index as the field name, instead of using the 
cell value as the field name, which would be the expected behavior.



##

[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-19 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1199111371


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */package org.apache.nifi.excel;
+
+import com.github.pjfanning.xlsx.StreamingReader;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.ss.usermodel.Sheet;
+import org.apache.poi.ss.usermodel.Workbook;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RowIterator implements Iterator, Closeable {
+private final Workbook workbook;
+private final Iterator sheets;
+private Sheet currentSheet;
+private Iterator currentRows;
+private final Map desiredSheets;
+private final int firstRow;
+private ComponentLog logger;
+private boolean log;
+private Row currentRow;
+
+public RowIterator(InputStream in, List desiredSheets, int 
firstRow) {
+this(in, desiredSheets, firstRow, null);
+}
+
+public RowIterator(InputStream in, List desiredSheets, int 
firstRow, ComponentLog logger) {
+this.workbook = StreamingReader.builder()
+.rowCacheSize(100)
+.bufferSize(4096)
+.open(in);
+this.sheets = this.workbook.iterator();
+this.desiredSheets = desiredSheets != null ? desiredSheets.stream()
+.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) 
: new HashMap<>();
+this.firstRow = firstRow;
+this.logger = logger;
+this.log = logger != null;
+}
+
+@Override
+public boolean hasNext() {
+setCurrent();
+boolean next = currentRow != null;
+if(!next) {
+String sheetsNotFound = getSheetsNotFound(desiredSheets);
+if (!sheetsNotFound.isEmpty() && log) {
+logger.warn("Excel sheet(s) not found: {}", sheetsNotFound);
+}

Review Comment:
   It could be something else, probably `ProcessException` would be a good 
option as a more specific subclass of `RuntimeException`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-19 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1199060548


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */package org.apache.nifi.excel;
+
+import com.github.pjfanning.xlsx.StreamingReader;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.ss.usermodel.Sheet;
+import org.apache.poi.ss.usermodel.Workbook;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RowIterator implements Iterator, Closeable {
+private final Workbook workbook;
+private final Iterator sheets;
+private Sheet currentSheet;
+private Iterator currentRows;
+private final Map desiredSheets;
+private final int firstRow;
+private ComponentLog logger;
+private boolean log;
+private Row currentRow;
+
+public RowIterator(InputStream in, List desiredSheets, int 
firstRow) {
+this(in, desiredSheets, firstRow, null);
+}
+
+public RowIterator(InputStream in, List desiredSheets, int 
firstRow, ComponentLog logger) {
+this.workbook = StreamingReader.builder()
+.rowCacheSize(100)
+.bufferSize(4096)
+.open(in);
+this.sheets = this.workbook.iterator();
+this.desiredSheets = desiredSheets != null ? desiredSheets.stream()
+.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) 
: new HashMap<>();
+this.firstRow = firstRow;
+this.logger = logger;
+this.log = logger != null;
+}
+
+@Override
+public boolean hasNext() {
+setCurrent();
+boolean next = currentRow != null;
+if(!next) {
+String sheetsNotFound = getSheetsNotFound(desiredSheets);
+if (!sheetsNotFound.isEmpty() && log) {
+logger.warn("Excel sheet(s) not found: {}", sheetsNotFound);
+}

Review Comment:
   You're right, I was thinking more of Record Writers and Processors that 
generate record-oriented files. With that in mind, it looks like there isn't a 
good way to signal a problem of this kind, other than throwing an exception.
   
   With that in mind, I think the property should be changed to `Required 
Sheets`, and the description should indicate that the absence of these Sheets 
will throw an exception. We could theoretically introduce a `Sheet Processing 
Strategy` of some kind, but I think having the single `Required Sheets` 
property is the best way to go forward for now. How does that sound? In that 
case, instead of logging a warning, this condition would throw an `IOException`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-18 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1198284268


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */package org.apache.nifi.excel;
+
+import com.github.pjfanning.xlsx.StreamingReader;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.ss.usermodel.Sheet;
+import org.apache.poi.ss.usermodel.Workbook;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RowIterator implements Iterator, Closeable {
+private final Workbook workbook;
+private final Iterator sheets;
+private Sheet currentSheet;
+private Iterator currentRows;
+private final Map desiredSheets;
+private final int firstRow;
+private ComponentLog logger;
+private boolean log;
+private Row currentRow;
+
+public RowIterator(InputStream in, List desiredSheets, int 
firstRow) {
+this(in, desiredSheets, firstRow, null);
+}
+
+public RowIterator(InputStream in, List desiredSheets, int 
firstRow, ComponentLog logger) {
+this.workbook = StreamingReader.builder()
+.rowCacheSize(100)
+.bufferSize(4096)
+.open(in);
+this.sheets = this.workbook.iterator();
+this.desiredSheets = desiredSheets != null ? desiredSheets.stream()
+.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) 
: new HashMap<>();
+this.firstRow = firstRow;
+this.logger = logger;
+this.log = logger != null;
+}
+
+@Override
+public boolean hasNext() {
+setCurrent();
+boolean next = currentRow != null;
+if(!next) {
+String sheetsNotFound = getSheetsNotFound(desiredSheets);
+if (!sheetsNotFound.isEmpty() && log) {
+logger.warn("Excel sheet(s) not found: {}", sheetsNotFound);
+}
+}
+return next;
+}
+
+private void setCurrent() {
+currentRow = getNextRow();
+if (currentRow != null) {
+return;
+}
+
+currentSheet = null;
+currentRows = null;
+while (sheets.hasNext()) {
+currentSheet = sheets.next();
+if (isIterateOverAllSheets() || 
hasSheet(currentSheet.getSheetName())) {
+currentRows = currentSheet.iterator();
+currentRow = getNextRow();
+if (currentRow != null) {
+return;
+}
+}
+}
+}
+
+private Row getNextRow() {
+while (currentRows != null && !hasExhaustedRows()) {
+Row tempCurrentRow = currentRows.next();
+if (!isSkip(tempCurrentRow)) {
+return tempCurrentRow;
+}
+}
+return null;
+}
+
+private boolean hasExhaustedRows() {
+boolean exhausted = !currentRows.hasNext();
+if (log && exhausted) {
+logger.info("Exhausted all rows from sheet {}", 
currentSheet.getSheetName());
+}
+return exhausted;
+}
+
+private boolean isSkip(Row row) {
+return row.getRowNum() < firstRow;
+}
+
+private boolean isIterateOverAllSheets() {
+boolean iterateAllSheets = desiredSheets.isEmpty();
+if (iterateAllSheets && log) {
+logger.info("Advanced to sheet {}", currentSheet.getSheetName());
+}
+return iterateAllSheets;
+}
+
+private boolean hasSheet(String name) {
+boolean sheetByName = !desiredSheets.isEmpty()
+&& desiredSheets.keySet().stream()
+.anyMatch(desiredSheet -> desiredSheet.equalsIgnoreCase(name));
+if (sheetByName) {
+desiredSheets.put(name, Boolean.TRUE);
+}
+return sheetByName;
+}
+
+ 

[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-18 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1198272978


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */package org.apache.nifi.excel;
+
+import com.github.pjfanning.xlsx.StreamingReader;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.ss.usermodel.Sheet;
+import org.apache.poi.ss.usermodel.Workbook;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RowIterator implements Iterator, Closeable {
+private final Workbook workbook;
+private final Iterator sheets;
+private Sheet currentSheet;
+private Iterator currentRows;
+private final Map desiredSheets;
+private final int firstRow;
+private ComponentLog logger;
+private boolean log;
+private Row currentRow;
+
+public RowIterator(InputStream in, List desiredSheets, int 
firstRow) {
+this(in, desiredSheets, firstRow, null);
+}
+
+public RowIterator(InputStream in, List desiredSheets, int 
firstRow, ComponentLog logger) {
+this.workbook = StreamingReader.builder()
+.rowCacheSize(100)
+.bufferSize(4096)
+.open(in);
+this.sheets = this.workbook.iterator();
+this.desiredSheets = desiredSheets != null ? desiredSheets.stream()
+.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) 
: new HashMap<>();
+this.firstRow = firstRow;
+this.logger = logger;
+this.log = logger != null;
+}
+
+@Override
+public boolean hasNext() {
+setCurrent();
+boolean next = currentRow != null;
+if(!next) {
+String sheetsNotFound = getSheetsNotFound(desiredSheets);
+if (!sheetsNotFound.isEmpty() && log) {
+logger.warn("Excel sheet(s) not found: {}", sheetsNotFound);
+}
+}
+return next;
+}
+
+private void setCurrent() {
+currentRow = getNextRow();
+if (currentRow != null) {
+return;
+}
+
+currentSheet = null;
+currentRows = null;
+while (sheets.hasNext()) {
+currentSheet = sheets.next();
+if (isIterateOverAllSheets() || 
hasSheet(currentSheet.getSheetName())) {
+currentRows = currentSheet.iterator();
+currentRow = getNextRow();
+if (currentRow != null) {
+return;
+}
+}
+}
+}
+
+private Row getNextRow() {
+while (currentRows != null && !hasExhaustedRows()) {
+Row tempCurrentRow = currentRows.next();
+if (!isSkip(tempCurrentRow)) {
+return tempCurrentRow;
+}
+}
+return null;
+}
+
+private boolean hasExhaustedRows() {
+boolean exhausted = !currentRows.hasNext();
+if (log && exhausted) {
+logger.info("Exhausted all rows from sheet {}", 
currentSheet.getSheetName());
+}
+return exhausted;
+}
+
+private boolean isSkip(Row row) {
+return row.getRowNum() < firstRow;
+}
+
+private boolean isIterateOverAllSheets() {
+boolean iterateAllSheets = desiredSheets.isEmpty();
+if (iterateAllSheets && log) {
+logger.info("Advanced to sheet {}", currentSheet.getSheetName());
+}
+return iterateAllSheets;
+}
+
+private boolean hasSheet(String name) {
+boolean sheetByName = !desiredSheets.isEmpty()
+&& desiredSheets.keySet().stream()
+.anyMatch(desiredSheet -> desiredSheet.equalsIgnoreCase(name));
+if (sheetByName) {
+desiredSheets.put(name, Boolean.TRUE);
+}
+return sheetByName;
+}
+
+ 

[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-18 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1198269634


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelReader.java:
##
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import org.apache.commons.lang3.StringUtils;
+import org.apache.nifi.annotation.documentation.CapabilityDescription;
+import org.apache.nifi.annotation.documentation.Tags;
+import org.apache.nifi.annotation.lifecycle.OnEnabled;
+import org.apache.nifi.components.AllowableValue;
+import org.apache.nifi.components.PropertyDescriptor;
+import org.apache.nifi.context.PropertyContext;
+import org.apache.nifi.controller.ConfigurationContext;
+import org.apache.nifi.expression.ExpressionLanguageScope;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.processor.util.StandardValidators;
+import org.apache.nifi.schema.access.SchemaAccessStrategy;
+import org.apache.nifi.schema.access.SchemaNotFoundException;
+import org.apache.nifi.schema.inference.InferSchemaAccessStrategy;
+import org.apache.nifi.schema.inference.RecordSourceFactory;
+import org.apache.nifi.schema.inference.SchemaInferenceEngine;
+import org.apache.nifi.schema.inference.SchemaInferenceUtil;
+import org.apache.nifi.schema.inference.TimeValueInference;
+import org.apache.nifi.schemaregistry.services.SchemaRegistry;
+import org.apache.nifi.serialization.DateTimeUtils;
+import org.apache.nifi.serialization.MalformedRecordException;
+import org.apache.nifi.serialization.RecordReader;
+import org.apache.nifi.serialization.RecordReaderFactory;
+import org.apache.nifi.serialization.SchemaRegistryService;
+import org.apache.nifi.serialization.record.RecordSchema;
+import org.apache.nifi.stream.io.NonCloseableInputStream;
+import org.apache.poi.ss.usermodel.Row;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.atomic.AtomicReferenceArray;
+import java.util.stream.IntStream;
+
+@Tags({"excel", "spreadsheet", "xlsx", "parse", "record", "row", "reader", 
"values", "cell"})
+@CapabilityDescription("Parses a Microsoft Excel document returning each row 
in each sheet as a separate record. "
++ "This reader allows for inferring a schema either based on the first 
line of an Excel sheet if a 'header line' is "
++ "present or from all the desired sheets, or providing an explicit 
schema "
++ "for interpreting the values. See Controller Service's Usage for 
further documentation. "
++ "This reader is currently only capable of processing .xlsx "
++ "(XSSF 2007 OOXML file format) Excel documents and not older .xls 
(HSSF '97(-2007) file format) documents.)")
+public class ExcelReader extends SchemaRegistryService implements 
RecordReaderFactory {
+
+private static final AllowableValue HEADER_DERIVED = new 
AllowableValue("excel-header-derived", "Use fields From Header",
+"The first chosen row of the Excel sheet is a header row that 
contains the columns representative of all the rows " +
+"in the desired sheets. The schema will be derived by 
using those columns in the header.");
+public static final PropertyDescriptor DESIRED_SHEETS = new 
PropertyDescriptor
+.Builder().name("extract-sheets")
+.displayName("Sheets to Extract")
+.description("Comma separated list of Excel document sheet names 
whose rows should be extracted from the excel document. If this property" +
+" is left blank then all the rows from all the sheets will 
be extracted from the Excel document. The list of names is case in-sensitive. 
Any sheets not" +
+" specified in this value will be ignored. A bulletin will 
be generated if a specified sheet(s) are not found.")
+.required(false)
+
.expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+  

[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-18 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1198210056


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelUtils.java:
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import org.apache.poi.ss.usermodel.Row;
+
+public class ExcelUtils {

Review Comment:
   Thanks, it is fine for now, leaving it as is sounds good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-18 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1198205592


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */package org.apache.nifi.excel;
+
+import com.github.pjfanning.xlsx.StreamingReader;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.ss.usermodel.Sheet;
+import org.apache.poi.ss.usermodel.Workbook;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RowIterator implements Iterator, Closeable {
+private final Workbook workbook;
+private final Iterator sheets;
+private Sheet currentSheet;
+private Iterator currentRows;
+private final Map desiredSheets;
+private final int firstRow;
+private ComponentLog logger;
+private boolean log;
+private Row currentRow;
+
+public RowIterator(InputStream in, List desiredSheets, int 
firstRow) {
+this(in, desiredSheets, firstRow, null);
+}
+
+public RowIterator(InputStream in, List desiredSheets, int 
firstRow, ComponentLog logger) {
+this.workbook = StreamingReader.builder()
+.rowCacheSize(100)
+.bufferSize(4096)
+.open(in);
+this.sheets = this.workbook.iterator();
+this.desiredSheets = desiredSheets != null ? desiredSheets.stream()
+.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) 
: new HashMap<>();
+this.firstRow = firstRow;
+this.logger = logger;
+this.log = logger != null;
+}
+
+@Override
+public boolean hasNext() {
+setCurrent();
+boolean next = currentRow != null;
+if(!next) {
+String sheetsNotFound = getSheetsNotFound(desiredSheets);
+if (!sheetsNotFound.isEmpty() && log) {
+logger.warn("Excel sheet(s) not found: {}", sheetsNotFound);
+}
+}
+return next;
+}
+
+private void setCurrent() {
+currentRow = getNextRow();
+if (currentRow != null) {
+return;
+}
+
+currentSheet = null;
+currentRows = null;
+while (sheets.hasNext()) {
+currentSheet = sheets.next();
+if (isIterateOverAllSheets() || 
hasSheet(currentSheet.getSheetName())) {
+currentRows = currentSheet.iterator();
+currentRow = getNextRow();
+if (currentRow != null) {
+return;
+}
+}
+}
+}
+
+private Row getNextRow() {
+while (currentRows != null && !hasExhaustedRows()) {
+Row tempCurrentRow = currentRows.next();
+if (!isSkip(tempCurrentRow)) {
+return tempCurrentRow;
+}
+}
+return null;
+}
+
+private boolean hasExhaustedRows() {
+boolean exhausted = !currentRows.hasNext();
+if (log && exhausted) {
+logger.info("Exhausted all rows from sheet {}", 
currentSheet.getSheetName());
+}
+return exhausted;
+}
+
+private boolean isSkip(Row row) {
+return row.getRowNum() < firstRow;
+}
+
+private boolean isIterateOverAllSheets() {
+boolean iterateAllSheets = desiredSheets.isEmpty();
+if (iterateAllSheets && log) {
+logger.info("Advanced to sheet {}", currentSheet.getSheetName());
+}
+return iterateAllSheets;
+}
+
+private boolean hasSheet(String name) {
+boolean sheetByName = !desiredSheets.isEmpty()
+&& desiredSheets.keySet().stream()
+.anyMatch(desiredSheet -> desiredSheet.equalsIgnoreCase(name));
+if (sheetByName) {
+desiredSheets.put(name, Boolean.TRUE);
+}
+return sheetByName;
+}
+
+ 

[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-18 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1198205061


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */package org.apache.nifi.excel;
+
+import com.github.pjfanning.xlsx.StreamingReader;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.ss.usermodel.Sheet;
+import org.apache.poi.ss.usermodel.Workbook;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RowIterator implements Iterator, Closeable {
+private final Workbook workbook;
+private final Iterator sheets;
+private Sheet currentSheet;
+private Iterator currentRows;
+private final Map desiredSheets;
+private final int firstRow;
+private ComponentLog logger;
+private boolean log;
+private Row currentRow;
+
+public RowIterator(InputStream in, List desiredSheets, int 
firstRow) {
+this(in, desiredSheets, firstRow, null);
+}
+
+public RowIterator(InputStream in, List desiredSheets, int 
firstRow, ComponentLog logger) {
+this.workbook = StreamingReader.builder()
+.rowCacheSize(100)
+.bufferSize(4096)
+.open(in);
+this.sheets = this.workbook.iterator();
+this.desiredSheets = desiredSheets != null ? desiredSheets.stream()
+.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) 
: new HashMap<>();
+this.firstRow = firstRow;
+this.logger = logger;
+this.log = logger != null;
+}
+
+@Override
+public boolean hasNext() {
+setCurrent();
+boolean next = currentRow != null;
+if(!next) {
+String sheetsNotFound = getSheetsNotFound(desiredSheets);
+if (!sheetsNotFound.isEmpty() && log) {
+logger.warn("Excel sheet(s) not found: {}", sheetsNotFound);
+}

Review Comment:
   The problem with the warning is that it is not actionable in terms of flow 
handling. Using the `record.count` attribute provides the opportunity to warn 
if a flow configuration expects to see records in all cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-16 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1195809425


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelRecordReader.java:
##
@@ -0,0 +1,221 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.serialization.MalformedRecordException;
+import org.apache.nifi.serialization.RecordReader;
+import org.apache.nifi.serialization.record.DataType;
+import org.apache.nifi.serialization.record.MapRecord;
+import org.apache.nifi.serialization.record.Record;
+import org.apache.nifi.serialization.record.RecordField;
+import org.apache.nifi.serialization.record.RecordSchema;
+import org.apache.nifi.serialization.record.util.DataTypeUtils;
+import org.apache.poi.ss.usermodel.Cell;
+import org.apache.poi.ss.usermodel.DateUtil;
+import org.apache.poi.ss.usermodel.Row;
+
+import java.io.IOException;
+import java.text.DateFormat;
+import java.util.ArrayList;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.function.Supplier;
+import java.util.stream.IntStream;
+
+import static org.apache.commons.lang3.StringUtils.isEmpty;
+
+public class ExcelRecordReader implements RecordReader {
+private final RowIterator rowIterator;
+private ComponentLog logger;
+private final RecordSchema schema;
+private final List desiredSheets;
+private final Supplier LAZY_DATE_FORMAT;
+private final Supplier LAZY_TIME_FORMAT;
+private final Supplier LAZY_TIMESTAMP_FORMAT;
+private final String dateFormat;
+private final String timeFormat;
+private final String timestampFormat;
+
+public ExcelRecordReader(ExcelRecordReaderArgs args) throws 
MalformedRecordException {
+this.logger = args.getLogger();
+this.schema = args.getSchema();
+this.desiredSheets = new ArrayList<>();
+if (args.getDesiredSheets() != null && 
args.getDesiredSheets().length() > 0) {
+IntStream.range(0, args.getDesiredSheets().length())
+.forEach(index -> 
this.desiredSheets.add(args.getDesiredSheets().get(index)));
+}
+
+if (isEmpty(args.getDateFormat())) {
+this.dateFormat = null;
+LAZY_DATE_FORMAT = null;
+} else {
+this.dateFormat = args.getDateFormat();
+LAZY_DATE_FORMAT = () -> DataTypeUtils.getDateFormat(dateFormat);
+}
+
+if (isEmpty(args.getTimeFormat())) {
+this.timeFormat = null;
+LAZY_TIME_FORMAT = null;
+} else {
+this.timeFormat = args.getTimeFormat();
+LAZY_TIME_FORMAT = () -> DataTypeUtils.getDateFormat(timeFormat);
+}
+
+if (isEmpty(args.getTimestampFormat())) {
+this.timestampFormat = null;
+LAZY_TIMESTAMP_FORMAT = null;
+} else {
+this.timestampFormat = args.getTimestampFormat();
+LAZY_TIMESTAMP_FORMAT = () -> 
DataTypeUtils.getDateFormat(timestampFormat);
+}
+
+try {
+this.rowIterator = new RowIterator(args.getInputStream(), 
desiredSheets, args.getFirstRow(), logger);
+} catch (RuntimeException e) {
+String msg = "Error occurred while processing record file";
+logger.error(msg, e);
+throw new MalformedRecordException(msg, e);
+}
+}
+
+@Override
+public Record nextRecord(boolean coerceTypes, boolean dropUnknownFields) 
throws IOException, MalformedRecordException {
+try {
+if (rowIterator.hasNext()) {
+Row currentRow = rowIterator.next();
+Map currentRowValues = 
getCurrentRowValues(currentRow, coerceTypes, dropUnknownFields);
+return new MapRecord(schema, currentRowValues);
+}
+} catch (Exception e) {
+throw new MalformedRecordException("Error while getting next 
record", e);
+}
+return null;
+}
+
+private Map getCurrentRowValues(Row 

[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-16 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1195712821


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */package org.apache.nifi.excel;
+
+import com.github.pjfanning.xlsx.StreamingReader;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.ss.usermodel.Sheet;
+import org.apache.poi.ss.usermodel.Workbook;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class RowIterator implements Iterator, Closeable {
+private final Workbook workbook;
+private final Iterator sheets;
+private Sheet currentSheet;
+private Iterator currentRows;
+private final Map desiredSheets;
+private final int firstRow;
+private ComponentLog logger;
+private boolean log;
+private Row currentRow;
+
+public RowIterator(InputStream in, List desiredSheets, int 
firstRow) {
+this(in, desiredSheets, firstRow, null);
+}
+
+public RowIterator(InputStream in, List desiredSheets, int 
firstRow, ComponentLog logger) {
+this.workbook = StreamingReader.builder()
+.rowCacheSize(100)
+.bufferSize(4096)
+.open(in);
+this.sheets = this.workbook.iterator();
+this.desiredSheets = desiredSheets != null ? desiredSheets.stream()
+.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) 
: new HashMap<>();
+this.firstRow = firstRow;
+this.logger = logger;
+this.log = logger != null;
+}
+
+@Override
+public boolean hasNext() {
+setCurrent();
+boolean next = currentRow != null;
+if(!next) {
+String sheetsNotFound = getSheetsNotFound(desiredSheets);
+if (!sheetsNotFound.isEmpty() && log) {
+logger.warn("Excel sheet(s) not found: {}", sheetsNotFound);
+}

Review Comment:
   Some of the other Record Readers add a record.count attribute, which is 
useful for tracking. That's probably a better option than logging a warning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-16 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1195705310


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelRecordReader.java:
##
@@ -0,0 +1,221 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.serialization.MalformedRecordException;
+import org.apache.nifi.serialization.RecordReader;
+import org.apache.nifi.serialization.record.DataType;
+import org.apache.nifi.serialization.record.MapRecord;
+import org.apache.nifi.serialization.record.Record;
+import org.apache.nifi.serialization.record.RecordField;
+import org.apache.nifi.serialization.record.RecordSchema;
+import org.apache.nifi.serialization.record.util.DataTypeUtils;
+import org.apache.poi.ss.usermodel.Cell;
+import org.apache.poi.ss.usermodel.DateUtil;
+import org.apache.poi.ss.usermodel.Row;
+
+import java.io.IOException;
+import java.text.DateFormat;
+import java.util.ArrayList;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.function.Supplier;
+import java.util.stream.IntStream;
+
+import static org.apache.commons.lang3.StringUtils.isEmpty;
+
+public class ExcelRecordReader implements RecordReader {
+private final RowIterator rowIterator;
+private ComponentLog logger;
+private final RecordSchema schema;
+private final List desiredSheets;
+private final Supplier LAZY_DATE_FORMAT;
+private final Supplier LAZY_TIME_FORMAT;
+private final Supplier LAZY_TIMESTAMP_FORMAT;
+private final String dateFormat;
+private final String timeFormat;
+private final String timestampFormat;
+
+public ExcelRecordReader(ExcelRecordReaderArgs args) throws 
MalformedRecordException {
+this.logger = args.getLogger();
+this.schema = args.getSchema();
+this.desiredSheets = new ArrayList<>();
+if (args.getDesiredSheets() != null && 
args.getDesiredSheets().length() > 0) {
+IntStream.range(0, args.getDesiredSheets().length())
+.forEach(index -> 
this.desiredSheets.add(args.getDesiredSheets().get(index)));
+}
+
+if (isEmpty(args.getDateFormat())) {
+this.dateFormat = null;
+LAZY_DATE_FORMAT = null;
+} else {
+this.dateFormat = args.getDateFormat();
+LAZY_DATE_FORMAT = () -> DataTypeUtils.getDateFormat(dateFormat);
+}
+
+if (isEmpty(args.getTimeFormat())) {
+this.timeFormat = null;
+LAZY_TIME_FORMAT = null;
+} else {
+this.timeFormat = args.getTimeFormat();
+LAZY_TIME_FORMAT = () -> DataTypeUtils.getDateFormat(timeFormat);
+}
+
+if (isEmpty(args.getTimestampFormat())) {
+this.timestampFormat = null;
+LAZY_TIMESTAMP_FORMAT = null;
+} else {
+this.timestampFormat = args.getTimestampFormat();
+LAZY_TIMESTAMP_FORMAT = () -> 
DataTypeUtils.getDateFormat(timestampFormat);
+}
+
+try {
+this.rowIterator = new RowIterator(args.getInputStream(), 
desiredSheets, args.getFirstRow(), logger);
+} catch (RuntimeException e) {
+String msg = "Error occurred while processing record file";
+logger.error(msg, e);
+throw new MalformedRecordException(msg, e);
+}
+}
+
+@Override
+public Record nextRecord(boolean coerceTypes, boolean dropUnknownFields) 
throws IOException, MalformedRecordException {
+try {
+if (rowIterator.hasNext()) {
+Row currentRow = rowIterator.next();
+Map currentRowValues = 
getCurrentRowValues(currentRow, coerceTypes, dropUnknownFields);
+return new MapRecord(schema, currentRowValues);
+}
+} catch (Exception e) {
+throw new MalformedRecordException("Error while getting next 
record", e);
+}
+return null;
+}
+
+private Map getCurrentRowValues(Row 

[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-16 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1195649993


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelRecordReader.java:
##
@@ -0,0 +1,221 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.serialization.MalformedRecordException;
+import org.apache.nifi.serialization.RecordReader;
+import org.apache.nifi.serialization.record.DataType;
+import org.apache.nifi.serialization.record.MapRecord;
+import org.apache.nifi.serialization.record.Record;
+import org.apache.nifi.serialization.record.RecordField;
+import org.apache.nifi.serialization.record.RecordSchema;
+import org.apache.nifi.serialization.record.util.DataTypeUtils;
+import org.apache.poi.ss.usermodel.Cell;
+import org.apache.poi.ss.usermodel.DateUtil;
+import org.apache.poi.ss.usermodel.Row;
+
+import java.io.IOException;
+import java.text.DateFormat;
+import java.util.ArrayList;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.function.Supplier;
+import java.util.stream.IntStream;
+
+import static org.apache.commons.lang3.StringUtils.isEmpty;
+
+public class ExcelRecordReader implements RecordReader {
+private final RowIterator rowIterator;
+private ComponentLog logger;
+private final RecordSchema schema;
+private final List desiredSheets;
+private final Supplier LAZY_DATE_FORMAT;
+private final Supplier LAZY_TIME_FORMAT;
+private final Supplier LAZY_TIMESTAMP_FORMAT;
+private final String dateFormat;
+private final String timeFormat;
+private final String timestampFormat;
+
+public ExcelRecordReader(ExcelRecordReaderArgs args) throws 
MalformedRecordException {
+this.logger = args.getLogger();
+this.schema = args.getSchema();
+this.desiredSheets = new ArrayList<>();
+if (args.getDesiredSheets() != null && 
args.getDesiredSheets().length() > 0) {
+IntStream.range(0, args.getDesiredSheets().length())
+.forEach(index -> 
this.desiredSheets.add(args.getDesiredSheets().get(index)));
+}
+
+if (isEmpty(args.getDateFormat())) {
+this.dateFormat = null;
+LAZY_DATE_FORMAT = null;
+} else {
+this.dateFormat = args.getDateFormat();
+LAZY_DATE_FORMAT = () -> DataTypeUtils.getDateFormat(dateFormat);
+}
+
+if (isEmpty(args.getTimeFormat())) {
+this.timeFormat = null;
+LAZY_TIME_FORMAT = null;
+} else {
+this.timeFormat = args.getTimeFormat();
+LAZY_TIME_FORMAT = () -> DataTypeUtils.getDateFormat(timeFormat);
+}
+
+if (isEmpty(args.getTimestampFormat())) {
+this.timestampFormat = null;
+LAZY_TIMESTAMP_FORMAT = null;
+} else {
+this.timestampFormat = args.getTimestampFormat();
+LAZY_TIMESTAMP_FORMAT = () -> 
DataTypeUtils.getDateFormat(timestampFormat);
+}
+
+try {
+this.rowIterator = new RowIterator(args.getInputStream(), 
desiredSheets, args.getFirstRow(), logger);
+} catch (RuntimeException e) {
+String msg = "Error occurred while processing record file";
+logger.error(msg, e);
+throw new MalformedRecordException(msg, e);
+}
+}
+
+@Override
+public Record nextRecord(boolean coerceTypes, boolean dropUnknownFields) 
throws IOException, MalformedRecordException {
+try {
+if (rowIterator.hasNext()) {
+Row currentRow = rowIterator.next();
+Map currentRowValues = 
getCurrentRowValues(currentRow, coerceTypes, dropUnknownFields);
+return new MapRecord(schema, currentRowValues);
+}
+} catch (Exception e) {
+throw new MalformedRecordException("Error while getting next 
record", e);
+}
+return null;
+}
+
+private Map getCurrentRowValues(Row 

[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader

2023-05-15 Thread via GitHub


exceptionfactory commented on code in PR #7194:
URL: https://github.com/apache/nifi/pull/7194#discussion_r1194517265


##
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelReader.java:
##
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.excel;
+
+import org.apache.commons.lang3.StringUtils;
+import org.apache.nifi.annotation.documentation.CapabilityDescription;
+import org.apache.nifi.annotation.documentation.Tags;
+import org.apache.nifi.annotation.lifecycle.OnEnabled;
+import org.apache.nifi.components.AllowableValue;
+import org.apache.nifi.components.PropertyDescriptor;
+import org.apache.nifi.context.PropertyContext;
+import org.apache.nifi.controller.ConfigurationContext;
+import org.apache.nifi.expression.ExpressionLanguageScope;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.processor.util.StandardValidators;
+import org.apache.nifi.schema.access.SchemaAccessStrategy;
+import org.apache.nifi.schema.access.SchemaNotFoundException;
+import org.apache.nifi.schema.inference.InferSchemaAccessStrategy;
+import org.apache.nifi.schema.inference.RecordSourceFactory;
+import org.apache.nifi.schema.inference.SchemaInferenceEngine;
+import org.apache.nifi.schema.inference.SchemaInferenceUtil;
+import org.apache.nifi.schema.inference.TimeValueInference;
+import org.apache.nifi.schemaregistry.services.SchemaRegistry;
+import org.apache.nifi.serialization.DateTimeUtils;
+import org.apache.nifi.serialization.MalformedRecordException;
+import org.apache.nifi.serialization.RecordReader;
+import org.apache.nifi.serialization.RecordReaderFactory;
+import org.apache.nifi.serialization.SchemaRegistryService;
+import org.apache.nifi.serialization.record.RecordSchema;
+import org.apache.nifi.stream.io.NonCloseableInputStream;
+import org.apache.poi.ss.usermodel.Row;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.atomic.AtomicReferenceArray;
+import java.util.stream.IntStream;
+
+@Tags({"excel", "spreadsheet", "xlsx", "parse", "record", "row", "reader", 
"values", "cell"})
+@CapabilityDescription("Parses a Microsoft Excel document returning each row 
in each sheet as a separate record. "
++ "This reader allows for inferring a schema either based on the first 
line of an Excel sheet if a 'header line' is "
++ "present or from all the desired sheets, or providing an explicit 
schema "
++ "for interpreting the values. See Controller Service's Usage for 
further documentation. "
++ "This reader is currently only capable of processing .xlsx "
++ "(XSSF 2007 OOXML file format) Excel documents and not older .xls 
(HSSF '97(-2007) file format) documents.)")
+public class ExcelReader extends SchemaRegistryService implements 
RecordReaderFactory {
+
+private static final AllowableValue HEADER_DERIVED = new 
AllowableValue("excel-header-derived", "Use fields From Header",
+"The first chosen row of the Excel sheet is a header row that 
contains the columns representative of all the rows " +
+"in the desired sheets. The schema will be derived by 
using those columns in the header.");
+public static final PropertyDescriptor DESIRED_SHEETS = new 
PropertyDescriptor
+.Builder().name("extract-sheets")
+.displayName("Sheets to Extract")
+.description("Comma separated list of Excel document sheet names 
whose rows should be extracted from the excel document. If this property" +
+" is left blank then all the rows from all the sheets will 
be extracted from the Excel document. The list of names is case in-sensitive. 
Any sheets not" +
+" specified in this value will be ignored. A bulletin will 
be generated if a specified sheet(s) are not found.")
+.required(false)
+
.expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+