[jira] [Commented] (NIFI-4496) Improve performance of CSVReader

2017-12-22 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301480#comment-16301480
 ] 

ASF subversion and git services commented on NIFI-4496:
---

Commit 14d2291db87d8ea160f538c10de31ac69fc996ae in nifi's branch 
refs/heads/master from [~ca9mbu]
[ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=14d2291 ]

NIFI-4496: Added JacksonCSVRecordReader to allow choice of CSV parser. This 
closes #2245.


> Improve performance of CSVReader
> 
>
> Key: NIFI-4496
> URL: https://issues.apache.org/jira/browse/NIFI-4496
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Matt Burgess
>Assignee: Matt Burgess
> Fix For: 1.5.0
>
>
> During some throughput testing, it was noted that the CSVReader was not as 
> fast as desired, processing less than 50k records per second. A look at [this 
> benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that 
> the Apache Commons CSV parser (used by CSVReader) is quite slow compared to 
> others.
> From that benchmark it appears that CSVReader could be enhanced by using a 
> different CSV parser under the hood. Perhaps Jackson is the best choice, as 
> it is fast when values are quoted, and is a mature and maintained codebase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NIFI-4496) Improve performance of CSVReader

2017-12-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301481#comment-16301481
 ] 

ASF GitHub Bot commented on NIFI-4496:
--

Github user asfgit closed the pull request at:

https://github.com/apache/nifi/pull/2245


> Improve performance of CSVReader
> 
>
> Key: NIFI-4496
> URL: https://issues.apache.org/jira/browse/NIFI-4496
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Matt Burgess
>Assignee: Matt Burgess
> Fix For: 1.5.0
>
>
> During some throughput testing, it was noted that the CSVReader was not as 
> fast as desired, processing less than 50k records per second. A look at [this 
> benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that 
> the Apache Commons CSV parser (used by CSVReader) is quite slow compared to 
> others.
> From that benchmark it appears that CSVReader could be enhanced by using a 
> different CSV parser under the hood. Perhaps Jackson is the best choice, as 
> it is fast when values are quoted, and is a mature and maintained codebase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NIFI-4496) Improve performance of CSVReader

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298635#comment-16298635
 ] 

ASF GitHub Bot commented on NIFI-4496:
--

Github user mattyb149 commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2245#discussion_r158051846
  
--- Diff: 
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/JacksonCSVRecordReader.java
 ---
@@ -0,0 +1,257 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nifi.csv;
+
+import com.fasterxml.jackson.databind.MappingIterator;
+import com.fasterxml.jackson.databind.ObjectReader;
+import com.fasterxml.jackson.dataformat.csv.CsvMapper;
+import com.fasterxml.jackson.dataformat.csv.CsvParser;
+import com.fasterxml.jackson.dataformat.csv.CsvSchema;
+import org.apache.commons.csv.CSVFormat;
+import org.apache.commons.io.input.BOMInputStream;
+import org.apache.commons.lang3.CharUtils;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.serialization.MalformedRecordException;
+import org.apache.nifi.serialization.RecordReader;
+import org.apache.nifi.serialization.record.DataType;
+import org.apache.nifi.serialization.record.MapRecord;
+import org.apache.nifi.serialization.record.Record;
+import org.apache.nifi.serialization.record.RecordSchema;
+import org.apache.nifi.serialization.record.util.DataTypeUtils;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.Reader;
+import java.text.DateFormat;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.function.Supplier;
+
+
+public class JacksonCSVRecordReader implements RecordReader {
+private final RecordSchema schema;
+
+private final Supplier LAZY_DATE_FORMAT;
+private final Supplier LAZY_TIME_FORMAT;
+private final Supplier LAZY_TIMESTAMP_FORMAT;
+
+private final ComponentLog logger;
+private final boolean hasHeader;
+private final boolean ignoreHeader;
+private final MappingIterator recordStream;
+private List rawFieldNames = null;
+
+private volatile static CsvMapper mapper = new 
CsvMapper().enable(CsvParser.Feature.WRAP_AS_ARRAY);
+
+public JacksonCSVRecordReader(final InputStream in, final ComponentLog 
logger, final RecordSchema schema, final CSVFormat csvFormat, final boolean 
hasHeader, final boolean ignoreHeader,
+  final String dateFormat, final String 
timeFormat, final String timestampFormat, final String encoding) throws 
IOException {
+
+this.schema = schema;
+this.logger = logger;
+this.hasHeader = hasHeader;
+this.ignoreHeader = ignoreHeader;
+final DateFormat df = dateFormat == null ? null : 
DataTypeUtils.getDateFormat(dateFormat);
+final DateFormat tf = timeFormat == null ? null : 
DataTypeUtils.getDateFormat(timeFormat);
+final DateFormat tsf = timestampFormat == null ? null : 
DataTypeUtils.getDateFormat(timestampFormat);
+
+LAZY_DATE_FORMAT = () -> df;
+LAZY_TIME_FORMAT = () -> tf;
+LAZY_TIMESTAMP_FORMAT = () -> tsf;
+
+final Reader reader = new InputStreamReader(new 
BOMInputStream(in));
+
+CsvSchema.Builder csvSchemaBuilder = CsvSchema.builder()
+.setColumnSeparator(csvFormat.getDelimiter())
+.setLineSeparator(csvFormat.getRecordSeparator())
+// Can only use comments in Jackson CSV if the correct 
marker is set
+.setAllowComments("#" 

[jira] [Commented] (NIFI-4496) Improve performance of CSVReader

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298633#comment-16298633
 ] 

ASF GitHub Bot commented on NIFI-4496:
--

Github user markap14 commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2245#discussion_r158051319
  
--- Diff: 
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/JacksonCSVRecordReader.java
 ---
@@ -0,0 +1,257 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nifi.csv;
+
+import com.fasterxml.jackson.databind.MappingIterator;
+import com.fasterxml.jackson.databind.ObjectReader;
+import com.fasterxml.jackson.dataformat.csv.CsvMapper;
+import com.fasterxml.jackson.dataformat.csv.CsvParser;
+import com.fasterxml.jackson.dataformat.csv.CsvSchema;
+import org.apache.commons.csv.CSVFormat;
+import org.apache.commons.io.input.BOMInputStream;
+import org.apache.commons.lang3.CharUtils;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.serialization.MalformedRecordException;
+import org.apache.nifi.serialization.RecordReader;
+import org.apache.nifi.serialization.record.DataType;
+import org.apache.nifi.serialization.record.MapRecord;
+import org.apache.nifi.serialization.record.Record;
+import org.apache.nifi.serialization.record.RecordSchema;
+import org.apache.nifi.serialization.record.util.DataTypeUtils;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.Reader;
+import java.text.DateFormat;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.function.Supplier;
+
+
+public class JacksonCSVRecordReader implements RecordReader {
+private final RecordSchema schema;
+
+private final Supplier LAZY_DATE_FORMAT;
+private final Supplier LAZY_TIME_FORMAT;
+private final Supplier LAZY_TIMESTAMP_FORMAT;
+
+private final ComponentLog logger;
+private final boolean hasHeader;
+private final boolean ignoreHeader;
+private final MappingIterator recordStream;
+private List rawFieldNames = null;
+
+private volatile static CsvMapper mapper = new 
CsvMapper().enable(CsvParser.Feature.WRAP_AS_ARRAY);
+
+public JacksonCSVRecordReader(final InputStream in, final ComponentLog 
logger, final RecordSchema schema, final CSVFormat csvFormat, final boolean 
hasHeader, final boolean ignoreHeader,
+  final String dateFormat, final String 
timeFormat, final String timestampFormat, final String encoding) throws 
IOException {
+
+this.schema = schema;
+this.logger = logger;
+this.hasHeader = hasHeader;
+this.ignoreHeader = ignoreHeader;
+final DateFormat df = dateFormat == null ? null : 
DataTypeUtils.getDateFormat(dateFormat);
+final DateFormat tf = timeFormat == null ? null : 
DataTypeUtils.getDateFormat(timeFormat);
+final DateFormat tsf = timestampFormat == null ? null : 
DataTypeUtils.getDateFormat(timestampFormat);
+
+LAZY_DATE_FORMAT = () -> df;
+LAZY_TIME_FORMAT = () -> tf;
+LAZY_TIMESTAMP_FORMAT = () -> tsf;
+
+final Reader reader = new InputStreamReader(new 
BOMInputStream(in));
+
+CsvSchema.Builder csvSchemaBuilder = CsvSchema.builder()
+.setColumnSeparator(csvFormat.getDelimiter())
+.setLineSeparator(csvFormat.getRecordSeparator())
+// Can only use comments in Jackson CSV if the correct 
marker is set
+.setAllowComments("#" 

[jira] [Commented] (NIFI-4496) Improve performance of CSVReader

2017-12-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292927#comment-16292927
 ] 

ASF GitHub Bot commented on NIFI-4496:
--

Github user mattyb149 commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2245#discussion_r157261623
  
--- Diff: 
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/JacksonCSVRecordReader.java
 ---
@@ -136,7 +134,7 @@ public Record nextRecord(final boolean coerceTypes, 
final boolean dropUnknownFie
 
 // If the first record is the header names (and we're using 
them), store those off for use in creating the value map on the next iterations
 if (rawFieldNames == null) {
-if (hasHeader && ignoreHeader) {
+if (!hasHeader || ignoreHeader) {
 rawFieldNames = schema.getFieldNames();
 } else {
 rawFieldNames = Arrays.stream(csvRecord).map((a) -> {
--- End diff --

Who knows lol. I'll try asList() instead


> Improve performance of CSVReader
> 
>
> Key: NIFI-4496
> URL: https://issues.apache.org/jira/browse/NIFI-4496
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Matt Burgess
>Assignee: Matt Burgess
>
> During some throughput testing, it was noted that the CSVReader was not as 
> fast as desired, processing less than 50k records per second. A look at [this 
> benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that 
> the Apache Commons CSV parser (used by CSVReader) is quite slow compared to 
> others.
> From that benchmark it appears that CSVReader could be enhanced by using a 
> different CSV parser under the hood. Perhaps Jackson is the best choice, as 
> it is fast when values are quoted, and is a mature and maintained codebase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NIFI-4496) Improve performance of CSVReader

2017-12-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292922#comment-16292922
 ] 

ASF GitHub Bot commented on NIFI-4496:
--

Github user markap14 commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2245#discussion_r157261143
  
--- Diff: 
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/JacksonCSVRecordReader.java
 ---
@@ -136,7 +134,7 @@ public Record nextRecord(final boolean coerceTypes, 
final boolean dropUnknownFie
 
 // If the first record is the header names (and we're using 
them), store those off for use in creating the value map on the next iterations
 if (rawFieldNames == null) {
-if (hasHeader && ignoreHeader) {
+if (!hasHeader || ignoreHeader) {
 rawFieldNames = schema.getFieldNames();
 } else {
 rawFieldNames = Arrays.stream(csvRecord).map((a) -> {
--- End diff --

I'm not sure that I understand the logic here... was this perhaps due to 
some refactoring and got overlooked, or is this actually doing something that's 
just not obvious to me? Seems this could just be done as 
`Arrays.asList(csvRecord)`


> Improve performance of CSVReader
> 
>
> Key: NIFI-4496
> URL: https://issues.apache.org/jira/browse/NIFI-4496
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Matt Burgess
>Assignee: Matt Burgess
>
> During some throughput testing, it was noted that the CSVReader was not as 
> fast as desired, processing less than 50k records per second. A look at [this 
> benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that 
> the Apache Commons CSV parser (used by CSVReader) is quite slow compared to 
> others.
> From that benchmark it appears that CSVReader could be enhanced by using a 
> different CSV parser under the hood. Perhaps Jackson is the best choice, as 
> it is fast when values are quoted, and is a mature and maintained codebase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NIFI-4496) Improve performance of CSVReader

2017-12-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292819#comment-16292819
 ] 

ASF GitHub Bot commented on NIFI-4496:
--

Github user mattyb149 commented on the issue:

https://github.com/apache/nifi/pull/2245
  
@jdye64 I think I fixed the issue you were seeing. We have to do most of 
the schema resolution/management manually, Jackson's methods for handling that 
don't seem to work for what we need. So I removed the setting of column names 
on the parser, having the column names changed the parser to want an actual 
array with [] surrounding the line (weird, right?). Then for files without 
headers, I needed to make sure we used the schema field names, so I had to 
adjust the logic where "rawFieldNames" is generated.  Mind taking a look at 
this latest version? Please and thanks!


> Improve performance of CSVReader
> 
>
> Key: NIFI-4496
> URL: https://issues.apache.org/jira/browse/NIFI-4496
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Matt Burgess
>Assignee: Matt Burgess
>
> During some throughput testing, it was noted that the CSVReader was not as 
> fast as desired, processing less than 50k records per second. A look at [this 
> benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that 
> the Apache Commons CSV parser (used by CSVReader) is quite slow compared to 
> others.
> From that benchmark it appears that CSVReader could be enhanced by using a 
> different CSV parser under the hood. Perhaps Jackson is the best choice, as 
> it is fast when values are quoted, and is a mature and maintained codebase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NIFI-4496) Improve performance of CSVReader

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242089#comment-16242089
 ] 

ASF GitHub Bot commented on NIFI-4496:
--

Github user jdye64 commented on the issue:

https://github.com/apache/nifi/pull/2245
  
@mattyb149 I'm seeing invalid output when I run run an existing flow with 
this PR. I had an existing flow that used ConvertRecord and Apache Commons CSV. 
That was working fine and giving me the output I expected. However when I 
switched to using the Jackson implementation all of the output was empty. I 
have attached a screenshot from my debugger session in hopes that will help 
shed some light into what is going on.

https://user-images.githubusercontent.com/2127235/32498256-32f8ffc6-c39d-11e7-86dd-cde8f7d3a758.png;>



> Improve performance of CSVReader
> 
>
> Key: NIFI-4496
> URL: https://issues.apache.org/jira/browse/NIFI-4496
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Matt Burgess
>Assignee: Matt Burgess
>
> During some throughput testing, it was noted that the CSVReader was not as 
> fast as desired, processing less than 50k records per second. A look at [this 
> benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that 
> the Apache Commons CSV parser (used by CSVReader) is quite slow compared to 
> others.
> From that benchmark it appears that CSVReader could be enhanced by using a 
> different CSV parser under the hood. Perhaps Jackson is the best choice, as 
> it is fast when values are quoted, and is a mature and maintained codebase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NIFI-4496) Improve performance of CSVReader

2017-11-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234547#comment-16234547
 ] 

ASF GitHub Bot commented on NIFI-4496:
--

Github user andrewmlim commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2245#discussion_r148347696
  
--- Diff: 
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/CSVReader.java
 ---
@@ -54,6 +54,26 @@
 "The first non-comment line of the CSV file is a header line that 
contains the names of the columns. The schema will be derived by using the "
 + "column names in the header and assuming that all columns 
are of type String.");
 
+// CSV parsers
+public static final AllowableValue APACHE_COMMONS_CSV = new 
AllowableValue("commons-csv", "Apache Commons CSV",
+"The CSV parser implementation from the Apache Commons CSV 
library.");
+
+public static final AllowableValue JACKSON_CSV = new 
AllowableValue("jackson-csv", "Jackson CSV",
+"The CSV parser implementation from the Jackson Dataformats 
library");
+
+
+public static final PropertyDescriptor CSV_PARSER = new 
PropertyDescriptor.Builder()
+.name("csv-reader-csv-parser")
+.displayName("CSV Parser")
+.description("Specifies which parser to use to read CSV 
records. NOTE: Different parsers may support different subsets of 
functionality, "
++ "and/or exhibit different levels of performance.")
--- End diff --

Suggest changing the NOTE to:

Different parsers may support different subsets of functionality and may 
also exhibit different levels of performance.


> Improve performance of CSVReader
> 
>
> Key: NIFI-4496
> URL: https://issues.apache.org/jira/browse/NIFI-4496
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Matt Burgess
>Assignee: Matt Burgess
>Priority: Major
>
> During some throughput testing, it was noted that the CSVReader was not as 
> fast as desired, processing less than 50k records per second. A look at [this 
> benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that 
> the Apache Commons CSV parser (used by CSVReader) is quite slow compared to 
> others.
> From that benchmark it appears that CSVReader could be enhanced by using a 
> different CSV parser under the hood. Perhaps Jackson is the best choice, as 
> it is fast when values are quoted, and is a mature and maintained codebase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NIFI-4496) Improve performance of CSVReader

2017-11-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234544#comment-16234544
 ] 

ASF GitHub Bot commented on NIFI-4496:
--

Github user andrewmlim commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2245#discussion_r148347427
  
--- Diff: 
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/CSVReader.java
 ---
@@ -54,6 +54,26 @@
 "The first non-comment line of the CSV file is a header line that 
contains the names of the columns. The schema will be derived by using the "
 + "column names in the header and assuming that all columns 
are of type String.");
 
+// CSV parsers
+public static final AllowableValue APACHE_COMMONS_CSV = new 
AllowableValue("commons-csv", "Apache Commons CSV",
+"The CSV parser implementation from the Apache Commons CSV 
library.");
+
+public static final AllowableValue JACKSON_CSV = new 
AllowableValue("jackson-csv", "Jackson CSV",
+"The CSV parser implementation from the Jackson Dataformats 
library");
--- End diff --

Need a period (.) after library to be consistent.


> Improve performance of CSVReader
> 
>
> Key: NIFI-4496
> URL: https://issues.apache.org/jira/browse/NIFI-4496
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Matt Burgess
>Assignee: Matt Burgess
>Priority: Major
>
> During some throughput testing, it was noted that the CSVReader was not as 
> fast as desired, processing less than 50k records per second. A look at [this 
> benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that 
> the Apache Commons CSV parser (used by CSVReader) is quite slow compared to 
> others.
> From that benchmark it appears that CSVReader could be enhanced by using a 
> different CSV parser under the hood. Perhaps Jackson is the best choice, as 
> it is fast when values are quoted, and is a mature and maintained codebase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NIFI-4496) Improve performance of CSVReader

2017-11-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234252#comment-16234252
 ] 

ASF GitHub Bot commented on NIFI-4496:
--

GitHub user mattyb149 opened a pull request:

https://github.com/apache/nifi/pull/2245

NIFI-4496: Added JacksonCSVRecordReader to allow choice of CSV parser

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

### For all changes:
- [x] Is there a JIRA ticket associated with this PR? Is it referenced 
 in the commit message?

- [x] Does your PR title start with NIFI- where  is the JIRA number 
you are trying to resolve? Pay particular attention to the hyphen "-" character.

- [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?

- [x] Is your initial contribution a single, squashed commit?

### For code changes:
- [x] Have you ensured that the full suite of tests is executed via mvn 
-Pcontrib-check clean install at the root nifi folder?
- [x] Have you written or updated unit tests to verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
- [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file under nifi-assembly?
- [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found under nifi-assembly?
- [x] If adding new Properties, have you added .displayName in addition to 
.name (programmatic access) for each of the new properties?

### For documentation related changes:
- [x] Have you ensured that format looks appropriate for the output in 
which it is rendered?

### Note:
Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mattyb149/nifi NIFI-4496

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/2245.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2245


commit 15040f4f67a785ab16894992ffeca7d7847f62f1
Author: Matthew Burgess 
Date:   2017-11-01T15:50:06Z

NIFI-4496: Added JacksonCSVRecordReader to allow choice of CSV parser




> Improve performance of CSVReader
> 
>
> Key: NIFI-4496
> URL: https://issues.apache.org/jira/browse/NIFI-4496
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Matt Burgess
>Assignee: Matt Burgess
>Priority: Major
>
> During some throughput testing, it was noted that the CSVReader was not as 
> fast as desired, processing less than 50k records per second. A look at [this 
> benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that 
> the Apache Commons CSV parser (used by CSVReader) is quite slow compared to 
> others.
> From that benchmark it appears that CSVReader could be enhanced by using a 
> different CSV parser under the hood. Perhaps Jackson is the best choice, as 
> it is fast when values are quoted, and is a mature and maintained codebase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)