[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes

2016-11-01 Thread Krystal (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626977#comment-15626977
 ] 

Krystal commented on DRILL-3178:


commit id: 83513daf0903e0d94fcaad7b1ae4e8ad6272b494

Using data from comment #2, verified that data gets returned as expected.

select * from `drill-3178/drill3178.csv`;
+--+
| columns  |
+--+
| ["1","line1"]|
| ["2","line2\n"]  |
| ["3","line3"]|
+--+
3 rows selected (0.158 seconds)

 select columns[0], columns[1] from `drill-3178/drill3178.csv`;
+-+-+
| EXPR$0  | EXPR$1  |
+-+-+
| 1   | line1   |
| 2   | line2
  |
| 3   | line3   |
+-+-+

select columns[0],columns[1] from `drill-3178/drill3178.csv` where columns[0] > 
1 order by columns[1] desc;
+-+-+
| EXPR$0  | EXPR$1  |
+-+-+
| 3   | line3   |
| 2   | line2
  |
+-+-+
2 rows selected (0.373 seconds)


> csv reader should allow newlines inside quotes 
> ---
>
> Key: DRILL-3178
> URL: https://issues.apache.org/jira/browse/DRILL-3178
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.0.0
> Environment: Ubuntu Trusty 14.04.2 LTS
>Reporter: Neal McBurnett
>Assignee: F Méthot
> Fix For: 1.9.0
>
> Attachments: drill-3178.patch
>
>
> When reading a csv file which contains newlines within quoted strings, e.g. 
> via
> select * from dfs.`/tmp/q.csv`;
> Drill 1.0 says:
> Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException:  
> Error processing input: Cannot use newline character within quoted string
> But many tools produce csv files with newlines in quoted strings.  Drill 
> should be able to handle them.
> Workaround: the csvquote program (https://github.com/dbro/csvquote) can 
> encode embedded commas and newlines, and even decode them later if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes

2016-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587056#comment-15587056
 ] 

ASF GitHub Bot commented on DRILL-3178:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/593


> csv reader should allow newlines inside quotes 
> ---
>
> Key: DRILL-3178
> URL: https://issues.apache.org/jira/browse/DRILL-3178
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.0.0
> Environment: Ubuntu Trusty 14.04.2 LTS
>Reporter: Neal McBurnett
>Assignee: F Méthot
> Fix For: Future
>
> Attachments: drill-3178.patch
>
>
> When reading a csv file which contains newlines within quoted strings, e.g. 
> via
> select * from dfs.`/tmp/q.csv`;
> Drill 1.0 says:
> Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException:  
> Error processing input: Cannot use newline character within quoted string
> But many tools produce csv files with newlines in quoted strings.  Drill 
> should be able to handle them.
> Workaround: the csvquote program (https://github.com/dbro/csvquote) can 
> encode embedded commas and newlines, and even decode them later if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes

2016-10-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577086#comment-15577086
 ] 

ASF GitHub Bot commented on DRILL-3178:
---

Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/593
  
+1


> csv reader should allow newlines inside quotes 
> ---
>
> Key: DRILL-3178
> URL: https://issues.apache.org/jira/browse/DRILL-3178
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.0.0
> Environment: Ubuntu Trusty 14.04.2 LTS
>Reporter: Neal McBurnett
>Assignee: F Méthot
> Fix For: Future
>
> Attachments: drill-3178.patch
>
>
> When reading a csv file which contains newlines within quoted strings, e.g. 
> via
> select * from dfs.`/tmp/q.csv`;
> Drill 1.0 says:
> Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException:  
> Error processing input: Cannot use newline character within quoted string
> But many tools produce csv files with newlines in quoted strings.  Drill 
> should be able to handle them.
> Workaround: the csvquote program (https://github.com/dbro/csvquote) can 
> encode embedded commas and newlines, and even decode them later if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes

2016-10-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557029#comment-15557029
 ] 

ASF GitHub Bot commented on DRILL-3178:
---

Github user fmethot commented on the issue:

https://github.com/apache/drill/pull/593
  
Thanks for the comments so far, See my newer changes,  I suggest that we 
remove the flag and add an extra method instead.
There is no more check for a boolean in nextChar, but instead there is an 
extra method call (readNext()->readNextNoNewLineCheck())



> csv reader should allow newlines inside quotes 
> ---
>
> Key: DRILL-3178
> URL: https://issues.apache.org/jira/browse/DRILL-3178
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.0.0
> Environment: Ubuntu Trusty 14.04.2 LTS
>Reporter: Neal McBurnett
>Assignee: F Méthot
> Fix For: Future
>
> Attachments: drill-3178.patch
>
>
> When reading a csv file which contains newlines within quoted strings, e.g. 
> via
> select * from dfs.`/tmp/q.csv`;
> Drill 1.0 says:
> Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException:  
> Error processing input: Cannot use newline character within quoted string
> But many tools produce csv files with newlines in quoted strings.  Drill 
> should be able to handle them.
> Workaround: the csvquote program (https://github.com/dbro/csvquote) can 
> encode embedded commas and newlines, and even decode them later if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes

2016-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553523#comment-15553523
 ] 

ASF GitHub Bot commented on DRILL-3178:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/593#discussion_r82303872
  
--- Diff: exec/java-exec/src/test/resources/store/text/WithQuotedCrLf.tbl 
---
@@ -0,0 +1,6 @@
+"a
+1"|a|a
+a|"a
+2"|a
+a|a|"a
+3"
--- End diff --

Is there an issue with git converting Windows-style newlines (\r\n) into 
Unix-style (\n) when this file is checked in & out? Will that mess up the test? 
Should the test generate this file to handle this particular special case?


> csv reader should allow newlines inside quotes 
> ---
>
> Key: DRILL-3178
> URL: https://issues.apache.org/jira/browse/DRILL-3178
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.0.0
> Environment: Ubuntu Trusty 14.04.2 LTS
>Reporter: Neal McBurnett
>Assignee: F Méthot
> Fix For: Future
>
> Attachments: drill-3178.patch
>
>
> When reading a csv file which contains newlines within quoted strings, e.g. 
> via
> select * from dfs.`/tmp/q.csv`;
> Drill 1.0 says:
> Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException:  
> Error processing input: Cannot use newline character within quoted string
> But many tools produce csv files with newlines in quoted strings.  Drill 
> should be able to handle them.
> Workaround: the csvquote program (https://github.com/dbro/csvquote) can 
> encode embedded commas and newlines, and even decode them later if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes

2016-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553520#comment-15553520
 ] 

ASF GitHub Bot commented on DRILL-3178:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/593#discussion_r82304834
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextReader.java
 ---
@@ -231,33 +231,34 @@ private void parseQuotedValue(byte prev) throws 
IOException {
 final TextInput input = this.input;
 final byte quote = this.quote;
 
-ch = input.nextChar();
+try {
+  input.setMonitorForNewLine(false);
+  ch = input.nextChar();
 
-while (!(prev == quote && (ch == delimiter || ch == newLine || 
isWhite(ch {
-  if (ch != quote) {
-if (prev == quote) { // unescaped quote detected
-  if (parseUnescapedQuotes) {
-output.append(quote);
-output.append(ch);
-parseQuotedValue(ch);
-break;
-  } else {
-throw new TextParsingException(
-context,
-"Unescaped quote character '"
-+ quote
-+ "' inside quoted value of CSV field. To allow 
unescaped quotes, set 'parseUnescapedQuotes' to 'true' in the CSV parser 
settings. Cannot parse CSV input.");
+  while (!(prev == quote && (ch == delimiter || ch == newLine || 
isWhite(ch {
+if (ch != quote) {
+  if (prev == quote) { // unescaped quote detected
+if (parseUnescapedQuotes) {
+  output.append(quote);
+  output.append(ch);
+  parseQuotedValue(ch);
+  break;
+} else {
+  throw new TextParsingException(context, "Unescaped quote 
character '" + quote + "' inside quoted value of CSV field. To allow unescaped 
quotes, set 'parseUnescapedQuotes' to 'true' in the CSV parser settings. Cannot 
parse CSV input.");
+}
   }
+  output.append(ch);
+  prev = ch;
+} else if (prev == quoteEscape) {
+  output.append(quote);
+  prev = NULL_BYTE;
+} else {
+  prev = ch;
 }
-output.append(ch);
-prev = ch;
-  } else if (prev == quoteEscape) {
-output.append(quote);
-prev = NULL_BYTE;
-  } else {
-prev = ch;
+ch = input.nextChar();
   }
-  ch = input.nextChar();
+} finally {
--- End diff --

I see why it is done in finally. However, as noted above, I'm not sure that 
pushing this kind of flag into the getChar function is the optimal approach...


> csv reader should allow newlines inside quotes 
> ---
>
> Key: DRILL-3178
> URL: https://issues.apache.org/jira/browse/DRILL-3178
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.0.0
> Environment: Ubuntu Trusty 14.04.2 LTS
>Reporter: Neal McBurnett
>Assignee: F Méthot
> Fix For: Future
>
> Attachments: drill-3178.patch
>
>
> When reading a csv file which contains newlines within quoted strings, e.g. 
> via
> select * from dfs.`/tmp/q.csv`;
> Drill 1.0 says:
> Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException:  
> Error processing input: Cannot use newline character within quoted string
> But many tools produce csv files with newlines in quoted strings.  Drill 
> should be able to handle them.
> Workaround: the csvquote program (https://github.com/dbro/csvquote) can 
> encode embedded commas and newlines, and even decode them later if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes

2016-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553522#comment-15553522
 ] 

ASF GitHub Bot commented on DRILL-3178:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/593#discussion_r82303401
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextReader.java
 ---
@@ -231,33 +231,34 @@ private void parseQuotedValue(byte prev) throws 
IOException {
 final TextInput input = this.input;
 final byte quote = this.quote;
 
-ch = input.nextChar();
+try {
+  input.setMonitorForNewLine(false);
--- End diff --

Seems an overly complex way to do the parsing. Is there any reason we want 
to capture the original newline character rather than the normalized one?

If we need to capture the original one, then a cleaner way to do that is to 
keep track of the start & end position of the current token (character), and 
provide a method to return that block as a string. Then, scan for a close 
quote, reading characters & special-casing any newlines.

If we want to include newlines in quoted strings sometimes, but not other 
times, then the check logic can be a bit more complex.

But, the proposed solution of making newlines not be newlines seems a bit 
odd...


> csv reader should allow newlines inside quotes 
> ---
>
> Key: DRILL-3178
> URL: https://issues.apache.org/jira/browse/DRILL-3178
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.0.0
> Environment: Ubuntu Trusty 14.04.2 LTS
>Reporter: Neal McBurnett
>Assignee: F Méthot
> Fix For: Future
>
> Attachments: drill-3178.patch
>
>
> When reading a csv file which contains newlines within quoted strings, e.g. 
> via
> select * from dfs.`/tmp/q.csv`;
> Drill 1.0 says:
> Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException:  
> Error processing input: Cannot use newline character within quoted string
> But many tools produce csv files with newlines in quoted strings.  Drill 
> should be able to handle them.
> Workaround: the csvquote program (https://github.com/dbro/csvquote) can 
> encode embedded commas and newlines, and even decode them later if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes

2016-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553521#comment-15553521
 ] 

ASF GitHub Bot commented on DRILL-3178:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/593#discussion_r82296690
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextInput.java
 ---
@@ -88,6 +88,11 @@
   private boolean endFound = false;
 
   /**
+   * Switch for enabling/disabling new line detection
--- End diff --

Explain a bit more? Presumably, we already "monitor" and "detect" new lines 
in some way. What, specifically does this add? Presumably, it sets the mode to 
enable new line detection within quotes (the title of the Jira entry)?


> csv reader should allow newlines inside quotes 
> ---
>
> Key: DRILL-3178
> URL: https://issues.apache.org/jira/browse/DRILL-3178
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.0.0
> Environment: Ubuntu Trusty 14.04.2 LTS
>Reporter: Neal McBurnett
>Assignee: F Méthot
> Fix For: Future
>
> Attachments: drill-3178.patch
>
>
> When reading a csv file which contains newlines within quoted strings, e.g. 
> via
> select * from dfs.`/tmp/q.csv`;
> Drill 1.0 says:
> Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException:  
> Error processing input: Cannot use newline character within quoted string
> But many tools produce csv files with newlines in quoted strings.  Drill 
> should be able to handle them.
> Workaround: the csvquote program (https://github.com/dbro/csvquote) can 
> encode embedded commas and newlines, and even decode them later if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes

2016-09-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15458600#comment-15458600
 ] 

F Méthot commented on DRILL-3178:
-

With 1.7 build, for this file:

> cat data/3428.csv
1,"line1"
2,"line2
"
3,"line3"

I get:

> select * from my_dfs.`/root/data/3428.csv`;
Error: DATA_READ ERROR: Error processing input: Cannot use newline character 
within quoted string, line=3, char=22. Content parsed: [ ]

Failure while reading file file:///root/data/3428.csv. Happened at or shortly 
before byte position 22.
Fragment 0:0

[Error Id: 49a05427-e763-4cca-9f97-e4b4308ecb75 on perfnode206.perf.lab:31010] 
(state=,code=0)



> csv reader should allow newlines inside quotes 
> ---
>
> Key: DRILL-3178
> URL: https://issues.apache.org/jira/browse/DRILL-3178
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.0.0
> Environment: Ubuntu Trusty 14.04.2 LTS
>Reporter: Neal McBurnett
> Fix For: Future
>
>
> When reading a csv file which contains newlines within quoted strings, e.g. 
> via
> select * from dfs.`/tmp/q.csv`;
> Drill 1.0 says:
> Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException:  
> Error processing input: Cannot use newline character within quoted string
> But many tools produce csv files with newlines in quoted strings.  Drill 
> should be able to handle them.
> Workaround: the csvquote program (https://github.com/dbro/csvquote) can 
> encode embedded commas and newlines, and even decode them later if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes

2016-03-29 Thread Daniel Reznick (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216384#comment-15216384
 ] 

Daniel Reznick commented on DRILL-3178:
---

As drill is meant for working with data in place, having to pre-process files 
prior to use with drill is counter-productive.  Drill should work hard to read 
data as is when possible, and as noted many other tools both read and write 
delimited content with newlines in quoted fields.

> csv reader should allow newlines inside quotes 
> ---
>
> Key: DRILL-3178
> URL: https://issues.apache.org/jira/browse/DRILL-3178
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.0.0
> Environment: Ubuntu Trusty 14.04.2 LTS
>Reporter: Neal McBurnett
> Fix For: Future
>
>
> When reading a csv file which contains newlines within quoted strings, e.g. 
> via
> select * from dfs.`/tmp/q.csv`;
> Drill 1.0 says:
> Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException:  
> Error processing input: Cannot use newline character within quoted string
> But many tools produce csv files with newlines in quoted strings.  Drill 
> should be able to handle them.
> Workaround: the csvquote program (https://github.com/dbro/csvquote) can 
> encode embedded commas and newlines, and even decode them later if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)