[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

2016-11-23 Thread greghogan
Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2060
  
I am working on merging this ..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

2016-11-21 Thread rekhajoshm
Github user rekhajoshm commented on the issue:

https://github.com/apache/flink/pull/2060
  
Agree @greghogan , we started this just for StringParser encoding :-)
I am in tight schedule, however did a quick update for comments, please 
have a look.thank you.

- Removed unused methods on setCommentPrefix
- Added charset functionality for user to set charset encoding on CsvReader
- Updated test 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

2016-11-16 Thread greghogan
Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2060
  
@rekhajoshm have you had a chance to look again at this PR? The 1.2 feature 
freeze will be in a few weeks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

2016-09-20 Thread greghogan
Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2060
  
Apologies for the long delay. I'd like to attempt to summarize this ticket 
and pull request to validate my understanding.

Previously StringParser was using the system encoding and 
`GenericCsvInputFormat` was using UTF-8 for the delimiter and an overloadable 
UTF-8 for the comment prefix.

StringParser's quoteCharacter remains a `byte` with no encoding.

Now GenericCsvInputFormat can be configured with a charset which is used 
for the delimiter, comment prefix, and field parsers (only used in 
StringParser).

Should `setCommentPrefix(String commentPrefix, Charset charset)` and 
`setCommentPrefix(String commentPrefix, String charsetName)` be removed from 
`GenericCsvInputFormat`? Would different encodings be used on the same file?

Allow the user to set the character encoding in `CsvReader` which would be 
applied in `CsvReader.configureInputFormat`?

Are the new tests checking the encoding? The test strings are using using 
characters common to UTF-8 and ASCII. We could instead use one of the UTF-16 
encodings from 
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

2016-09-09 Thread rekhajoshm
Github user rekhajoshm commented on the issue:

https://github.com/apache/flink/pull/2060
  
Thanks @greghogan for your review. Please have a look. thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

2016-09-08 Thread rekhajoshm
Github user rekhajoshm commented on the issue:

https://github.com/apache/flink/pull/2060
  
Hi @greghogan sorry, lost thread on this, updated for Preconditions check 
now. Unless i am missing something, the enableQuotedStringParsing() is present 
in subclasses as it is slightly different than the parent and involves 
if-else-casting to be invoked.But in case of setCharset it is same for all 
possible subclasses and does not need to be.Hope you agree.Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

2016-09-02 Thread greghogan
Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2060
  
Hi @rekhajoshm have you had a chance to look at the last two comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

2016-07-18 Thread greghogan
Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2060
  
`StringParser` and `StringValueParser` have the method 
`enableQuotedStringParsing` which is called by `GenericCsvInputFormat.open` 
only for these types. We should consider doing the same for `setCharset` rather 
than adding it to `FieldParser`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

2016-06-13 Thread rekhajoshm
Github user rekhajoshm commented on the issue:

https://github.com/apache/flink/pull/2060
  
Sorry for delay, got busy. 
Thank you @StephanEwen for your review.updated.Please have a look.thanks cc 
@greghogan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

2016-06-07 Thread StephanEwen
Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/2060
  
Let's make this configurable on `GenericCsvInputFormat`, with the default 
of "UTF-8" (this is what we use in other places as the default, too).

The charset affects the
  - delimiter
  - comments
  - parsers

I think the `FieldParser` should have a `setCharset()` method. That way, we 
need not pass the charset to every method call.

This would also need a test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

2016-06-04 Thread rekhajoshm
Github user rekhajoshm commented on the issue:

https://github.com/apache/flink/pull/2060
  
Thanks @greghogan for your inputs. updated. Please have a look. 
Please note though the configurable Charset will have no impact on 
StringValueParser(or other parsers) except StringParser.In SVParser we 
explicitly call setValueAscii(), and hence works with ascii only. @StephanEwen 
@fhueske 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

2016-06-03 Thread greghogan
Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2060
  
The only internal usage of `StringParser` is from `GenericCsvInputFormat`. 
Should we make the encoding configurable in `GenericCsvInputFormat` with a 
default of US-ASCII? This could then be overridden in an additional constructor 
of `StringParser`.

Should the same changes be made to `StringValueParser`?

@rekhajoshm @fhueske @StephanEwen :+1: :-1:?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---