[jira] [Updated] (CSV-196) Store the info of whether a field is enclosed by quotes

2016-09-24 Thread Benedikt Ritter (JIRA)

 [ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedikt Ritter updated CSV-196:

Fix Version/s: Patch Needed

> Store the info of whether a field is enclosed by quotes
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>  Labels: easyfix, features, patch
> Fix For: Patch Needed
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can't tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop Jobs, which should support splitting input 
> data. To accurately split a CSV file into pieces, we need to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field 
> indicating whether the column was enclosed by quotes. While Lexer is doing 
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as 
> resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CSV-196) Store the info of whether a field is enclosed by quotes

2016-09-22 Thread Matt Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Sun updated CSV-196:
-
Description: 
It will be good to have CSVParser class to store the info of whether a field 
was enclosed by quotes in the original source file.
For example, for this data sample:

A, B, C
a1, "b1", c1

CSVParser gives us record a1, b1, c1, which is helpful because it parsed double 
quotes, but we also lost the information of original data at the same time. We 
can't tell from the CSVRecord returned whether the original data is enclosed by 
double quotes or not.

In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
is one kind of input of Hadoop Jobs, which should support splitting input data. 
To accurately split a CSV file into pieces, we need to count the bytes of  data 
CSVParser actually read. CSVParser doesn't have accurate information of whether 
a field was enclosed by quotes, neither does it store raw data of the original 
source. Downstream users of commons CSVParser is not able to get those info.

To suggest a fix: Extend the token/CSVRecord to have a boolean field indicating 
whether the column was enclosed by quotes. While Lexer is doing getNextToken, 
set the flag if a field is encapsulated and successfully parsed.

I find another issue reported with similar request, but it was marked as 
resolved: [CSV91] 
https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22

  was:
It will be good to have CSVParser class to store the info of whether a field 
was enclosed by quotes in the original source file.
For example, for this data sample:

A, B, C
a1, "b1", c1

CSVParser gives us record a1, b1, c1, which is helpful because it parsed double 
quotes, but we also lost the information of original data at the same time. We 
can tell from the CSVRecord returned whether the original data is enclosed by 
double quotes or not.

In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
is one kind of input of Hadoop, which should splitting input data. To 
accurately split a CSV file into pieces, the program needs to count the bytes 
of  data CSVParser actually read. CSVParser doesn't have accurate information 
of whether a field was enclosed by quotes, neither does it store raw data of 
the original source. Downstream users of commons CSVParser is not able to get 
those info.

To suggest a fix: Extend the token/CSVRecord to have a field indicating whether 
the column was enclosed by quotes. While Lexer is doing getNextToken, set the 
flag if a field is encapsulated and successfully parsed.

I find another issue reported, but it was marked as resolved: [CSV91] 
https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22


> Store the info of whether a field is enclosed by quotes
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>  Labels: easyfix, features, patch
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can't tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop Jobs, which should support splitting input 
> data. To accurately split a CSV file into pieces, we need to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field 
> indicating whether the column was enclosed by quotes. While Lexer is doing 
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as 
> resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CSV-196) Store the info of whether a field is enclosed by quotes

2016-09-22 Thread Matt Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Sun updated CSV-196:
-
Priority: Major  (was: Minor)

> Store the info of whether a field is enclosed by quotes
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>  Labels: easyfix, features, patch
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop, which should splitting input data. To 
> accurately split a CSV file into pieces, the program needs to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a field indicating 
> whether the column was enclosed by quotes. While Lexer is doing getNextToken, 
> set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported, but it was marked as resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)