GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/15138

    [SPARK-17583][SQL] Remove unused rowSeparator variable and set 
auto-expanding buffer as default for maxCharsPerColumn option in CSV

    ## What changes were proposed in this pull request?
    
    This PR includes the changes below:
    
    1. Upgrade Univocity library from 2.1.1 to 2.2.1
    
      This includes some performance improvement and also enabling 
auto-extending buffer in `maxCharsPerColumn` option in CSV. Please refer the 
[release notes](https://github.com/uniVocity/univocity-parsers/releases).
    
    2. Remove useless `rowSeparator` variable existing in `CSVOptions`
    
      We have this unused variable in 
[CSVOptions.scala#L127](https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127)
 but it seems possibly causing confusion that it actually does not care of 
`\r\n`. For example, we have an issue open about this, 
[SPARK-17227](https://issues.apache.org/jira/browse/SPARK-17227), describing 
this variable.
    
      This variable is virtually not being used because we rely on 
`LineRecordReader` in Hadoop which deals with only both `\n` and `\r\n`.
    
    3. Setting the default value of `maxCharsPerColumn` to auto-expending.
    
      We are setting 1000000 for the length of each column. It'd be more 
sensible we allow auto-expending rather than fixed length by default.
    
      To make sure, using `-1` is being described in the release note, 
[2.2.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.2.0).
    
    ## How was this patch tested?


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-17583

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15138.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15138
    
----
commit ee04aadf9dcca349f5045faef8973fccb964b511
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2016-09-18T09:06:27Z

    Remove unused rowSeparator variable and set auto-expanding buffer as 
default for maxCharsPerColumn option in CSV

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to