Nicholas Chammas created SPARK-47180:
----------------------------------------

             Summary: Migrate CSV parsing off of Univocity
                 Key: SPARK-47180
                 URL: https://issues.apache.org/jira/browse/SPARK-47180
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.0.0
            Reporter: Nicholas Chammas


Univocity appears to be unmaintained.

As of February 2024:
 * The last release was [more than 3 years 
ago|https://github.com/uniVocity/univocity-parsers/releases].
 * The last commit to {{master}} was [almost 3 years 
ago|https://github.com/uniVocity/univocity-parsers/commits/master/].
 * The website is 
[down|https://github.com/uniVocity/univocity-parsers/issues/506].
 * There are 
[multiple|https://github.com/uniVocity/univocity-parsers/issues/494] 
[open|https://github.com/uniVocity/univocity-parsers/issues/495] 
[bugs|https://github.com/uniVocity/univocity-parsers/issues/499] on the tracker 
with no indication that anyone cares.

It's not urgent, but we should consider migrating to an actively maintained CSV 
library in the JVM ecosystem.

There are a bunch of libraries [listed here on this Maven 
Repository|https://mvnrepository.com/open-source/csv-libraries].

[jackson-dataformats-text|https://github.com/FasterXML/jackson-dataformats-text]
 looks interesting. I know we already use FasterXML to parse JSON. Perhaps we 
should use them to parse CSV as well.

I'm guessing we chose Univocity back in the day because it was the fastest CSV 
library on the JVM. However, the last performance benchmark comparing it to 
others was [from February 
2018|https://github.com/uniVocity/csv-parsers-comparison/blob/5548b52f2cc27eb19c11464e9a331491e8ad4ba6/README.md#statistics-updated-28th-of-february-2018],
 so this may no longer be true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to