Nicholas Chammas created SPARK-47180: ----------------------------------------
Summary: Migrate CSV parsing off of Univocity Key: SPARK-47180 URL: https://issues.apache.org/jira/browse/SPARK-47180 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Nicholas Chammas Univocity appears to be unmaintained. As of February 2024: * The last release was [more than 3 years ago|https://github.com/uniVocity/univocity-parsers/releases]. * The last commit to {{master}} was [almost 3 years ago|https://github.com/uniVocity/univocity-parsers/commits/master/]. * The website is [down|https://github.com/uniVocity/univocity-parsers/issues/506]. * There are [multiple|https://github.com/uniVocity/univocity-parsers/issues/494] [open|https://github.com/uniVocity/univocity-parsers/issues/495] [bugs|https://github.com/uniVocity/univocity-parsers/issues/499] on the tracker with no indication that anyone cares. It's not urgent, but we should consider migrating to an actively maintained CSV library in the JVM ecosystem. There are a bunch of libraries [listed here on this Maven Repository|https://mvnrepository.com/open-source/csv-libraries]. [jackson-dataformats-text|https://github.com/FasterXML/jackson-dataformats-text] looks interesting. I know we already use FasterXML to parse JSON. Perhaps we should use them to parse CSV as well. I'm guessing we chose Univocity back in the day because it was the fastest CSV library on the JVM. However, the last performance benchmark comparing it to others was [from February 2018|https://github.com/uniVocity/csv-parsers-comparison/blob/5548b52f2cc27eb19c11464e9a331491e8ad4ba6/README.md#statistics-updated-28th-of-february-2018], so this may no longer be true. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org