[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
antonkulaga reopened SPARK-28547: --------------------------------- I did not see any solutions. > Make it work for wide (> 10K columns data) > ------------------------------------------ > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.4.4, 2.4.3 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) > Reporter: antonkulaga > Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes hours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work fast (minutes) > and well with pure pandas (without any spark involved). > f -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org