[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-10-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949561#comment-16949561
 ] 

Hyukjin Kwon commented on SPARK-28547:
--

[~antonkulaga], if you're unable to elaborate what to target to fix in a JIRA,  
it might be better to start from dev mailing list to develop the idea and where 
to fix.

{quote}
If I see that doing even super-simple tasks like describe or like some simple 
transformation (like taking log out of each gene expression values) fail,
{quote}

As a user, seems you will definitely have a reproducer. Feel free to make a 
minimised reproducer and open another ticket that targets specific cases.


> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-10-11 Thread antonkulaga (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949230#comment-16949230
 ] 

antonkulaga commented on SPARK-28547:
-

>I bet there is room for improvement, but, ten thousand columns is just 
>inherently slow given how metadata, query plans, etc are handled.
>You'd at least need to help narrow down where the slow down is and why, and 
>even better if you can propose a class of fix. As it is I'd close this.

[~srowen] I am not a spark developer, I am spark user, so I cannot say where 
the bottleneck is. If I see that doing even super-simple tasks like describe or 
like some simple transformation (like taking log out of each gene expression 
values) fail, I report it as a performance problem. As I am bioinformatician, 
most of my work is about dealing with gene expressions (thousands of samples * 
tens of thousands genes) it makes Spark unusable for me for most of the 
use-cases If operations that take seconds in pandas dataframe (without any 
spark involved) take many hours or freeze in Spark dataframe there is something 
inherently wrong how you handle the data in Spark dataframe and something you 
should investigate for Spark 3.0
If you want to narrow it down, can it be "make dataframe.describe work for 15K 
* 15K dataframe and take less than 20 minutes to complete"?

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-10-10 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948907#comment-16948907
 ] 

Sean R. Owen commented on SPARK-28547:
--

I agree, this is too open-ended. It's not clear whether it's a general problem 
or specific to a usage pattern, a SQL query, a data type or distribution. Often 
I find that use cases for "1 columns" are use cases for "a big array-valued 
column".

I bet there is room for improvement, but, ten thousand columns is just 
inherently slow given how metadata, query plans, etc are handled.

You'd at least need to help narrow down where the slow down is and why, and 
even better if you can propose a class of fix. As it is I'd close this.

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-10-10 Thread antonkulaga (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948897#comment-16948897
 ] 

antonkulaga commented on SPARK-28547:
-

[~hyukjin.kwon] what is not clear for you? I think it is really clear that 
Spark performs miserably (freezing or taking many hours) whenever the data 
frame has 10-20K and more columns and I gave GTEX dataset as an example 
(however any gene or transcript expression dataset will be ok to demonstrate 
it). In many fields (like big part of bioinformatics) wide data frames are 
common, right now Spark is totally useless there.

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-07-31 Thread antonkulaga (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897194#comment-16897194
 ] 

antonkulaga commented on SPARK-28547:
-

[~maropu] I think I was quite clear: even describe works slow as hell. So the 
easiest way to reproduce is just to run describe on all numeric columns in 
GTEX. 

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.3
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-07-28 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894839#comment-16894839
 ] 

Takeshi Yamamuro commented on SPARK-28547:
--

You need to ask in the dev mailinglist first to narrow down the issue. We can 
do nothing based on the current description.

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.3
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org