Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-22 Thread makatun
Manu, thank you very much for your response. 1. Your post helps to further optimize the spark jobs for wide data. (https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015) The suggested change of code: df.select(df.columns.map { col => df(col).isNotNull }: _*)

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-20 Thread Manu Zhang
Hi Makatun, For 2, I guess `cache` will break up the logical plan and force it be analyzed. For 3, I have a similar observation here https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015. Each `withColumn` will force the logical plan to be analyzed which is not free.

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-20 Thread antonkulaga
makatun, did you try to test somewhing more complex, like dataframe.describe or PCA? -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-20 Thread makatun
Hi Marco, many thanks for pointing the related Spark commit. According to the discription, it introduces indexed (instead of linear) search over columns in LogicalPlan.resolve(...). We have performed the tests on the current Spark master branch and would like to share the results. There are some

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-14 Thread antonkulaga
Is it not going to be backported to 2.3.2? I am totally blocked by this issue in one of my projects. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail:

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-10 Thread Marco Gaido
Hi Makatun, I think your problem has been solved in https://issues.apache.org/jira/browse/SPARK-16406 which is going to be in Spark 2.4. Please try on the current master, where you should see the problem disappeared. Thanks, Marco 2018-08-09 12:56 GMT+02:00 makatun : > Here are the images

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-09 Thread makatun
Here are the images missing in the previous mail. My apologies. -- Sent from:

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-09 Thread makatun
Following the discussion and recommendations by the link you provided, we ran tests with disabled constraint propagation, using the following option: spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) The resulting measurements are on the plot:

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-08 Thread makatun
Steve, thank you for your response. We have tested the spark.read with various options. The difference in performance is very small. In particular, inference makes virtually no effect in the tested case (the testing files have just few rows) Moreover, the complexity of spark.read remains

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-07 Thread Steve Loughran
CVS with schema inference is a full read of the data, so that could be one of the problems. Do it at most once, print out the schema and use it from then on during ingress & use something else for persistence On 6 Aug 2018, at 05:44, makatun mailto:d.i.maka...@gmail.com>> wrote: a.

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-07 Thread 0xF0F0F0
This (and related JIRA tickets) might shed some light on the problem http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-td20803.html Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On August 6, 2018 2:44 PM,

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-06 Thread antonkulaga
I have the same problem with gene expressions data ( javascript:portalClient.browseDatasets.downloadFile('GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz','gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz') where I have tens of thousands genes as

[Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-06 Thread makatun
It is well known that wide tables are not the most efficient way to organize data. However, sometimes we have to deal with extremely wide tables featuring thousands of columns. For example, loading data from legacy systems. *We have performed an investigation of how the number of columns affects