Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-22 Thread makatun
Manu, thank you very much for your response. 1. Your post helps to further optimize the spark jobs for wide data. (https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015) The suggested change of code: df.select(df.columns.map { col => df(col).isNotNull }: _*)

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-20 Thread makatun
Hi Marco, many thanks for pointing the related Spark commit. According to the discription, it introduces indexed (instead of linear) search over columns in LogicalPlan.resolve(...). We have performed the tests on the current Spark master branch and would like to share the results. There are some

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-09 Thread makatun
Here are the images missing in the previous mail. My apologies. -- Sent from:

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-09 Thread makatun
Following the discussion and recommendations by the link you provided, we ran tests with disabled constraint propagation, using the following option: spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) The resulting measurements are on the plot:

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-08 Thread makatun
Steve, thank you for your response. We have tested the spark.read with various options. The difference in performance is very small. In particular, inference makes virtually no effect in the tested case (the testing files have just few rows) Moreover, the complexity of spark.read remains

[Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-06 Thread makatun
It is well known that wide tables are not the most efficient way to organize data. However, sometimes we have to deal with extremely wide tables featuring thousands of columns. For example, loading data from legacy systems. *We have performed an investigation of how the number of columns affects