from:"makatun"

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-22 Thread makatun

Manu, thank you very much for your response. 1. Your post helps to further optimize the spark jobs for wide data. (https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015) The suggested change of code: df.select(df.columns.map { col => df(col).isNotNull }: _*)

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-20 Thread makatun

Hi Marco, many thanks for pointing the related Spark commit. According to the discription, it introduces indexed (instead of linear) search over columns in LogicalPlan.resolve(...). We have performed the tests on the current Spark master branch and would like to share the results. There are some

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-09 Thread makatun

Here are the images missing in the previous mail. My apologies. -- Sent from:

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-09 Thread makatun

Following the discussion and recommendations by the link you provided, we ran tests with disabled constraint propagation, using the following option: spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) The resulting measurements are on the plot:

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-08 Thread makatun

Steve, thank you for your response. We have tested the spark.read with various options. The difference in performance is very small. In particular, inference makes virtually no effect in the tested case (the testing files have just few rows) Moreover, the complexity of spark.read remains

[Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-06 Thread makatun

It is well known that wide tables are not the most efficient way to organize data. However, sometimes we have to deal with extremely wide tables featuring thousands of columns. For example, loading data from legacy systems. *We have performed an investigation of how the number of columns affects

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

[Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

6 matches

Site Navigation

Mail list logo

Footer information