Hi, Using Scala, spark version 2.3.0 (also 2.2.0):I've come across two main
ways to create a DataFrame from a sequence. The more
common: spark.sparkContext.parallelize(0 until
10).toDF("value") *good*and the less common (but still
prevalent): (0 until 10).toDF("value") *bad*The latter
results in much worse performance (for example in,
df.agg(mean("value")).collect()). I don't know if it is a bug or a
misunderstanding that these two are equivalent?The latter appears to use the
implicit method localSeqToDatasetHolder while the former uses
rddToDatasetHolder.The difference in the physical plans is that the *good*
looks like:
*(1) SerializeFromObject [input[0, int, false] AS value#2]+- Scan
ExternalRDDScan[obj#1]
The *bad* looks like:
LocalTableScan [value#1]
Even if this is not a bug, it would be great to learn more about what is
going on here and why I see such a huge performance difference. I've tried
to find some resources that would help me understand more about this but
I've struggled to get anywhere. (Looking at the source code I can follow
/what/ is going on to generate these plans, but I haven't found the
/why/).Many thanks,Matt
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/