Re: Creating DataFrame with the implicit localSeqToDatasetHolder has bad performance

2018-03-12 Thread msinton
I think I understand that in the second case the DataFrame is created as a
Local object, so it lives in the memory of the driver and is serialized as
part of the Task that gets sent to each executor. 

Though I think the implicit conversion here is something that others could
also misunderstand - maybe it would be better if it was not part of
spark.implicits? Or at least something can be said/warning in the developer
guides.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Creating DataFrame with the implicit localSeqToDatasetHolder has bad performance

2018-03-12 Thread msinton
Hi, Using Scala, spark version 2.3.0 (also 2.2.0):I've come across two main
ways to create a DataFrame from a sequence. The more
common:  spark.sparkContext.parallelize(0 until
10).toDF("value")  *good*and the less common (but still
prevalent):  (0 until 10).toDF("value") *bad*The latter
results in much worse performance (for example in,
df.agg(mean("value")).collect()). I don't know if it is a bug or a
misunderstanding that these two are equivalent?The latter appears to use the
implicit method localSeqToDatasetHolder while the former uses
rddToDatasetHolder.The difference in the physical plans is that the *good*
looks like:
*(1) SerializeFromObject [input[0, int, false] AS value#2]+- Scan
ExternalRDDScan[obj#1]
The *bad* looks like:
LocalTableScan [value#1]
Even if this is not a bug, it would be great to learn more about what is
going on here and why I see such a huge performance difference. I've tried
to find some resources that would help me understand more about this but
I've struggled to get anywhere. (Looking at the source code I can follow
/what/ is going on to generate these plans, but I haven't found the
/why/).Many thanks,Matt



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/