fil wrote
> - Python functions like groupCount; these get reflected from their Python
> AST and converted into a Spark DAG? Presumably if I try and do something
> non-convertible this transformation process will throw an error? In other
> words this runs in the JVM.

Further to this - it seems that Python does run on each node in the cluster,
meaning it runs outside the JVM. Presumably this means that writing this in
Scala would be far more performant.

Could I write groupCount() in Scala, and then use it from Pyspark? Care to
supply an example, I'm finding them hard to find :)


fil wrote
> - I had considered that "partitions" were batches of distributable work,
> and generally large. Presumably the above is OK with small groups (eg.
> average size < 10) - this won't kill performance?

I'm still a bit confused about the dual meaning of partition: work
segmentation, and key groups. Care to clarify anyone - when are partitions
used to describe chunks of data for different nodes in the cluster (ie.
large), and when are they used to describe groups of items in data (ie.
small)..




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278p12342.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to