fil wrote > - Python functions like groupCount; these get reflected from their Python > AST and converted into a Spark DAG? Presumably if I try and do something > non-convertible this transformation process will throw an error? In other > words this runs in the JVM.
Further to this - it seems that Python does run on each node in the cluster, meaning it runs outside the JVM. Presumably this means that writing this in Scala would be far more performant. Could I write groupCount() in Scala, and then use it from Pyspark? Care to supply an example, I'm finding them hard to find :) fil wrote > - I had considered that "partitions" were batches of distributable work, > and generally large. Presumably the above is OK with small groups (eg. > average size < 10) - this won't kill performance? I'm still a bit confused about the dual meaning of partition: work segmentation, and key groups. Care to clarify anyone - when are partitions used to describe chunks of data for different nodes in the cluster (ie. large), and when are they used to describe groups of items in data (ie. small).. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278p12342.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org