Thanks for the quick answer, Holden! Are there any other tricks with PySpark which are hard to debug using UI or toDebugString?
On Wed, May 10, 2017 at 7:18 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > In PySpark the filter and then map steps are combined into a single > transformation from the JVM point of view. This allows us to avoid copying > the data back to Scala in between the filter and the map steps. The > debugging exeperience is certainly much harder in PySpark and I think is an > interesting area for those interested in contributing :) > > On Wed, May 10, 2017 at 7:33 AM pklemenkov <pklemen...@gmail.com> wrote: > >> This Scala code: >> scala> val logs = sc.textFile("big_data_specialization/log.txt"). >> | filter(x => !x.contains("INFO")). >> | map(x => (x.split("\t")(1), 1)). >> | reduceByKey((x, y) => x + y) >> >> generated obvious lineage: >> >> (2) ShuffledRDD[4] at reduceByKey at <console>:27 [] >> +-(2) MapPartitionsRDD[3] at map at <console>:26 [] >> | MapPartitionsRDD[2] at filter at <console>:25 [] >> | big_data_specialization/log.txt MapPartitionsRDD[1] at textFile at >> <console>:24 [] >> | big_data_specialization/log.txt HadoopRDD[0] at textFile at >> <console>:24 [] >> >> But Python code: >> >> logs = sc.textFile("../log.txt")\ >> .filter(lambda x: 'INFO' not in x)\ >> .map(lambda x: (x.split('\t')[1], 1))\ >> .reduceByKey(lambda x, y: x + y) >> >> generated something strange which is hard to follow: >> >> (2) PythonRDD[13] at RDD at PythonRDD.scala:48 [] >> | MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 [] >> | ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0 [] >> +-(2) PairwiseRDD[10] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 >> [] >> | PythonRDD[9] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 [] >> | ../log.txt MapPartitionsRDD[8] at textFile at >> NativeMethodAccessorImpl.java:0 [] >> | ../log.txt HadoopRDD[7] at textFile at >> NativeMethodAccessorImpl.java:0 [] >> >> Why is that? Does pyspark do some optimizations under the hood? This debug >> string is really useless for debugging. >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/Spark-Core-Python-and-Scala- >> generate-different-DAGs-for-identical-code-tp28674.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- > Cell : 425-233-8271 <(425)%20233-8271> > Twitter: https://twitter.com/holdenkarau > -- Yours faithfully, Pavel Klemenkov.