Re: [Spark Core]: Python and Scala generate different DAGs for identical code

Pavel Klemenkov Wed, 10 May 2017 09:42:50 -0700

Thanks for the quick answer, Holden!

Are there any other tricks with PySpark which are hard to debug using UI or
toDebugString?


On Wed, May 10, 2017 at 7:18 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> In PySpark the filter and then map steps are combined into a single
> transformation from the JVM point of view. This allows us to avoid copying
> the data back to Scala in between the filter and the map steps. The
> debugging exeperience is certainly much harder in PySpark and I think is an
> interesting area for those interested in contributing :)
>
> On Wed, May 10, 2017 at 7:33 AM pklemenkov <pklemen...@gmail.com> wrote:
>
>> This Scala code:
>> scala> val logs = sc.textFile("big_data_specialization/log.txt").
>>      | filter(x => !x.contains("INFO")).
>>      | map(x => (x.split("\t")(1), 1)).
>>      | reduceByKey((x, y) => x + y)
>>
>> generated obvious lineage:
>>
>> (2) ShuffledRDD[4] at reduceByKey at <console>:27 []
>>  +-(2) MapPartitionsRDD[3] at map at <console>:26 []
>>     |  MapPartitionsRDD[2] at filter at <console>:25 []
>>     |  big_data_specialization/log.txt MapPartitionsRDD[1] at textFile at
>> <console>:24 []
>>     |  big_data_specialization/log.txt HadoopRDD[0] at textFile at
>> <console>:24 []
>>
>> But Python code:
>>
>> logs = sc.textFile("../log.txt")\
>>          .filter(lambda x: 'INFO' not in x)\
>>          .map(lambda x: (x.split('\t')[1], 1))\
>>          .reduceByKey(lambda x, y: x + y)
>>
>> generated something strange which is hard to follow:
>>
>> (2) PythonRDD[13] at RDD at PythonRDD.scala:48 []
>>  |  MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 []
>>  |  ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0 []
>>  +-(2) PairwiseRDD[10] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1
>> []
>>     |  PythonRDD[9] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 []
>>     |  ../log.txt MapPartitionsRDD[8] at textFile at
>> NativeMethodAccessorImpl.java:0 []
>>     |  ../log.txt HadoopRDD[7] at textFile at
>> NativeMethodAccessorImpl.java:0 []
>>
>> Why is that? Does pyspark do some optimizations under the hood? This debug
>> string is really useless for debugging.
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-Core-Python-and-Scala-
>> generate-different-DAGs-for-identical-code-tp28674.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>



-- 
Yours faithfully, Pavel Klemenkov.

Re: [Spark Core]: Python and Scala generate different DAGs for identical code

Reply via email to