[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example
[ https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169424#comment-14169424 ] Sean Owen commented on SPARK-3662: -- [~esamanas] Do you have a suggested change here, beyond just disambiguating imports in your example? Or a different example that doesn't involve import collision? It sounds like the modified example is then misunderstood to refer to a pandas random class, not the Python one, and that is simply a matter of namespace collision, and why pandas is dragged in. This example seems to fall down before it demonstrates anything else. Importing pandas breaks included pi.py example -- Key: SPARK-3662 URL: https://issues.apache.org/jira/browse/SPARK-3662 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.1.0 Environment: Xubuntu 14.04. Yarn cluster running on Ubuntu 12.04. Reporter: Evan Samanas If I add import pandas at the top of the included pi.py example and submit using spark-submit --master yarn-client, I get this stack trace: {code} Traceback (most recent call last): File /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py, line 39, in module count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add) File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 759, in reduce vals = self.mapPartitions(func).collect() File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 723, in collect bytesInJava = self._jrdd.collect().iterator() File /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: org.apache.spark.api.python.PythonException (Traceback (most recent call last): File /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py, line 75, in main command = pickleSer._read_with_length(infile) File /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py, line 150, in _read_with_length return self.loads(obj) ImportError: No module named algos {code} The example works fine if I move the statement from random import random from the top and into the function (def f(_)) defined in the example. Near as I can tell, random is getting confused with a function of the same name within pandas.algos. Submitting the same script using --master local works, but gives a distressing amount of random characters to stdout or stderr and messes up my terminal: {code} ... @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J @J!@J@J#@J$@J%@J@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J@J=@J@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ AJ AJ AJ AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ AJ!AJAJ#AJ$AJ%AJAJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ=AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23 15:42:09 INFO SparkContext: Job finished: reduce at /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 11.276879779 s J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ BJ BJ BJ BJBJBJBJBJBJBJBJBJBJBJBJBJBJJBJBJBJBJ BJ!BJBJ#BJ$BJ%BJBJ'BJ(BJ)BJ*BJ+BJ,BJ-BJ.BJ/BJ0BJ1BJ2BJ3BJ4BJ5BJ6BJ7BJ8BJ9BJ:BJ;BJBJ=BJBJ?BJ@Be. �]qJ#1a. �]qJX4a. �]qJX4a. �]qJ#1a. �]qJX4a. �]qJX4a. �]qJ#1a. �]qJX4a. �]qJX4a. �]qJa. Pi is roughly 3.146136 {code} No idea if that's related, but thought I'd include it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example
[ https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146017#comment-14146017 ] Sean Owen commented on SPARK-3662: -- Maybe I miss something, but, does this just mean you can't import pandas entirely? If you're modifying the example, you should import only what you need from pandas. Or, it may be that you need to modify the import random, indeed, to accommodate other modifications you want to make. But what is the problem with the included example? it runs fine without modifications, no? Importing pandas breaks included pi.py example -- Key: SPARK-3662 URL: https://issues.apache.org/jira/browse/SPARK-3662 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.1.0 Environment: Xubuntu 14.04. Yarn cluster running on Ubuntu 12.04. Reporter: Evan Samanas If I add import pandas at the top of the included pi.py example and submit using spark-submit --master yarn-client, I get this stack trace: {code} Traceback (most recent call last): File /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py, line 39, in module count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add) File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 759, in reduce vals = self.mapPartitions(func).collect() File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 723, in collect bytesInJava = self._jrdd.collect().iterator() File /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: org.apache.spark.api.python.PythonException (Traceback (most recent call last): File /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py, line 75, in main command = pickleSer._read_with_length(infile) File /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py, line 150, in _read_with_length return self.loads(obj) ImportError: No module named algos {code} The example works fine if I move the statement from random import random from the top and into the function (def f(_)) defined in the example. Near as I can tell, random is getting confused with a function of the same name within pandas.algos. Submitting the same script using --master local works, but gives a distressing amount of random characters to stdout or stderr and messes up my terminal: {code} ... @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J @J!@J@J#@J$@J%@J@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J@J=@J@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ AJ AJ AJ AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ AJ!AJAJ#AJ$AJ%AJAJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ=AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23 15:42:09 INFO SparkContext: Job finished: reduce at /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 11.276879779 s J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ BJ BJ BJ BJBJBJBJBJBJBJBJBJBJBJBJBJBJJBJBJBJBJ BJ!BJBJ#BJ$BJ%BJBJ'BJ(BJ)BJ*BJ+BJ,BJ-BJ.BJ/BJ0BJ1BJ2BJ3BJ4BJ5BJ6BJ7BJ8BJ9BJ:BJ;BJBJ=BJBJ?BJ@Be. �]qJ#1a. �]qJX4a. �]qJX4a. �]qJ#1a. �]qJX4a. �]qJX4a. �]qJ#1a. �]qJX4a. �]qJX4a. �]qJa. Pi is roughly 3.146136 {code} No idea if that's related, but thought I'd include it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example
[ https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146639#comment-14146639 ] Evan Samanas commented on SPARK-3662: - I wouldn't focus on the example, that I modified it, or whether I should be importing a small portion of pandas. The issue here is that Spark breaks in this case because of a name collision. Modifying the example is simply the one reproducer I've found. I was modifying the example to learn about how Spark ships Python code to the cluster. In this case, I expected pandas to only be imported in the driver program and not to be imported by any workers. The workers do not have pandas installed, so expected behavior means the example would run to completion, and an ImportError would mean that the workers are importing things they don't need for the task at hand. The way I expected Spark to work IS actually how Spark works...modules will only be imported by workers if a function passed to them uses the modules, but this error showed me false evidence to the contrary. I'm assuming the error is in Spark's modifications to CloudPickle...not in the way the example is set up. Importing pandas breaks included pi.py example -- Key: SPARK-3662 URL: https://issues.apache.org/jira/browse/SPARK-3662 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.1.0 Environment: Xubuntu 14.04. Yarn cluster running on Ubuntu 12.04. Reporter: Evan Samanas If I add import pandas at the top of the included pi.py example and submit using spark-submit --master yarn-client, I get this stack trace: {code} Traceback (most recent call last): File /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py, line 39, in module count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add) File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 759, in reduce vals = self.mapPartitions(func).collect() File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 723, in collect bytesInJava = self._jrdd.collect().iterator() File /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: org.apache.spark.api.python.PythonException (Traceback (most recent call last): File /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py, line 75, in main command = pickleSer._read_with_length(infile) File /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py, line 150, in _read_with_length return self.loads(obj) ImportError: No module named algos {code} The example works fine if I move the statement from random import random from the top and into the function (def f(_)) defined in the example. Near as I can tell, random is getting confused with a function of the same name within pandas.algos. Submitting the same script using --master local works, but gives a distressing amount of random characters to stdout or stderr and messes up my terminal: {code} ... @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J @J!@J@J#@J$@J%@J@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J@J=@J@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ AJ AJ AJ AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ AJ!AJAJ#AJ$AJ%AJAJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ=AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23 15:42:09 INFO SparkContext: Job finished: reduce at /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 11.276879779 s J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ BJ BJ