[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example

2014-10-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169424#comment-14169424
 ] 

Sean Owen commented on SPARK-3662:
--

[~esamanas] Do you have a suggested change here, beyond just disambiguating 
imports in your example? Or a different example that doesn't involve import 
collision? It sounds like the modified example is then misunderstood to refer 
to a pandas random class, not the Python one, and that is simply a matter of 
namespace collision, and why pandas is dragged in. This example seems to fall 
down before it demonstrates anything else.

 Importing pandas breaks included pi.py example
 --

 Key: SPARK-3662
 URL: https://issues.apache.org/jira/browse/SPARK-3662
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.1.0
 Environment: Xubuntu 14.04.  Yarn cluster running on Ubuntu 12.04.
Reporter: Evan Samanas

 If I add import pandas at the top of the included pi.py example and submit 
 using spark-submit --master yarn-client, I get this stack trace:
 {code}
 Traceback (most recent call last):
   File /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py, line 
 39, in module
 count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add)
   File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 759, in reduce
 vals = self.mapPartitions(func).collect()
   File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 723, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File 
 /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, 
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 
 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: 
 org.apache.spark.api.python.PythonException (Traceback (most recent call 
 last):
   File 
 /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py,
  line 75, in main
 command = pickleSer._read_with_length(infile)
   File 
 /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
  line 150, in _read_with_length
 return self.loads(obj)
 ImportError: No module named algos
 {code}
 The example works fine if I move the statement from random import random 
 from the top and into the function (def f(_)) defined in the example.  Near 
 as I can tell, random is getting confused with a function of the same name 
 within pandas.algos.  
 Submitting the same script using --master local works, but gives a 
 distressing amount of random characters to stdout or stderr and messes up my 
 terminal:
 {code}
 ...
 @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J 
 @J!@J@J#@J$@J%@J@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J@J=@J@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ
AJ
 AJ
   AJ
 AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ 
 AJ!AJAJ#AJ$AJ%AJAJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ=AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23
  15:42:09 INFO SparkContext: Job finished: reduce at 
 /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 
 11.276879779 s
 J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ
  BJ
 BJ
   BJ
 BJBJBJBJBJBJBJBJBJBJBJBJBJBJJBJBJBJBJ 
 BJ!BJBJ#BJ$BJ%BJBJ'BJ(BJ)BJ*BJ+BJ,BJ-BJ.BJ/BJ0BJ1BJ2BJ3BJ4BJ5BJ6BJ7BJ8BJ9BJ:BJ;BJBJ=BJBJ?BJ@Be.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJa.
 Pi is roughly 3.146136
 {code}
 No idea if that's related, but thought I'd include it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example

2014-09-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146017#comment-14146017
 ] 

Sean Owen commented on SPARK-3662:
--

Maybe I miss something, but, does this just mean you can't import pandas 
entirely? If you're modifying the example, you should import only what you need 
from pandas. Or, it may be that you need to modify the import random, indeed, 
to accommodate other modifications you want to make.

But what is the problem with the included example? it runs fine without 
modifications, no?

 Importing pandas breaks included pi.py example
 --

 Key: SPARK-3662
 URL: https://issues.apache.org/jira/browse/SPARK-3662
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.1.0
 Environment: Xubuntu 14.04.  Yarn cluster running on Ubuntu 12.04.
Reporter: Evan Samanas

 If I add import pandas at the top of the included pi.py example and submit 
 using spark-submit --master yarn-client, I get this stack trace:
 {code}
 Traceback (most recent call last):
   File /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py, line 
 39, in module
 count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add)
   File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 759, in reduce
 vals = self.mapPartitions(func).collect()
   File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 723, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File 
 /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, 
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 
 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: 
 org.apache.spark.api.python.PythonException (Traceback (most recent call 
 last):
   File 
 /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py,
  line 75, in main
 command = pickleSer._read_with_length(infile)
   File 
 /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
  line 150, in _read_with_length
 return self.loads(obj)
 ImportError: No module named algos
 {code}
 The example works fine if I move the statement from random import random 
 from the top and into the function (def f(_)) defined in the example.  Near 
 as I can tell, random is getting confused with a function of the same name 
 within pandas.algos.  
 Submitting the same script using --master local works, but gives a 
 distressing amount of random characters to stdout or stderr and messes up my 
 terminal:
 {code}
 ...
 @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J 
 @J!@J@J#@J$@J%@J@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J@J=@J@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ
AJ
 AJ
   AJ
 AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ 
 AJ!AJAJ#AJ$AJ%AJAJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ=AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23
  15:42:09 INFO SparkContext: Job finished: reduce at 
 /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 
 11.276879779 s
 J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ
  BJ
 BJ
   BJ
 BJBJBJBJBJBJBJBJBJBJBJBJBJBJJBJBJBJBJ 
 BJ!BJBJ#BJ$BJ%BJBJ'BJ(BJ)BJ*BJ+BJ,BJ-BJ.BJ/BJ0BJ1BJ2BJ3BJ4BJ5BJ6BJ7BJ8BJ9BJ:BJ;BJBJ=BJBJ?BJ@Be.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJa.
 Pi is roughly 3.146136
 {code}
 No idea if that's related, but thought I'd include it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example

2014-09-24 Thread Evan Samanas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146639#comment-14146639
 ] 

Evan Samanas commented on SPARK-3662:
-

I wouldn't focus on the example, that I modified it, or whether I should be 
importing a small portion of pandas.  The issue here is that Spark breaks in 
this case because of a name collision.  Modifying the example is simply the one 
reproducer I've found.

I was modifying the example to learn about how Spark ships Python code to the 
cluster.  In this case, I expected pandas to only be imported in the driver 
program and not to be imported by any workers.  The workers do not have pandas 
installed, so expected behavior means the example would run to completion, and 
an ImportError would mean that the workers are importing things they don't need 
for the task at hand.

The way I expected Spark to work IS actually how Spark works...modules will 
only be imported by workers if a function passed to them uses the modules, but 
this error showed me false evidence to the contrary.  I'm assuming the error is 
in Spark's modifications to CloudPickle...not in the way the example is set up.

 Importing pandas breaks included pi.py example
 --

 Key: SPARK-3662
 URL: https://issues.apache.org/jira/browse/SPARK-3662
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.1.0
 Environment: Xubuntu 14.04.  Yarn cluster running on Ubuntu 12.04.
Reporter: Evan Samanas

 If I add import pandas at the top of the included pi.py example and submit 
 using spark-submit --master yarn-client, I get this stack trace:
 {code}
 Traceback (most recent call last):
   File /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py, line 
 39, in module
 count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add)
   File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 759, in reduce
 vals = self.mapPartitions(func).collect()
   File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 723, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File 
 /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, 
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 
 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: 
 org.apache.spark.api.python.PythonException (Traceback (most recent call 
 last):
   File 
 /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py,
  line 75, in main
 command = pickleSer._read_with_length(infile)
   File 
 /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
  line 150, in _read_with_length
 return self.loads(obj)
 ImportError: No module named algos
 {code}
 The example works fine if I move the statement from random import random 
 from the top and into the function (def f(_)) defined in the example.  Near 
 as I can tell, random is getting confused with a function of the same name 
 within pandas.algos.  
 Submitting the same script using --master local works, but gives a 
 distressing amount of random characters to stdout or stderr and messes up my 
 terminal:
 {code}
 ...
 @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J 
 @J!@J@J#@J$@J%@J@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J@J=@J@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ
AJ
 AJ
   AJ
 AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ 
 AJ!AJAJ#AJ$AJ%AJAJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ=AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23
  15:42:09 INFO SparkContext: Job finished: reduce at 
 /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 
 11.276879779 s
 J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ
  BJ
 BJ