[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-10-23 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661502#comment-16661502
 ] 

Bryan Cutler commented on SPARK-22809:
--

Sure, I probably shouldn't have tested out of the branches. Running tests again 
from IPython with Python 3.6.6:

 

*v2.2.2* - Error is raised

*v2.3.2* - Working

*v2.4.0-rc4* - Working

 

>From those results, it seems like SPARK-21070 most likely fixed it

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
> Fix For: 2.4.0
>
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-10-23 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661444#comment-16661444
 ] 

Dongjoon Hyun commented on SPARK-22809:
---

Hi, [~bryanc]. It seems that the test occurs in `branch-2.2`. Could you confirm 
2.3.2, too?

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
> Fix For: 2.4.0
>
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-10-23 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661418#comment-16661418
 ] 

Bryan Cutler commented on SPARK-22809:
--

I confirmed that I could reproduce in IPython with Spark branch-2.3 and did not 
have the issue with branch-2.4. I think we can close this issue
{noformat}
    __
 / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.1-SNAPSHOT
  /_/

Using Python version 3.6.6 (default, Oct 12 2018 14:08:43)
SparkSession available as 'spark'.

In [1]: import pyspark.cloudpickle
   ...: import pyspark
   ...: import py4j
   ...: rdd = sc.parallelize([(1,2)])
   ...: import scipy.interpolate
     

In [2]: import scipy.interpolate
   ...: def foo(*ards, **kwd):
   ...: scipy.interpolate.interp1d
   ...: try:
   ...: rdd.mapValues(foo).collect()
   ...: except py4j.protocol.Py4JJavaError as err:
   ...: print("it errored")
   ...: import scipy.interpolate as scipy_interpolate
   ...: def bar(*ards, **kwd):
   ...: scipy_interpolate.interp1d
   ...: rdd.mapValues(bar).collect()
   ...: print("worked")
   ...: rdd.mapValues(foo).collect()
   ...: print("worked") 
     
worked  
worked{noformat}

{noformat}
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.3-SNAPSHOT
  /_/

Using Python version 3.6.6 (default, Oct 12 2018 14:08:43)
SparkSession available as 'spark'.

In [1]: import pyspark.cloudpickle 
   ...: import pyspark 
   ...: import py4j 
   ...: rdd = sc.parallelize([(1,2)]) 
   ...: import scipy.interpolate
 

In [2]: import scipy.interpolate 
   ...: def foo(*ards, **kwd): 
   ...: scipy.interpolate.interp1d 
   ...: try: 
   ...: rdd.mapValues(foo).collect() 
   ...: except py4j.protocol.Py4JJavaError as err: 
   ...: print("it errored") 
   ...: import scipy.interpolate as scipy_interpolate 
   ...: def bar(*ards, **kwd): 
   ...: scipy_interpolate.interp1d 
   ...: rdd.mapValues(bar).collect() 
   ...: print("worked") 
   ...: rdd.mapValues(foo).collect() 
   ...: print("worked") 
 
18/10/23 15:39:54 ERROR Executor: Exception in task 7.0 in stage 0.0 (TID 7)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
196, in main
process()
  File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
191, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
  File "/home/bryan/git/spark/python/pyspark/rdd.py", line 1951, in 
map_values_fn = lambda kv: (kv[0], f(kv[1]))
  File "", line 3, in foo
AttributeError: module 'scipy' has no attribute 'interpolate'

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:197)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:238)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:156)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:344)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[Stage 0:>  (0 + 8) / 
8]18/10/23 15:39:54 WARN TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, 
localhost, executor driver): org.apache.spark.api.python.PythonException: 
Traceback (most recent call last):
  File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
196, in main
process()
  File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
191, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/serializers.

[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-10-22 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659950#comment-16659950
 ] 

Wenchen Fan commented on SPARK-22809:
-

is this issue fixed? Anyway I'm removing the target version, since it's not a 
blocker and we are not able to fix it before 2.4.0.

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-05-25 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491107#comment-16491107
 ] 

Marcelo Vanzin commented on SPARK-22809:


I'm removing 2.3.1 since it doesn't seem there's any activity here. Please 
re-add if you plan to work on this for that release.

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-03-27 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416038#comment-16416038
 ] 

holdenk commented on SPARK-22809:
-

This _should_ be resolved by SPARK-23169 but I'll double check when I've got 
some cycles set aside this Friday..

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-01-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340344#comment-16340344
 ] 

Sean Owen commented on SPARK-22809:
---

Might duplicate SPARK-23159 but I wasn't sure.

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-01-23 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336983#comment-16336983
 ] 

holdenk commented on SPARK-22809:
-

oh wait it should work in 0.4.2, I'll poke at that PR more.

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-01-23 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336979#comment-16336979
 ] 

holdenk commented on SPARK-22809:
-

[~ueshin]: we can push this out to 2.3.1 given we are already in the RC process 
for 2.3.0. I'm not sure if this is fixed in the 0.4.2 upgrade (I think it would 
be in 0.5.2 but we're only partially upgrading). Already looking at the upgrade 
PR.

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-01-23 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336881#comment-16336881
 ] 

Takuya Ueshin commented on SPARK-22809:
---

I agree with [~dongjoon].

[~CricketScience] Ah, I tried in Jupyter but it should work, right?

Btw, will [https://github.com/apache/spark/pull/20373] fix this issue? Could 
someone take a look at the pr please?

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-01-23 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336494#comment-16336494
 ] 

Dongjoon Hyun commented on SPARK-22809:
---

Since this is not a regression, can we move the target version from 2.3.0 into 
2.3.1, [~holdenk] and [~ueshin]?

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-01-23 Thread Cricket Temple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336070#comment-16336070
 ] 

Cricket Temple commented on SPARK-22809:


You've got to run it in ipython or zeppelin




> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-01-23 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335536#comment-16335536
 ] 

Takuya Ueshin commented on SPARK-22809:
---

[~holdenk] Hi, any updates on this?
 I might miss something, but I couldn't reproduce the problem with the current 
master in my local environment.
 Thanks!

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-01-05 Thread Cricket Temple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314318#comment-16314318
 ] 

Cricket Temple commented on SPARK-22809:


Much shorter version

{code:python}
import cloudpickle
import pyspark
import py4j
sc = pyspark.SparkContext()
rdd = sc.parallelize([(1,2)])

import scipy.interpolate
def foo(*ards, **kwd):
scipy.interpolate.interp1d
try:
rdd.mapValues(foo).collect()
except py4j.protocol.Py4JJavaError, err:
print("it errored")

import scipy.interpolate as scipy_interpolate
def bar(*ards, **kwd):
scipy_interpolate.interp1d
rdd.mapValues(bar).collect()
print("worked")
rdd.mapValues(foo).collect()
print("worked")
{code}


> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-01-05 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314308#comment-16314308
 ] 

holdenk commented on SPARK-22809:
-

Note we used a simple 
{code:python}
def foo(x):
return scipy.interpolate.interp1d(x)
cloudpickle.dumps(foo)
{code}

To do the dump and verify the issue.

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2017-12-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293835#comment-16293835
 ] 

Sean Owen commented on SPARK-22809:
---

What is the error? shouldn't this fail on any dot import if so, and can we just 
see that error?
I still don't see how this is a Spark issue vs Python interpreter issue, but 
even there, not sure why this import type would be an issue.
I don't think it's an import thing.

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Cricket Temple
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2017-12-15 Thread Cricket Temple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293209#comment-16293209
 ] 

Cricket Temple commented on SPARK-22809:


Outputs: When I run it, it plots a picture and prints "See?"

This is certainly "unexpected behavior" for me.
{noformat}
> import a.b
> import a.b as a_b
> a_b2 = a.b
> your_function(a_b)
Yay!
>your_function(a.b)
Boo!
>your_function(a_b2)
Yay!
>your_function(a.b)
Yay!
{noformat}

The problem is that when people port code to pyspark they're going to have 
errors until they go through and update imports to avoid this.  If it's 
possible to trigger this from a library (which I don't know if it is), that 
might be hard to work around.


> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Cricket Temple
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2017-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293103#comment-16293103
 ] 

Sean Owen commented on SPARK-22809:
---

It's not clear what the problem is? what does this output?
Spark isn't managing imports, so I am not yet clear how this is a Spark-related 
issue.

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Cricket Temple
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(False, "Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org