[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-01-23 Thread Cricket Temple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336070#comment-16336070
 ] 

Cricket Temple commented on SPARK-22809:


You've got to run it in ipython or zeppelin




> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-01-05 Thread Cricket Temple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314318#comment-16314318
 ] 

Cricket Temple commented on SPARK-22809:


Much shorter version

{code:python}
import cloudpickle
import pyspark
import py4j
sc = pyspark.SparkContext()
rdd = sc.parallelize([(1,2)])

import scipy.interpolate
def foo(*ards, **kwd):
scipy.interpolate.interp1d
try:
rdd.mapValues(foo).collect()
except py4j.protocol.Py4JJavaError, err:
print("it errored")

import scipy.interpolate as scipy_interpolate
def bar(*ards, **kwd):
scipy_interpolate.interp1d
rdd.mapValues(bar).collect()
print("worked")
rdd.mapValues(foo).collect()
print("worked")
{code}


> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2017-12-15 Thread Cricket Temple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293209#comment-16293209
 ] 

Cricket Temple commented on SPARK-22809:


Outputs: When I run it, it plots a picture and prints "See?"

This is certainly "unexpected behavior" for me.
{noformat}
> import a.b
> import a.b as a_b
> a_b2 = a.b
> your_function(a_b)
Yay!
>your_function(a.b)
Boo!
>your_function(a_b2)
Yay!
>your_function(a.b)
Yay!
{noformat}

The problem is that when people port code to pyspark they're going to have 
errors until they go through and update imports to avoid this.  If it's 
possible to trigger this from a library (which I don't know if it is), that 
might be hard to work around.


> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Cricket Temple
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22809) pyspark is sensitive to imports with dots

2017-12-15 Thread Cricket Temple (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cricket Temple updated SPARK-22809:
---
Description: 
User code can fail with dotted imports.  Here's a repro script.

{noformat}
import numpy as np
import pandas as pd
import pyspark
import scipy.interpolate
import scipy.interpolate as scipy_interpolate
import py4j

scipy_interpolate2 = scipy.interpolate

sc = pyspark.SparkContext()
spark_session = pyspark.SQLContext(sc)

###
# The details of this dataset are irrelevant  #
# Sorry if you'd have preferred something more boring #
###
x__ = np.linspace(0,10,1000)
freq__ = np.arange(1,5)
x_, freq_ = np.ix_(x__, freq__)
y = np.sin(x_ * freq_).ravel()
x = (x_ * np.ones(freq_.shape)).ravel()
freq = (np.ones(x_.shape) * freq_).ravel()
df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
df_sk = spark_session.createDataFrame(df_pd)
assert(df_sk.toPandas() == df_pd).all().all()

try:
import matplotlib.pyplot as plt
for f, data in df_pd.groupby("freq"):
plt.plot(*data[['x','y']].values.T)
plt.show()
except:
print("I guess we can't plot anything")

def mymap(x, interp_fn):
df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
return interp_fn(df.x.values, df.y.values)(np.pi)

df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()

result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy_interpolate.interp1d)).collect()
assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), atol=1e-6))

try:
result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy.interpolate.interp1d)).collect()
raise Excpetion("Not going to reach this line")
except py4j.protocol.Py4JJavaError, e:
print("See?")

result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy_interpolate2.interp1d)).collect()
assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), atol=1e-6))

# But now it works!
result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy.interpolate.interp1d)).collect()
assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), atol=1e-6))
{noformat}

  was:
User code can fail with dotted imports.  Here's a repro script.

{noformat}
import numpy as np
import pandas as pd
import pyspark
import scipy.interpolate
import scipy.interpolate as scipy_interpolate
import py4j

sc = pyspark.SparkContext()
spark_session = pyspark.SQLContext(sc)

###
# The details of this dataset are irrelevant  #
# Sorry if you'd have preferred something more boring #
###
x__ = np.linspace(0,10,1000)
freq__ = np.arange(1,5)
x_, freq_ = np.ix_(x__, freq__)
y = np.sin(x_ * freq_).ravel()
x = (x_ * np.ones(freq_.shape)).ravel()
freq = (np.ones(x_.shape) * freq_).ravel()
df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
df_sk = spark_session.createDataFrame(df_pd)
assert(df_sk.toPandas() == df_pd).all().all()

try:
import matplotlib.pyplot as plt
for f, data in df_pd.groupby("freq"):
plt.plot(*data[['x','y']].values.T)
plt.show()
except:
print("I guess we can't plot anything")

def mymap(x, interp_fn):
df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
return interp_fn(df.x.values, df.y.values)(np.pi)

df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()

result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy_interpolate.interp1d)).collect()
assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), atol=1e-6))
try:
result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy.interpolate.interp1d)).collect()
assert(False, "Not going to reach this line")
except py4j.protocol.Py4JJavaError, e:
print("See?")
{noformat}


> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Cricket Temple
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * 

[jira] [Created] (SPARK-22809) pyspark is sensitive to imports with dots

2017-12-15 Thread Cricket Temple (JIRA)
Cricket Temple created SPARK-22809:
--

 Summary: pyspark is sensitive to imports with dots
 Key: SPARK-22809
 URL: https://issues.apache.org/jira/browse/SPARK-22809
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Cricket Temple


User code can fail with dotted imports.  Here's a repro script.

{noformat}
import numpy as np
import pandas as pd
import pyspark
import scipy.interpolate
import scipy.interpolate as scipy_interpolate
import py4j

sc = pyspark.SparkContext()
spark_session = pyspark.SQLContext(sc)

###
# The details of this dataset are irrelevant  #
# Sorry if you'd have preferred something more boring #
###
x__ = np.linspace(0,10,1000)
freq__ = np.arange(1,5)
x_, freq_ = np.ix_(x__, freq__)
y = np.sin(x_ * freq_).ravel()
x = (x_ * np.ones(freq_.shape)).ravel()
freq = (np.ones(x_.shape) * freq_).ravel()
df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
df_sk = spark_session.createDataFrame(df_pd)
assert(df_sk.toPandas() == df_pd).all().all()

try:
import matplotlib.pyplot as plt
for f, data in df_pd.groupby("freq"):
plt.plot(*data[['x','y']].values.T)
plt.show()
except:
print("I guess we can't plot anything")

def mymap(x, interp_fn):
df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
return interp_fn(df.x.values, df.y.values)(np.pi)

df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()

result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy_interpolate.interp1d)).collect()
assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), atol=1e-6))
try:
result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy.interpolate.interp1d)).collect()
assert(False, "Not going to reach this line")
except py4j.protocol.Py4JJavaError, e:
print("See?")
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org