from:"Li Jin \(JIRA\)"

[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2022-02-22 Thread Li Jin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17496142#comment-17496142
 ] 

Li Jin commented on SPARK-22947:


For those who are interested, I will no longer on work on this SPIP. Currently 
we use two approaches to solve this problem:

(1) We have an basic implementation internally and have been using it for a 
couple of years now:

[https://github.com/icexelloss/spark/commit/31025075a2c1f3bd4c4aa9367b164984e3bf65bb]

(2) We also use pandas cogroup UDF + pd.merge_asof to solve this

If anyone is interested in taking (1) and make it officially supported I am 
happy to help but I likely won't do it myself.

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
>

[jira] [Commented] (SPARK-33057) Cannot use filter with window operations

2020-10-16 Thread Li Jin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215468#comment-17215468
 ] 

Li Jin commented on SPARK-33057:


I agree this is an improvement rather than a bug.

Although, I am not sure why this is a "Won't fix"/"Invalid".

Yes the second code adds project and the first one doesn't. However, I think 
the logically they are doing the same thing - "Filtering based on the output of 
a window operation". Whether the output of the window operation is assigned to 
a new field or not doesn't seem to change the logically meaning. 

> Cannot use filter with window operations
> 
>
> Key: SPARK-33057
> URL: https://issues.apache.org/jira/browse/SPARK-33057
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Li Jin
>Priority: Major
>
> Current, trying to use filter with a window operations will fail:
>  
> {code:java}
> df = spark.range(100)
> win = Window.partitionBy().orderBy('id')
> df.filter(F.rank().over(win) > 10).show()
> {code}
> Error:
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/sql/dataframe.py",
>  line 1461, in filter
>     jdf = self._jdf.filter(condition._jc)
>   File 
> "/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1304, in __call__
>   File 
> "/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/sql/utils.py",
>  line 134, in deco
>     raise_from(converted)
>   File "", line 3, in raise_from
> pyspark.sql.utils.AnalysisException: It is not allowed to use window 
> functions inside WHERE clause;{code}
> Although the code is same as the code below, which works:
> {code:java}
> df = spark.range(100)
> win = Window.partitionBy().orderBy('id')
> df = df.withColumn('rank', F.rank().over(win))
> df = df[df['rank'] > 10]
> df = df.drop('rank'){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33057) Cannot use filter with window operations

2020-10-02 Thread Li Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-33057:
---
Description: 
Current, trying to use filter with a window operations will fail:

 
{code:java}
df = spark.range(100)
win = Window.partitionBy().orderBy('id')
df.filter(F.rank().over(win) > 10).show()
{code}
Error:
{code:java}
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/sql/dataframe.py",
 line 1461, in filter
    jdf = self._jdf.filter(condition._jc)
  File 
"/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1304, in __call__
  File 
"/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/sql/utils.py",
 line 134, in deco
    raise_from(converted)
  File "", line 3, in raise_from
pyspark.sql.utils.AnalysisException: It is not allowed to use window functions 
inside WHERE clause;{code}
Although the code is same as the code below, which works:
{code:java}
df = spark.range(100)
win = Window.partitionBy().orderBy('id')
df = df.withColumn('rank', F.rank().over(win))
df = df[df['rank'] > 10]
df = df.drop('rank'){code}
 

  was:
Current, trying to use filter with a window operations will fail:

 
{code:java}
df = spark.range(100)
win = Window.partitionBy().orderBy('id')
df.filter(F.rank().over(win) > 10).show()
{code}
Error:
{code:java}
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/sql/dataframe.py",
 line 1461, in filter
    jdf = self._jdf.filter(condition._jc)
  File 
"/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1304, in __call__
  File 
"/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/sql/utils.py",
 line 134, in deco
    raise_from(converted)
  File "", line 3, in raise_from
pyspark.sql.utils.AnalysisException: It is not allowed to use window functions 
inside WHERE clause;{code}
Although the code is same as:
{code:java}
df = spark.range(100)
win = Window.partitionBy().orderBy('id')
df = df.withColumn('rank', F.rank().over(win))
df = df[df['rank'] > 10]
df = df.drop('rank'){code}
 


> Cannot use filter with window operations
> 
>
> Key: SPARK-33057
> URL: https://issues.apache.org/jira/browse/SPARK-33057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Li Jin
>Priority: Major
>
> Current, trying to use filter with a window operations will fail:
>  
> {code:java}
> df = spark.range(100)
> win = Window.partitionBy().orderBy('id')
> df.filter(F.rank().over(win) > 10).show()
> {code}
> Error:
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/sql/dataframe.py",
>  line 1461, in filter
>     jdf = self._jdf.filter(condition._jc)
>   File 
> "/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1304, in __call__
>   File 
> "/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/sql/utils.py",
>  line 134, in deco
>     raise_from(converted)
>   File "", line 3, in raise_from
> pyspark.sql.utils.AnalysisException: It is not allowed to use window 
> functions inside WHERE clause;{code}
> Although the code is same as the code below, which works:
> {code:java}
> df = spark.range(100)
> win = Window.partitionBy().orderBy('id')
> df = df.withColumn('rank', F.rank().over(win))
> df = df[df['rank'] > 10]
> df = df.drop('rank'){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33057) Cannot use filter with window operations

2020-10-02 Thread Li Jin (Jira)

Li Jin created SPARK-33057:
--

 Summary: Cannot use filter with window operations
 Key: SPARK-33057
 URL: https://issues.apache.org/jira/browse/SPARK-33057
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: Li Jin


Current, trying to use filter with a window operations will fail:

 
{code:java}
df = spark.range(100)
win = Window.partitionBy().orderBy('id')
df.filter(F.rank().over(win) > 10).show()
{code}
Error:
{code:java}
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/sql/dataframe.py",
 line 1461, in filter
    jdf = self._jdf.filter(condition._jc)
  File 
"/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1304, in __call__
  File 
"/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/sql/utils.py",
 line 134, in deco
    raise_from(converted)
  File "", line 3, in raise_from
pyspark.sql.utils.AnalysisException: It is not allowed to use window functions 
inside WHERE clause;{code}
Although the code is same as:
{code:java}
df = spark.range(100)
win = Window.partitionBy().orderBy('id')
df = df.withColumn('rank', F.rank().over(win))
df = df[df['rank'] > 10]
df = df.drop('rank'){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28482) Data incomplete when using pandas udf in Python 3

2019-07-30 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896132#comment-16896132
 ] 

Li Jin edited comment on SPARK-28482 at 7/30/19 1:28 PM:
-

Hi [~jiangyu1211], thank you for the bug report.
  
 From the description, it looks like an issue either in Spark or Arrow. Since 
it's not clear where the problem is, there needs to be a person leading the 
investigation. It seems that you have done very good investigation so far I 
think it'd be great if you continue to chase down the issue under the help of 
both communities. 
  
 To the problem itself, it seems like the issue happens with in the Arrow 
stream format. Since the JVM write the complete Arrow stream format to the 
socket, one thing I suggest you do is to write the Arrow stream format data to 
a file and try reading it from pyarrow to see if you can reproduce it.


was (Author: icexelloss):
Hi [~jiangyu1211], thank you for the bug report.
  
 From the description, it looks like an issue either in Spark or Arrow. Since 
it's not clear where the problem is, there needs to be a person leading the 
investigation. It seems that you have done very good investigation so far I 
think it'd be great if you can lead the investigation under the help of both 
communities. 
  
 To the problem itself, it seems like the issue happens with in the Arrow 
stream format. Since the JVM write the complete Arrow stream format to the 
socket, one thing I suggest you do is to write the Arrow stream format data to 
a file and try reading it from pyarrow to see if you can reproduce it.

> Data incomplete when using pandas udf in Python 3
> -
>
> Key: SPARK-28482
> URL: https://issues.apache.org/jira/browse/SPARK-28482
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
> Environment: centos 7.4   
> pyarrow 0.10.0 0.14.0
> python 2.7 3.5 3.6
>Reporter: jiangyu
>Priority: Major
> Attachments: py2.7.png, py3.6.png, test.csv, test.py, worker.png
>
>
> Hi,
>   
>  Since Spark 2.3.x, pandas udf has been introduced as default ser/des method 
> when using udf. However, an issue raises with python >= 3.5.x version.
>  We use pandas udf to process batches of data, but we find the data is 
> incomplete in python 3.x. At first , i think the process logical maybe wrong, 
> so i change the code to very simple one and it has the same problem.After 
> investigate for a week, i find it is related to pyarrow.   
>   
>  *Reproduce procedure:*
> 1. prepare data
>  The data have seven column, a、b、c、d、e、f and g, data type is Integer
>  a,b,c,d,e,f,g
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>   produce 100,000 rows and name the file test.csv ,upload to hdfs, then load 
> it , and repartition it to 1 partition.
>   
> {code:java}
> df=spark.read.format('csv').option("header","true").load('/test.csv')
> df=df.select(*(col(c).cast("int").alias(c) for c in df.columns))
> df=df.repartition(1)
> spark_context = SparkContext.getOrCreate() {code}
>  
>  2.register pandas udf
>   
> {code:java}
> def add_func(a,b,c,d,e,f,g):
> print('iterator one time')
> return a
> add = pandas_udf(add_func, returnType=IntegerType())
> df_result=df.select(add(col("a"),col("b"),col("c"),col("d"),col("e"),col("f"),col("g"))){code}
>  
>  3.apply pandas udf
>   
> {code:java}
> def trigger_func(iterator):
>       yield iterator
> df_result.rdd.foreachPartition(trigger_func){code}
>  
>  4.execute it in pyspark (local or yarn)
>  run it with conf spark.sql.execution.arrow.maxRecordsPerBatch=10. As 
> mentioned before the total row number is 100, it should print "iterator 
> one time " 10 times.
>  (1)Python 2.7 envs:
>   
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/py2.7/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores 1{code}
>  
>  !py2.7.png!   
>  The result is right, 10 times of print.
>  
>  
> (2)Python 3.5 or 3.6 envs:
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/python3.6/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores{code}
>  
> !py3.6.png!
> The data is incomplete. Exception is print by jvm spark which have been added 
> by us , I will explain it later.
>   
>   
> h3. *Investigation*
> The “process done” is added in the worker.py.
>  !worker.png!
>  In order to get the exception,  change the spark code, the code is under 
> core/src/main/scala/org/apache/spark/util/Utils.scala , and add this code to 
> print the exception.
>   
>  
> {code:java}
> @@ -1362,6

[jira] [Comment Edited] (SPARK-28482) Data incomplete when using pandas udf in Python 3

2019-07-30 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896132#comment-16896132
 ] 

Li Jin edited comment on SPARK-28482 at 7/30/19 1:28 PM:
-

Hi [~jiangyu1211], thank you for the bug report.
  
 From the description, it looks like an issue either in Spark or Arrow. Since 
it's not clear where the problem is, there needs to be a person leading the 
investigation. It seems that you have done very good investigation so far I 
think it'd be great if you can lead the investigation under the help of both 
communities. 
  
 To the problem itself, it seems like the issue happens with in the Arrow 
stream format. Since the JVM write the complete Arrow stream format to the 
socket, one thing I suggest you do is to write the Arrow stream format data to 
a file and try reading it from pyarrow to see if you can reproduce it.


was (Author: icexelloss):
Hi [~jiangyu1211], thank you for the bug report.
 
>From the description, it looks like an issue either in Spark or Arrow. Since 
>it's not clear where the problem is, there needs to be a person leading the 
>investigation. It seems that you have done pretty good investigation so far I 
>think it'd be great if you can lead the investigation under the help of both 
>communities. 
 
To the problem itself, it seems like the issue happens with in the Arrow stream 
format. Since the JVM write the complete Arrow stream format to the socket, one 
thing I suggest you do is to write the Arrow stream format data to a file and 
try reading it from pyarrow to see if you can reproduce it.

> Data incomplete when using pandas udf in Python 3
> -
>
> Key: SPARK-28482
> URL: https://issues.apache.org/jira/browse/SPARK-28482
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
> Environment: centos 7.4   
> pyarrow 0.10.0 0.14.0
> python 2.7 3.5 3.6
>Reporter: jiangyu
>Priority: Major
> Attachments: py2.7.png, py3.6.png, test.csv, test.py, worker.png
>
>
> Hi,
>   
>  Since Spark 2.3.x, pandas udf has been introduced as default ser/des method 
> when using udf. However, an issue raises with python >= 3.5.x version.
>  We use pandas udf to process batches of data, but we find the data is 
> incomplete in python 3.x. At first , i think the process logical maybe wrong, 
> so i change the code to very simple one and it has the same problem.After 
> investigate for a week, i find it is related to pyarrow.   
>   
>  *Reproduce procedure:*
> 1. prepare data
>  The data have seven column, a、b、c、d、e、f and g, data type is Integer
>  a,b,c,d,e,f,g
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>   produce 100,000 rows and name the file test.csv ,upload to hdfs, then load 
> it , and repartition it to 1 partition.
>   
> {code:java}
> df=spark.read.format('csv').option("header","true").load('/test.csv')
> df=df.select(*(col(c).cast("int").alias(c) for c in df.columns))
> df=df.repartition(1)
> spark_context = SparkContext.getOrCreate() {code}
>  
>  2.register pandas udf
>   
> {code:java}
> def add_func(a,b,c,d,e,f,g):
> print('iterator one time')
> return a
> add = pandas_udf(add_func, returnType=IntegerType())
> df_result=df.select(add(col("a"),col("b"),col("c"),col("d"),col("e"),col("f"),col("g"))){code}
>  
>  3.apply pandas udf
>   
> {code:java}
> def trigger_func(iterator):
>       yield iterator
> df_result.rdd.foreachPartition(trigger_func){code}
>  
>  4.execute it in pyspark (local or yarn)
>  run it with conf spark.sql.execution.arrow.maxRecordsPerBatch=10. As 
> mentioned before the total row number is 100, it should print "iterator 
> one time " 10 times.
>  (1)Python 2.7 envs:
>   
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/py2.7/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores 1{code}
>  
>  !py2.7.png!   
>  The result is right, 10 times of print.
>  
>  
> (2)Python 3.5 or 3.6 envs:
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/python3.6/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores{code}
>  
> !py3.6.png!
> The data is incomplete. Exception is print by jvm spark which have been added 
> by us , I will explain it later.
>   
>   
> h3. *Investigation*
> The “process done” is added in the worker.py.
>  !worker.png!
>  In order to get the exception,  change the spark code, the code is under 
> core/src/main/scala/org/apache/spark/util/Utils.scala , and add this code to 
> print the exception.
>   
>  
> {code:java}
> @@ -1362,6

[jira] [Commented] (SPARK-28482) Data incomplete when using pandas udf in Python 3

2019-07-30 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896132#comment-16896132
 ] 

Li Jin commented on SPARK-28482:


Hi [~jiangyu1211], thank you for the bug report.
 
>From the description, it looks like an issue either in Spark or Arrow. Since 
>it's not clear where the problem is, there needs to be a person leading the 
>investigation. It seems that you have done pretty good investigation so far I 
>think it'd be great if you can lead the investigation under the help of both 
>communities. 
 
To the problem itself, it seems like the issue happens with in the Arrow stream 
format. Since the JVM write the complete Arrow stream format to the socket, one 
thing I suggest you do is to write the Arrow stream format data to a file and 
try reading it from pyarrow to see if you can reproduce it.

> Data incomplete when using pandas udf in Python 3
> -
>
> Key: SPARK-28482
> URL: https://issues.apache.org/jira/browse/SPARK-28482
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
> Environment: centos 7.4   
> pyarrow 0.10.0 0.14.0
> python 2.7 3.5 3.6
>Reporter: jiangyu
>Priority: Major
> Attachments: py2.7.png, py3.6.png, test.csv, test.py, worker.png
>
>
> Hi,
>   
>  Since Spark 2.3.x, pandas udf has been introduced as default ser/des method 
> when using udf. However, an issue raises with python >= 3.5.x version.
>  We use pandas udf to process batches of data, but we find the data is 
> incomplete in python 3.x. At first , i think the process logical maybe wrong, 
> so i change the code to very simple one and it has the same problem.After 
> investigate for a week, i find it is related to pyarrow.   
>   
>  *Reproduce procedure:*
> 1. prepare data
>  The data have seven column, a、b、c、d、e、f and g, data type is Integer
>  a,b,c,d,e,f,g
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>   produce 100,000 rows and name the file test.csv ,upload to hdfs, then load 
> it , and repartition it to 1 partition.
>   
> {code:java}
> df=spark.read.format('csv').option("header","true").load('/test.csv')
> df=df.select(*(col(c).cast("int").alias(c) for c in df.columns))
> df=df.repartition(1)
> spark_context = SparkContext.getOrCreate() {code}
>  
>  2.register pandas udf
>   
> {code:java}
> def add_func(a,b,c,d,e,f,g):
> print('iterator one time')
> return a
> add = pandas_udf(add_func, returnType=IntegerType())
> df_result=df.select(add(col("a"),col("b"),col("c"),col("d"),col("e"),col("f"),col("g"))){code}
>  
>  3.apply pandas udf
>   
> {code:java}
> def trigger_func(iterator):
>       yield iterator
> df_result.rdd.foreachPartition(trigger_func){code}
>  
>  4.execute it in pyspark (local or yarn)
>  run it with conf spark.sql.execution.arrow.maxRecordsPerBatch=10. As 
> mentioned before the total row number is 100, it should print "iterator 
> one time " 10 times.
>  (1)Python 2.7 envs:
>   
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/py2.7/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores 1{code}
>  
>  !py2.7.png!   
>  The result is right, 10 times of print.
>  
>  
> (2)Python 3.5 or 3.6 envs:
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/python3.6/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores{code}
>  
> !py3.6.png!
> The data is incomplete. Exception is print by jvm spark which have been added 
> by us , I will explain it later.
>   
>   
> h3. *Investigation*
> The “process done” is added in the worker.py.
>  !worker.png!
>  In order to get the exception,  change the spark code, the code is under 
> core/src/main/scala/org/apache/spark/util/Utils.scala , and add this code to 
> print the exception.
>   
>  
> {code:java}
> @@ -1362,6 +1362,8 @@ private[spark] object Utils extends Logging {
>  case t: Throwable =>
>  // Purposefully not using NonFatal, because even fatal exceptions
>  // we don't want to have our finallyBlock suppress
> + logInfo(t.getLocalizedMessage)
> + t.printStackTrace()
>  originalThrowable = t
>  throw originalThrowable
>  } finally {{code}
>  
>  
>  It seems the pyspark get the data from jvm , but pyarrow get the data 
> incomplete. Pyarrow side think the data is finished, then shutdown the 
> socket. At the same time, the jvm side still writes to the same socket , but 
> get socket close exception.
>  The pyarrow part is in ipc.pxi:
>   
> {code:java}
> cdef class _RecordBatchReader:
>  cdef:
>  shared_ptr[CRecordBatchReader] reader
>  shared_ptr[InputStream]

[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf

2019-07-26 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894134#comment-16894134
 ] 

Li Jin commented on SPARK-28502:


Hmm.. I think this has sth to do with timezone, can you try setting the 
{{"spark.sql.session.timeZone" to "UTC"?}}

> Error with struct conversion while using pandas_udf
> ---
>
> Key: SPARK-28502
> URL: https://issues.apache.org/jira/browse/SPARK-28502
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: OS: Ubuntu
> Python: 3.6
>Reporter: Nasir Ali
>Priority: Minor
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days 
> window) and perform some operations on dataframe using (pandas) UDFs. I don't 
> know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>  
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
> (13.00, "2018-03-11T12:27:18+00:00"),
> (25.00, "2018-03-12T11:27:18+00:00"),
> (20.00, "2018-03-13T15:27:18+00:00"),
> (17.00, "2018-03-14T12:27:18+00:00"),
> (99.00, "2018-03-15T11:27:18+00:00"),
> (156.00, "2018-03-22T11:27:18+00:00"),
> (17.00, "2018-03-31T11:27:18+00:00"),
> (25.00, "2018-03-15T11:27:18+00:00"),
> (25.00, "2018-03-16T11:27:18+00:00")
> ],
>["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
> StructField("id", IntegerType()),
> StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
> # some computation
> return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>  
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False)
> {code}
> Output
> {code:java}
> +-+--+---+
> |id   |window|avg(id)|
> +-+--+---+
> |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0   |
> |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0   |
> |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0  |
> |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0   |
> |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0   |
> |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0   |
> |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0   |
> +-+--+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

2019-07-17 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28422:
---
Description: 
 
{code:java}
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def max_udf(v):
return v.max()

df = spark.range(0, 100)
df.udf.register('max_udf', max_udf)
df.createTempView('table')

# A. This works
df.agg(max_udf(df['id'])).show()

# B. This doesn't work
spark.sql("select max_udf(id) from table").show(){code}
 

 

Query plan:

A:
{code:java}
== Parsed Logical Plan ==

'Aggregate [max_udf('id) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Analyzed Logical Plan ==

max_udf(id): double

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Optimized Logical Plan ==

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Physical Plan ==

!AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]

+- Exchange SinglePartition

   +- *(1) Range (0, 1000, step=1, splits=4)
{code}
B:
{code:java}
== Parsed Logical Plan ==

'Project [unresolvedalias('max_udf('id), None)]

+- 'UnresolvedRelation [table]




== Analyzed Logical Plan ==

max_udf(id): double

Project [max_udf(id#0L) AS max_udf(id)#136]

+- SubqueryAlias `table`

   +- Range (0, 100, step=1, splits=Some(4))




== Optimized Logical Plan ==

Project [max_udf(id#0L) AS max_udf(id)#136]

+- Range (0, 100, step=1, splits=Some(4))




== Physical Plan ==

*(1) Project [max_udf(id#0L) AS max_udf(id)#136]

+- *(1) Range (0, 100, step=1, splits=4)
{code}
 

  was:
 
{code:java}
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def max_udf(v):
return v.max()

df = spark.range(0, 100)
df.udf.register('max_udf', max_udf)
df.createTempView('table')

# A. This works
df.agg(max_udf(df['id'])).show()

# B. This doesn't work
spark.sql("select max_udf(id) from table").show(){code}
 

 

Query plan:

A:
{code:java}
== Parsed Logical Plan ==

'Aggregate [max_udf('id) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Analyzed Logical Plan ==

max_udf(id): double

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Optimized Logical Plan ==

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Physical Plan ==

!AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]

+- Exchange SinglePartition

   +- *(1) Range (0, 1000, step=1, splits=4)
{code}
B:
{code:java}
== Parsed Logical Plan ==

'Project [unresolvedalias('max_udf('id), None)]

+- 'UnresolvedRelation [table]




== Analyzed Logical Plan ==

max_udf(id): double

Project [max_udf(id#0L) AS max_udf(id)#136]

+- SubqueryAlias `table`

   +- Range (0, 100, step=1, splits=Some(4))




== Optimized Logical Plan ==

Project [max_udf(id#0L) AS max_udf(id)#136]

+- Range (0, 100, step=1, splits=Some(4))




== Physical Plan ==

*(1) Project [max_udf(id#0L) AS max_udf(id)#136]

+- *(1) Range (0, 100, step=1, splits=4)
{code}
Maybe related to subquery?


> GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
> ---
>
> Key: SPARK-28422
> URL: https://issues.apache.org/jira/browse/SPARK-28422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
>  
> {code:java}
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def max_udf(v):
> return v.max()
> df = spark.range(0, 100)
> df.udf.register('max_udf', max_udf)
> df.createTempView('table')
> # A. This works
> df.agg(max_udf(df['id'])).show()
> # B. This doesn't work
> spark.sql("select max_udf(id) from table").show(){code}
>  
>  
> Query plan:
> A:
> {code:java}
> == Parsed Logical Plan ==
> 'Aggregate [max_udf('id) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> max_udf(id): double
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Physical Plan ==
> !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
> +- Exchange SinglePartition
>    +- *(1) Range (0, 1000, step=1, splits=4)
> {code}
> B:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('max_udf('id), None)]
> +- 'UnresolvedRelation [table]
> == Analyzed Logical Plan ==
> max_udf(id): double
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- SubqueryAlias `table`
>    +- Range (0, 100, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- Range (0, 100, step=1, splits=Some(4))
> == Physical Plan ==
>

[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

2019-07-17 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28422:
---
Summary: GROUPED_AGG pandas_udf doesn't with spark.sql() without group by 
clause  (was: GROUPED_AGG pandas_udf doesn't with spark.sql without group by 
clause)

> GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
> ---
>
> Key: SPARK-28422
> URL: https://issues.apache.org/jira/browse/SPARK-28422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
>  
> {code:java}
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def max_udf(v):
> return v.max()
> df = spark.range(0, 100)
> df.udf.register('max_udf', max_udf)
> df.createTempView('table')
> # A. This works
> df.agg(max_udf(df['id'])).show()
> # B. This doesn't work
> spark.sql("select max_udf(id) from table"){code}
>  
>  
> Query plan:
> A:
> {code:java}
> == Parsed Logical Plan ==
> 'Aggregate [max_udf('id) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> max_udf(id): double
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Physical Plan ==
> !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
> +- Exchange SinglePartition
>    +- *(1) Range (0, 1000, step=1, splits=4)
> {code}
> B:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('max_udf('id), None)]
> +- 'UnresolvedRelation [table]
> == Analyzed Logical Plan ==
> max_udf(id): double
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- SubqueryAlias `table`
>    +- Range (0, 100, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- Range (0, 100, step=1, splits=Some(4))
> == Physical Plan ==
> *(1) Project [max_udf(id#0L) AS max_udf(id)#136]
> +- *(1) Range (0, 100, step=1, splits=4)
> {code}
> Maybe related to subquery?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

2019-07-17 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28422:
---
Description: 
 
{code:java}
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def max_udf(v):
return v.max()

df = spark.range(0, 100)
df.udf.register('max_udf', max_udf)
df.createTempView('table')

# A. This works
df.agg(max_udf(df['id'])).show()

# B. This doesn't work
spark.sql("select max_udf(id) from table").show(){code}
 

 

Query plan:

A:
{code:java}
== Parsed Logical Plan ==

'Aggregate [max_udf('id) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Analyzed Logical Plan ==

max_udf(id): double

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Optimized Logical Plan ==

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Physical Plan ==

!AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]

+- Exchange SinglePartition

   +- *(1) Range (0, 1000, step=1, splits=4)
{code}
B:
{code:java}
== Parsed Logical Plan ==

'Project [unresolvedalias('max_udf('id), None)]

+- 'UnresolvedRelation [table]




== Analyzed Logical Plan ==

max_udf(id): double

Project [max_udf(id#0L) AS max_udf(id)#136]

+- SubqueryAlias `table`

   +- Range (0, 100, step=1, splits=Some(4))




== Optimized Logical Plan ==

Project [max_udf(id#0L) AS max_udf(id)#136]

+- Range (0, 100, step=1, splits=Some(4))




== Physical Plan ==

*(1) Project [max_udf(id#0L) AS max_udf(id)#136]

+- *(1) Range (0, 100, step=1, splits=4)
{code}
Maybe related to subquery?

  was:
 
{code:java}
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def max_udf(v):
return v.max()

df = spark.range(0, 100)
df.udf.register('max_udf', max_udf)
df.createTempView('table')

# A. This works
df.agg(max_udf(df['id'])).show()

# B. This doesn't work
spark.sql("select max_udf(id) from table"){code}
 

 

Query plan:

A:
{code:java}
== Parsed Logical Plan ==

'Aggregate [max_udf('id) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Analyzed Logical Plan ==

max_udf(id): double

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Optimized Logical Plan ==

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Physical Plan ==

!AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]

+- Exchange SinglePartition

   +- *(1) Range (0, 1000, step=1, splits=4)
{code}
B:
{code:java}
== Parsed Logical Plan ==

'Project [unresolvedalias('max_udf('id), None)]

+- 'UnresolvedRelation [table]




== Analyzed Logical Plan ==

max_udf(id): double

Project [max_udf(id#0L) AS max_udf(id)#136]

+- SubqueryAlias `table`

   +- Range (0, 100, step=1, splits=Some(4))




== Optimized Logical Plan ==

Project [max_udf(id#0L) AS max_udf(id)#136]

+- Range (0, 100, step=1, splits=Some(4))




== Physical Plan ==

*(1) Project [max_udf(id#0L) AS max_udf(id)#136]

+- *(1) Range (0, 100, step=1, splits=4)
{code}
Maybe related to subquery?


> GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
> ---
>
> Key: SPARK-28422
> URL: https://issues.apache.org/jira/browse/SPARK-28422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
>  
> {code:java}
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def max_udf(v):
> return v.max()
> df = spark.range(0, 100)
> df.udf.register('max_udf', max_udf)
> df.createTempView('table')
> # A. This works
> df.agg(max_udf(df['id'])).show()
> # B. This doesn't work
> spark.sql("select max_udf(id) from table").show(){code}
>  
>  
> Query plan:
> A:
> {code:java}
> == Parsed Logical Plan ==
> 'Aggregate [max_udf('id) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> max_udf(id): double
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Physical Plan ==
> !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
> +- Exchange SinglePartition
>    +- *(1) Range (0, 1000, step=1, splits=4)
> {code}
> B:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('max_udf('id), None)]
> +- 'UnresolvedRelation [table]
> == Analyzed Logical Plan ==
> max_udf(id): double
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- SubqueryAlias `table`
>    +- Range (0, 100, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- Range (0, 100, step=1, splits=Some(4))
> ==

[jira] [Created] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql without group by clause

2019-07-17 Thread Li Jin (JIRA)

Li Jin created SPARK-28422:
--

 Summary: GROUPED_AGG pandas_udf doesn't with spark.sql without 
group by clause
 Key: SPARK-28422
 URL: https://issues.apache.org/jira/browse/SPARK-28422
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.4.3
Reporter: Li Jin


 
{code:java}
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def max_udf(v):
return v.max()

df = spark.range(0, 100)
df.udf.register('max_udf', max_udf)
df.createTempView('table')

# A. This works
df.agg(max_udf(df['id'])).show()

# B. This doesn't work
spark.sql("select max_udf(id) from table"){code}
 

 

Query plan:

A:
{code:java}
== Parsed Logical Plan ==

'Aggregate [max_udf('id) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Analyzed Logical Plan ==

max_udf(id): double

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Optimized Logical Plan ==

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Physical Plan ==

!AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]

+- Exchange SinglePartition

   +- *(1) Range (0, 1000, step=1, splits=4)
{code}
B:
{code:java}
== Parsed Logical Plan ==

'Project [unresolvedalias('max_udf('id), None)]

+- 'UnresolvedRelation [table]




== Analyzed Logical Plan ==

max_udf(id): double

Project [max_udf(id#0L) AS max_udf(id)#136]

+- SubqueryAlias `table`

   +- Range (0, 100, step=1, splits=Some(4))




== Optimized Logical Plan ==

Project [max_udf(id#0L) AS max_udf(id)#136]

+- Range (0, 100, step=1, splits=Some(4))




== Physical Plan ==

*(1) Project [max_udf(id#0L) AS max_udf(id)#136]

+- *(1) Range (0, 100, step=1, splits=4)
{code}
Maybe related to subquery?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-13 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863186#comment-16863186
 ] 

Li Jin edited comment on SPARK-28006 at 6/13/19 3:36 PM:
-

Hi [~viirya] good questions!

>> Can we use pandas agg udfs as window function?

pandas agg udfs as window function is supported. With both unbounded and 
bounded window.

>> Because the proposed GROUPED_XFORM udf calculates output values for all rows 
>> in the group, looks like the proposed GROUPED_XFORM udf can only use window 
>> frame (UnboundedPreceding, UnboundedFollowing)

This is correct. It is really using unbounded window as groups here (because 
there is no groupby transform API in Spark sql).


was (Author: icexelloss):
Hi [~viirya] good questions:

>> Can we use pandas agg udfs as window function?

pandas agg udfs as window function is supported. With both unbounded and 
bounded window.

>> Because the proposed GROUPED_XFORM udf calculates output values for all rows 
>> in the group, looks like the proposed GROUPED_XFORM udf can only use window 
>> frame (UnboundedPreceding, UnboundedFollowing)

This is correct. It is really using unbounded window as groups here (because 
there is no groupby transform API in Spark sql).

> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def subtract_mean(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean()
> return pdf
> df.groupby('id').apply(subtract_mean)
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> This approach has a few downside:
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> @pandas_udf('double', GROUPED_XFORM)
> def subtract_mean(v):
> return v - v.mean()
> w = Window.partitionBy('id')
> df = df.withColumn('v', subtract_mean(df['v']).over(w))
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled from the udf implementation
>  * We only need to send one column to Python worker and concat the result 
> with the original dataframe (this is what grouped aggregate is doing already)
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-13 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863186#comment-16863186
 ] 

Li Jin commented on SPARK-28006:


Hi [~viirya] good questions:

>> Can we use pandas agg udfs as window function?

pandas agg udfs as window function is supported. With both unbounded and 
bounded window.

>> Because the proposed GROUPED_XFORM udf calculates output values for all rows 
>> in the group, looks like the proposed GROUPED_XFORM udf can only use window 
>> frame (UnboundedPreceding, UnboundedFollowing)

This is correct. It is really using unbounded window as groups here (because 
there is no groupby transform API in Spark sql).

> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def subtract_mean(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean()
> return pdf
> df.groupby('id').apply(subtract_mean)
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> This approach has a few downside:
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> @pandas_udf('double', GROUPED_XFORM)
> def subtract_mean(v):
> return v - v.mean()
> w = Window.partitionBy('id')
> df = df.withColumn('v', subtract_mean(df['v']).over(w))
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled from the udf implementation
>  * We only need to send one column to Python worker and concat the result 
> with the original dataframe (this is what grouped aggregate is doing already)
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs

2019-06-13 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863177#comment-16863177
 ] 

Li Jin commented on SPARK-27463:


Yeah I think the exact spelling of the API can go either way. I think the 
current options are pretty close that we don't need to commit to either one at 
this moment and they shouldn't affect the implementation too much.

[~d80tb7] [~hyukjin.kwon] How about we start working towards a PR that 
implements one of the proposed APIs and go from there?

> Support Dataframe Cogroup via Pandas UDFs 
> --
>
> Key: SPARK-27463
> URL: https://issues.apache.org/jira/browse/SPARK-27463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Chris Martin
>Priority: Major
>
> Recent work on Pandas UDFs in Spark, has allowed for improved 
> interoperability between Pandas and Spark.  This proposal aims to extend this 
> by introducing a new Pandas UDF type which would allow for a cogroup 
> operation to be applied to two PySpark DataFrames.
> Full details are in the google document linked below.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-12 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28006:
---
Description: 
Currently, pandas_udf supports "grouped aggregate" type that can be used with 
unbounded and unbounded windows. There is another set of use cases that can 
benefit from a "grouped transform" type pandas_udf.

Grouped transform is defined as a N -> N mapping over a group. For example, 
"compute zscore for values in the group using the grouped mean and grouped 
stdev", or "rank the values in the group".

Currently, in order to do this, user needs to use "grouped apply", for example:
{code:java}
@pandas_udf(schema, GROUPED_MAP)
def subtract_mean(pdf)
v = pdf['v']
pdf['v'] = v - v.mean()
return pdf

df.groupby('id').apply(subtract_mean)
# +---++
# | id|   v|
# +---++
# |  1|-0.5|
# |  1| 0.5|
# |  2|-3.0|
# |  2|-1.0|
# |  2| 4.0|
# +---++{code}
This approach has a few downside:
 * Specifying the full return schema is complicated for the user although the 
function only changes one column.
 * The column name 'v' inside as part of the udf, makes the udf less reusable.
 * The entire dataframe is serialized to pass to Python although only one 
column is needed.

Here we propose a new type of pandas_udf to work with these types of use cases:
{code:java}
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))

@pandas_udf('double', GROUPED_XFORM)
def subtract_mean(v):
return v - v.mean()

w = Window.partitionBy('id')

df = df.withColumn('v', subtract_mean(df['v']).over(w))
# +---++
# | id|   v|
# +---++
# |  1|-0.5|
# |  1| 0.5|
# |  2|-3.0|
# |  2|-1.0|
# |  2| 4.0|
# +---++{code}
Which addresses the above downsides.
 * The user only needs to specify the output type of a single column.
 * The column being zscored is decoupled from the udf implementation
 * We only need to send one column to Python worker and concat the result with 
the original dataframe (this is what grouped aggregate is doing already)

 

 

  was:
Currently, pandas_udf supports "grouped aggregate" type that can be used with 
unbounded and unbounded windows. There is another set of use cases that can 
benefit from a "grouped transform" type pandas_udf.

Grouped transform is defined as a N -> N mapping over a group. For example, 
"compute zscore for values in the group using the grouped mean and grouped 
stdev", or "rank the values in the group".

Currently, in order to do this, user needs to use "grouped apply", for example:
{code:java}
@pandas_udf(schema, GROUPED_MAP)
def subtract_mean(pdf)
v = pdf['v']
pdf['v'] = v - v.mean()
return pdf

df.groupby('id').apply(subtract_mean)
# +---++
# | id|   v|
# +---++
# |  1|-0.5|
# |  1| 0.5|
# |  2|-3.0|
# |  2|-1.0|
# |  2| 4.0|
# +---++{code}
This approach has a few downside:
 * Specifying the full return schema is complicated for the user although the 
function only changes one column.
 * The column name 'v' inside as part of the udf, makes the udf less reusable.
 * The entire dataframe is serialized to pass to Python although only one 
column is needed.

Here we propose a new type of pandas_udf to work with these types of use cases:
{code:java}
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))

@pandas_udf('double', GROUPED_XFORM)
def subtract_mean(v):
return v - v.mean() / v.std()

w = Window.partitionBy('id')

df = df.withColumn('v', subtract_mean(df['v']).over(w))
# +---++
# | id|   v|
# +---++
# |  1|-0.5|
# |  1| 0.5|
# |  2|-3.0|
# |  2|-1.0|
# |  2| 4.0|
# +---++{code}
Which addresses the above downsides.
 * The user only needs to specify the output type of a single column.
 * The column being zscored is decoupled from the udf implementation
 * We only need to send one column to Python worker and concat the result with 
the original dataframe (this is what grouped aggregate is doing already)

 

 


> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
>

[jira] [Commented] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-12 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862505#comment-16862505
 ] 

Li Jin commented on SPARK-28006:


Thanks [~hyukjin.kwon] for the comments! I updated the description to include 
input/outputs.

Yeah I don't think currently the Scala window function has such type. The 
analogy of this in pandas is groupby transform, (hence the name grouped 
transform udf)

 
{code:java}
>>> df = pd.DataFrame({'id': [1, 1, 2, 2, 2], 'value': [1., 2., 3., 5., 10.]})

>>> df
   id  value

0   1    1.0
1   1    2.0
2   2    3.0
3   2    5.0
4   2   10.0

>>> df['value_demean'] = df.groupby('id')['value'].transform(lambda x: x - 
>>> x.mean())
>>> df

   id  value  value_demean
0   1    1.0          -0.5
1   1    2.0           0.5
2   2    3.0          -3.0
3   2    5.0          -1.0
4   2   10.0           4.0
{code}
 

> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def subtract_mean(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean()
> return pdf
> df.groupby('id').apply(subtract_mean)
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> This approach has a few downside:
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> @pandas_udf('double', GROUPED_XFORM)
> def subtract_mean(v):
> return v - v.mean() / v.std()
> w = Window.partitionBy('id')
> df = df.withColumn('v', subtract_mean(df['v']).over(w))
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled from the udf implementation
>  * We only need to send one column to Python worker and concat the result 
> with the original dataframe (this is what grouped aggregate is doing already)
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-12 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28006:
---
Description: 
Currently, pandas_udf supports "grouped aggregate" type that can be used with 
unbounded and unbounded windows. There is another set of use cases that can 
benefit from a "grouped transform" type pandas_udf.

Grouped transform is defined as a N -> N mapping over a group. For example, 
"compute zscore for values in the group using the grouped mean and grouped 
stdev", or "rank the values in the group".

Currently, in order to do this, user needs to use "grouped apply", for example:
{code:java}
@pandas_udf(schema, GROUPED_MAP)
def subtract_mean(pdf)
v = pdf['v']
pdf['v'] = v - v.mean()
return pdf

df.groupby('id').apply(subtract_mean)
# +---++
# | id|   v|
# +---++
# |  1|-0.5|
# |  1| 0.5|
# |  2|-3.0|
# |  2|-1.0|
# |  2| 4.0|
# +---++{code}
This approach has a few downside:
 * Specifying the full return schema is complicated for the user although the 
function only changes one column.
 * The column name 'v' inside as part of the udf, makes the udf less reusable.
 * The entire dataframe is serialized to pass to Python although only one 
column is needed.

Here we propose a new type of pandas_udf to work with these types of use cases:
{code:java}
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))

@pandas_udf('double', GROUPED_XFORM)
def subtract_mean(v):
return v - v.mean() / v.std()

w = Window.partitionBy('id')

df = df.withColumn('v', subtract_mean(df['v']).over(w))
# +---++
# | id|   v|
# +---++
# |  1|-0.5|
# |  1| 0.5|
# |  2|-3.0|
# |  2|-1.0|
# |  2| 4.0|
# +---++{code}
Which addresses the above downsides.
 * The user only needs to specify the output type of a single column.
 * The column being zscored is decoupled from the udf implementation
 * We only need to send one column to Python worker and concat the result with 
the original dataframe (this is what grouped aggregate is doing already)

 

 

  was:
Currently, pandas_udf supports "grouped aggregate" type that can be used with 
unbounded and unbounded windows. There is another set of use cases that can 
benefit from a "grouped transform" type pandas_udf.

Grouped transform is defined as a N -> N mapping over a group. For example, 
"compute zscore for values in the group using the grouped mean and grouped 
stdev", or "rank the values in the group".

 

Currently, in order to do this, user needs to use "grouped apply", for example:

 
{code:java}
@pandas_udf(schema, GROUPED_MAP)
def zscore(pdf)
v = pdf['v']
pdf['v'] = v - v.mean() / v.std()
return pdf

df.groupby('id').apply(zscore){code}
This approach has a few downside:
 * Specifying the full return schema is complicated for the user although the 
function only changes one column.
 * The column name 'v' inside as part of the udf, makes the udf less reusable.
 * The entire dataframe is serialized to pass to Python although only one 
column is needed.

Here we propose a new type of pandas_udf to work with these types of use cases:
{code:java}
@pandas_udf('double', GROUPED_XFORM)
def zscore(v):
return v - v.mean() / v.std()

w = Window.partitionBy('id')

df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code}
Which addresses the above downsides.
 * The user only needs to specify the output type of a single column.
 * The column being zscored is decoupled from the udf implementation
 * We only need to send one column to Python worker and concat the result with 
the original dataframe (this is what grouped aggregate is doing already)

 

 


> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def subtract_mean(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean()
> return pdf
> df.groupby('id').apply(subtract_mean)
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> This approach has

[jira] [Updated] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-12 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28006:
---
Description: 
Currently, pandas_udf supports "grouped aggregate" type that can be used with 
unbounded and unbounded windows. There is another set of use cases that can 
benefit from a "grouped transform" type pandas_udf.

Grouped transform is defined as a N -> N mapping over a group. For example, 
"compute zscore for values in the group using the grouped mean and grouped 
stdev", or "rank the values in the group".

 

Currently, in order to do this, user needs to use "grouped apply", for example:

 
{code:java}
@pandas_udf(schema, GROUPED_MAP)
def zscore(pdf)
v = pdf['v']
pdf['v'] = v - v.mean() / v.std()
return pdf

df.groupby('id').apply(zscore){code}
This approach has a few downside:
 * Specifying the full return schema is complicated for the user although the 
function only changes one column.
 * The column name 'v' inside as part of the udf, makes the udf less reusable.
 * The entire dataframe is serialized to pass to Python although only one 
column is needed.

Here we propose a new type of pandas_udf to work with these types of use cases:
{code:java}
@pandas_udf('double', GROUPED_XFORM)
def zscore(v):
return v - v.mean() / v.std()

w = Window.partitionBy('id')

df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code}
Which addresses the above downsides.
 * The user only needs to specify the output type of a single column.
 * The column being zscored is decoupled from the udf implementation
 * We only need to send one column to Python worker and concat the result with 
the original dataframe (this is what grouped aggregate is doing already)

 

 

  was:
Currently, pandas_udf supports "grouped aggregate" type that can be used with 
unbounded and unbounded windows. There is another set of use cases that can 
benefit from a "grouped transform" type pandas_udf.

Grouped transform is defined as a N -> N mapping over a group. For example, 
"compute zscore for values in the group using the grouped mean and grouped 
stdev", or "rank the values in the group".

 

Currently, in order to do this, user needs to use "grouped apply", for example:

 
{code:java}
@pandas_udf(schema, GROUPED_MAP)
def zscore(pdf)
v = pdf['v']
pdf['v'] = v - v.mean() / v.std()
return pdf

df.groupby('id').apply(zscore){code}
This approach has a few downside:

 
 * Specifying the full return schema is complicated for the user although the 
function only changes one column.
 * The column name 'v' inside as part of the udf, makes the udf less reusable.
 * The entire dataframe is serialized to pass to Python although only one 
column is needed.

Here we propose a new type of pandas_udf to work with these types of use cases:
{code:java}
@pandas_udf('double', GROUPED_XFORM)
def zscore(v):
return v - v.mean() / v.std()

w = Window.partitionBy('id')

df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code}
Which addresses the above downsides.
 * The user only needs to specify the output type of a single column.
 * The column being zscored is decoupled from the udf implementation
 * We only need to send one column to Python worker and concat the result with 
the original dataframe (this is what grouped aggregate is doing already)

 

 


> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
>  
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
>  
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def zscore(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean() / v.std()
> return pdf
> df.groupby('id').apply(zscore){code}
> This approach has a few downside:
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> @pandas_udf('double', GROUPED_XFORM)
> def zscore(v):
>

[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs

2019-06-12 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862287#comment-16862287
 ] 

Li Jin commented on SPARK-27463:


I think one way to design this API to mimic the existing dataset cogroup API:
{code:java}
gdf1 = df1.groupByKey('id')
gdf2 = df2.groupByKey('id')

result = gdf1.cogroup(gdf2).apply(my_pandas_udf){code}
Although the KeyValueGroupedData and groupByKey isn't really exposed to pyspark 
(or maybe it doesn't apply to pyspark because of type?) So another way to go 
about this is to use RelationalGroupedData:
{code:java}
gdf1 = df1.groupBy('id')
gdf2 = df2.groupBy('id')

result = gdf1.cogroup(gdf2).apply(my_pandas_udf){code}

> Support Dataframe Cogroup via Pandas UDFs 
> --
>
> Key: SPARK-27463
> URL: https://issues.apache.org/jira/browse/SPARK-27463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Chris Martin
>Priority: Major
>
> Recent work on Pandas UDFs in Spark, has allowed for improved 
> interoperability between Pandas and Spark.  This proposal aims to extend this 
> by introducing a new Pandas UDF type which would allow for a cogroup 
> operation to be applied to two PySpark DataFrames.
> Full details are in the google document linked below.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs

2019-06-12 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862225#comment-16862225
 ] 

Li Jin commented on SPARK-27463:


For cogroup, I don't think there is analogous API in pandas. There is analogous 
Spark Scala API in KeyValueGroupedDataset:
{code:java}
val gdf1 = df1.groupByKey('id')
val gdf2 = df2.groupByKey('id')

val result = gdf1.cogroup(gdf2)(my_scala_udf){code}
 

> Support Dataframe Cogroup via Pandas UDFs 
> --
>
> Key: SPARK-27463
> URL: https://issues.apache.org/jira/browse/SPARK-27463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Chris Martin
>Priority: Major
>
> Recent work on Pandas UDFs in Spark, has allowed for improved 
> interoperability between Pandas and Spark.  This proposal aims to extend this 
> by introducing a new Pandas UDF type which would allow for a cogroup 
> operation to be applied to two PySpark DataFrames.
> Full details are in the google document linked below.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-11 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861559#comment-16861559
 ] 

Li Jin commented on SPARK-28006:


cc [~hyukjin.kwon] [~LI,Xiao] [~ueshin] [~bryanc]

I think code wise this is pretty simple but since this is adding a new pandas 
udf type I'd like to get some feedback on this.

> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
>  
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
>  
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def zscore(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean() / v.std()
> return pdf
> df.groupby('id').apply(zscore){code}
> This approach has a few downside:
>  
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> @pandas_udf('double', GROUPED_XFORM)
> def zscore(v):
> return v - v.mean() / v.std()
> w = Window.partitionBy('id')
> df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled from the udf implementation
>  * We only need to send one column to Python worker and concat the result 
> with the original dataframe (this is what grouped aggregate is doing already)
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-11 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28006:
---
Description: 
Currently, pandas_udf supports "grouped aggregate" type that can be used with 
unbounded and unbounded windows. There is another set of use cases that can 
benefit from a "grouped transform" type pandas_udf.

Grouped transform is defined as a N -> N mapping over a group. For example, 
"compute zscore for values in the group using the grouped mean and grouped 
stdev", or "rank the values in the group".

 

Currently, in order to do this, user needs to use "grouped apply", for example:

 
{code:java}
@pandas_udf(schema, GROUPED_MAP)
def zscore(pdf)
v = pdf['v']
pdf['v'] = v - v.mean() / v.std()
return pdf

df.groupby('id').apply(zscore){code}
This approach has a few downside:

 
 * Specifying the full return schema is complicated for the user although the 
function only changes one column.
 * The column name 'v' inside as part of the udf, makes the udf less reusable.
 * The entire dataframe is serialized to pass to Python although only one 
column is needed.

Here we propose a new type of pandas_udf to work with these types of use cases:
{code:java}
@pandas_udf('double', GROUPED_XFORM)
def zscore(v):
return v - v.mean() / v.std()

w = Window.partitionBy('id')

df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code}
Which addresses the above downsides.
 * The user only needs to specify the output type of a single column.
 * The column being zscored is decoupled from the udf implementation
 * We only need to send one column to Python worker and concat the result with 
the original dataframe (this is what grouped aggregate is doing already)

 

 

  was:
Currently, pandas_udf supports "grouped aggregate" type that can be used with 
unbounded and unbounded windows. There is another set of use cases that can 
benefit from a "grouped transform" type pandas_udf.

Grouped transform is defined as a N -> N mapping over a group. For example, 
"compute zscore for values in the group using the grouped mean and grouped 
stdev", or "rank the values in the group".

 

Currently, in order to do this, user needs to use "grouped apply", for example:

 
{code:java}
@pandas_udf(schema, GROUPED_MAP)
def zscore(pdf)
v = pdf['v']
pdf['v'] = v - v.mean() / v.std()
return pdf

df.groupby('id').apply(zscore){code}
This approach has a few downside:

 
 * Specifying the full return schema is complicated for the user although the 
function only changes one column.
 * The column name 'v' inside as part of the udf, makes the udf less reusable.
 * The entire dataframe is serialized to pass to Python although only one 
column is needed.

Here we propose a new type of pandas_udf to work with these types of use cases:
{code:java}
@pandas_udf('double', GROUPED_XFORM)
def zscore(v):
return v - v.mean() / v.std()

w = Window.partitionBy('id')

df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code}
Which addresses the above downsides.

 

 


> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
>  
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
>  
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def zscore(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean() / v.std()
> return pdf
> df.groupby('id').apply(zscore){code}
> This approach has a few downside:
>  
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> @pandas_udf('double', GROUPED_XFORM)
> def zscore(v):
> return v - v.mean() / v.std()
> w = Window.partitionBy('id')
> df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled

[jira] [Created] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-11 Thread Li Jin (JIRA)

Li Jin created SPARK-28006:
--

 Summary: User-defined grouped transform pandas_udf for window 
operations
 Key: SPARK-28006
 URL: https://issues.apache.org/jira/browse/SPARK-28006
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 2.4.3
Reporter: Li Jin


Currently, pandas_udf supports "grouped aggregate" type that can be used with 
unbounded and unbounded windows. There is another set of use cases that can 
benefit from a "grouped transform" type pandas_udf.

Grouped transform is defined as a N -> N mapping over a group. For example, 
"compute zscore for values in the group using the grouped mean and grouped 
stdev", or "rank the values in the group".

 

Currently, in order to do this, user needs to use "grouped apply", for example:

 
{code:java}
@pandas_udf(schema, GROUPED_MAP)
def zscore(pdf)
v = pdf['v']
pdf['v'] = v - v.mean() / v.std()
return pdf

df.groupby('id').apply(zscore){code}
This approach has a few downside:

 
 * Specifying the full return schema is complicated for the user although the 
function only changes one column.
 * The column name 'v' inside as part of the udf, makes the udf less reusable.
 * The entire dataframe is serialized to pass to Python although only one 
column is needed.

Here we propose a new type of pandas_udf to work with these types of use cases:
{code:java}
@pandas_udf('double', GROUPED_XFORM)
def zscore(v):
return v - v.mean() / v.std()

w = Window.partitionBy('id')

df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code}
Which addresses the above downsides.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT

2019-06-11 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28003:
---
Affects Version/s: (was: 2.3.2)
   2.3.3

> spark.createDataFrame with Arrow doesn't work with pandas.NaT 
> --
>
> Key: SPARK-28003
> URL: https://issues.apache.org/jira/browse/SPARK-28003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> {code:java}
> import pandas as pd
> dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100
> pdf1 = pd.DataFrame({'time': dt1})
> df1 = self.spark.createDataFrame(pdf1)
> {code}
> The example above doesn't work with arrow enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT

2019-06-11 Thread Li Jin (JIRA)

Li Jin created SPARK-28003:
--

 Summary: spark.createDataFrame with Arrow doesn't work with 
pandas.NaT 
 Key: SPARK-28003
 URL: https://issues.apache.org/jira/browse/SPARK-28003
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Li Jin


{code:java}
import pandas as pd
dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100
pdf1 = pd.DataFrame({'time': dt1})

df1 = self.spark.createDataFrame(pdf1)
{code}
The example above doesn't work with arrow enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT

2019-06-11 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28003:
---
Affects Version/s: (was: 2.4.0)
   2.3.2
   2.4.3

> spark.createDataFrame with Arrow doesn't work with pandas.NaT 
> --
>
> Key: SPARK-28003
> URL: https://issues.apache.org/jira/browse/SPARK-28003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> {code:java}
> import pandas as pd
> dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100
> pdf1 = pd.DataFrame({'time': dt1})
> df1 = self.spark.createDataFrame(pdf1)
> {code}
> The example above doesn't work with arrow enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27538) sparksql could not start in jdk11, exception org.datanucleus.exceptions.NucleusException: The java type java.lang.Long (jdbc-type='', sql-type="") cant be mapped for t

2019-05-29 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851185#comment-16851185
 ] 

Li Jin commented on SPARK-27538:


[~hyukjin.kwon] I saw you closed this. I wonder if this should be a sub Jira 
under SPARK-24417. As far as I can tell there is no Jira tracking this.

> sparksql could not start in jdk11, exception 
> org.datanucleus.exceptions.NucleusException: The java type java.lang.Long 
> (jdbc-type='', sql-type="") cant be mapped for this datastore. No mapping is 
> available.
> --
>
> Key: SPARK-27538
> URL: https://issues.apache.org/jira/browse/SPARK-27538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.1
> Environment: Machine：aarch64
> OS：Red Hat Enterprise Linux Server release 7.4
> Kernel：4.11.0-44.el7a
> spark version： spark-2.4.1
> java：openjdk version "11.0.2" 2019-01-15
>   OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.2+9)
> scala：2.11.12
>Reporter: dingwei2019
>Priority: Major
>  Labels: features
>
> [root@172-19-18-8 spark-2.4.1-bin-hadoop2.7-bak]# bin/spark-sql 
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/home/dingwei/spark-2.4.1-bin-x86/spark-2.4.1-bin-hadoop2.7-bak/jars/spark-unsafe_2.11-2.4.1.jar)
>  to method java.nio.Bits.unaligned()
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 2019-04-22 11:27:34,419 WARN util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable
> 2019-04-22 11:27:35,306 INFO metastore.HiveMetaStore: 0: Opening raw store 
> with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 2019-04-22 11:27:35,330 INFO metastore.ObjectStore: ObjectStore, initialize 
> called
> 2019-04-22 11:27:35,492 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 2019-04-22 11:27:35,492 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 2019-04-22 11:27:37,012 INFO metastore.ObjectStore: Setting MetaStore object 
> pin classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 2019-04-22 11:27:37,638 WARN DataNucleus.Query: Query for candidates of 
> org.apache.hadoop.hive.metastore.model.MDatabase and subclasses resulted in 
> no possible candidates
> The java type java.lang.Long (jdbc-type="", sql-type="") cant be mapped for 
> this datastore. No mapping is available.
> org.datanucleus.exceptions.NucleusException: The java type java.lang.Long 
> (jdbc-type="", sql-type="") cant be mapped for this datastore. No mapping is 
> available.
>  at 
> org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.getDatastoreMappingClass(RDBMSMappingManager.java:1215)
>  at 
> org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.createDatastoreMapping(RDBMSMappingManager.java:1378)
>  at 
> org.datanucleus.store.rdbms.table.AbstractClassTable.addDatastoreId(AbstractClassTable.java:392)
>  at 
> org.datanucleus.store.rdbms.table.ClassTable.initializePK(ClassTable.java:1087)
>  at 
> org.datanucleus.store.rdbms.table.ClassTable.preInitialize(ClassTable.java:247)
>  at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.addClassTable(RDBMSStoreManager.java:3118)
>  at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.addClassTables(RDBMSStoreManager.java:2909)
>  at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.addClassTablesAndValidate(RDBMSStoreManager.java:3182)
>  at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2841)
>  at 
> org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:122)
>  at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.addClasses(RDBMSStoreManager.java:1605)
>  at 
> org.datanucleus.store.AbstractStoreManager.addClass(AbstractStoreManager.java:954)
>  at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:679)
>  at 
> org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:408)
>  at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:947)
>  at 
>

[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration

2018-12-21 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726824#comment-16726824
 ] 

Li Jin commented on SPARK-26410:


One thing we want think about is whether or not to mix different size batches 
inside a single Eval node. It feels a bit complicated and it's probably a 
little simpler to group UDFs of the same batch size into one Eval node, with 
some performance cost. I am not very certain on this and curious about what 
other people think.

> Support per Pandas UDF configuration
> 
>
> Key: SPARK-26410
> URL: https://issues.apache.org/jira/browse/SPARK-26410
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
> "right" batch size usually depends on the task itself. It would be nice if 
> user can configure the batch size when they declare the Pandas UDF.
> This is orthogonal to SPARK-23258 (using max buffer size instead of row 
> count).
> Besides API, we should also discuss how to merge Pandas UDFs of different 
> configurations. For example,
> {code}
> df.select(predict1(col("features"), predict2(col("features")))
> {code}
> when predict1 requests 100 rows per batch, while predict2 requests 120 rows 
> per batch.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration

2018-12-21 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726820#comment-16726820
 ] 

Li Jin commented on SPARK-26410:


Thanks for the explanation. I think it makes sense to have batch size as one of 
the parameters for Scala Pandas UDF. 

I also remember seeing cases where the default batch size is too large for a 
particular column which happens to be a large array column. This sounds similar 
to your image example.

> Support per Pandas UDF configuration
> 
>
> Key: SPARK-26410
> URL: https://issues.apache.org/jira/browse/SPARK-26410
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
> "right" batch size usually depends on the task itself. It would be nice if 
> user can configure the batch size when they declare the Pandas UDF.
> This is orthogonal to SPARK-23258 (using max buffer size instead of row 
> count).
> Besides API, we should also discuss how to merge Pandas UDFs of different 
> configurations. For example,
> {code}
> df.select(predict1(col("features"), predict2(col("features")))
> {code}
> when predict1 requests 100 rows per batch, while predict2 requests 120 rows 
> per batch.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames for the entire partition

2018-12-20 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725957#comment-16725957
 ] 

Li Jin commented on SPARK-26412:


So this is similar to the mapPartitions API in Scala but instead of having an 
iterator of records, here we want an iterator of pd.DataFrames instead?

> Allow Pandas UDF to take an iterator of pd.DataFrames for the entire partition
> --
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to batch scope, user need to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient. I created this JIRA to discuss possible solutions.
> Essentially we need to support "start()" and "finish()" besides "apply". We 
> can either provide those interfaces or simply provide users the iterator of 
> batches in pd.DataFrame and let user code handle it.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration

2018-12-20 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725948#comment-16725948
 ] 

Li Jin commented on SPARK-26410:


I am curious why would user want to configure maxRecordsPerBatch? As far as I 
remember this is just a performance/memory usage conf. In your example, what's 
the reason that the user wants 100 rows per batch for one UDF and 120 rows per 
batch for another?

> Support per Pandas UDF configuration
> 
>
> Key: SPARK-26410
> URL: https://issues.apache.org/jira/browse/SPARK-26410
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
> "right" batch size usually depends on the task itself. It would be nice if 
> user can configure the batch size when they declare the Pandas UDF.
> This is orthogonal to SPARK-23258 (using max buffer size instead of row 
> count).
> Besides API, we should also discuss how to merge Pandas UDFs of different 
> configurations. For example,
> {code}
> df.select(predict1(col("features"), predict2(col("features")))
> {code}
> when predict1 requests 100 rows per batch, while predict2 requests 120 rows 
> per batch.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26364) Clean up import statements in pandas udf tests

2018-12-13 Thread Li Jin (JIRA)

Li Jin created SPARK-26364:
--

 Summary: Clean up import statements in pandas udf tests
 Key: SPARK-26364
 URL: https://issues.apache.org/jira/browse/SPARK-26364
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 2.4.0
Reporter: Li Jin


Per discussion [https://github.com/apache/spark/pull/22305/files#r241215618] we 
should clean up the import statements in test_pandas_udf* and move them to the 
top. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26328) Use GenerateOrdering for group key comparison in WindowExec

2018-12-10 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin resolved SPARK-26328.

Resolution: Not A Problem

> Use GenerateOrdering for group key comparison in WindowExec
> ---
>
> Key: SPARK-26328
> URL: https://issues.apache.org/jira/browse/SPARK-26328
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> Per discussion here:
> [https://github.com/apache/spark/pull/22305/commits/850541873df1101485a85632b85ba474dac67d56#r239312302]
> It seems better to use GenerateOrdering for group key comparison in 
> WindowExec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26328) Use GenerateOrdering for group key comparison in WindowExec

2018-12-10 Thread Li Jin (JIRA)

Li Jin created SPARK-26328:
--

 Summary: Use GenerateOrdering for group key comparison in 
WindowExec
 Key: SPARK-26328
 URL: https://issues.apache.org/jira/browse/SPARK-26328
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Li Jin


Per discussion here:

[https://github.com/apache/spark/pull/22305/commits/850541873df1101485a85632b85ba474dac67d56#r239312302]

It seems better to use GenerateOrdering for group key comparison in WindowExec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25640) Clarify/Improve EvalType for grouped aggregate and window aggregate

2018-10-04 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-25640:
---
Description: 
Currently, grouped aggregate and window aggregate uses different EvalType, 
however, they map to the same user facing type PandasUDFType.GROUPED_MAP.

It makes sense to have one user facing type because it 
(PandasUDFType.GROUPED_MAP) can be used in both groupby and window operation.

However, the mismatching between PandasUDFType and EvalType can be confusing to 
developers. We should clarify and/or improve this.

See discussion at: 
https://github.com/apache/spark/pull/22620#discussion_r222452544

  was:
Currently, grouped aggregate and window aggregate uses different EvalType, 
however, they map to the same user facing type PandasUDFType.GROUPED_MAP.

It makes sense to have one user facing type because it 
(PandasUDFType.GROUPED_MAP) can be used in both groupby and window operation.

However, the mismatching between PandasUDFType and EvalType can be confusing to 
developers. We should clarify and/or improve this.


> Clarify/Improve EvalType for grouped aggregate and window aggregate
> ---
>
> Key: SPARK-25640
> URL: https://issues.apache.org/jira/browse/SPARK-25640
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> Currently, grouped aggregate and window aggregate uses different EvalType, 
> however, they map to the same user facing type PandasUDFType.GROUPED_MAP.
> It makes sense to have one user facing type because it 
> (PandasUDFType.GROUPED_MAP) can be used in both groupby and window operation.
> However, the mismatching between PandasUDFType and EvalType can be confusing 
> to developers. We should clarify and/or improve this.
> See discussion at: 
> https://github.com/apache/spark/pull/22620#discussion_r222452544



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25640) Clarify/Improve EvalType for grouped aggregate and window aggregate

2018-10-04 Thread Li Jin (JIRA)

Li Jin created SPARK-25640:
--

 Summary: Clarify/Improve EvalType for grouped aggregate and window 
aggregate
 Key: SPARK-25640
 URL: https://issues.apache.org/jira/browse/SPARK-25640
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Li Jin


Currently, grouped aggregate and window aggregate uses different EvalType, 
however, they map to the same user facing type PandasUDFType.GROUPED_MAP.

It makes sense to have one user facing type because it 
(PandasUDFType.GROUPED_MAP) can be used in both groupby and window operation.

However, the mismatching between PandasUDFType and EvalType can be confusing to 
developers. We should clarify and/or improve this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows

2018-08-28 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594973#comment-16594973
 ] 

Li Jin commented on SPARK-25213:


This is resolved by https://github.com/apache/spark/pull/22104

> DataSourceV2 doesn't seem to produce unsafe rows 
> -
>
> Key: SPARK-25213
> URL: https://issues.apache.org/jira/browse/SPARK-25213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Reproduce (Need to compile test-classes):
> bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes
> {code:java}
> datasource_v2_df = spark.read \
> .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") 
> \
> .load()
> result = datasource_v2_df.withColumn('x', udf(lambda x: x, 
> 'int')(datasource_v2_df['i']))
> result.show()
> {code}
> The above code fails with:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
> {code}
> Seems like Data Source V2 doesn't produce unsafeRows here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25216) Provide better error message when a column contains dot and needs backticks quote

2018-08-23 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-25216:
---
Description: 
The current error message is  often confusing to a new Spark user that a column 
containing "." needs backticks quote. 

For example, consider the following code:
{code:java}
spark.range(0, 1).toDF('a.b')['a.b']{code}
the current message looks like:
{code:java}
Cannot resolve column name "a.b" among (a.b)
{code}
We could improve the error message to, e.g: 
{code:java}
Cannot resolve column name "a.b" among (a.b). Try adding backticks to the 
column name, i.e., `a.b`;
{code}
 

  was:
The current error message is  often confusing to a new Spark user that a column 
containing "." needs backticks quote. 

For example, the current message looks like:
{code:java}
Cannot resolve column name "a.b" among (a.b)
{code}
We could improve the error message to, e.g: 
{code:java}
Cannot resolve column name "a.b" among (a.b). Try adding backticks to the 
column name, i.e., `a.b`;
{code}
 


> Provide better error message when a column contains dot and needs backticks 
> quote
> -
>
> Key: SPARK-25216
> URL: https://issues.apache.org/jira/browse/SPARK-25216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> The current error message is  often confusing to a new Spark user that a 
> column containing "." needs backticks quote. 
> For example, consider the following code:
> {code:java}
> spark.range(0, 1).toDF('a.b')['a.b']{code}
> the current message looks like:
> {code:java}
> Cannot resolve column name "a.b" among (a.b)
> {code}
> We could improve the error message to, e.g: 
> {code:java}
> Cannot resolve column name "a.b" among (a.b). Try adding backticks to the 
> column name, i.e., `a.b`;
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25216) Provide better error message when a column contains dot and needs backticks quote

2018-08-23 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-25216:
---
Description: 
The current error message is  often confusing to a new Spark user that a column 
containing "." needs backticks quote. 

For example, the current message looks like:
{code:java}
Cannot resolve column name "a.b" among (a.b)
{code}
We could improve the error message to, e.g: 
{code:java}
Cannot resolve column name "a.b" among (a.b). Try adding backticks to the 
column name, i.e., `a.b`;
{code}
 

  was:
The current error message is  often confusing to a new Spark user that a column 
containing "." needs backticks quote. 

For example, the current message looks like:

 
{code:java}
Cannot resolve column name "a.b" among (a.b)
{code}
 

We could improve the error message to, e.g:

 
{code:java}
Cannot resolve column name "a.b" among (a.b). Try adding backticks to the 
column name, i.e., `a.b`;
{code}
 

 

 


> Provide better error message when a column contains dot and needs backticks 
> quote
> -
>
> Key: SPARK-25216
> URL: https://issues.apache.org/jira/browse/SPARK-25216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> The current error message is  often confusing to a new Spark user that a 
> column containing "." needs backticks quote. 
> For example, the current message looks like:
> {code:java}
> Cannot resolve column name "a.b" among (a.b)
> {code}
> We could improve the error message to, e.g: 
> {code:java}
> Cannot resolve column name "a.b" among (a.b). Try adding backticks to the 
> column name, i.e., `a.b`;
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25216) Provide better error message when a column contains dot and needs backticks quote

2018-08-23 Thread Li Jin (JIRA)

Li Jin created SPARK-25216:
--

 Summary: Provide better error message when a column contains dot 
and needs backticks quote
 Key: SPARK-25216
 URL: https://issues.apache.org/jira/browse/SPARK-25216
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Li Jin


The current error message is  often confusing to a new Spark user that a column 
containing "." needs backticks quote. 

For example, the current message looks like:

 
{code:java}
Cannot resolve column name "a.b" among (a.b)
{code}
 

We could improve the error message to, e.g:

 
{code:java}
Cannot resolve column name "a.b" among (a.b). Try adding backticks to the 
column name, i.e., `a.b`;
{code}
 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows

2018-08-23 Thread Li Jin (JIRA)

Li Jin created SPARK-25213:
--

 Summary: DataSourceV2 doesn't seem to produce unsafe rows 
 Key: SPARK-25213
 URL: https://issues.apache.org/jira/browse/SPARK-25213
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Li Jin


Reproduce (Need to compile test-classes):

bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes
{code:java}
datasource_v2_df = spark.read \
.format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") \
.load()
result = datasource_v2_df.withColumn('x', udf(lambda x: x, 
'int')(datasource_v2_df['i']))
result.show()
{code}
The above code fails with:
{code:java}
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to 
org.apache.spark.sql.catalyst.expressions.UnsafeRow
at 
org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127)
at 
org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
{code}
Seems like Data Source V2 doesn't produce unsafeRows here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)

2018-08-14 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579965#comment-16579965
 ] 

Li Jin commented on SPARK-24561:


I am looking into this. Early investigation: 
https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit

> User-defined window functions with pandas udf (bounded window)
> --
>
> Key: SPARK-24561
> URL: https://issues.apache.org/jira/browse/SPARK-24561
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Li Jin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources

2018-08-14 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24721:
---
Component/s: SQL

> Failed to use PythonUDF with literal inputs in filter with data sources
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
>

[jira] [Updated] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources

2018-08-14 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24721:
---
Issue Type: Bug  (was: Sub-task)
Parent: (was: SPARK-22216)

> Failed to use PythonUDF with literal inputs in filter with data sources
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
>

[jira] [Comment Edited] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources

2018-08-14 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579956#comment-16579956
 ] 

Li Jin edited comment on SPARK-24721 at 8/14/18 3:26 PM:
-

Updated Jira title to reflect the actual issue


was (Author: icexelloss):
Updates Jira title to reflect the actual issue

> Failed to use PythonUDF with literal inputs in filter with data sources
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
>

[jira] [Commented] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources

2018-08-14 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579956#comment-16579956
 ] 

Li Jin commented on SPARK-24721:


Updates Jira title to reflect the actual issue

> Failed to use PythonUDF with literal inputs in filter with data sources
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
>

[jira] [Updated] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources

2018-08-14 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24721:
---
Summary: Failed to use PythonUDF with literal inputs in filter with data 
sources  (was: Failed to call PythonUDF whose input is the output of another 
PythonUDF)

> Failed to use PythonUDF with literal inputs in filter with data sources
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
>

[jira] [Comment Edited] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-07-27 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560337#comment-16560337
 ] 

Li Jin edited comment on SPARK-24721 at 7/27/18 9:18 PM:
-

I think the issue is the UDF is being pushed down to the PartitionFilter in 
FileScan physical node and then ExtractPythonUDFs rule throws the exception 
(this is the Spark plan before execution the ExtractPythonUDFs rule):
{code:java}
Project [_c0#17, (1) AS v1#20]
+- Filter ((1) = 0)
   +- FileScan csv [_c0#17] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:/tmp/tab3], PartitionFilters: [((1) = 0)], 
PushedFilters: [], ReadSchema: struct<_c0:string>
{code}
I am not familiar with how PartiionFilters pushdown is supposed to work. 
[~smilegator] and [~cloud_fan] could you guys maybe point me to the right 
direction? Should we not push down the filter  to FileScan? Or 
should we ignore it in the ExtractPythonUDFs rule?

 


was (Author: icexelloss):
I think the issue is the UDF is being pushed down to the PartitionFilter in 
FileScan physical node (this is the Spark plan before execution the 
ExtractPythonUDFs rule):

 
{code:java}
Project [_c0#17, (1) AS v1#20]
+- Filter ((1) = 0)
   +- FileScan csv [_c0#17] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:/tmp/tab3], PartitionFilters: [((1) = 0)], 
PushedFilters: [], ReadSchema: struct<_c0:string>
{code}
I am not familiar with how PartiionFilters pushdown is supposed to work. 
[~smilegator] and [~cloud_fan] could you guys maybe point me to the right 
direction? Should we not push down the filter  to FileScan? Or 
should we ignore it in the ExtractPythonUDFs rule?

 

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
>

[jira] [Commented] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-07-27 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560337#comment-16560337
 ] 

Li Jin commented on SPARK-24721:


I think the issue is the UDF is being pushed down to the PartitionFilter in 
FileScan physical node (this is the Spark plan before execution the 
ExtractPythonUDFs rule):

 
{code:java}
Project [_c0#17, (1) AS v1#20]
+- Filter ((1) = 0)
   +- FileScan csv [_c0#17] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:/tmp/tab3], PartitionFilters: [((1) = 0)], 
PushedFilters: [], ReadSchema: struct<_c0:string>
{code}
I am not familiar with how PartiionFilters pushdown is supposed to work. 
[~smilegator] and [~cloud_fan] could you guys maybe point me to the right 
direction? Should we not push down the filter  to FileScan? Or 
should we ignore it in the ExtractPythonUDFs rule?

 

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
>

[jira] [Commented] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-07-27 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560283#comment-16560283
 ] 

Li Jin commented on SPARK-24721:


{code:java}
from pyspark.sql.functions import udf, lit, col

spark.range(1).write.mode("overwrite").format('csv').save("/tmp/tab3")
df = spark.read.csv('/tmp/tab3')
df2 = df.withColumn('v1', udf(lambda x: x, 'int')(lit(1)))
df2 = df2.filter(df2['v1'] == 0)

df2.explain()
{code}
This is a simpler reproduce

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
>

[jira] [Updated] (SPARK-24624) Can not mix vectorized and non-vectorized UDFs

2018-07-15 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24624:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22216

> Can not mix vectorized and non-vectorized UDFs
> --
>
> Key: SPARK-24624
> URL: https://issues.apache.org/jira/browse/SPARK-24624
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> In the current impl, we have the limitation: users are unable to mix 
> vectorized and non-vectorized UDFs in same Project. This becomes worse since 
> our optimizer could combine continuous Projects into a single one. For 
> example, 
> {code}
> applied_df = df.withColumn('regular', my_regular_udf('total', 
> 'qty')).withColumn('pandas', my_pandas_udf('total', 'qty'))
> {code}
> Returns the following error. 
> {code}
> IllegalArgumentException: Can not mix vectorized and non-vectorized UDFs
> java.lang.IllegalArgumentException: Can not mix vectorized and non-vectorized 
> UDFs
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:170)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:146)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>  at scala.collection.immutable.List.map(List.scala:285)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:146)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
>  at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>  at scala.collection.immutable.List.foldLeft(List.scala:84)
>  at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:113)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:99)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3312)
>  at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:2750)
>  ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-07-15 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24721:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22216

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
>

[jira] [Commented] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-07-15 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544656#comment-16544656
 ] 

Li Jin commented on SPARK-24721:


I am currently traveling but will try to take a look when I get back

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
>

[jira] [Updated] (SPARK-24796) Support GROUPED_AGG_PANDAS_UDF in Pivot

2018-07-15 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24796:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22216

> Support GROUPED_AGG_PANDAS_UDF in Pivot
> ---
>
> Key: SPARK-24796
> URL: https://issues.apache.org/jira/browse/SPARK-24796
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, Grouped AGG PandasUDF is not supported in Pivot. It is nice to 
> support it. 
> {code}
> # create input dataframe
> from pyspark.sql import Row
> data = [
>   Row(id=123, total=200.0, qty=3, name='item1'),
>   Row(id=124, total=1500.0, qty=1, name='item2'),
>   Row(id=125, total=203.5, qty=2, name='item3'),
>   Row(id=126, total=200.0, qty=500, name='item1'),
> ]
> df = spark.createDataFrame(data)
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def pandas_avg(v):
>return v.mean()
> from pyspark.sql.functions import col, sum
>   
> applied_df = 
> df.groupby('id').pivot('name').agg(pandas_avg('total').alias('mean'))
> applied_df.show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24796) Support GROUPED_AGG_PANDAS_UDF in Pivot

2018-07-15 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544655#comment-16544655
 ] 

Li Jin commented on SPARK-24796:


Sorry I am traveling now but I will try to take a look when I get back

> Support GROUPED_AGG_PANDAS_UDF in Pivot
> ---
>
> Key: SPARK-24796
> URL: https://issues.apache.org/jira/browse/SPARK-24796
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, Grouped AGG PandasUDF is not supported in Pivot. It is nice to 
> support it. 
> {code}
> # create input dataframe
> from pyspark.sql import Row
> data = [
>   Row(id=123, total=200.0, qty=3, name='item1'),
>   Row(id=124, total=1500.0, qty=1, name='item2'),
>   Row(id=125, total=203.5, qty=2, name='item3'),
>   Row(id=126, total=200.0, qty=500, name='item1'),
> ]
> df = spark.createDataFrame(data)
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def pandas_avg(v):
>return v.mean()
> from pyspark.sql.functions import col, sum
>   
> applied_df = 
> df.groupby('id').pivot('name').agg(pandas_avg('total').alias('mean'))
> applied_df.show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24760) Pandas UDF does not handle NaN correctly

2018-07-09 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537662#comment-16537662
 ] 

Li Jin commented on SPARK-24760:


I think the issue here is that the output schema for the UDF is not correctly 
specified.

You specified the output schema of the UDF to the same as input, which means 
the column "new" is not nullable, but you are returning a null value in your 
UDF. It doesn't match the schema and therefore the exception.

Does that make sense?

 

> Pandas UDF does not handle NaN correctly
> 
>
> Key: SPARK-24760
> URL: https://issues.apache.org/jira/browse/SPARK-24760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Pandas 0.23.1
>Reporter: Mortada Mehyar
>Priority: Minor
>
> I noticed that having `NaN` values when using the new Pandas UDF feature 
> triggers a JVM exception. Not sure if this is an issue with PySpark or 
> PyArrow. Here is a somewhat contrived example to showcase the problem.
> {code}
> In [1]: import pandas as pd
>...: from pyspark.sql.functions import lit, pandas_udf, PandasUDFType
> In [2]: d = [{'key': 'a', 'value': 1},
>  {'key': 'a', 'value': 2},
>  {'key': 'b', 'value': 3},
>  {'key': 'b', 'value': -2}]
> df = spark.createDataFrame(d, "key: string, value: int")
> df.show()
> +---+-+
> |key|value|
> +---+-+
> |  a|1|
> |  a|2|
> |  b|3|
> |  b|   -2|
> +---+-+
> In [3]: df_tmp = df.withColumn('new', lit(1.0))  # add a DoubleType column
> df_tmp.printSchema()
> root
>  |-- key: string (nullable = true)
>  |-- value: integer (nullable = true)
>  |-- new: double (nullable = false)
> {code}
> And the Pandas UDF is simply creating a new column where negative values 
> would be set to a particular float, in this case INF and it works fine
> {code}
> In [4]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('inf'))
>...: return pdf
> In [5]: df.groupby('key').apply(func).show()
> +---+-+--+
> |key|value|new|
> +---+-+--+
> |  b|3|   3.0|
> |  b|   -2|  Infinity|
> |  a|1|   1.0|
> |  a|2|   2.0|
> +---+-+--+
> {code}
> However if we set this value to NaN then it triggers an exception:
> {code}
> In [6]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('nan'))
>...: return pdf
>...:
>...: df.groupby('key').apply(func).show()
> [Stage 23:==> (73 + 2) / 
> 75]2018-07-07 16:26:27 ERROR Executor:91 - Exception in task 36.0 in stage 
> 23.0 (TID 414)
> java.lang.IllegalStateException: Value at index is null
>   at org.apache.arrow.vector.Float8Vector.get(Float8Vector.java:98)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector$DoubleAccessor.getDouble(ArrowColumnVector.java:344)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector.getDouble(ArrowColumnVector.java:99)
>   at 
> org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getDouble(MutableColumnarRow.java:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at

[jira] [Commented] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-07-02 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530462#comment-16530462
 ] 

Li Jin commented on SPARK-24721:


Yep I can take a look

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
>

[jira] [Commented] (SPARK-24624) Can not mix vectorized and non-vectorized UDFs

2018-06-21 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519590#comment-16519590
 ] 

Li Jin commented on SPARK-24624:


I can take a look at this one

> Can not mix vectorized and non-vectorized UDFs
> --
>
> Key: SPARK-24624
> URL: https://issues.apache.org/jira/browse/SPARK-24624
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> In the current impl, we have the limitation: users are unable to mix 
> vectorized and non-vectorized UDFs in same Project. This becomes worse since 
> our optimizer could combine continuous Projects into a single one. For 
> example, 
> {code}
> applied_df = df.withColumn('regular', my_regular_udf('total', 
> 'qty')).withColumn('pandas', my_pandas_udf('total', 'qty'))
> {code}
> Returns the following error. 
> {code}
> IllegalArgumentException: Can not mix vectorized and non-vectorized UDFs
> java.lang.IllegalArgumentException: Can not mix vectorized and non-vectorized 
> UDFs
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:170)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:146)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>  at scala.collection.immutable.List.map(List.scala:285)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:146)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
>  at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>  at scala.collection.immutable.List.foldLeft(List.scala:84)
>  at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:113)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:99)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3312)
>  at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:2750)
>  ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24578) Reading remote cache block behavior changes and causes timeout issue

2018-06-18 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16515879#comment-16515879
 ] 

Li Jin edited comment on SPARK-24578 at 6/18/18 3:24 PM:
-

cc @gatorsmile [~cloud_fan]

We found this when switching from 2.2.1 to 2.3.0 in one of our applications. 
The implication is pretty bad - the time outs significantly hurt the 
performance (20s to several minutes for some jobs). This could affect other 
Spark 2.3 users too because it's pretty easy to reproduce.


was (Author: icexelloss):
cc @gatorsmile

We found this when switching from 2.2.1 to 2.3.0 in one of our applications. 
The implication is pretty bad - the time outs significantly hurt the 
performance (20s to several minutes for some jobs). This could affect other 
Spark 2.3 users too because it's pretty easy to reproduce.

> Reading remote cache block behavior changes and causes timeout issue
> 
>
> Key: SPARK-24578
> URL: https://issues.apache.org/jira/browse/SPARK-24578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Wenbo Zhao
>Priority: Major
>
> After Spark 2.3, we observed lots of errors like the following in some of our 
> production job
> {code:java}
> 18/06/15 20:59:42 ERROR TransportRequestHandler: Error sending result 
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=91672904003, 
> chunkIndex=0}, 
> buffer=org.apache.spark.storage.BlockManagerManagedBuffer@783a9324} to 
> /172.22.18.7:60865; closing connection
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.writeNioBuffer(MessageWithHeader.java:156)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:142)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
> at 
> io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:983)
> at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:248)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$1.run(AbstractNioByteChannel.java:284)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> {code}
>  
> Here is a small reproducible for a small cluster of 2 executors (say host-1 
>

[jira] [Commented] (SPARK-24578) Reading remote cache block behavior changes and causes timeout issue

2018-06-18 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16515879#comment-16515879
 ] 

Li Jin commented on SPARK-24578:


cc @gatorsmile

We found this when switching from 2.2.1 to 2.3.0 in one of our applications. 
The implication is pretty bad - the time outs significantly hurt the 
performance (20s to several minutes for some jobs). This could affect other 
Spark 2.3 users too because it's pretty easy to reproduce.

> Reading remote cache block behavior changes and causes timeout issue
> 
>
> Key: SPARK-24578
> URL: https://issues.apache.org/jira/browse/SPARK-24578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Wenbo Zhao
>Priority: Major
>
> After Spark 2.3, we observed lots of errors like the following in some of our 
> production job
> {code:java}
> 18/06/15 20:59:42 ERROR TransportRequestHandler: Error sending result 
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=91672904003, 
> chunkIndex=0}, 
> buffer=org.apache.spark.storage.BlockManagerManagedBuffer@783a9324} to 
> /172.22.18.7:60865; closing connection
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.writeNioBuffer(MessageWithHeader.java:156)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:142)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
> at 
> io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:983)
> at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:248)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$1.run(AbstractNioByteChannel.java:284)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> {code}
>  
> Here is a small reproducible for a small cluster of 2 executors (say host-1 
> and host-2) each with 8 cores. Here, the memory of driver and executors are 
> not an import factor here as long as it is big enough, say 20G. 
> {code:java}
> val n = 1
> val df0 = sc.parallelize(1 to n).toDF
> val df = df0.withColumn("x0", rand()).withColumn("x0", rand()
> ).withColumn("x1", rand()
> ).withColumn("x2", rand()
> ).withColumn("x3", rand()
> ).withColumn("x4",

[jira] [Updated] (SPARK-24578) Reading remote cache block behavior changes and causes timeout issue

2018-06-18 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24578:
---
Component/s: (was: Input/Output)
 Spark Core

> Reading remote cache block behavior changes and causes timeout issue
> 
>
> Key: SPARK-24578
> URL: https://issues.apache.org/jira/browse/SPARK-24578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Wenbo Zhao
>Priority: Major
>
> After Spark 2.3, we observed lots of errors like the following in some of our 
> production job
> {code:java}
> 18/06/15 20:59:42 ERROR TransportRequestHandler: Error sending result 
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=91672904003, 
> chunkIndex=0}, 
> buffer=org.apache.spark.storage.BlockManagerManagedBuffer@783a9324} to 
> /172.22.18.7:60865; closing connection
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.writeNioBuffer(MessageWithHeader.java:156)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:142)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
> at 
> io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:983)
> at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:248)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$1.run(AbstractNioByteChannel.java:284)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> {code}
>  
> Here is a small reproducible for a small cluster of 2 executors (say host-1 
> and host-2) each with 8 cores (the memory of driver and executors are not a 
> import factor here as long as it is big enough, say 20G).
> {code:java}
> val n = 1
> val df0 = sc.parallelize(1 to n).toDF
> val df = df0.withColumn("x0", rand()).withColumn("x0", rand()
> ).withColumn("x1", rand()
> ).withColumn("x2", rand()
> ).withColumn("x3", rand()
> ).withColumn("x4", rand()
> ).withColumn("x5", rand()
> ).withColumn("x6", rand()
> ).withColumn("x7", rand()
> ).withColumn("x8", rand()
> ).withColumn("x9", rand())
> df.cache; df.count
> (1 to 10).toArray.par.map { i => println(i); 
> df.groupBy("x1").agg(count("value")).show() }
> {code}
>  
>

[jira] [Commented] (SPARK-24563) Allow running PySpark shell without Hive

2018-06-14 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512844#comment-16512844
 ] 

Li Jin commented on SPARK-24563:


Will submit a PR soon

> Allow running PySpark shell without Hive
> 
>
> Key: SPARK-24563
> URL: https://issues.apache.org/jira/browse/SPARK-24563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> A previous commit: 
> [https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]
> disallows running PySpark shell without Hive.
> Per discussion on mailing list, the behavior change is unintended.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24563) Allow running PySpark shell without Hive

2018-06-14 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24563:
---
Description: 
A previous commit: 

[https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]

disallows running PySpark shell without Hive.

Per discussion on mailing list, the behavior change is unintended.

  was:
A previous commit: 

[https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]

Disallows running PySpark shell without Hive. Per discussion on mailing list, 
the behavior change is unintended.


> Allow running PySpark shell without Hive
> 
>
> Key: SPARK-24563
> URL: https://issues.apache.org/jira/browse/SPARK-24563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> A previous commit: 
> [https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]
> disallows running PySpark shell without Hive.
> Per discussion on mailing list, the behavior change is unintended.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24563) Allow running PySpark shell without Hive

2018-06-14 Thread Li Jin (JIRA)

Li Jin created SPARK-24563:
--

 Summary: Allow running PySpark shell without Hive
 Key: SPARK-24563
 URL: https://issues.apache.org/jira/browse/SPARK-24563
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Li Jin


A previous commit: 

[https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]

Disallows running PySpark shell without Hive. Per discussion on mailing list, 
the behavior change is unintended.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22239) User-defined window functions with pandas udf (unbounded window)

2018-06-14 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-22239:
---
Description: 
Window function is another place we can benefit from vectored udf and add 
another useful function to the pandas_udf suite.

Example usage (preliminary):
{code:java}
w = Window.partitionBy('id').rowsBetween(Window.unbounedPreceding, 
Window.unbounedFollowing)

@pandas_udf(DoubleType())
def mean_udf(v):
return v.mean()

df.withColumn('v_mean', mean_udf(df.v1).over(window))
{code}

  was:
Window function is another place we can benefit from vectored udf and add 
another useful function to the pandas_udf suite.

Example usage (preliminary):

{code:java}
w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0)

@pandas_udf(DoubleType())
def ema(v1):
return v1.ewm(alpha=0.5).mean().iloc[-1]

df.withColumn('v1_ema', ema(df.v1).over(window))
{code}




> User-defined window functions with pandas udf (unbounded window)
> 
>
> Key: SPARK-22239
> URL: https://issues.apache.org/jira/browse/SPARK-22239
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: 
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Window function is another place we can benefit from vectored udf and add 
> another useful function to the pandas_udf suite.
> Example usage (preliminary):
> {code:java}
> w = Window.partitionBy('id').rowsBetween(Window.unbounedPreceding, 
> Window.unbounedFollowing)
> @pandas_udf(DoubleType())
> def mean_udf(v):
> return v.mean()
> df.withColumn('v_mean', mean_udf(df.v1).over(window))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24561) User-defined window functions with pandas udf (bounded window)

2018-06-14 Thread Li Jin (JIRA)

Li Jin created SPARK-24561:
--

 Summary: User-defined window functions with pandas udf (bounded 
window)
 Key: SPARK-24561
 URL: https://issues.apache.org/jira/browse/SPARK-24561
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.3.1
Reporter: Li Jin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22239) User-defined window functions with pandas udf (unbounded window)

2018-06-14 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-22239:
---
Summary: User-defined window functions with pandas udf (unbounded window)  
(was: User-defined window functions with pandas udf)

> User-defined window functions with pandas udf (unbounded window)
> 
>
> Key: SPARK-22239
> URL: https://issues.apache.org/jira/browse/SPARK-22239
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: 
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Window function is another place we can benefit from vectored udf and add 
> another useful function to the pandas_udf suite.
> Example usage (preliminary):
> {code:java}
> w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0)
> @pandas_udf(DoubleType())
> def ema(v1):
> return v1.ewm(alpha=0.5).mean().iloc[-1]
> df.withColumn('v1_ema', ema(df.v1).over(window))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22239) User-defined window functions with pandas udf

2018-06-13 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511495#comment-16511495
 ] 

Li Jin commented on SPARK-22239:


[~hyukjin.kwon] I actually don't think this Jira is done. The PR only resolves 
the unbounded window case, do you mind if I edit the Jira to reflect what's 
done and create a new Jira for rolling window case?

> User-defined window functions with pandas udf
> -
>
> Key: SPARK-22239
> URL: https://issues.apache.org/jira/browse/SPARK-22239
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: 
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Window function is another place we can benefit from vectored udf and add 
> another useful function to the pandas_udf suite.
> Example usage (preliminary):
> {code:java}
> w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0)
> @pandas_udf(DoubleType())
> def ema(v1):
> return v1.ewm(alpha=0.5).mean().iloc[-1]
> df.withColumn('v1_ema', ema(df.v1).over(window))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24521) Fix ineffective test in CachedTableSuite

2018-06-11 Thread Li Jin (JIRA)

Li Jin created SPARK-24521:
--

 Summary: Fix ineffective test in CachedTableSuite
 Key: SPARK-24521
 URL: https://issues.apache.org/jira/browse/SPARK-24521
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.3.1
Reporter: Li Jin


test("withColumn doesn't invalidate cached dataframe") in CachedTableSuite 
doesn't not work because:

The UDF is executed and test count incremented when "df.cache()" is called and 
the subsequent "df.collect()" has no effect on the test result. 

link:

https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala#L86



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24258) SPIP: Improve PySpark support for ML Matrix and Vector types

2018-06-08 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506399#comment-16506399
 ] 

Li Jin commented on SPARK-24258:


I ran into [~mengxr] and chatted about this. Seems a good first step is to have 
tensor type to be first-class type in Spark DataFrame. For operations, there is 
concerns about having to add many many linear algebra functions in Spark 
codebase, so it's not clear whether it's a good idea.

Any thoughts?

> SPIP: Improve PySpark support for ML Matrix and Vector types
> 
>
> Key: SPARK-24258
> URL: https://issues.apache.org/jira/browse/SPARK-24258
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Leif Walsh
>Priority: Major
>
> h1. Background and Motivation:
> In Spark ML ({{pyspark.ml.linalg}}), there are four column types you can 
> construct, {{SparseVector}}, {{DenseVector}}, {{SparseMatrix}}, and 
> {{DenseMatrix}}.  In PySpark, you can construct one of these vectors with 
> {{VectorAssembler}}, and then you can run python UDFs on these columns, and 
> use {{toArray()}} to get numpy ndarrays and do things with them.  They also 
> have a small native API where you can compute {{dot()}}, {{norm()}}, and a 
> few other things with them (I think these are computed in scala, not python, 
> could be wrong).
> For statistical applications, having the ability to manipulate columns of 
> matrix and vector values (from here on, I will use the term tensor to refer 
> to arrays of arbitrary dimensionality, matrices are 2-tensors and vectors are 
> 1-tensors) would be powerful.  For example, you could use PySpark to reshape 
> your data in parallel, assemble some matrices from your raw data, and then 
> run some statistical computation on them using UDFs leveraging python 
> libraries like statsmodels, numpy, tensorflow, and scikit-learn.
> I propose enriching the {{pyspark.ml.linalg}} types in the following ways:
> # Expand the set of column operations one can apply to tensor columns beyond 
> the few functions currently available on these types.  Ideally, the API 
> should aim to be as wide as the numpy ndarray API, but would wrap Breeze 
> operations.  For example, we should provide {{DenseVector.outerProduct()}} so 
> that a user could write something like {{df.withColumn("XtX", 
> df["X"].outerProduct(df["X"]))}}.
> # Make sure all ser/de mechanisms (including Arrow) understand these types, 
> and faithfully represent them as natural types in all languages (in scala and 
> java, Breeze objects, in python, numpy ndarrays rather than the 
> pyspark.ml.linalg types that wrap them, in SparkR, I'm not sure what, but 
> something natural) when applying UDFs or collecting with {{toPandas()}}.
> # Improve the construction of these types from scalar columns.  The 
> {{VectorAssembler}} API is not very ergonomic.  I propose something like 
> {{df.withColumn("predictors", Vector.of(df["feature1"], df["feature2"], 
> df["feature3"]))}}.
> h1. Target Personas:
> Data scientists, machine learning practitioners, machine learning library 
> developers.
> h1. Goals:
> This would allow users to do more statistical computation in Spark natively, 
> and would allow users to apply python statistical computation to data in 
> Spark using UDFs.
> h1. Non-Goals:
> I suppose one non-goal is to reimplement something like statsmodels using 
> Breeze data structures and computation.  That could be seen as an effort to 
> enrich Spark ML itself, but is out of scope of this effort.  This effort is 
> just to make it possible and easy to apply existing python libraries to 
> tensor values in parallel.
> h1. Proposed API Changes:
> Add the above APIs to PySpark and the other language bindings.  I think the 
> list is:
> # {{pyspark.ml.linalg.Vector.of(*columns)}}
> # {{pyspark.ml.linalg.Matrix.of( provide this>)}}
> # For each of the matrix and vector types in {{pyspark.ml.linalg}}, add more 
> methods like {{outerProduct}}, {{matmul}}, {{kron}}, etc.  
> https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.linalg.html has a 
> good list to look at.
> Also, change python UDFs so that these tensor types are passed to the python 
> function not as \{Sparse,Dense\}\{Matrix,Vector\} objects that wrap 
> {{numpy.ndarray}}, but as {{numpy.ndarray}} objects by themselves, and 
> interpret return values that are {{numpy.ndarray}} objects back into the 
> spark types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data when the analyzed plans are different after re-analyzing the plans

2018-05-30 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495695#comment-16495695
 ] 

Li Jin commented on SPARK-24373:


[~smilegator] Thank you for the suggestion.

> "df.cache() df.count()" no longer eagerly caches data when the analyzed plans 
> are different after re-analyzing the plans
> 
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Assignee: Marco Gaido
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-05-29 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493815#comment-16493815
 ] 

Li Jin edited comment on SPARK-22947 at 5/29/18 4:34 PM:
-

Hi [~TomaszGaweda] thanks for your interest! Yes I am willing to work on this 
but needs some help and engagement from Spark committers and PMCs to help move 
discussion forward. 


was (Author: icexelloss):
Hi [~TomaszGaweda] thanks for your interest! Yes I am willing to work on this 
needs some help and engagement from Spark committers and PMCs to help move 
discussion forward. 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 20160101, 2, 50
>

[jira] [Comment Edited] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-05-29 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493815#comment-16493815
 ] 

Li Jin edited comment on SPARK-22947 at 5/29/18 4:34 PM:
-

Hi [~TomaszGaweda] thanks for your interest! Yes I am willing to work on this 
but needs some help and involvement from Spark committers and PMCs to help move 
discussion forward. 


was (Author: icexelloss):
Hi [~TomaszGaweda] thanks for your interest! Yes I am willing to work on this 
but needs some help and engagement from Spark committers and PMCs to help move 
discussion forward. 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 20160101, 2, 50
>

[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-05-29 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493815#comment-16493815
 ] 

Li Jin commented on SPARK-22947:


Hi [~TomaszGaweda] thanks for your interest! Yes I am willing to work on this 
needs some help and engagement from Spark committers and PMCs to help move 
discussion forward. 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 20160101, 2, 50
> 20160102, 1, -50
> 20160102, 2, 50
> dfB:
> time, id, price
> 20151231, 1, 100.0
> 20150102, 1, 105.0
> 20150102, 2, 195.0
> Output:
> time, id, quantity, price
> 20160101, 1, 100, 100.0
> 20160101, 2, 50, null
> 20160102, 1, -50, 105.0
> 20160102, 2, 50, 195.0

[jira] [Comment Edited] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-05-29 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449935#comment-16449935
 ] 

Li Jin edited comment on SPARK-22947 at 5/29/18 4:33 PM:
-

I came across this blog today:

[https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html]

And realized the Ad Monetization example in the log pretty much described asof 
join case in streaming mode.

 

 


was (Author: icexelloss):
I came across this blog today:

[https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html]

And realized the Ad Monetization example in the log pretty much described asof 
join case in streaming mode.

 

 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result =

[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data when the analyzed plans are different after re-analyzing the plans

2018-05-25 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491309#comment-16491309
 ] 

Li Jin commented on SPARK-24373:


[~smilegator] do you mean that add AnalysisBarrier to RelationalGroupedDataset 
and KeyValueGroupedDataset could lead to new bugs?

> "df.cache() df.count()" no longer eagerly caches data when the analyzed plans 
> are different after re-analyzing the plans
> 
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Blocker
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels

2018-05-25 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491125#comment-16491125
 ] 

Li Jin commented on SPARK-24324:


Moved under Spark-22216 for better ticket organization.

> Pandas Grouped Map UserDefinedFunction mixes column labels
> --
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("timestamp", TimestampType(), False),
>  StructField("views", IntegerType(), False)])
> df = sqlctx.read.csv(fp.name,
>  header=False,
>  schema=schema,
>  timestampFormat="MMdd-HHmmss",
>  sep=' ')
> df.count()
> df.dtypes
> df.show()
> new_schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("day", StringType(), False),
>  StructField("enc", StringType(), False)])
> from

[jira] [Updated] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels

2018-05-25 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24324:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22216

> Pandas Grouped Map UserDefinedFunction mixes column labels
> --
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("timestamp", TimestampType(), False),
>  StructField("views", IntegerType(), False)])
> df = sqlctx.read.csv(fp.name,
>  header=False,
>  schema=schema,
>  timestampFormat="MMdd-HHmmss",
>  sep=' ')
> df.count()
> df.dtypes
> df.show()
> new_schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("day", StringType(), False),
>  StructField("enc", StringType(), False)])
> from pyspark.sql.functions import

[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-25 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491086#comment-16491086
 ] 

Li Jin commented on SPARK-24373:


We use groupby() and pivot()

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-24 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489060#comment-16489060
 ] 

Li Jin edited comment on SPARK-24373 at 5/24/18 9:00 PM:
-

This is a reproduce:
{code:java}
val myUDF = udf((x: Long) => { println(""); x + 1 })

val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.count()
// No  printed
{code}
It appears the issue is related to UDF:
{code:java}
val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.groupBy().count().explain()

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
  +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(1) Project
  +- *(1) Range (0, 1, step=1, splits=2)
{code}
Without UDF it uses "count" materialize cache:
{code:java}
val df1 = spark.range(0, 1).toDF("s").select($"s" + 1)
df1.cache()
df1.groupBy().count().explain()

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
  +- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
  +- *(1) InMemoryTableScan
+- InMemoryRelation [(s + 1)#179L], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) Project [(id#175L + 1) AS (s + 1)#179L]
  +- *(1) Range (0, 1, step=1, splits=2) ,None)
+- *(1) Project [(id#175L + 1) AS (s + 1)#179L]
   +- *(1) Range (0, 1, step=1, splits=2)

{code}
 


was (Author: icexelloss):
This is a reproduce:
{code:java}
val myUDF = udf((x: Long) => { println(""); x + 1 })

val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.count()
// No  printed
{code}
It appears the issue is related to UDF:

 
{code:java}
val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.groupBy().count().explain()

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
  +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(1) Project
  +- *(1) Range (0, 1, step=1, splits=2)
{code}
Without UDF it uses "count" materialize cache:

 
{code:java}
val df1 = spark.range(0, 1).toDF("s").select($"s" + 1)
df1.cache()
df1.groupBy().count().explain()

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
  +- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
  +- *(1) InMemoryTableScan
+- InMemoryRelation [(s + 1)#179L], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) Project [(id#175L + 1) AS (s + 1)#179L]
  +- *(1) Range (0, 1, step=1, splits=2) ,None)
+- *(1) Project [(id#175L + 1) AS (s + 1)#179L]
   +- *(1) Range (0, 1, step=1, splits=2)

{code}
 

 

 

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory,

[jira] [Comment Edited] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-24 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489060#comment-16489060
 ] 

Li Jin edited comment on SPARK-24373 at 5/24/18 9:00 PM:
-

This is a reproduce:
{code:java}
val myUDF = udf((x: Long) => { println(""); x + 1 })

val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.count()
// No  printed
{code}
It appears the issue is related to UDF:

 
{code:java}
val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.groupBy().count().explain()

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
  +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(1) Project
  +- *(1) Range (0, 1, step=1, splits=2)
{code}
Without UDF it uses "count" materialize cache:

 
{code:java}
val df1 = spark.range(0, 1).toDF("s").select($"s" + 1)
df1.cache()
df1.groupBy().count().explain()

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
  +- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
  +- *(1) InMemoryTableScan
+- InMemoryRelation [(s + 1)#179L], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) Project [(id#175L + 1) AS (s + 1)#179L]
  +- *(1) Range (0, 1, step=1, splits=2) ,None)
+- *(1) Project [(id#175L + 1) AS (s + 1)#179L]
   +- *(1) Range (0, 1, step=1, splits=2)

{code}
 

 

 


was (Author: icexelloss):
This is a reproduce:
{code:java}
val myUDF = udf((x: Long) => { println(""); x + 1 })

val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.count()
// No  printed{code}

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)],

[jira] [Comment Edited] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-24 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489060#comment-16489060
 ] 

Li Jin edited comment on SPARK-24373 at 5/24/18 8:51 PM:
-

This is a reproduce:
{code:java}
val myUDF = udf((x: Long) => { println(""); x + 1 })

val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.count()
// No  printed{code}


was (Author: icexelloss):
This is a reproduce in unit test:
{code:java}
test("cache and count") {
  var evalCount = 0
  val myUDF = udf((x: String) => { evalCount += 1; "result" })
  val df = spark.range(0, 1).select(myUDF($"id"))
  df.cache()
  df.count()
  assert(evalCount === 1)

  df.count()
  assert(evalCount === 1)
}
{code}

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels

2018-05-24 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489759#comment-16489759
 ] 

Li Jin commented on SPARK-24324:


This is a dup of https://issues.apache.org/jira/browse/SPARK-23929, I am now 
convinced this behavior should be fixed.

> Pandas Grouped Map UserDefinedFunction mixes column labels
> --
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("timestamp", TimestampType(), False),
>  StructField("views", IntegerType(), False)])
> df = sqlctx.read.csv(fp.name,
>  header=False,
>  schema=schema,
>  timestampFormat="MMdd-HHmmss",
>  sep=' ')
> df.count()
> df.dtypes
> df.show()
> new_schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("day", StringType(), False),
>

[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-24 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489060#comment-16489060
 ] 

Li Jin commented on SPARK-24373:


This is a reproduce in unit test:
{code:java}
test("cache and count") {
  var evalCount = 0
  val myUDF = udf((x: String) => { evalCount += 1; "result" })
  val df = spark.range(0, 1).select(myUDF($"id"))
  df.cache()
  df.count()
  assert(evalCount === 1)

  df.count()
  assert(evalCount === 1)
}
{code}

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-23 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24373:
---
Summary: "df.cache() df.count()" no longer eagerly caches data  (was: Spark 
Dataset groupby.agg/count doesn't respect cache with UDF)

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF

2018-05-23 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488109#comment-16488109
 ] 

Li Jin edited comment on SPARK-24373 at 5/23/18 11:18 PM:
--

We found after upgrading to Spark 2.3, many of our production systems runs 
slower. This is founded to be caused by "df.cache(); df.count()" no longer 
eagerly cache df correctly. I think this might be a regression from 2.2. 

I think any one uses  "df.cache() df.count()" to cache data eagerly will be 
affected.


was (Author: icexelloss):
I think this might be a regression from 2.2

Any one uses  "df.cache() df.count()" to cache data eagerly will be affected.

> Spark Dataset groupby.agg/count doesn't respect cache with UDF
> --
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF

2018-05-23 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24373:
---
Component/s: (was: Input/Output)
 SQL

> Spark Dataset groupby.agg/count doesn't respect cache with UDF
> --
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF

2018-05-23 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488109#comment-16488109
 ] 

Li Jin commented on SPARK-24373:


I think this might be a regression from 2.2

Any one uses  "df.cache() df.count()" to cache data eagerly will be affected.

> Spark Dataset groupby.agg/count doesn't respect cache with UDF
> --
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24334) Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator

2018-05-22 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484086#comment-16484086
 ] 

Li Jin commented on SPARK-24334:


[~pi3ni0] did it happen for you when your UDF throws exception?

> Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory 
> allocator
> -
>
> Key: SPARK-24334
> URL: https://issues.apache.org/jira/browse/SPARK-24334
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> Currently, ArrowPythonRunner has two thread that frees the Arrow vector 
> schema root and allocator - The main writer thread and task completion 
> listener thread. 
> Having both thread doing the clean up leads to weird case (e.g., negative ref 
> cnt, NPE, and memory leak exception) when an exceptions are thrown from the 
> user function.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24334) Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator

2018-05-21 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483004#comment-16483004
 ] 

Li Jin commented on SPARK-24334:


I have done some investigation and will submit a PR soon.

> Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory 
> allocator
> -
>
> Key: SPARK-24334
> URL: https://issues.apache.org/jira/browse/SPARK-24334
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> Currently, ArrowPythonRunner has two thread that frees the Arrow vector 
> schema root and allocator - The main writer thread and task completion 
> listener thread. 
> Having both thread doing the clean up leads to weird case (e.g., negative ref 
> cnt, NPE, and memory leak exception) when an exceptions are thrown from the 
> user function.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24334) Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator

2018-05-21 Thread Li Jin (JIRA)

Li Jin created SPARK-24334:
--

 Summary: Race condition in ArrowPythonRunner causes unclean 
shutdown of Arrow memory allocator
 Key: SPARK-24334
 URL: https://issues.apache.org/jira/browse/SPARK-24334
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Li Jin


Currently, ArrowPythonRunner has two thread that frees the Arrow vector schema 
root and allocator - The main writer thread and task completion listener 
thread. 

Having both thread doing the clean up leads to weird case (e.g., negative ref 
cnt, NPE, and memory leak exception) when an exceptions are thrown from the 
user function.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22239) User-defined window functions with pandas udf

2018-04-25 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452977#comment-16452977
 ] 

Li Jin commented on SPARK-22239:


[~hvanhovell], I have done a bit further research of UDF over rolling windows 
and posted my results here:

[https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit?usp=sharing]

TL; DR I think we can implement efficiently by computing window indices in the 
JVM and pass the indices along with the window Python and do rolling over the 
indices in Python. I have not addressed the issue of splitting the window 
partition into smaller batches but I think it's doable as well. Would you be 
interested in taking a look and let me know what you think?

> User-defined window functions with pandas udf
> -
>
> Key: SPARK-22239
> URL: https://issues.apache.org/jira/browse/SPARK-22239
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: 
>Reporter: Li Jin
>Priority: Major
>
> Window function is another place we can benefit from vectored udf and add 
> another useful function to the pandas_udf suite.
> Example usage (preliminary):
> {code:java}
> w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0)
> @pandas_udf(DoubleType())
> def ema(v1):
> return v1.ewm(alpha=0.5).mean().iloc[-1]
> df.withColumn('v1_ema', ema(df.v1).over(window))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-25 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452355#comment-16452355
 ] 

Li Jin commented on SPARK-23929:


[~tr3w] does using OrderedDict help in your case?

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-04-24 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449935#comment-16449935
 ] 

Li Jin edited comment on SPARK-22947 at 4/24/18 2:16 PM:
-

I came across this blog today:

[https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html]

And realized the Ad Monetization example in the log pretty much described asof 
join case in streaming mode.

 

 


was (Author: icexelloss):
I came across this blog today:

[https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html]

And realized the Ad Monetization problem pretty much described asof join case 
in streaming mode.

 

 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB,

[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-04-24 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449935#comment-16449935
 ] 

Li Jin commented on SPARK-22947:


I came across this blog today:

[https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html]

And realized the Ad Monetization problem pretty much described asof join case 
in streaming mode.

 

 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 20160101, 2, 50
> 20160102, 1, -50
> 20160102, 2, 50
> dfB:
> time, id, price
> 20151231, 1, 100.0
> 20150102, 1, 105.0
> 20150102, 2, 195.0
> Output:
> time, id, quantity, price
> 20160101, 1, 100, 100.0
> 20160101, 2,

[jira] [Updated] (SPARK-24019) AnalysisException for Window function expression to compute derivative

2018-04-20 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24019:
---
Component/s: (was: Spark Core)
 SQL

> AnalysisException for Window function expression to compute derivative
> --
>
> Key: SPARK-24019
> URL: https://issues.apache.org/jira/browse/SPARK-24019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Ubuntu, spark 2.1.1, standalone.
>Reporter: Barry Becker
>Priority: Minor
>
> I am using spark 2.1.1 currently.
> I created an expression to compute the derivative of some series data using a 
> window function.
> I have a simple reproducible case of the error.
> I'm only filing this bug because the error message says "Please file a bug 
> report with this error message, stack trace, and the query."
> Here they are:
> {code:java}
> ((coalesce(lead(value#9, 1, null) windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), 
> value#9) - coalesce(lag(value#9, 1, null) windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), 
> value#9)) / cast((coalesce(lead(sequence_num#7, 1, null) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING), sequence_num#7) - coalesce(lag(sequence_num#7, 
> 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, 
> ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), sequence_num#7)) as double)) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, RANGE 
> BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS derivative#14 has multiple 
> Window Specifications (ArrayBuffer(windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT 
> ROW), windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS 
> BETWEEN 1 FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))).
> Please file a bug report with this error message, stack trace, and the query.;
> org.apache.spark.sql.AnalysisException: ((coalesce(lead(value#9, 1, null) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING), value#9) - coalesce(lag(value#9, 1, null) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 PRECEDING AND 1 PRECEDING), value#9)) / cast((coalesce(lead(sequence_num#7, 
> 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, 
> ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), sequence_num#7) - 
> coalesce(lag(sequence_num#7, 1, null) windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), 
> sequence_num#7)) as double)) windowspecdefinition(category#8, sequence_num#7 
> ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
> derivative#14 has multiple Window Specifications 
> (ArrayBuffer(windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, sequence_num#7 
> ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))).
> Please file a bug report with this error message, stack trace, and the query.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$78.apply(Analyzer.scala:1772){code}
> And here is a simple unit test that can be used to reproduce the problem:
> {code:java}
> import com.mineset.spark.testsupport.SparkTestCase.SPARK_SESSION
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions._
> import org.scalatest.FunSuite
> import com.mineset.spark.testsupport.SparkTestCase._
> /**
> * Test to see that window functions work as expected on spark.
> * @author Barry Becker
> */
> class WindowFunctionSuite extends FunSuite {
> val simpleDf = createSimpleData()
> test("Window function for finding derivatives for 2 series") {
> val window =
> Window.partitionBy("category").orderBy("sequence_num")//.rangeBetween(-1, 1)
> // Consider the three sequential series points (Xlag, Ylag), (X, Y), (Xlead, 
> Ylead).
> // This defines the derivative as (Ylead - Ylag) / (Xlead - Xlag)
> // If the lead or lag points are null, then we fall back on using the

[jira] [Comment Edited] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-15 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438935#comment-16438935
 ] 

Li Jin edited comment on SPARK-23929 at 4/16/18 2:40 AM:
-

I agree with [~hyukjin.kwon]. Seems like there is not a strong enough reason to 
change the behavior...


was (Author: icexelloss):
I agree with [~hyukjin.kwon]. Seems like there is not a strong enough reason to 
change the behavior.

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-15 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438935#comment-16438935
 ] 

Li Jin commented on SPARK-23929:


I agree with [~hyukjin.kwon]. Seems like there is not a strong enough reason to 
change the behavior.

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23030) Decrease memory consumption with toPandas() collection using Arrow

2018-04-13 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438001#comment-16438001
 ] 

Li Jin commented on SPARK-23030:


Hey [~bryanc], did you by an chance have some process on this? I guess what's 
tricky here is you probably lose the parallelism if streaming each partitions 
one by one?

> Decrease memory consumption with toPandas() collection using Arrow
> --
>
> Key: SPARK-23030
> URL: https://issues.apache.org/jira/browse/SPARK-23030
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently with Arrow enabled, calling {{toPandas()}} results in a collection 
> of all partitions in the JVM in the form of batches of Arrow file format.  
> Once collected in the JVM, they are served to the Python driver process. 
> I believe using the Arrow stream format can help to optimize this and reduce 
> memory consumption in the JVM by only loading one record batch at a time 
> before sending it to Python.  This might also reduce the latency between 
> making the initial call in Python and receiving the first batch of records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 220 matches

Mail list logo