[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-08-29 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596859#comment-16596859
 ] 

Bryan Cutler commented on SPARK-25272:
--

This is a followup that is possible now that Arrow streaming format is being 
used.

> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25274) Improve `toPandas` with Arrow by sending out-of-order record batches

2018-08-29 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-25274:


 Summary: Improve `toPandas` with Arrow by sending out-of-order 
record batches
 Key: SPARK-25274
 URL: https://issues.apache.org/jira/browse/SPARK-25274
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 2.4.0
Reporter: Bryan Cutler


When executing {{toPandas}} with Arrow enabled, partitions that arrive in the 
JVM out-of-order must be buffered before they can be send to Python. This 
causes an excess of memory to be used in the driver JVM and increases the time 
it takes to complete because data must sit in the JVM waiting for preceding 
partitions to come in.

This can be improved by sending out-of-order partitions to Python as soon as 
they arrive in the JVM, followed by a list of indices so that Python can 
assemble the data in the correct order. This way, data is not buffered at the 
JVM and there is no waiting on particular partitions so performance will be 
increased.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-08-29 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596850#comment-16596850
 ] 

Bryan Cutler commented on SPARK-25272:
--

yeah that would be great to make it easier to view the python test output in 
Jenkins.

> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-08-29 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-25272:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22216

> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-08-29 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-25272:


 Summary: Show some kind of test output to indicate pyarrow tests 
were run
 Key: SPARK-25272
 URL: https://issues.apache.org/jira/browse/SPARK-25272
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Tests
Affects Versions: 2.4.0
Reporter: Bryan Cutler
Assignee: Bryan Cutler


Right now tests only output status when they are skipped and there is no way to 
really see from the logs that pyarrow tests, like ArrowTests, have been run 
except by the absence of a skipped message.  We can add a test that is skipped 
if pyarrow is installed, which will give an output in our Jenkins test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-28 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16595835#comment-16595835
 ] 

Bryan Cutler commented on SPARK-23874:
--

[~smilegator] I linked the most relevant pyarrow issues above, which deal 
mostly with binary type support. If I get the time, I'll go through the 
changelog here https://arrow.apache.org/release/0.10.0.html and look for other 
relevant ones.

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25147) GroupedData.apply pandas_udf crashing

2018-08-22 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589177#comment-16589177
 ] 

Bryan Cutler commented on SPARK-25147:
--

Works for me on linux with:
Python 3.6.6
pyarrow 0.10.0
pandas 23.4
numpy 1.14.3

Maybe only on MacOS?

> GroupedData.apply pandas_udf crashing
> -
>
> Key: SPARK-25147
> URL: https://issues.apache.org/jira/browse/SPARK-25147
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: OS: Mac OS 10.13.6
> Python: 2.7.15, 3.6.6
> PyArrow: 0.10.0
> Pandas: 0.23.4
> Numpy: 1.15.0
>Reporter: Mike Sukmanowsky
>Priority: Major
>
> Running the following example taken straight from the docs results in 
> {{org.apache.spark.SparkException: Python worker exited unexpectedly 
> (crashed)}} for reasons that aren't clear from any logs I can see:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> spark = (
> SparkSession
> .builder
> .appName("pandas_udf")
> .getOrCreate()
> )
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v")
> )
> @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP)
> def normalize(pdf):
> v = pdf.v
> return pdf.assign(v=(v - v.mean()) / v.std())
> (
> df
> .groupby("id")
> .apply(normalize)
> .show()
> )
> {code}
>  See output.log for 
> [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28].
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23698) Spark code contains numerous undefined names in Python 3

2018-08-22 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589143#comment-16589143
 ] 

Bryan Cutler commented on SPARK-23698:
--

Followup resolved by pull request 20838
https://github.com/apache/spark/pull/20838


> Spark code contains numerous undefined names in Python 3
> 
>
> Key: SPARK-23698
> URL: https://issues.apache.org/jira/browse/SPARK-23698
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: cclauss
>Assignee: cclauss
>Priority: Minor
> Fix For: 2.4.0
>
>
> flake8 testing of https://github.com/apache/spark on Python 3.6.3
> $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source 
> --statistics*
> ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input'
> result = raw_input("\n%s (y/n): " % prompt)
>  ^
> ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input'
> primary_author = raw_input(
>  ^
> ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input'
> pick_ref = raw_input("Enter a branch name [%s]: " % default_branch)
>^
> ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input'
> jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id)
>   ^
> ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input'
> fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % 
> default_fix_versions)
>^
> ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input'
> raw_assignee = raw_input(
>^
> ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input'
> pr_num = raw_input("Which pull request would you like to merge? (e.g. 
> 34): ")
>  ^
> ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input'
> result = raw_input("Would you like to use the modified title? (y/n): 
> ")
>  ^
> ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input'
> while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y":
>   ^
> ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input'
> response = raw_input("%s [y/n]: " % msg)
>^
> ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode'
> author = unidecode.unidecode(unicode(author, "UTF-8")).strip()
>  ^
> ./python/setup.py:37:11: F821 undefined name '__version__'
> VERSION = __version__
>   ^
> ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer'
> dispatch[buffer] = save_buffer
>  ^
> ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file'
> dispatch[file] = save_file
>  ^
> ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode'
> if not isinstance(obj, str) and not isinstance(obj, unicode):
> ^
> ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long'
> intlike = (int, long)
> ^
> ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long'
> return self._sc._jvm.Time(long(timestamp * 1000))
>   ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 
> undefined name 'xrange'
> for i in xrange(50):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 
> undefined name 'xrange'
> for j in xrange(5):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 
> undefined name 'xrange'
> for k in xrange(20022):
>  ^
> 20F821 undefined name 'raw_input'
> 20



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well

2018-08-22 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-25105.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22100
[https://github.com/apache/spark/pull/22100]

> Importing all of pyspark.sql.functions should bring PandasUDFType in as well
> 
>
> Key: SPARK-25105
> URL: https://issues.apache.org/jira/browse/SPARK-25105
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: kevin yu
>Priority: Trivial
> Fix For: 2.4.0
>
>
>  
> {code:java}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> Traceback (most recent call last):
>  File "", line 1, in 
> NameError: name 'PandasUDFType' is not defined
>  
> {code}
> When explicitly imported it works fine:
> {code:java}
>  
> >>> from pyspark.sql.functions import PandasUDFType
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> {code}
>  
> We just need to make sure it's included in __all__/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well

2018-08-22 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-25105:


Assignee: kevin yu

> Importing all of pyspark.sql.functions should bring PandasUDFType in as well
> 
>
> Key: SPARK-25105
> URL: https://issues.apache.org/jira/browse/SPARK-25105
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: kevin yu
>Priority: Trivial
> Fix For: 2.4.0
>
>
>  
> {code:java}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> Traceback (most recent call last):
>  File "", line 1, in 
> NameError: name 'PandasUDFType' is not defined
>  
> {code}
> When explicitly imported it works fine:
> {code:java}
>  
> >>> from pyspark.sql.functions import PandasUDFType
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> {code}
>  
> We just need to make sure it's included in __all__/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-08-21 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587882#comment-16587882
 ] 

Bryan Cutler commented on SPARK-25179:
--

I can work on this, probably can't get to it right away tho

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Priority: Major
>
> binary type support requires pyarrow 0.10.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-08-21 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-25179:
-
Description: binary type support requires pyarrow 0.10.0

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Priority: Major
>
> binary type support requires pyarrow 0.10.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-21 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587770#comment-16587770
 ] 

Bryan Cutler commented on SPARK-23874:
--

Yes, I still would recommend users upgrade pyarrow if possible. There were a 
large number of fixes and improvements made since 0.8.0.

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-20 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586612#comment-16586612
 ] 

Bryan Cutler commented on SPARK-23874:
--

For the Python fixes, yes the user would have to upgrade pyarrow. In Spark, 
there is no upgrade for pyarrow, just bumping up the minimum version which 
determines what version is installed with pyspark, which is still currently 
0.8.0. Since we upgraded the Java library, those fixes are included.

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21375) Add date and timestamp support to ArrowConverters for toPandas() collection

2018-08-20 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586369#comment-16586369
 ] 

Bryan Cutler commented on SPARK-21375:
--

Hi [~ewohlstadter], the timestamp values should be in UTC.  I'm not sure if 
that's what Hive uses or not, but if not then you will need to do a conversion 
there. Then when building an Arrow TimeStampMicroTZVector, use the conf 
"spark.sql.session.timeZone" for the timezone, which should make things 
consistent. Hope that helps.

> Add date and timestamp support to ArrowConverters for toPandas() collection
> ---
>
> Key: SPARK-21375
> URL: https://issues.apache.org/jira/browse/SPARK-21375
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.3.0
>
>
> Date and timestamp are not yet supported in DataFrame.toPandas() using 
> ArrowConverters.  These are common types for data analysis used in both Spark 
> and Pandas and should be supported.
> There is a discrepancy with the way that PySpark and Arrow store timestamps, 
> without timezone specified, internally.  PySpark takes a UTC timestamp that 
> is adjusted to local time and Arrow is in UTC time.  Hopefully there is a 
> clean way to resolve this.
> Spark internal storage spec:
> * *DateType* stored as days
> * *Timestamp* stored as microseconds 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2018-08-17 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21187:
-
Description: 
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
 * -*Decimal*-
 * -*Binary*-

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * -Need to add some user docs-
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?

  was:
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
 * -*Decimal*-
 * *Binary* - in pyspark

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * -Need to add some user docs-
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?


> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
>  * -*Decimal*-
>  * -*Binary*-
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23555) Add BinaryType support for Arrow in PySpark

2018-08-17 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23555.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20725
[https://github.com/apache/spark/pull/20725]

> Add BinaryType support for Arrow in PySpark
> ---
>
> Key: SPARK-23555
> URL: https://issues.apache.org/jira/browse/SPARK-23555
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> BinaryType is supported with Arrow in Scala, but not yet in Python.  It 
> should be added for all codepaths using Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23555) Add BinaryType support for Arrow in PySpark

2018-08-17 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-23555:


Assignee: Bryan Cutler

> Add BinaryType support for Arrow in PySpark
> ---
>
> Key: SPARK-23555
> URL: https://issues.apache.org/jira/browse/SPARK-23555
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> BinaryType is supported with Arrow in Scala, but not yet in Python.  It 
> should be added for all codepaths using Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-14 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23874.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21939
[https://github.com/apache/spark/pull/21939]

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-13 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23874:
-
Description: 
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support ARROW-2141
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 

  was:
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support SPARK-23555
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 


> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-13 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23874:
-
Description: 
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support SPARK-23555
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 

  was:
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 


> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support SPARK-23555
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25060) PySpark UDF in case statement is always run

2018-08-08 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573866#comment-16573866
 ] 

Bryan Cutler edited comment on SPARK-25060 at 8/8/18 9:03 PM:
--

I believe this was brought up 
[here|https://lists.apache.org/thread.html/a83e58dceed7b9b277070d42a5c85b319cb3ab77a2e2f00c4a7f3deb@]
 cc [~LI,Xiao] 


was (Author: bryanc):
I believe this was brought up here 
https://lists.apache.org/thread.html/a83e58dceed7b9b277070d42a5c85b319cb3ab77a2e2f00c4a7f3deb@
 cc [~LI,Xiao] 

> PySpark UDF in case statement is always run
> ---
>
> Key: SPARK-25060
> URL: https://issues.apache.org/jira/browse/SPARK-25060
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Ryan Blue
>Priority: Major
>
> When evaluating a case statement with a python UDF, Spark will always run the 
> UDF even if the case doesn't use the branch with the UDF call. Here's a repro 
> case:
> {code:lang=python}
> from pyspark.sql.types import StringType
> def fail_if_x(s):
> assert s != 'x'
> return s
> spark.udf.register("fail_if_x", fail_if_x, StringType())
> df = spark.createDataFrame([(1, 'x'), (2, 'y')], ['id', 'str'])
> df.registerTempTable("data")
> spark.sql("select id, case when str <> 'x' then fail_if_x(str) else null end 
> from data").show()
> {code}
> This produces the following error:
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last): 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 189, in main 
> process() 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 184, in process 
> serializer.dump_stream(func(split_index, iterator), outfile) 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 104, in  
> func = lambda _, it: map(mapper, it) 
>   File "", line 1, in  
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 71, in  
> return lambda *a: f(*a) 
>   File "", line 4, in fail_if_x 
> AssertionError
> {code}
> This is because Python UDFs are extracted from expressions and run in the 
> BatchEvalPython node inserted as the child of the expression node:
> {code}
> == Physical Plan ==
> CollectLimit 21
> +- *Project [id#0L, CASE WHEN NOT (str#1 = x) THEN pythonUDF0#14 ELSE null 
> END AS CASE WHEN (NOT (str = x)) THEN fail_if_x(str) ELSE CAST(NULL AS 
> STRING) END#6]
>+- BatchEvalPython [fail_if_x(str#1)], [id#0L, str#1, pythonUDF0#14]
>   +- Scan ExistingRDD[id#0L,str#1]
> {code}
> This doesn't affect correctness, but the behavior doesn't match the Scala API 
> where case can be used to avoid passing data that will cause a UDF to fail 
> into the UDF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25060) PySpark UDF in case statement is always run

2018-08-08 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573866#comment-16573866
 ] 

Bryan Cutler commented on SPARK-25060:
--

I believe this was brought up here 
https://lists.apache.org/thread.html/a83e58dceed7b9b277070d42a5c85b319cb3ab77a2e2f00c4a7f3deb@
 cc [~LI,Xiao] 

> PySpark UDF in case statement is always run
> ---
>
> Key: SPARK-25060
> URL: https://issues.apache.org/jira/browse/SPARK-25060
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Ryan Blue
>Priority: Major
>
> When evaluating a case statement with a python UDF, Spark will always run the 
> UDF even if the case doesn't use the branch with the UDF call. Here's a repro 
> case:
> {code:lang=python}
> from pyspark.sql.types import StringType
> def fail_if_x(s):
> assert s != 'x'
> return s
> spark.udf.register("fail_if_x", fail_if_x, StringType())
> df = spark.createDataFrame([(1, 'x'), (2, 'y')], ['id', 'str'])
> df.registerTempTable("data")
> spark.sql("select id, case when str <> 'x' then fail_if_x(str) else null end 
> from data").show()
> {code}
> This produces the following error:
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last): 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 189, in main 
> process() 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 184, in process 
> serializer.dump_stream(func(split_index, iterator), outfile) 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 104, in  
> func = lambda _, it: map(mapper, it) 
>   File "", line 1, in  
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 71, in  
> return lambda *a: f(*a) 
>   File "", line 4, in fail_if_x 
> AssertionError
> {code}
> This is because Python UDFs are extracted from expressions and run in the 
> BatchEvalPython node inserted as the child of the expression node:
> {code}
> == Physical Plan ==
> CollectLimit 21
> +- *Project [id#0L, CASE WHEN NOT (str#1 = x) THEN pythonUDF0#14 ELSE null 
> END AS CASE WHEN (NOT (str = x)) THEN fail_if_x(str) ELSE CAST(NULL AS 
> STRING) END#6]
>+- BatchEvalPython [fail_if_x(str#1)], [id#0L, str#1, pythonUDF0#14]
>   +- Scan ExistingRDD[id#0L,str#1]
> {code}
> This doesn't affect correctness, but the behavior doesn't match the Scala API 
> where case can be used to avoid passing data that will cause a UDF to fail 
> into the UDF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24976) Allow None for Decimal type conversion (specific to PyArrow 0.9.0)

2018-07-31 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-24976.
--
   Resolution: Fixed
Fix Version/s: 2.3.2
   2.4.0

Issue resolved by pull request 21928
[https://github.com/apache/spark/pull/21928]

> Allow None for Decimal type conversion (specific to PyArrow 0.9.0)
> --
>
> Key: SPARK-24976
> URL: https://issues.apache.org/jira/browse/SPARK-24976
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0, 2.3.2
>
>
> See https://jira.apache.org/jira/browse/ARROW-2432
> If we use Arrow 0.9.0, the the test case (None as decimal) failed as below:
> {code}
> Traceback (most recent call last):
>   File "/.../spark/python/pyspark/sql/tests.py", line 4672, in 
> test_vectorized_udf_null_decimal
> self.assertEquals(df.collect(), res.collect())
>   File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect
> sock_info = self._jdf.collectToPython()
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o51.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 
> in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 
> (TID 7, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/.../spark/python/pyspark/worker.py", line 320, in main
> process()
>   File "/.../spark/python/pyspark/worker.py", line 315, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream
> batch = _create_batch(series, self._timezone)
>   File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch
> arrs = [create_array(s, t) for s, t in series]
>   File "/.../spark/python/pyspark/serializers.py", line 241, in create_array
> return pa.Array.from_pandas(s, mask=mask, type=t)
>   File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24976) Allow None for Decimal type conversion (specific to PyArrow 0.9.0)

2018-07-31 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-24976:


Assignee: Hyukjin Kwon

> Allow None for Decimal type conversion (specific to PyArrow 0.9.0)
> --
>
> Key: SPARK-24976
> URL: https://issues.apache.org/jira/browse/SPARK-24976
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> See https://jira.apache.org/jira/browse/ARROW-2432
> If we use Arrow 0.9.0, the the test case (None as decimal) failed as below:
> {code}
> Traceback (most recent call last):
>   File "/.../spark/python/pyspark/sql/tests.py", line 4672, in 
> test_vectorized_udf_null_decimal
> self.assertEquals(df.collect(), res.collect())
>   File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect
> sock_info = self._jdf.collectToPython()
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o51.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 
> in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 
> (TID 7, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/.../spark/python/pyspark/worker.py", line 320, in main
> process()
>   File "/.../spark/python/pyspark/worker.py", line 315, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream
> batch = _create_batch(series, self._timezone)
>   File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch
> arrs = [create_array(s, t) for s, t in series]
>   File "/.../spark/python/pyspark/serializers.py", line 241, in create_array
> return pa.Array.from_pandas(s, mask=mask, type=t)
>   File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception

2018-07-25 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16556126#comment-16556126
 ] 

Bryan Cutler commented on SPARK-24915:
--

Hi [~stspencer], I've been trying fix similar issues, but this is a little 
different since the StructType makes _needSerializeAnyField==True, as you 
pointed out.  I agree that the current behavior is very confusing and should be 
fixed, but the related issue had to be pushed back to Spark 3.0 because it 
causes a behavior change.  Hopefully we can improve both these issues.  Until 
then, if you're not aware the intended way to define Row data if you care about 
a specific positioning is like this:

 
{code:java}
In [10]: MyRow = Row("field2", "field1")

In [11]: data = [
...: MyRow(Row(sub_field='world'), "hello")
...: ]

In [12]: df = spark.createDataFrame(data, schema=schema)

In [13]: df.show()
+---+--+
| field2|field1|
+---+--+
|[world]| hello|
+---+--+{code}

hope that helps

> Calling SparkSession.createDataFrame with schema can throw exception
> 
>
> Key: SPARK-24915
> URL: https://issues.apache.org/jira/browse/SPARK-24915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Python 3.6.3
> PySpark 2.3.1 (installed via pip)
> OSX 10.12.6
>Reporter: Stephen Spencer
>Priority: Major
>
> There seems to be a bug in PySpark when using the PySparkSQL session to 
> create a dataframe with a pre-defined schema.
> Code to reproduce the error:
> {code:java}
> from pyspark import SparkConf, SparkContext
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, StringType, Row
> conf = SparkConf().setMaster("local").setAppName("repro") 
> context = SparkContext(conf=conf) 
> session = SparkSession(context)
> # Construct schema (the order of fields is important)
> schema = StructType([
> StructField('field2', StructType([StructField('sub_field', StringType(), 
> False)]), False),
> StructField('field1', StringType(), False),
> ])
> # Create data to populate data frame
> data = [
> Row(field1="Hello", field2=Row(sub_field='world'))
> ]
> # Attempt to create the data frame supplying the schema
> # this will throw a ValueError
> df = session.createDataFrame(data, schema=schema)
> df.show(){code}
> Running this throws a ValueError
> {noformat}
> Traceback (most recent call last):
> File "schema_bug.py", line 18, in 
> df = session.createDataFrame(data, schema=schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 691, in createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in _createFromLocal
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in 
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in toInternal
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in 
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 439, in toInternal
> return self.dataType.toInternal(obj)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 619, in toInternal
> raise ValueError("Unexpected tuple %r with StructType" % obj)
> ValueError: Unexpected tuple 'Hello' with StructType{noformat}
> The problem seems to be here:
> https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603
> specifically the bit
> {code:java}
> zip(self.fields, obj, self._needConversion)
> {code}
> This zip statement seems to assume that obj and self.fields are ordered in 
> the same way, so that the elements of obj will correspond to the right fields 
> in the schema. However this is not true, a Row orders its elements 
> alphabetically but the fields in the schema are in whatever order they are 
> specified. In this example field2 is being initialised with the field1 
> element 'Hello'. If you re-order the fields in the schema to go (field1, 
> field2), the given example works without error.
> The schema in the repro is specifically designed to elicit the problem, the 
> fields ar

[jira] [Commented] (SPARK-24644) Pyarrow exception while running pandas_udf on pyspark 2.3.1

2018-07-16 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545589#comment-16545589
 ] 

Bryan Cutler commented on SPARK-24644:
--

[~helkhalfi], the error in the stack trace is coming from pandas internals and 
it looks like you are using a pretty old version, so my guess is that you need 
to upgrade pandas to solve this.  For Spark, we currently test pyarrow with 
pandas 0.19.2 and I would recommend at least that version or higher.

> Pyarrow exception while running pandas_udf on pyspark 2.3.1
> ---
>
> Key: SPARK-24644
> URL: https://issues.apache.org/jira/browse/SPARK-24644
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: os: centos
> pyspark 2.3.1
> spark 2.3.1
> pyarrow >= 0.8.0
>Reporter: Hichame El Khalfi
>Priority: Major
>
> Hello,
> When I try to run a `pandas_udf` on my spark dataframe, I get this error
>  
> {code:java}
>   File 
> "/mnt/ephemeral3/yarn/nm/usercache/user/appcache/application_1524574803975_205774/container_e280_1524574803975_205774_01_44/pyspark.zip/pyspark/serializers.py",
>  lin
> e 280, in load_stream
> pdf = batch.to_pandas()
>   File "pyarrow/table.pxi", line 677, in pyarrow.lib.RecordBatch.to_pandas 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:43226)
> return Table.from_batches([self]).to_pandas(nthreads=nthreads)
>   File "pyarrow/table.pxi", line 1043, in pyarrow.lib.Table.to_pandas 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:46331)
> mgr = pdcompat.table_to_blockmanager(options, self, memory_pool,
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 528, in table_to_blockmanager
> blocks = _table_to_blocks(options, block_table, nthreads, memory_pool)
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 622, in _table_to_blocks
> return [_reconstruct_block(item) for item in result]
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 446, in _reconstruct_block
> block = _int.make_block(block_arr, placement=placement)
> TypeError: make_block() takes at least 3 arguments (2 given)
> {code}
>  
>  More than happy to provide any additional information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-07-16 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545519#comment-16545519
 ] 

Bryan Cutler commented on SPARK-23874:
--

[~smilegator], we are aiming to have the Arrow 0.10.0 release soon, and I will 
pick this up again when that happens. So I think it's possible we could have 
the upgrade done before the code freeze.

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2018-07-15 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16544703#comment-16544703
 ] 

Bryan Cutler commented on SPARK-24632:
--

Hi [~josephkb], would you mind clarifying why there needs to be an additional 
trait in Scala to point to Python class paths, instead of something to override 
the line
{code:java}
stage_name = java_stage.getClass().getName().replace("org.apache.spark", 
"pyspark")
{code}
in wrapper.py?  Ideally the Scala classes should not be aware of the Python, 
and when loading, the Python esitmators/models should be able to create the 
Java object and wrap it as long as the line above has the correct class prefix? 
 Thanks!

> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers 
> for persistence
> --
>
> Key: SPARK-24632
> URL: https://issues.apache.org/jira/browse/SPARK-24632
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement 
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and 
> use Pipeline persistence.  This task is to make it easier for 3rd-party 
> libraries to have PipelineStages written in Java and then to use pyspark.ml 
> abstractions to create wrappers around those Java classes.  This is currently 
> possible, except that users hit bugs around persistence.
> I spent a bit thinking about this and wrote up thoughts and a proposal in the 
> doc linked below.  Summary of proposal:
> Require that 3rd-party libraries with Java classes with Python wrappers 
> implement a trait which provides the corresponding Python classpath in some 
> field:
> {code}
> trait PythonWrappable {
>   def pythonClassPath: String = …
> }
> MyJavaType extends PythonWrappable
> {code}
> This will not be required for MLlib wrappers, which we can handle specially.
> One issue for this task will be that we may have trouble writing unit tests.  
> They would ideally test a Java class + Python wrapper class pair sitting 
> outside of pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24760) Pandas UDF does not handle NaN correctly

2018-07-12 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541967#comment-16541967
 ] 

Bryan Cutler commented on SPARK-24760:
--

Yeah, createDataFrame is inconsistent with pandas_udf here, but you could argue 
the other way that your example with arrow disabled is incorrect.  If NaN 
values are introduced into the Pandas DataFrame not manually, but maybe as a 
result of reindexing, then these represent NULL values and the user would 
expect to see it in Spark that way.  For example:
{code:java}
 
 In [19]: pdf = pd.DataFrame({"x": [1.0, 2.0, 4.0]}, index=[0, 1, 
3]).reindex(range(4))
 
 In [20]: pdf
 Out[20]:
   x
 0 1.0
 1 2.0
 2 NaN
 3 4.0
 
 In [21]: pdf.isnull()
 Out[21]: 
 x
 0 False
 1 False
 2 True
 3 False

 In [23]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
 
 In [24]: spark.createDataFrame(pdf).show()
 ++
 | x|
 ++
 | 1.0|
 | 2.0|
 |null|
 | 4.0|
 ++

{code}
[~ueshin] and [~hyukjin.kwon] what do you think the correct behavior is for 
createDataFrames from Pandas that has NaN values?

> Pandas UDF does not handle NaN correctly
> 
>
> Key: SPARK-24760
> URL: https://issues.apache.org/jira/browse/SPARK-24760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Pandas 0.23.1
>Reporter: Mortada Mehyar
>Priority: Minor
>
> I noticed that having `NaN` values when using the new Pandas UDF feature 
> triggers a JVM exception. Not sure if this is an issue with PySpark or 
> PyArrow. Here is a somewhat contrived example to showcase the problem.
> {code}
> In [1]: import pandas as pd
>...: from pyspark.sql.functions import lit, pandas_udf, PandasUDFType
> In [2]: d = [{'key': 'a', 'value': 1},
>  {'key': 'a', 'value': 2},
>  {'key': 'b', 'value': 3},
>  {'key': 'b', 'value': -2}]
> df = spark.createDataFrame(d, "key: string, value: int")
> df.show()
> +---+-+
> |key|value|
> +---+-+
> |  a|1|
> |  a|2|
> |  b|3|
> |  b|   -2|
> +---+-+
> In [3]: df_tmp = df.withColumn('new', lit(1.0))  # add a DoubleType column
> df_tmp.printSchema()
> root
>  |-- key: string (nullable = true)
>  |-- value: integer (nullable = true)
>  |-- new: double (nullable = false)
> {code}
> And the Pandas UDF is simply creating a new column where negative values 
> would be set to a particular float, in this case INF and it works fine
> {code}
> In [4]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('inf'))
>...: return pdf
> In [5]: df.groupby('key').apply(func).show()
> +---+-+--+
> |key|value|new|
> +---+-+--+
> |  b|3|   3.0|
> |  b|   -2|  Infinity|
> |  a|1|   1.0|
> |  a|2|   2.0|
> +---+-+--+
> {code}
> However if we set this value to NaN then it triggers an exception:
> {code}
> In [6]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('nan'))
>...: return pdf
>...:
>...: df.groupby('key').apply(func).show()
> [Stage 23:==> (73 + 2) / 
> 75]2018-07-07 16:26:27 ERROR Executor:91 - Exception in task 36.0 in stage 
> 23.0 (TID 414)
> java.lang.IllegalStateException: Value at index is null
>   at org.apache.arrow.vector.Float8Vector.get(Float8Vector.java:98)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector$DoubleAccessor.getDouble(ArrowColumnVector.java:344)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector.getDouble(ArrowColumnVector.java:99)
>   at 
> org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getDouble(MutableColumnarRow.java:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.s

[jira] [Comment Edited] (SPARK-24760) Pandas UDF does not handle NaN correctly

2018-07-10 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16538924#comment-16538924
 ] 

Bryan Cutler edited comment on SPARK-24760 at 7/10/18 4:48 PM:
---

Pandas uses NaNs as a special value that it interprets as a NULL or a missing 
value
{code}
In [35]: s = pd.Series([1, None, float('nan')], dtype='O')

In [36]: s
Out[36]:
0   1
1    None
2 NaN
dtype: object

In [37]: s.isnull()
Out[37]:
0    False
1 True
2 True
dtype: bool{code}
I'm not sure if there is a way to change the above behavior in Pandas. The only 
other possibility I can think of is to specify your own NULL mask when pyspark 
converts Pandas data to Arrow. Although, this doesn't seem to work with NaNs in 
Arrow currently
{code}
In [39]: pa.Array.from_pandas(s, mask=[False, True, False], type=pa.float64())
Out[39]: 

[
  1.0,
  NA,
  NA
]
{code}


was (Author: bryanc):
Pandas uses NaNs as a special value that it interprets as a NULL or a missing 
value
{code:python}
In [35]: s = pd.Series([1, None, float('nan')], dtype='O')

In [36]: s
Out[36]:
0   1
1    None
2 NaN
dtype: object

In [37]: s.isnull()
Out[37]:
0    False
1 True
2 True
dtype: bool{code}

I'm not sure if there is a way to change the above behavior in Pandas.  The 
only other possibility I can think of is to specify your own NULL mask when 
pyspark converts Pandas data to Arrow.  This doesn't seem to be supported in 
Arrow currently

{code:python}
In [39]: pa.Array.from_pandas(s, mask=[False, True, False], type=pa.float64())
Out[39]: 

[
  1.0,
  NA,
  NA
]
{code}

> Pandas UDF does not handle NaN correctly
> 
>
> Key: SPARK-24760
> URL: https://issues.apache.org/jira/browse/SPARK-24760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Pandas 0.23.1
>Reporter: Mortada Mehyar
>Priority: Minor
>
> I noticed that having `NaN` values when using the new Pandas UDF feature 
> triggers a JVM exception. Not sure if this is an issue with PySpark or 
> PyArrow. Here is a somewhat contrived example to showcase the problem.
> {code}
> In [1]: import pandas as pd
>...: from pyspark.sql.functions import lit, pandas_udf, PandasUDFType
> In [2]: d = [{'key': 'a', 'value': 1},
>  {'key': 'a', 'value': 2},
>  {'key': 'b', 'value': 3},
>  {'key': 'b', 'value': -2}]
> df = spark.createDataFrame(d, "key: string, value: int")
> df.show()
> +---+-+
> |key|value|
> +---+-+
> |  a|1|
> |  a|2|
> |  b|3|
> |  b|   -2|
> +---+-+
> In [3]: df_tmp = df.withColumn('new', lit(1.0))  # add a DoubleType column
> df_tmp.printSchema()
> root
>  |-- key: string (nullable = true)
>  |-- value: integer (nullable = true)
>  |-- new: double (nullable = false)
> {code}
> And the Pandas UDF is simply creating a new column where negative values 
> would be set to a particular float, in this case INF and it works fine
> {code}
> In [4]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('inf'))
>...: return pdf
> In [5]: df.groupby('key').apply(func).show()
> +---+-+--+
> |key|value|new|
> +---+-+--+
> |  b|3|   3.0|
> |  b|   -2|  Infinity|
> |  a|1|   1.0|
> |  a|2|   2.0|
> +---+-+--+
> {code}
> However if we set this value to NaN then it triggers an exception:
> {code}
> In [6]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('nan'))
>...: return pdf
>...:
>...: df.groupby('key').apply(func).show()
> [Stage 23:==> (73 + 2) / 
> 75]2018-07-07 16:26:27 ERROR Executor:91 - Exception in task 36.0 in stage 
> 23.0 (TID 414)
> java.lang.IllegalStateException: Value at index is null
>   at org.apache.arrow.vector.Float8Vector.get(Float8Vector.java:98)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector$DoubleAccessor.getDouble(ArrowColumnVector.java:344)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector.getDouble(ArrowColumnVector.java:99)
>   at 
> org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getDouble(MutableColumnarRow.java:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.

[jira] [Commented] (SPARK-24760) Pandas UDF does not handle NaN correctly

2018-07-10 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16538924#comment-16538924
 ] 

Bryan Cutler commented on SPARK-24760:
--

Pandas uses NaNs as a special value that it interprets as a NULL or a missing 
value
{code:python}
In [35]: s = pd.Series([1, None, float('nan')], dtype='O')

In [36]: s
Out[36]:
0   1
1    None
2 NaN
dtype: object

In [37]: s.isnull()
Out[37]:
0    False
1 True
2 True
dtype: bool{code}

I'm not sure if there is a way to change the above behavior in Pandas.  The 
only other possibility I can think of is to specify your own NULL mask when 
pyspark converts Pandas data to Arrow.  This doesn't seem to be supported in 
Arrow currently

{code:python}
In [39]: pa.Array.from_pandas(s, mask=[False, True, False], type=pa.float64())
Out[39]: 

[
  1.0,
  NA,
  NA
]
{code}

> Pandas UDF does not handle NaN correctly
> 
>
> Key: SPARK-24760
> URL: https://issues.apache.org/jira/browse/SPARK-24760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Pandas 0.23.1
>Reporter: Mortada Mehyar
>Priority: Minor
>
> I noticed that having `NaN` values when using the new Pandas UDF feature 
> triggers a JVM exception. Not sure if this is an issue with PySpark or 
> PyArrow. Here is a somewhat contrived example to showcase the problem.
> {code}
> In [1]: import pandas as pd
>...: from pyspark.sql.functions import lit, pandas_udf, PandasUDFType
> In [2]: d = [{'key': 'a', 'value': 1},
>  {'key': 'a', 'value': 2},
>  {'key': 'b', 'value': 3},
>  {'key': 'b', 'value': -2}]
> df = spark.createDataFrame(d, "key: string, value: int")
> df.show()
> +---+-+
> |key|value|
> +---+-+
> |  a|1|
> |  a|2|
> |  b|3|
> |  b|   -2|
> +---+-+
> In [3]: df_tmp = df.withColumn('new', lit(1.0))  # add a DoubleType column
> df_tmp.printSchema()
> root
>  |-- key: string (nullable = true)
>  |-- value: integer (nullable = true)
>  |-- new: double (nullable = false)
> {code}
> And the Pandas UDF is simply creating a new column where negative values 
> would be set to a particular float, in this case INF and it works fine
> {code}
> In [4]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('inf'))
>...: return pdf
> In [5]: df.groupby('key').apply(func).show()
> +---+-+--+
> |key|value|new|
> +---+-+--+
> |  b|3|   3.0|
> |  b|   -2|  Infinity|
> |  a|1|   1.0|
> |  a|2|   2.0|
> +---+-+--+
> {code}
> However if we set this value to NaN then it triggers an exception:
> {code}
> In [6]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('nan'))
>...: return pdf
>...:
>...: df.groupby('key').apply(func).show()
> [Stage 23:==> (73 + 2) / 
> 75]2018-07-07 16:26:27 ERROR Executor:91 - Exception in task 36.0 in stage 
> 23.0 (TID 414)
> java.lang.IllegalStateException: Value at index is null
>   at org.apache.arrow.vector.Float8Vector.get(Float8Vector.java:98)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector$DoubleAccessor.getDouble(ArrowColumnVector.java:344)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector.getDouble(ArrowColumnVector.java:99)
>   at 
> org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getDouble(MutableColumnarRow.java:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache

[jira] [Resolved] (SPARK-24760) Pandas UDF does not handle NaN correctly

2018-07-09 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-24760.
--
Resolution: Not A Problem

> Pandas UDF does not handle NaN correctly
> 
>
> Key: SPARK-24760
> URL: https://issues.apache.org/jira/browse/SPARK-24760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Pandas 0.23.1
>Reporter: Mortada Mehyar
>Priority: Minor
>
> I noticed that having `NaN` values when using the new Pandas UDF feature 
> triggers a JVM exception. Not sure if this is an issue with PySpark or 
> PyArrow. Here is a somewhat contrived example to showcase the problem.
> {code}
> In [1]: import pandas as pd
>...: from pyspark.sql.functions import lit, pandas_udf, PandasUDFType
> In [2]: d = [{'key': 'a', 'value': 1},
>  {'key': 'a', 'value': 2},
>  {'key': 'b', 'value': 3},
>  {'key': 'b', 'value': -2}]
> df = spark.createDataFrame(d, "key: string, value: int")
> df.show()
> +---+-+
> |key|value|
> +---+-+
> |  a|1|
> |  a|2|
> |  b|3|
> |  b|   -2|
> +---+-+
> In [3]: df_tmp = df.withColumn('new', lit(1.0))  # add a DoubleType column
> df_tmp.printSchema()
> root
>  |-- key: string (nullable = true)
>  |-- value: integer (nullable = true)
>  |-- new: double (nullable = false)
> {code}
> And the Pandas UDF is simply creating a new column where negative values 
> would be set to a particular float, in this case INF and it works fine
> {code}
> In [4]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('inf'))
>...: return pdf
> In [5]: df.groupby('key').apply(func).show()
> +---+-+--+
> |key|value|new|
> +---+-+--+
> |  b|3|   3.0|
> |  b|   -2|  Infinity|
> |  a|1|   1.0|
> |  a|2|   2.0|
> +---+-+--+
> {code}
> However if we set this value to NaN then it triggers an exception:
> {code}
> In [6]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('nan'))
>...: return pdf
>...:
>...: df.groupby('key').apply(func).show()
> [Stage 23:==> (73 + 2) / 
> 75]2018-07-07 16:26:27 ERROR Executor:91 - Exception in task 36.0 in stage 
> 23.0 (TID 414)
> java.lang.IllegalStateException: Value at index is null
>   at org.apache.arrow.vector.Float8Vector.get(Float8Vector.java:98)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector$DoubleAccessor.getDouble(ArrowColumnVector.java:344)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector.getDouble(ArrowColumnVector.java:99)
>   at 
> org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getDouble(MutableColumnarRow.java:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadP

[jira] [Commented] (SPARK-24760) Pandas UDF does not handle NaN correctly

2018-07-09 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537893#comment-16537893
 ] 

Bryan Cutler commented on SPARK-24760:
--

Pandas interprets NaN to be missing data for numeric values (see 
[https://pandas.pydata.org/pandas-docs/stable/missing_data.html#inserting-missing-data)],
 which then becomes a NULL value in pyarrow and Spark.  I think you might be 
able to configure Pandas differently to use another sentinel value for this.  
I'll close this since it's not a Spark issue.

> Pandas UDF does not handle NaN correctly
> 
>
> Key: SPARK-24760
> URL: https://issues.apache.org/jira/browse/SPARK-24760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Pandas 0.23.1
>Reporter: Mortada Mehyar
>Priority: Minor
>
> I noticed that having `NaN` values when using the new Pandas UDF feature 
> triggers a JVM exception. Not sure if this is an issue with PySpark or 
> PyArrow. Here is a somewhat contrived example to showcase the problem.
> {code}
> In [1]: import pandas as pd
>...: from pyspark.sql.functions import lit, pandas_udf, PandasUDFType
> In [2]: d = [{'key': 'a', 'value': 1},
>  {'key': 'a', 'value': 2},
>  {'key': 'b', 'value': 3},
>  {'key': 'b', 'value': -2}]
> df = spark.createDataFrame(d, "key: string, value: int")
> df.show()
> +---+-+
> |key|value|
> +---+-+
> |  a|1|
> |  a|2|
> |  b|3|
> |  b|   -2|
> +---+-+
> In [3]: df_tmp = df.withColumn('new', lit(1.0))  # add a DoubleType column
> df_tmp.printSchema()
> root
>  |-- key: string (nullable = true)
>  |-- value: integer (nullable = true)
>  |-- new: double (nullable = false)
> {code}
> And the Pandas UDF is simply creating a new column where negative values 
> would be set to a particular float, in this case INF and it works fine
> {code}
> In [4]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('inf'))
>...: return pdf
> In [5]: df.groupby('key').apply(func).show()
> +---+-+--+
> |key|value|new|
> +---+-+--+
> |  b|3|   3.0|
> |  b|   -2|  Infinity|
> |  a|1|   1.0|
> |  a|2|   2.0|
> +---+-+--+
> {code}
> However if we set this value to NaN then it triggers an exception:
> {code}
> In [6]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('nan'))
>...: return pdf
>...:
>...: df.groupby('key').apply(func).show()
> [Stage 23:==> (73 + 2) / 
> 75]2018-07-07 16:26:27 ERROR Executor:91 - Exception in task 36.0 in stage 
> 23.0 (TID 414)
> java.lang.IllegalStateException: Value at index is null
>   at org.apache.arrow.vector.Float8Vector.get(Float8Vector.java:98)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector$DoubleAccessor.getDouble(ArrowColumnVector.java:344)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector.getDouble(ArrowColumnVector.java:99)
>   at 
> org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getDouble(MutableColumnarRow.java:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.compute

[jira] [Created] (SPARK-24735) Improve exception when mixing pandas_udf types

2018-07-03 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-24735:


 Summary: Improve exception when mixing pandas_udf types
 Key: SPARK-24735
 URL: https://issues.apache.org/jira/browse/SPARK-24735
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Bryan Cutler


>From the discussion here 
>https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up 
>Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = pandas_udf(lambda 
>x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an exception which is 
>hard to understand.  It should tell the user that the UDF type is wrong.  This 
>is the full output:

{code}
>>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
>>> df.select(foo(df['v'])).show()
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", 
line 353, in show
print(self._jdf.showString(n, 20, vertical))
  File 
"/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
 line 1257, in __call__
  File 
"/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", line 
63, in deco
return f(*a, **kw)
  File 
"/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
 line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString.
: java.lang.UnsupportedOperationException: Cannot evaluate expression: 
(input[0, bigint, false])
at 
org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
at 
org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
at scala.Option.getOrElse(Option.scala:121)
...
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24439) Add distanceMeasure to BisectingKMeans in PySpark

2018-06-28 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-24439.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21557
[https://github.com/apache/spark/pull/21557]

> Add distanceMeasure to BisectingKMeans in PySpark
> -
>
> Key: SPARK-24439
> URL: https://issues.apache.org/jira/browse/SPARK-24439
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>
> https://issues.apache.org/jira/browse/SPARK-23412 added distanceMeasure to 
> BisectingKMeans. We will do the same for PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24439) Add distanceMeasure to BisectingKMeans in PySpark

2018-06-28 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-24439:


Assignee: Huaxin Gao

> Add distanceMeasure to BisectingKMeans in PySpark
> -
>
> Key: SPARK-24439
> URL: https://issues.apache.org/jira/browse/SPARK-24439
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-23412 added distanceMeasure to 
> BisectingKMeans. We will do the same for PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23858) Need to apply pyarrow adjustments to complex types with DateType/TimestampType

2018-06-25 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522916#comment-16522916
 ] 

Bryan Cutler commented on SPARK-23858:
--

[~semanticbeeng] sorry, there aren't failing tests I can point you to.  Here is 
one place where this adjustment needs to happen, and I think there is maybe one 
more place 
https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1724. 
Right now, I believe these cases are set to fail fast when encountered in the 
Schema.

> Need to apply pyarrow adjustments to complex types with 
> DateType/TimestampType 
> ---
>
> Key: SPARK-23858
> URL: https://issues.apache.org/jira/browse/SPARK-23858
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, ArrayTypes with DateType and TimestampType need to perform the 
> same adjustments as simple types, e.g.  
> {{_check_series_localize_timestamps}}, and that should work for nested types 
> as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-06-25 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522869#comment-16522869
 ] 

Bryan Cutler commented on SPARK-24579:
--

I left some comments on the shared doc, overall sounds pretty good and I'm 
looking forward to seeing this in Spark!

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can use DL/AI frameworks without learning specific 
> data APIs implemented there. And developers from both sides can work on 
> performance optimizations independently given the interface itself doesn’t 
> introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-06-13 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23874:
-
Description: 
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 

  was:
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645

 

 


> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24554) Add MapType Support for Arrow in PySpark

2018-06-13 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511691#comment-16511691
 ] 

Bryan Cutler commented on SPARK-24554:
--

There still is work to be done to add a Map logical type to Arrow C++/Python 
and Java.  Then finally Arrow integration tests (linked) can be completed.  At 
that point we can add this to Spark.

> Add MapType Support for Arrow in PySpark
> 
>
> Key: SPARK-24554
> URL: https://issues.apache.org/jira/browse/SPARK-24554
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Priority: Major
>
> Add support for MapType in Arrow related classes in Scala/Java and pyarrow 
> functionality in Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24554) Add MapType Support for Arrow in PySpark

2018-06-13 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-24554:


 Summary: Add MapType Support for Arrow in PySpark
 Key: SPARK-24554
 URL: https://issues.apache.org/jira/browse/SPARK-24554
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 2.3.1
Reporter: Bryan Cutler


Add support for MapType in Arrow related classes in Scala/Java and pyarrow 
functionality in Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24444) Improve pandas_udf GROUPED_MAP docs to explain column assignment

2018-05-31 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-2:
-
Target Version/s: 2.3.1, 2.4.0  (was: 2.3.1)

> Improve pandas_udf GROUPED_MAP docs to explain column assignment
> 
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> There have been several bugs regarding this and a clean solution still 
> changes some behavior.  Until this can be resolved, improve docs to explain 
> that columns are assigned by position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2018-05-31 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16497244#comment-16497244
 ] 

Bryan Cutler commented on SPARK-21187:
--

Hi [~teddy.choi], MapType still needs some work to be done in Arrow before we 
can add the Spark implementation. If you are able to help out on that front, 
that would be great!

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
>  * -*Decimal*-
>  * *Binary* - in pyspark
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24444) Improve pandas_udf GROUPED_MAP docs to explain column assignment

2018-05-31 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-2:


 Summary: Improve pandas_udf GROUPED_MAP docs to explain column 
assignment
 Key: SPARK-2
 URL: https://issues.apache.org/jira/browse/SPARK-2
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Bryan Cutler
Assignee: Bryan Cutler


There have been several bugs regarding this and a clean solution still changes 
some behavior.  Until this can be resolved, improve docs to explain that 
columns are assigned by position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-05-31 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23874:
-
Description: 
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645

 

 

  was:
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support
 * Bug fix related to array serialization 
[ARROW-1973|https://issues.apache.org/jira/browse/ARROW-1973]
 * Python2 str will be made into an Arrow string instead of bytes 
[ARROW-2101|https://issues.apache.org/jira/browse/ARROW-2101]
 * Python bytearrays are supported in as input to pyarrow 
[ARROW-2141|https://issues.apache.org/jira/browse/ARROW-2141]
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter [ARROW-1962|https://issues.apache.org/jira/browse/ARROW-1962]
 * Cleanup pyarrow type equality checks 
[ARROW-2423|https://issues.apache.org/jira/browse/ARROW-2423]

 

 


> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23161) Add missing APIs to Python GBTClassifier

2018-05-30 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23161.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21413
[https://github.com/apache/spark/pull/21413]

> Add missing APIs to Python GBTClassifier
> 
>
> Key: SPARK-23161
> URL: https://issues.apache.org/jira/browse/SPARK-23161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Huaxin Gao
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved to 
> {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.
> GBTClassificationModel is missing {{numClasses}}. It should inherit from 
> {{JavaClassificationModel}} instead of prediction model which will give it 
> this param.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23161) Add missing APIs to Python GBTClassifier

2018-05-30 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-23161:


Assignee: Huaxin Gao

> Add missing APIs to Python GBTClassifier
> 
>
> Key: SPARK-23161
> URL: https://issues.apache.org/jira/browse/SPARK-23161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Huaxin Gao
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved to 
> {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.
> GBTClassificationModel is missing {{numClasses}}. It should inherit from 
> {{JavaClassificationModel}} instead of prediction model which will give it 
> this param.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24392) Mark pandas_udf as Experimental

2018-05-29 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-24392:
-
Fix Version/s: 2.4.0

> Mark pandas_udf as Experimental
> ---
>
> Key: SPARK-24392
> URL: https://issues.apache.org/jira/browse/SPARK-24392
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> This functionality is still evolving and has introduced some bugs .  It was 
> an oversight to not mark it as experimental before it was released in 2.3.0.  
> Not sure if it is a good idea to change this after the fact, but I'm opening 
> this to continue discussion from 
> https://github.com/apache/spark/pull/21427#issuecomment-391967423



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24392) Mark pandas_udf as Experimental

2018-05-25 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491305#comment-16491305
 ] 

Bryan Cutler edited comment on SPARK-24392 at 5/25/18 9:53 PM:
---

Targeting 2.3.1, will try to resolve today to not hold up the release.


was (Author: bryanc):
Targeting 2.3.1

> Mark pandas_udf as Experimental
> ---
>
> Key: SPARK-24392
> URL: https://issues.apache.org/jira/browse/SPARK-24392
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Blocker
> Fix For: 2.3.1
>
>
> This functionality is still evolving and has introduced some bugs .  It was 
> an oversight to not mark it as experimental before it was released in 2.3.0.  
> Not sure if it is a good idea to change this after the fact, but I'm opening 
> this to continue discussion from 
> https://github.com/apache/spark/pull/21427#issuecomment-391967423



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24392) Mark pandas_udf as Experimental

2018-05-25 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491305#comment-16491305
 ] 

Bryan Cutler commented on SPARK-24392:
--

Targeting 2.3.1

> Mark pandas_udf as Experimental
> ---
>
> Key: SPARK-24392
> URL: https://issues.apache.org/jira/browse/SPARK-24392
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Blocker
> Fix For: 2.3.1
>
>
> This functionality is still evolving and has introduced some bugs .  It was 
> an oversight to not mark it as experimental before it was released in 2.3.0.  
> Not sure if it is a good idea to change this after the fact, but I'm opening 
> this to continue discussion from 
> https://github.com/apache/spark/pull/21427#issuecomment-391967423



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24392) Mark pandas_udf as Experimental

2018-05-25 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-24392:
-
Fix Version/s: 2.3.1

> Mark pandas_udf as Experimental
> ---
>
> Key: SPARK-24392
> URL: https://issues.apache.org/jira/browse/SPARK-24392
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Blocker
> Fix For: 2.3.1
>
>
> This functionality is still evolving and has introduced some bugs .  It was 
> an oversight to not mark it as experimental before it was released in 2.3.0.  
> Not sure if it is a good idea to change this after the fact, but I'm opening 
> this to continue discussion from 
> https://github.com/apache/spark/pull/21427#issuecomment-391967423



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24392) Mark pandas_udf as Experimental

2018-05-25 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-24392:
-
Priority: Blocker  (was: Critical)

> Mark pandas_udf as Experimental
> ---
>
> Key: SPARK-24392
> URL: https://issues.apache.org/jira/browse/SPARK-24392
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Blocker
>
> This functionality is still evolving and has introduced some bugs .  It was 
> an oversight to not mark it as experimental before it was released in 2.3.0.  
> Not sure if it is a good idea to change this after the fact, but I'm opening 
> this to continue discussion from 
> https://github.com/apache/spark/pull/21427#issuecomment-391967423



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24392) Mark pandas_udf as Experimental

2018-05-25 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-24392:


 Summary: Mark pandas_udf as Experimental
 Key: SPARK-24392
 URL: https://issues.apache.org/jira/browse/SPARK-24392
 Project: Spark
  Issue Type: Task
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Bryan Cutler


This functionality is still evolving and has introduced some bugs .  It was an 
oversight to not mark it as experimental before it was released in 2.3.0.  Not 
sure if it is a good idea to change this after the fact, but I'm opening this 
to continue discussion from 
https://github.com/apache/spark/pull/21427#issuecomment-391967423



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels

2018-05-24 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-24324:
-
Summary: Pandas Grouped Map UserDefinedFunction mixes column labels  (was: 
UserDefinedFunction mixes column labels)

> Pandas Grouped Map UserDefinedFunction mixes column labels
> --
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("timestamp", TimestampType(), False),
>  StructField("views", IntegerType(), False)])
> df = sqlctx.read.csv(fp.name,
>  header=False,
>  schema=schema,
>  timestampFormat="MMdd-HHmmss",
>  sep=' ')
> df.count()
> df.dtypes
> df.show()
> new_schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("day", StringType(), False),
>  StructField("enc", Stri

[jira] [Commented] (SPARK-24324) UserDefinedFunction mixes column labels

2018-05-24 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489533#comment-16489533
 ] 

Bryan Cutler commented on SPARK-24324:
--

I was able to reproduce, the problem is that when pyspark processes the result 
from the UDF, it iterates over the columns by index not by name.  Since you are 
returning a new pandas.DataFrame with input given by a dict, there is no 
guarantee of the column order when pandas makes the DataFrame (this is just the 
nature of the python dict for python < 3.6 I believe).  I can submit a fix that 
should not change any behaviors, I think.  As a workaround, you could write 
your UDF to make the pandas.DataFrame using an OrderedDict like so:

 
{code:java}
from collections import OrderedDict

@pandas_udf(new_schema, PandasUDFType.GROUPED_MAP)
def myudf(x):
foo = 'foo'
return pd.DataFrame(OrderedDict([('lang', x.lang), ('page', x.page), 
('foo', 'foo')])){code}

> UserDefinedFunction mixes column labels
> ---
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  S

[jira] [Created] (SPARK-24319) run-example can not print usage

2018-05-18 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-24319:


 Summary: run-example can not print usage
 Key: SPARK-24319
 URL: https://issues.apache.org/jira/browse/SPARK-24319
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.4.0
Reporter: Bryan Cutler


Running "bin/run-example" with no args or with "–help" will not print usage and 
just gives the error
{noformat}
$ bin/run-example
Exception in thread "main" java.lang.IllegalArgumentException: Missing 
application resource.
    at 
org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
    at 
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:181)
    at 
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:296)
    at 
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:162)
    at org.apache.spark.launcher.Main.main(Main.java:86){noformat}

it looks like there is an env var in the script that shows usage, but it's 
getting preempted by something else



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24303) Update cloudpickle to v0.4.4

2018-05-18 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-24303.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

> Update cloudpickle to v0.4.4
> 
>
> Key: SPARK-24303
> URL: https://issues.apache.org/jira/browse/SPARK-24303
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.0
>
>
> cloudpickle 0.4.4 is release - 
> https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.4
> The main difference is that we are now able to pickle root logger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24303) Update cloudpickle to v0.4.4

2018-05-18 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16480912#comment-16480912
 ] 

Bryan Cutler commented on SPARK-24303:
--

Issue resolved by pull request 21350

https://github.com/apache/spark/pull/21350

> Update cloudpickle to v0.4.4
> 
>
> Key: SPARK-24303
> URL: https://issues.apache.org/jira/browse/SPARK-24303
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.0
>
>
> cloudpickle 0.4.4 is release - 
> https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.4
> The main difference is that we are now able to pickle root logger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24303) Update cloudpickle to v0.4.4

2018-05-18 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-24303:


Assignee: Hyukjin Kwon

> Update cloudpickle to v0.4.4
> 
>
> Key: SPARK-24303
> URL: https://issues.apache.org/jira/browse/SPARK-24303
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> cloudpickle 0.4.4 is release - 
> https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.4
> The main difference is that we are now able to pickle root logger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2018-05-14 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16474664#comment-16474664
 ] 

Bryan Cutler commented on SPARK-21187:
--

Hi [~ewohlstadter], thanks for the interest!  The Map type needs some work to 
be done in Arrow to be fully supported, and then it can implemented for Spark.  
We are making this a requirement for Arrow 1.0, if not before then.  As for the 
interval type, I don't believe it is an external type for Spark SQL so it 
wasn't planned.

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
>  * -*Decimal*-
>  * *Binary* - in pyspark
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-05-14 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23874:
-
Description: 
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support
 * Bug fix related to array serialization 
[ARROW-1973|https://issues.apache.org/jira/browse/ARROW-1973]
 * Python2 str will be made into an Arrow string instead of bytes 
[ARROW-2101|https://issues.apache.org/jira/browse/ARROW-2101]
 * Python bytearrays are supported in as input to pyarrow 
[ARROW-2141|https://issues.apache.org/jira/browse/ARROW-2141]
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter [ARROW-1962|https://issues.apache.org/jira/browse/ARROW-1962]
 * Cleanup pyarrow type equality checks 
[ARROW-2423|https://issues.apache.org/jira/browse/ARROW-2423]

 

 

  was:
Version 0.9.0 of apache arrow comes with a bug fix related to array 
serialization. 
https://issues.apache.org/jira/browse/ARROW-1973


> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support
>  * Bug fix related to array serialization 
> [ARROW-1973|https://issues.apache.org/jira/browse/ARROW-1973]
>  * Python2 str will be made into an Arrow string instead of bytes 
> [ARROW-2101|https://issues.apache.org/jira/browse/ARROW-2101]
>  * Python bytearrays are supported in as input to pyarrow 
> [ARROW-2141|https://issues.apache.org/jira/browse/ARROW-2141]
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter [ARROW-1962|https://issues.apache.org/jira/browse/ARROW-1962]
>  * Cleanup pyarrow type equality checks 
> [ARROW-2423|https://issues.apache.org/jira/browse/ARROW-2423]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22232) Row objects in pyspark created using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2018-05-11 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472714#comment-16472714
 ] 

Bryan Cutler commented on SPARK-22232:
--

I'm closing the PR for now, will reopen for Spark 3.0.0.  The fix makes the 
behavior consistent but might cause a breaking change for some users.  It's not 
necessary to put the fix in and safeguard with a config because there are 
workarounds.  For example, this is a workaround for the code in the description:

{code}
from pyspark.sql.types import *
from pyspark.sql import *

UnsortedRow = Row("a", "c", "b")

def toRow(i):
return UnsortedRow("1", 3.0, 2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

print rdd.repartition(3).toDF(schema).take(2)
# [Row(a=u'1', c=3.0, b=2), Row(a=u'1', c=3.0, b=2)]
{code}

> Row objects in pyspark created using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because {{Row.__new__}} sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {code:none}
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   # Putting fields in alphabetical order masks the issue
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23161) Add missing APIs to Python GBTClassifier

2018-05-07 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23161:
-
Description: 
GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved to 
{{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.

GBTClassificationModel is missing {{numClasses}}. It should inherit from 
{{JavaClassificationModel}} instead of prediction model which will give it this 
param.

  was:
GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved 
{{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.

GBTClassificationModel is missing {{numClasses}}. It should inherit from 
{{JavaClassificationModel}} instead of prediction model which will give it this 
param.


> Add missing APIs to Python GBTClassifier
> 
>
> Key: SPARK-23161
> URL: https://issues.apache.org/jira/browse/SPARK-23161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Minor
>  Labels: starter
>
> GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved to 
> {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.
> GBTClassificationModel is missing {{numClasses}}. It should inherit from 
> {{JavaClassificationModel}} instead of prediction model which will give it 
> this param.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24044) Explicitly print out skipped tests from unittest module

2018-04-26 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-24044.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21107
[https://github.com/apache/spark/pull/21107]

> Explicitly print out skipped tests from unittest module
> ---
>
> Key: SPARK-24044
> URL: https://issues.apache.org/jira/browse/SPARK-24044
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> There was an actual issue, SPARK-23300, and we fixed this by manually 
> checking if the package is installed. This way needed duplicated codes and 
> could only check dependencies. There are many conditions, for example, Python 
> version specific or other packages like NumPy.  I think this is something we 
> should fix.
> `unittest` module can print out the skipped messages but these were swallowed 
> so far in our own testing script. This PR prints out the messages below after 
> sorted.
> It would be nicer if we remove the duplications and print out all the skipped 
> tests. For example, as below:
> This PR proposes to remove duplicated dependency checking logics and also 
> print out skipped tests from unittests. 
> For example, as below:
> {code}
> Skipped tests in pyspark.sql.tests with pypy:
> test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
> ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
> test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
> ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
> ...
> Skipped tests in pyspark.sql.tests with python3:
> test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
> ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
> test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
> ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
> ...
> {code}
> Actual format can be a bit varied per the discussion in the PR. Please check 
> out the PR for exact format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24044) Explicitly print out skipped tests from unittest module

2018-04-26 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-24044:


Assignee: Hyukjin Kwon

> Explicitly print out skipped tests from unittest module
> ---
>
> Key: SPARK-24044
> URL: https://issues.apache.org/jira/browse/SPARK-24044
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> There was an actual issue, SPARK-23300, and we fixed this by manually 
> checking if the package is installed. This way needed duplicated codes and 
> could only check dependencies. There are many conditions, for example, Python 
> version specific or other packages like NumPy.  I think this is something we 
> should fix.
> `unittest` module can print out the skipped messages but these were swallowed 
> so far in our own testing script. This PR prints out the messages below after 
> sorted.
> It would be nicer if we remove the duplications and print out all the skipped 
> tests. For example, as below:
> This PR proposes to remove duplicated dependency checking logics and also 
> print out skipped tests from unittests. 
> For example, as below:
> {code}
> Skipped tests in pyspark.sql.tests with pypy:
> test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
> ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
> test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
> ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
> ...
> Skipped tests in pyspark.sql.tests with python3:
> test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
> ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
> test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
> ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
> ...
> {code}
> Actual format can be a bit varied per the discussion in the PR. Please check 
> out the PR for exact format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24057) put the real data type in the AssertionError message

2018-04-26 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-24057.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21159
[https://github.com/apache/spark/pull/21159]

> put the real data type in the AssertionError message
> 
>
> Key: SPARK-24057
> URL: https://issues.apache.org/jira/browse/SPARK-24057
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>
> I had a wrong data type in one of my tests and got the following message: 
> /spark/python/pyspark/sql/types.py", line 405, in __init__
> assert isinstance(dataType, DataType), "dataType should be DataType"
> AssertionError: dataType should be DataType
> I checked types.py,  line 405, in __init__, it has
> {{assert isinstance(dataType, DataType), "dataType should be DataType"}}
> I think it meant to be 
> {{assert isinstance(dataType, DataType), "%s should be %s"  % (dataType, 
> DataType)}}
> so the error message will be something like 
> {{AssertionError:  should be  'pyspark.sql.types.DataType'>}}
> There are a couple of other places that have the same problem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24057) put the real data type in the AssertionError message

2018-04-26 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-24057:


Assignee: Huaxin Gao

> put the real data type in the AssertionError message
> 
>
> Key: SPARK-24057
> URL: https://issues.apache.org/jira/browse/SPARK-24057
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>
> I had a wrong data type in one of my tests and got the following message: 
> /spark/python/pyspark/sql/types.py", line 405, in __init__
> assert isinstance(dataType, DataType), "dataType should be DataType"
> AssertionError: dataType should be DataType
> I checked types.py,  line 405, in __init__, it has
> {{assert isinstance(dataType, DataType), "dataType should be DataType"}}
> I think it meant to be 
> {{assert isinstance(dataType, DataType), "%s should be %s"  % (dataType, 
> DataType)}}
> so the error message will be something like 
> {{AssertionError:  should be  'pyspark.sql.types.DataType'>}}
> There are a couple of other places that have the same problem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-04-25 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16452666#comment-16452666
 ] 

Bryan Cutler commented on SPARK-23874:
--

[~smilegator] the Arrow community decided to put their efforts towards a 0.10.0 
release instead of 0.9.1.  Yes, it's possible there could be a regression, but 
its also possible that a bug was fixed that just hasn't surfaced yet.  I'll 
pick this up again after the release, do thorough testing, and we can decide it 
we will do the upgrade then.

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.9.0 of apache arrow comes with a bug fix related to array 
> serialization. 
> https://issues.apache.org/jira/browse/ARROW-1973



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17508) Setting weightCol to None in ML library causes an error

2018-04-18 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-17508.
--
Resolution: Won't Fix

Resolving this for now unless there is more interest in the fix

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>Priority: Minor
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23030) Decrease memory consumption with toPandas() collection using Arrow

2018-04-16 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439727#comment-16439727
 ] 

Bryan Cutler commented on SPARK-23030:
--

Hi [~icexelloss], I have something working, just need to write it up then we 
can discuss on the PR.  You're right though, if we want to keep collection as 
fast as possible, it must be fully asynchronous.  Then unfortunately there is 
no way to avoid the worst case of having all data in the JVM driver memory.  I 
did improve the average case and got a little speedup, so hopefully it will be 
worth it.  I'll put up a PR soon.

> Decrease memory consumption with toPandas() collection using Arrow
> --
>
> Key: SPARK-23030
> URL: https://issues.apache.org/jira/browse/SPARK-23030
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently with Arrow enabled, calling {{toPandas()}} results in a collection 
> of all partitions in the JVM in the form of batches of Arrow file format.  
> Once collected in the JVM, they are served to the Python driver process. 
> I believe using the Arrow stream format can help to optimize this and reduce 
> memory consumption in the JVM by only loading one record batch at a time 
> before sending it to Python.  This might also reduce the latency between 
> making the initial call in Python and receiving the first batch of records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-09 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431014#comment-16431014
 ] 

Bryan Cutler commented on SPARK-23929:
--

cc [~icexelloss]

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23883) Error with conversion to arrow while using pandas_udf

2018-04-06 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429123#comment-16429123
 ] 

Bryan Cutler commented on SPARK-23883:
--

I think the problem might be that since the {{pandas_udf}} is for a 
groupby-apply, you need to specify the functionType as PandasUDFType.GROUPED_MAP

for example, @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)

see 
[https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]
 for usage

Can you try that and see if it works?

> Error with conversion to arrow while using pandas_udf
> -
>
> Key: SPARK-23883
> URL: https://issues.apache.org/jira/browse/SPARK-23883
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Spark 2.3.0
> Python 3.5
> Java 1.8.0_161-b12
>Reporter: Omri
>Priority: Major
>
> Hi,
> I have a code that works on DataBricks but doesn't work on a local spark 
> installation.
> This is the code I'm running:
> {code:java}
> from pyspark.sql.functions import pandas_udf
> import pandas as pd
> import numpy as np
> from pyspark.sql.types import *
> schema = StructType([
>   StructField("Distance", FloatType()),
>   StructField("CarId", IntegerType())
> ])
> def haversine(lon1, lat1, lon2, lat2):
> #Calculate distance, return scalar
> return 3.5 # Removed logic to facilitate reading
> @pandas_udf(schema)
> def totalDistance(oneCar):
> dist = haversine(oneCar.Longtitude.shift(1),
>  oneCar.Latitude.shift(1),
>  oneCar.loc[1:, 'Longitude'], 
>  oneCar.loc[1:, 'Latitude'])
> return 
> pd.DataFrame({"CarId":oneCar['CarId'].iloc[0],"Distance":np.sum(dist)},index 
> = [0])
> ## Calculate the overall distance made by each car
> distancePerCar= df.groupBy('CarId').apply(totalDistance)
> {code}
> I'm getting this exception, about Arrow not able to deal with this input:
> {noformat}
> ---
> TypeError Traceback (most recent call last)
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> returnType(self)
> 114 try:
> --> 115 to_arrow_type(self._returnType_placeholder)
> 116 except TypeError:
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\types.py in 
> to_arrow_type(dt)
>1641 else:
> -> 1642 raise TypeError("Unsupported type in conversion to Arrow: " + 
> str(dt))
>1643 return arrow_type
> TypeError: Unsupported type in conversion to Arrow: 
> StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true)))
> During handling of the above exception, another exception occurred:
> NotImplementedError   Traceback (most recent call last)
>  in ()
>  18 km = 6367 * c
>  19 return km
> ---> 20 @pandas_udf("CarId: int, Distance: float")
>  21 def totalDistance(oneUser):
>  22 dist = haversine(oneUser.Longtitude.shift(1), 
> oneUser.Latitude.shift(1),
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> _create_udf(f, returnType, evalType)
>  62 udf_obj = UserDefinedFunction(
>  63 f, returnType=returnType, name=None, evalType=evalType, 
> deterministic=True)
> ---> 64 return udf_obj._wrapped()
>  65 
>  66 
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> _wrapped(self)
> 184 
> 185 wrapper.func = self.func
> --> 186 wrapper.returnType = self.returnType
> 187 wrapper.evalType = self.evalType
> 188 wrapper.deterministic = self.deterministic
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> returnType(self)
> 117 raise NotImplementedError(
> 118 "Invalid returnType with scalar Pandas UDFs: %s 
> is "
> --> 119 "not supported" % 
> str(self._returnType_placeholder))
> 120 elif self.evalType == 
> PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
> 121 if isinstance(self._returnType_placeholder, StructType):
> NotImplementedError: Invalid returnType with scalar Pandas UDFs: 
> StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true)))
>  is not supported{noformat}
> I've also tried changing the schema to
> {code:java}
> @pandas_udf("") {code}
> and
> {code:java}
> @pandas_udf("CarId:int,Distance:float"){code}
>  
> As mentioned, this is working on a DataBricks instance in Azure, but not 
> locally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SPARK-23871) add python api for VectorAssembler handleInvalid

2018-04-06 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23871:
-
Component/s: PySpark

> add python api for VectorAssembler handleInvalid
> 
>
> Key: SPARK-23871
> URL: https://issues.apache.org/jira/browse/SPARK-23871
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23828) PySpark StringIndexerModel should have constructor from labels

2018-04-06 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-23828:


Assignee: Huaxin Gao

> PySpark StringIndexerModel should have constructor from labels
> --
>
> Key: SPARK-23828
> URL: https://issues.apache.org/jira/browse/SPARK-23828
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>
> The Scala StringIndexerModel has an alternate constructor that will create 
> the model from an array of label strings.  This should be in the Python API 
> also, maybe as a classmethod
> {code:java}
> model = StringIndexerModel.from_labels(['a', 'b']){code}
> See SPARK-15009 which added a similar constructor to CountVectorizerModel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23828) PySpark StringIndexerModel should have constructor from labels

2018-04-06 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23828.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20968
[https://github.com/apache/spark/pull/20968]

> PySpark StringIndexerModel should have constructor from labels
> --
>
> Key: SPARK-23828
> URL: https://issues.apache.org/jira/browse/SPARK-23828
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Minor
> Fix For: 2.4.0
>
>
> The Scala StringIndexerModel has an alternate constructor that will create 
> the model from an array of label strings.  This should be in the Python API 
> also, maybe as a classmethod
> {code:java}
> model = StringIndexerModel.from_labels(['a', 'b']){code}
> See SPARK-15009 which added a similar constructor to CountVectorizerModel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.9.0

2018-04-05 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16427712#comment-16427712
 ] 

Bryan Cutler commented on SPARK-23874:
--

I can work on this.  I wasn't able to recreate the linked issue within Spark, 
but maybe I just wasn't hitting the right place so it would be good to upgrade 
anyway.  I'll try to get the code changes done next week and the help 
coordinate the Jenkins upgrades.

> Upgrade apache/arrow to 0.9.0
> -
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Version 0.9.0 of apache arrow comes with a bug fix related to array 
> serialization. 
> https://issues.apache.org/jira/browse/ARROW-1973



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23836) Support returning StructType to the level support in GroupedMap Arrow's "scalar" UDFS (or similar)

2018-04-03 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424567#comment-16424567
 ] 

Bryan Cutler commented on SPARK-23836:
--

So does this mean returning a pandas.DataFrame in a scalar pandas udf?

> Support returning StructType to the level support in GroupedMap Arrow's 
> "scalar" UDFS (or similar)
> --
>
> Key: SPARK-23836
> URL: https://issues.apache.org/jira/browse/SPARK-23836
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Major
>
> Currently not all of the supported types can be returned from the scalar 
> pandas UDF type. This means if someone wants to return a struct type doing a 
> map operation right now they either have to do a "junk" groupBy or use the 
> non-vectorized results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23858) Need to apply pyarrow adjustments to complex types with DateType/TimestampType

2018-04-03 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23858:
-
Summary: Need to apply pyarrow adjustments to complex types with 
DateType/TimestampType   (was: Need to apply adjustments to complex types with 
DateType/TimestampType )

> Need to apply pyarrow adjustments to complex types with 
> DateType/TimestampType 
> ---
>
> Key: SPARK-23858
> URL: https://issues.apache.org/jira/browse/SPARK-23858
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, ArrayTypes with DateType and TimestampType need to perform the 
> same adjustments as simple types, e.g.  
> {{_check_series_localize_timestamps}}, and that should work for nested types 
> as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2018-04-03 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21187:
-
Description: 
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
 * -*Decimal*-
 * *Binary* - in pyspark

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * -Need to add some user docs-
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?

  was:
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: Struct, -Array-, Map
 * -*Decimal*-
 * *Binary* - in pyspark

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * -Need to add some user docs-
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?


> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
>  * -*Decimal*-
>  * *Binary* - in pyspark
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23858) Need to apply adjustments to complex types with DateType/TimestampType

2018-04-03 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-23858:


 Summary: Need to apply adjustments to complex types with 
DateType/TimestampType 
 Key: SPARK-23858
 URL: https://issues.apache.org/jira/browse/SPARK-23858
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Bryan Cutler


Currently, ArrayTypes with DateType and TimestampType need to perform the same 
adjustments as simple types, e.g.  {{_check_series_localize_timestamps}}, and 
that should work for nested types as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23828) PySpark StringIndexerModel should have constructor from labels

2018-04-02 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422729#comment-16422729
 ] 

Bryan Cutler commented on SPARK-23828:
--

No I'm not working on it, please go ahead [~huaxingao], thanks!

> PySpark StringIndexerModel should have constructor from labels
> --
>
> Key: SPARK-23828
> URL: https://issues.apache.org/jira/browse/SPARK-23828
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> The Scala StringIndexerModel has an alternate constructor that will create 
> the model from an array of label strings.  This should be in the Python API 
> also, maybe as a classmethod
> {code:java}
> model = StringIndexerModel.from_labels(['a', 'b']){code}
> See SPARK-15009 which added a similar constructor to CountVectorizerModel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23828) PySpark StringIndexerModel should have constructor from labels

2018-03-29 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-23828:


 Summary: PySpark StringIndexerModel should have constructor from 
labels
 Key: SPARK-23828
 URL: https://issues.apache.org/jira/browse/SPARK-23828
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Bryan Cutler


The Scala StringIndexerModel has an alternate constructor that will create the 
model from an array of label strings.  This should be in the Python API also, 
maybe as a classmethod
{code:java}
model = StringIndexerModel.from_labels(['a', 'b']){code}
See SPARK-15009 which added a similar constructor to CountVectorizerModel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py

2018-03-29 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-22711.
--
Resolution: Workaround

Closing this because it seems wordnet is not serializable with Cloudpickle and 
it's preferable to import inside the executor function rather than globally.

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
>Priority: Major
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>

[jira] [Updated] (SPARK-23704) PySpark access of individual trees in random forest is slow

2018-03-29 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23704:
-
Component/s: PySpark

> PySpark access of individual trees in random forest is slow
> ---
>
> Key: SPARK-23704
> URL: https://issues.apache.org/jira/browse/SPARK-23704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.1
> Environment: PySpark 2.2.1 / Windows 10
>Reporter: Julian King
>Priority: Minor
>
> Making predictions from a randomForestClassifier PySpark is much faster than 
> making predictions from an individual tree contained within the .trees 
> attribute. 
> In fact, the model.transform call without an action is more than 10x slower 
> for an individual tree vs the model.transform call for the random forest 
> model.
> See 
> [https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark]
>  for example with timing.
> Ideally:
>  * Getting a prediction from a single tree should be comparable to or faster 
> than getting predictions from the whole tree
>  * Getting all the predictions from all the individual trees should be 
> comparable in speed to getting the predictions from the random forest
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23699) PySpark should raise same Error when Arrow fallback is disabled

2018-03-27 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23699.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20839
[https://github.com/apache/spark/pull/20839]

> PySpark should raise same Error when Arrow fallback is disabled
> ---
>
> Key: SPARK-23699
> URL: https://issues.apache.org/jira/browse/SPARK-23699
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 2.4.0
>
>
> When a schema or import error is encountered when using Arrow for 
> createDataFrame or toPandas and fallback is disabled, a RuntimeError is 
> raised.  It would be better to raise the same type of error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23699) PySpark should raise same Error when Arrow fallback is disabled

2018-03-27 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-23699:


Assignee: Bryan Cutler

> PySpark should raise same Error when Arrow fallback is disabled
> ---
>
> Key: SPARK-23699
> URL: https://issues.apache.org/jira/browse/SPARK-23699
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
>
> When a schema or import error is encountered when using Arrow for 
> createDataFrame or toPandas and fallback is disabled, a RuntimeError is 
> raised.  It would be better to raise the same type of error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj

2018-03-26 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23162.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20842
[https://github.com/apache/spark/pull/20842]

> PySpark ML LinearRegressionSummary missing r2adj
> 
>
> Key: SPARK-23162
> URL: https://issues.apache.org/jira/browse/SPARK-23162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: kevin yu
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj

2018-03-26 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-23162:


Assignee: kevin yu

> PySpark ML LinearRegressionSummary missing r2adj
> 
>
> Key: SPARK-23162
> URL: https://issues.apache.org/jira/browse/SPARK-23162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: kevin yu
>Priority: Minor
>  Labels: starter
>
> Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23615) Add maxDF Parameter to Python CountVectorizer

2018-03-23 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23615.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20777
[https://github.com/apache/spark/pull/20777]

> Add maxDF Parameter to Python CountVectorizer
> -
>
> Key: SPARK-23615
> URL: https://issues.apache.org/jira/browse/SPARK-23615
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>
> The maxDF parameter is for filtering out frequently occurring terms.  This 
> param was recently added to the Scala CountVectorizer and needs to be added 
> to Python also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23615) Add maxDF Parameter to Python CountVectorizer

2018-03-23 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-23615:


Assignee: Huaxin Gao

> Add maxDF Parameter to Python CountVectorizer
> -
>
> Key: SPARK-23615
> URL: https://issues.apache.org/jira/browse/SPARK-23615
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>
> The maxDF parameter is for filtering out frequently occurring terms.  This 
> param was recently added to the Scala CountVectorizer and needs to be added 
> to Python also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23234) ML python test failure due to default outputCol

2018-03-20 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23234.
--
Resolution: Duplicate

> ML python test failure due to default outputCol
> ---
>
> Key: SPARK-23234
> URL: https://issues.apache.org/jira/browse/SPARK-23234
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Major
>
> SPARK-22799 and SPARK-22797 are causing valid Python test failures. The 
> reason is that Python is setting the default params with set. So they are not 
> considered as defaults, but as params passed by the user.
> This means that an outputCol is set not as a default but as a real value.
> Anyway, this is a misbehavior of the python API which can cause serious 
> problems and I'd suggest to rethink the way this is done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers

2018-03-20 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407510#comment-16407510
 ] 

Bryan Cutler commented on SPARK-23244:
--

Just to clarify, the PySpark save/load is just a wrapper making the same calls 
in Java, so that will fix the root cause of the issue

> Incorrect handling of default values when deserializing python wrappers of 
> scala transformers
> -
>
> Key: SPARK-23244
> URL: https://issues.apache.org/jira/browse/SPARK-23244
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Tomas Nykodym
>Priority: Minor
>
> Default values are not handled properly when serializing/deserializing python 
> trasnformers which are wrappers around scala objects. It looks like that 
> after deserialization the default values which were based on uid do not get 
> properly restored and values which were not set are set to their (original) 
> default values.
> Here's a simple code example using Bucketizer:
> {code:python}
> >>> from pyspark.ml.feature import Bucketizer
> >>> a = Bucketizer() 
> >>> a.save("bucketizer0")
> >>> b = load("bucketizer0") 
> >>> a._defaultParamMap[a.outputCol]
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b._defaultParamMap[b.outputCol]
> u'Bucketizer_41cf9afbc559ca2bfc9a__output'
> >>> a.isSet(a.outputCol)
> False 
> >>> b.isSet(b.outputCol)
> True
> >>> a.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers

2018-03-20 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23244.
--
Resolution: Duplicate

> Incorrect handling of default values when deserializing python wrappers of 
> scala transformers
> -
>
> Key: SPARK-23244
> URL: https://issues.apache.org/jira/browse/SPARK-23244
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Tomas Nykodym
>Priority: Minor
>
> Default values are not handled properly when serializing/deserializing python 
> trasnformers which are wrappers around scala objects. It looks like that 
> after deserialization the default values which were based on uid do not get 
> properly restored and values which were not set are set to their (original) 
> default values.
> Here's a simple code example using Bucketizer:
> {code:python}
> >>> from pyspark.ml.feature import Bucketizer
> >>> a = Bucketizer() 
> >>> a.save("bucketizer0")
> >>> b = load("bucketizer0") 
> >>> a._defaultParamMap[a.outputCol]
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b._defaultParamMap[b.outputCol]
> u'Bucketizer_41cf9afbc559ca2bfc9a__output'
> >>> a.isSet(a.outputCol)
> False 
> >>> b.isSet(b.outputCol)
> True
> >>> a.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers

2018-03-20 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407507#comment-16407507
 ] 

Bryan Cutler commented on SPARK-23244:
--

I looked into this and it is a little bit different because with save/load, 
params are only transferred from Java to Python.  So the actual problem is in 
Scala:
{code:java}
scala> import org.apache.spark.ml.feature.Bucketizer
import org.apache.spark.ml.feature.Bucketizer

scala> val a = new Bucketizer()
a: org.apache.spark.ml.feature.Bucketizer = bucketizer_30c66d09db18

scala> a.isSet(a.outputCol)
res2: Boolean = false

scala> a.save("bucketizer0")

scala> val b = Bucketizer.load("bucketizer0")
b: org.apache.spark.ml.feature.Bucketizer = bucketizer_30c66d09db18

scala> b.isSet(b.outputCol)
res4: Boolean = true{code}
It seems this is being worked on in SPARK-23455, so I'll still close this as a 
duplicate

> Incorrect handling of default values when deserializing python wrappers of 
> scala transformers
> -
>
> Key: SPARK-23244
> URL: https://issues.apache.org/jira/browse/SPARK-23244
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Tomas Nykodym
>Priority: Minor
>
> Default values are not handled properly when serializing/deserializing python 
> trasnformers which are wrappers around scala objects. It looks like that 
> after deserialization the default values which were based on uid do not get 
> properly restored and values which were not set are set to their (original) 
> default values.
> Here's a simple code example using Bucketizer:
> {code:python}
> >>> from pyspark.ml.feature import Bucketizer
> >>> a = Bucketizer() 
> >>> a.save("bucketizer0")
> >>> b = load("bucketizer0") 
> >>> a._defaultParamMap[a.outputCol]
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b._defaultParamMap[b.outputCol]
> u'Bucketizer_41cf9afbc559ca2bfc9a__output'
> >>> a.isSet(a.outputCol)
> False 
> >>> b.isSet(b.outputCol)
> True
> >>> a.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23691) Use sql_conf util in PySpark tests where possible

2018-03-19 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405806#comment-16405806
 ] 

Bryan Cutler commented on SPARK-23691:
--

Thanks [~hyukjin.kwon]!

> Use sql_conf util in PySpark tests where possible
> -
>
> Key: SPARK-23691
> URL: https://issues.apache.org/jira/browse/SPARK-23691
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.0
>
>
> https://github.com/apache/spark/commit/d6632d185e147fcbe6724545488ad80dce20277e
>  added an useful util
> {code}
> @contextmanager
> def sql_conf(self, pairs):
> ...
> {code}
> to allow configuration set/unset within a block:
> {code}
> with self.sql_conf({"spark.blah.blah.blah", "blah"})
> # test codes
> {code}
> It would be nicer if we use it. 
> Note that there look already few places affecting tests without restoring the 
> original value back in unittest classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23691) Use sql_conf util in PySpark tests where possible

2018-03-19 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23691.
--
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.4.0

Issue resolved by pull request 20830
https://github.com/apache/spark/pull/20830

> Use sql_conf util in PySpark tests where possible
> -
>
> Key: SPARK-23691
> URL: https://issues.apache.org/jira/browse/SPARK-23691
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.0
>
>
> https://github.com/apache/spark/commit/d6632d185e147fcbe6724545488ad80dce20277e
>  added an useful util
> {code}
> @contextmanager
> def sql_conf(self, pairs):
> ...
> {code}
> to allow configuration set/unset within a block:
> {code}
> with self.sql_conf({"spark.blah.blah.blah", "blah"})
> # test codes
> {code}
> It would be nicer if we use it. 
> Note that there look already few places affecting tests without restoring the 
> original value back in unittest classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23700) Cleanup unused imports

2018-03-15 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-23700:


 Summary: Cleanup unused imports
 Key: SPARK-23700
 URL: https://issues.apache.org/jira/browse/SPARK-23700
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Bryan Cutler


I've noticed a fair amount of unused imports in pyspark, I'll take a look 
through and try to clean them up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23699) PySpark should raise same Error when Arrow fallback is disabled

2018-03-15 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-23699:


 Summary: PySpark should raise same Error when Arrow fallback is 
disabled
 Key: SPARK-23699
 URL: https://issues.apache.org/jira/browse/SPARK-23699
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.4.0
Reporter: Bryan Cutler


When a schema or import error is encountered when using Arrow for 
createDataFrame or toPandas and fallback is disabled, a RuntimeError is raised. 
 It would be better to raise the same type of error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23615) Add maxDF Parameter to Python CountVectorizer

2018-03-08 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391577#comment-16391577
 ] 

Bryan Cutler commented on SPARK-23615:
--

Sure, go ahead

> Add maxDF Parameter to Python CountVectorizer
> -
>
> Key: SPARK-23615
> URL: https://issues.apache.org/jira/browse/SPARK-23615
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> The maxDF parameter is for filtering out frequently occurring terms.  This 
> param was recently added to the Scala CountVectorizer and needs to be added 
> to Python also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   >