[jira] [Commented] (SPARK-23912) High-order function: array_distinct(x) → array

2018-04-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434926#comment-16434926
 ] 

Apache Spark commented on SPARK-23912:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21050

> High-order function: array_distinct(x) → array
> --
>
> Key: SPARK-23912
> URL: https://issues.apache.org/jira/browse/SPARK-23912
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Remove duplicate values from the array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23912) High-order function: array_distinct(x) → array

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23912:


Assignee: (was: Apache Spark)

> High-order function: array_distinct(x) → array
> --
>
> Key: SPARK-23912
> URL: https://issues.apache.org/jira/browse/SPARK-23912
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Remove duplicate values from the array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23912) High-order function: array_distinct(x) → array

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23912:


Assignee: Apache Spark

> High-order function: array_distinct(x) → array
> --
>
> Key: SPARK-23912
> URL: https://issues.apache.org/jira/browse/SPARK-23912
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Remove duplicate values from the array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23957) Sorts in subqueries are redundant and can be removed

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23957:


Assignee: Apache Spark

> Sorts in subqueries are redundant and can be removed
> 
>
> Key: SPARK-23957
> URL: https://issues.apache.org/jira/browse/SPARK-23957
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Henry Robinson
>Assignee: Apache Spark
>Priority: Major
>
> Unless combined with a {{LIMIT}}, there's no correctness reason that planned 
> and optimized subqueries should have any sort operators (since the result of 
> the subquery is an unordered collection of tuples). 
> For example:
> {{SELECT count(1) FROM (select id FROM dft ORDER by id)}}
> has the following plan:
> {code:java}
> == Physical Plan ==
> *(3) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
>+- *(2) HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *(2) Project
>  +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>+- *(1) Project [id#0L]
>   +- *(1) FileScan parquet [id#0L] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct
> {code}
> ... but the sort operator is redundant.
> Less intuitively, the sort is also redundant in selections from an ordered 
> subquery:
> {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}}
> has plan:
> {code:java}
> == Physical Plan ==
> *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>+- *(1) Project [id#0L]
>   +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> {code}
> ... but again, since the subquery returns a bag of tuples, the sort is 
> unnecessary.
> We should consider adding an optimizer rule that removes a sort inside a 
> subquery. SPARK-23375 is related, but removes sorts that are functionally 
> redundant because they perform the same ordering.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23957) Sorts in subqueries are redundant and can be removed

2018-04-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434918#comment-16434918
 ] 

Apache Spark commented on SPARK-23957:
--

User 'henryr' has created a pull request for this issue:
https://github.com/apache/spark/pull/21049

> Sorts in subqueries are redundant and can be removed
> 
>
> Key: SPARK-23957
> URL: https://issues.apache.org/jira/browse/SPARK-23957
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Henry Robinson
>Priority: Major
>
> Unless combined with a {{LIMIT}}, there's no correctness reason that planned 
> and optimized subqueries should have any sort operators (since the result of 
> the subquery is an unordered collection of tuples). 
> For example:
> {{SELECT count(1) FROM (select id FROM dft ORDER by id)}}
> has the following plan:
> {code:java}
> == Physical Plan ==
> *(3) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
>+- *(2) HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *(2) Project
>  +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>+- *(1) Project [id#0L]
>   +- *(1) FileScan parquet [id#0L] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct
> {code}
> ... but the sort operator is redundant.
> Less intuitively, the sort is also redundant in selections from an ordered 
> subquery:
> {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}}
> has plan:
> {code:java}
> == Physical Plan ==
> *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>+- *(1) Project [id#0L]
>   +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> {code}
> ... but again, since the subquery returns a bag of tuples, the sort is 
> unnecessary.
> We should consider adding an optimizer rule that removes a sort inside a 
> subquery. SPARK-23375 is related, but removes sorts that are functionally 
> redundant because they perform the same ordering.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23957) Sorts in subqueries are redundant and can be removed

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23957:


Assignee: (was: Apache Spark)

> Sorts in subqueries are redundant and can be removed
> 
>
> Key: SPARK-23957
> URL: https://issues.apache.org/jira/browse/SPARK-23957
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Henry Robinson
>Priority: Major
>
> Unless combined with a {{LIMIT}}, there's no correctness reason that planned 
> and optimized subqueries should have any sort operators (since the result of 
> the subquery is an unordered collection of tuples). 
> For example:
> {{SELECT count(1) FROM (select id FROM dft ORDER by id)}}
> has the following plan:
> {code:java}
> == Physical Plan ==
> *(3) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
>+- *(2) HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *(2) Project
>  +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>+- *(1) Project [id#0L]
>   +- *(1) FileScan parquet [id#0L] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct
> {code}
> ... but the sort operator is redundant.
> Less intuitively, the sort is also redundant in selections from an ordered 
> subquery:
> {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}}
> has plan:
> {code:java}
> == Physical Plan ==
> *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>+- *(1) Project [id#0L]
>   +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> {code}
> ... but again, since the subquery returns a bag of tuples, the sort is 
> unnecessary.
> We should consider adding an optimizer rule that removes a sort inside a 
> subquery. SPARK-23375 is related, but removes sorts that are functionally 
> redundant because they perform the same ordering.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23958) HadoopRdd filters empty files to avoid generating empty tasks that affect the performance of the Spark computing performance.

2018-04-11 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23958.
--
Resolution: Duplicate

> HadoopRdd filters empty files to avoid generating empty tasks that affect the 
> performance of the Spark computing performance.
> -
>
> Key: SPARK-23958
> URL: https://issues.apache.org/jira/browse/SPARK-23958
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> HadoopRdd filter empty files to avoid generating empty tasks that affect the 
> performance of the Spark computing performance.
> Empty file's length is zero.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23950) Coalescing an empty dataframe to 1 partition

2018-04-11 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23950.
--
Resolution: Cannot Reproduce

> Coalescing an empty dataframe to 1 partition
> 
>
> Key: SPARK-23950
> URL: https://issues.apache.org/jira/browse/SPARK-23950
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: Operating System: Windows 7
> Tested in Jupyter notebooks using Python 2.7.14 and Python 3.6.3.
> Hardware specs not relevant to the issue.
>Reporter: João Neves
>Priority: Major
>
> Coalescing an empty dataframe to 1 partition returns an error.
> The funny thing is that coalescing an empty dataframe to 2 or more partitions 
> seem to work.
> The test case is the following:
> {code}
> from pyspark.sql.types import StructType
> df = spark.createDataFrame(spark.sparkContext.emptyRDD(), StructType([]))
> print(df.coalesce(2).count())
> print(df.coalesce(3).count())
> print(df.coalesce(4).count())
> df.coalesce(1).count(){code}
> Output:
> {code:java}
> 0
> 0
> 0
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> 7 print(df.coalesce(4).count())
> 8 
> > 9 print(df.coalesce(1).count())
> C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\dataframe.py in count(self)
> 425 2
> 426 """
> --> 427 return int(self._jdf.count())
> 428 
> 429 @ignore_unicode_prefix
> c:\python36\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
> 1131 answer = self.gateway_client.send_command(command)
> 1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
> 1134 
> 1135 for temp_arg in temp_args:
> C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
> 61 def deco(*a, **kw):
> 62 try:
> ---> 63 return f(*a, **kw)
> 64 except py4j.protocol.Py4JJavaError as e:
> 65 s = e.java_exception.toString()
> c:\python36\lib\site-packages\py4j\protocol.py in get_return_value(answer, 
> gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o176.count.
> : java.util.NoSuchElementException: next on empty iterator
> at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
> at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
> at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
> at scala.collection.IterableLike$class.head(IterableLike.scala:107)
> at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
> at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
> at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2435)
> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2434)
> at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
> at org.apache.spark.sql.Dataset.count(Dataset.scala:2434)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214)
> at java.lang.Thread.run(Unknown Source){code}
> Shouldn't this be consistent?
> Thank you very much.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23965) make python/py4j-src-0.x.y.zip file name Spark version-independent

2018-04-11 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23965.
--
Resolution: Won't Fix

> make python/py4j-src-0.x.y.zip file name Spark version-independent
> --
>
> Key: SPARK-23965
> URL: https://issues.apache.org/jira/browse/SPARK-23965
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.1, 2.3.0, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> After each Spark release (that's normally packaged with slightly newer 
> version of py4j), we have to adjust our PySpark applications PYTHONPATH to 
> point to correct version of python/py4j-src-0.9.2.zip. 
> Change to python/py4j-src-0.9.2.zip to python/py4j-src-0.9.6.zip, next 
> release to something else etc. 
> Possible solutions. Would be great to either
>  - rename `python/py4j-src-0.x.y.zip` to `python/py4j-src-latest.zip` or 
> `python/py4j-src-current.zip`
>  - or make a symlink in Spark distributed `py4j-src-current.zip` to whatever 
> version Spark is shipped with.
> In either case, if this would be solved, we wouldn't have to adjust 
> PYTHONPATH during upgrades like Spark 2.2 to 2.3.. 
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23955) typo in parameter name 'rawPredicition'

2018-04-11 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23955.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/21030

> typo in parameter name 'rawPredicition'
> ---
>
> Key: SPARK-23955
> URL: https://issues.apache.org/jira/browse/SPARK-23955
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: John Bauer
>Priority: Trivial
>
> classifier.py MultilayerPerceptronClassifier.__init__ API call had typo 
> rawPredicition instead of rawPrediction
> also present in doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23965) make python/py4j-src-0.x.y.zip file name Spark version-independent

2018-04-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434810#comment-16434810
 ] 

Hyukjin Kwon commented on SPARK-23965:
--

I would leave this resolved. I don't think it's a strong reason to rename or 
make a link, IMHO.

> make python/py4j-src-0.x.y.zip file name Spark version-independent
> --
>
> Key: SPARK-23965
> URL: https://issues.apache.org/jira/browse/SPARK-23965
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.1, 2.3.0, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> After each Spark release (that's normally packaged with slightly newer 
> version of py4j), we have to adjust our PySpark applications PYTHONPATH to 
> point to correct version of python/py4j-src-0.9.2.zip. 
> Change to python/py4j-src-0.9.2.zip to python/py4j-src-0.9.6.zip, next 
> release to something else etc. 
> Possible solutions. Would be great to either
>  - rename `python/py4j-src-0.x.y.zip` to `python/py4j-src-latest.zip` or 
> `python/py4j-src-current.zip`
>  - or make a symlink in Spark distributed `py4j-src-current.zip` to whatever 
> version Spark is shipped with.
> In either case, if this would be solved, we wouldn't have to adjust 
> PYTHONPATH during upgrades like Spark 2.2 to 2.3.. 
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23965) make python/py4j-src-0.x.y.zip file name Spark version-independent

2018-04-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434809#comment-16434809
 ] 

Hyukjin Kwon commented on SPARK-23965:
--

I think that sounds we are going to more make the thridparty library dependent 
on Spark itself. 

Another simple solution I used a long while ago before:

{code}
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo 
"${ZIPS[*]}"):$PYTHONPATH
{code}


> make python/py4j-src-0.x.y.zip file name Spark version-independent
> --
>
> Key: SPARK-23965
> URL: https://issues.apache.org/jira/browse/SPARK-23965
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.1, 2.3.0, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> After each Spark release (that's normally packaged with slightly newer 
> version of py4j), we have to adjust our PySpark applications PYTHONPATH to 
> point to correct version of python/py4j-src-0.9.2.zip. 
> Change to python/py4j-src-0.9.2.zip to python/py4j-src-0.9.6.zip, next 
> release to something else etc. 
> Possible solutions. Would be great to either
>  - rename `python/py4j-src-0.x.y.zip` to `python/py4j-src-latest.zip` or 
> `python/py4j-src-current.zip`
>  - or make a symlink in Spark distributed `py4j-src-current.zip` to whatever 
> version Spark is shipped with.
> In either case, if this would be solved, we wouldn't have to adjust 
> PYTHONPATH during upgrades like Spark 2.2 to 2.3.. 
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23956) Use effective RPC port in AM registration

2018-04-11 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-23956:
--
Priority: Minor  (was: Major)

> Use effective RPC port in AM registration 
> --
>
> Key: SPARK-23956
> URL: https://issues.apache.org/jira/browse/SPARK-23956
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Gera Shegalov
>Priority: Minor
>
> AM's should use their real rpc port in the AM registration for better 
> diagnostics in Application Report.
> {code}
> 18/04/10 14:56:21 INFO Client:
> client token: N/A
> diagnostics: N/A
> ApplicationMaster host: localhost
> ApplicationMaster RPC port: 58338
> queue: default
> start time: 1523397373659
> final status: UNDEFINED
> tracking URL: http://localhost:8088/proxy/application_1523370127531_0016/
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23961) pyspark toLocalIterator throws an exception

2018-04-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434799#comment-16434799
 ] 

Hyukjin Kwon commented on SPARK-23961:
--

FWIW, I met this issue a while ago too (and I gave up with using this at that 
time and forget to debug it ahead).

> pyspark toLocalIterator throws an exception
> ---
>
> Key: SPARK-23961
> URL: https://issues.apache.org/jira/browse/SPARK-23961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Michel Lemay
>Priority: Minor
>  Labels: DataFrame, pyspark
>
> Given a dataframe and use toLocalIterator. If we do not consume all records, 
> it will throw: 
> {quote}ERROR PythonRDD: Error while sending iterator
>  java.net.SocketException: Connection reset by peer: socket write error
>  at java.net.SocketOutputStream.socketWrite0(Native Method)
>  at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>  at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>  at java.io.DataOutputStream.write(DataOutputStream.java:107)
>  at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>  at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
>  at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706)
> {quote}
>  
> To reproduce, here is a simple pyspark shell script that show the error:
> {quote}import itertools
>  df = spark.read.parquet("large parquet folder").cache()
> print(df.count())
>  b = df.toLocalIterator()
>  print(len(list(itertools.islice(b, 20
>  b = None # Make the iterator goes out of scope.  Throws here.
> {quote}
>  
> Observations:
>  * Consuming all records do not throw.  Taking only a subset of the 
> partitions create the error.
>  * In another experiment, doing the same on a regular RDD works if we 
> cache/materialize it. If we do not cache the RDD, it throws similarly.
>  * It works in scala shell
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23920) High-order function: array_remove(x, element) → array

2018-04-11 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434786#comment-16434786
 ] 

Huaxin Gao commented on SPARK-23920:


I will work on this. Thanks!

> High-order function: array_remove(x, element) → array
> -
>
> Key: SPARK-23920
> URL: https://issues.apache.org/jira/browse/SPARK-23920
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Remove all elements that equal element from array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23966) Refactoring all checkpoint file writing logic in a common interface

2018-04-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434682#comment-16434682
 ] 

Apache Spark commented on SPARK-23966:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/21048

> Refactoring all checkpoint file writing logic in a common interface
> ---
>
> Key: SPARK-23966
> URL: https://issues.apache.org/jira/browse/SPARK-23966
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Checkpoint files (offset log files, state store files) in Structured 
> Streaming must be written atomically such that no partial files are generated 
> (would break fault-tolerance guarantees). Currently, there are 3 locations 
> which try to do this individually, and in some cases, incorrectly.
>  # HDFSOffsetMetadataLog - This uses a FileManager interface to use any 
> implementation of `FileSystem` or `FileContext` APIs. It preferably loads 
> `FileContext` implementation as FileContext of HDFS has atomic renames.
>  # HDFSBackedStateStore (aka in-memory state store)
>  ## Writing a version.delta file - This uses FileSystem APIs only to perform 
> a rename. This is incorrect as rename is not atomic in HDFS FileSystem 
> implementation.
>  ## Writing a snapshot file - Same as above.
> Current problems:
>  # State Store behavior is incorrect - 
>  # Inflexible - Some file systems provide mechanisms other than 
> write-to-temp-file-and-rename for writing atomically and more efficiently. 
> For example, with S3 you can write directly to the final file and it will be 
> made visible only when the entire file is written and closed correctly. Any 
> failure can be made to terminate the writing without making any partial files 
> visible in S3. The current code does not abstract out this mechanism enough 
> that it can be customized. 
> Solution:
>  # Introduce a common interface that all 3 cases above can use to write 
> checkpoint files atomically. 
>  # This interface must provide the necessary interfaces that allow 
> customization of the write-and-rename mechanism.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23966) Refactoring all checkpoint file writing logic in a common interface

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23966:


Assignee: Tathagata Das  (was: Apache Spark)

> Refactoring all checkpoint file writing logic in a common interface
> ---
>
> Key: SPARK-23966
> URL: https://issues.apache.org/jira/browse/SPARK-23966
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Checkpoint files (offset log files, state store files) in Structured 
> Streaming must be written atomically such that no partial files are generated 
> (would break fault-tolerance guarantees). Currently, there are 3 locations 
> which try to do this individually, and in some cases, incorrectly.
>  # HDFSOffsetMetadataLog - This uses a FileManager interface to use any 
> implementation of `FileSystem` or `FileContext` APIs. It preferably loads 
> `FileContext` implementation as FileContext of HDFS has atomic renames.
>  # HDFSBackedStateStore (aka in-memory state store)
>  ## Writing a version.delta file - This uses FileSystem APIs only to perform 
> a rename. This is incorrect as rename is not atomic in HDFS FileSystem 
> implementation.
>  ## Writing a snapshot file - Same as above.
> Current problems:
>  # State Store behavior is incorrect - 
>  # Inflexible - Some file systems provide mechanisms other than 
> write-to-temp-file-and-rename for writing atomically and more efficiently. 
> For example, with S3 you can write directly to the final file and it will be 
> made visible only when the entire file is written and closed correctly. Any 
> failure can be made to terminate the writing without making any partial files 
> visible in S3. The current code does not abstract out this mechanism enough 
> that it can be customized. 
> Solution:
>  # Introduce a common interface that all 3 cases above can use to write 
> checkpoint files atomically. 
>  # This interface must provide the necessary interfaces that allow 
> customization of the write-and-rename mechanism.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23966) Refactoring all checkpoint file writing logic in a common interface

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23966:


Assignee: Apache Spark  (was: Tathagata Das)

> Refactoring all checkpoint file writing logic in a common interface
> ---
>
> Key: SPARK-23966
> URL: https://issues.apache.org/jira/browse/SPARK-23966
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Major
>
> Checkpoint files (offset log files, state store files) in Structured 
> Streaming must be written atomically such that no partial files are generated 
> (would break fault-tolerance guarantees). Currently, there are 3 locations 
> which try to do this individually, and in some cases, incorrectly.
>  # HDFSOffsetMetadataLog - This uses a FileManager interface to use any 
> implementation of `FileSystem` or `FileContext` APIs. It preferably loads 
> `FileContext` implementation as FileContext of HDFS has atomic renames.
>  # HDFSBackedStateStore (aka in-memory state store)
>  ## Writing a version.delta file - This uses FileSystem APIs only to perform 
> a rename. This is incorrect as rename is not atomic in HDFS FileSystem 
> implementation.
>  ## Writing a snapshot file - Same as above.
> Current problems:
>  # State Store behavior is incorrect - 
>  # Inflexible - Some file systems provide mechanisms other than 
> write-to-temp-file-and-rename for writing atomically and more efficiently. 
> For example, with S3 you can write directly to the final file and it will be 
> made visible only when the entire file is written and closed correctly. Any 
> failure can be made to terminate the writing without making any partial files 
> visible in S3. The current code does not abstract out this mechanism enough 
> that it can be customized. 
> Solution:
>  # Introduce a common interface that all 3 cases above can use to write 
> checkpoint files atomically. 
>  # This interface must provide the necessary interfaces that allow 
> customization of the write-and-rename mechanism.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23966) Refactoring all checkpoint file writing logic in a common interface

2018-04-11 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-23966:
-

 Summary: Refactoring all checkpoint file writing logic in a common 
interface
 Key: SPARK-23966
 URL: https://issues.apache.org/jira/browse/SPARK-23966
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Tathagata Das
Assignee: Tathagata Das


Checkpoint files (offset log files, state store files) in Structured Streaming 
must be written atomically such that no partial files are generated (would 
break fault-tolerance guarantees). Currently, there are 3 locations which try 
to do this individually, and in some cases, incorrectly.
 # HDFSOffsetMetadataLog - This uses a FileManager interface to use any 
implementation of `FileSystem` or `FileContext` APIs. It preferably loads 
`FileContext` implementation as FileContext of HDFS has atomic renames.
 # HDFSBackedStateStore (aka in-memory state store)
 ## Writing a version.delta file - This uses FileSystem APIs only to perform a 
rename. This is incorrect as rename is not atomic in HDFS FileSystem 
implementation.
 ## Writing a snapshot file - Same as above.

Current problems:
 # State Store behavior is incorrect - 
 # Inflexible - Some file systems provide mechanisms other than 
write-to-temp-file-and-rename for writing atomically and more efficiently. For 
example, with S3 you can write directly to the final file and it will be made 
visible only when the entire file is written and closed correctly. Any failure 
can be made to terminate the writing without making any partial files visible 
in S3. The current code does not abstract out this mechanism enough that it can 
be customized. 

Solution:
 # Introduce a common interface that all 3 cases above can use to write 
checkpoint files atomically. 
 # This interface must provide the necessary interfaces that allow 
customization of the write-and-rename mechanism.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23956) Use effective RPC port in AM registration

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23956:


Assignee: (was: Apache Spark)

> Use effective RPC port in AM registration 
> --
>
> Key: SPARK-23956
> URL: https://issues.apache.org/jira/browse/SPARK-23956
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Gera Shegalov
>Priority: Major
>
> AM's should use their real rpc port in the AM registration for better 
> diagnostics in Application Report.
> {code}
> 18/04/10 14:56:21 INFO Client:
> client token: N/A
> diagnostics: N/A
> ApplicationMaster host: localhost
> ApplicationMaster RPC port: 58338
> queue: default
> start time: 1523397373659
> final status: UNDEFINED
> tracking URL: http://localhost:8088/proxy/application_1523370127531_0016/
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23956) Use effective RPC port in AM registration

2018-04-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434595#comment-16434595
 ] 

Apache Spark commented on SPARK-23956:
--

User 'gerashegalov' has created a pull request for this issue:
https://github.com/apache/spark/pull/21047

> Use effective RPC port in AM registration 
> --
>
> Key: SPARK-23956
> URL: https://issues.apache.org/jira/browse/SPARK-23956
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Gera Shegalov
>Priority: Major
>
> AM's should use their real rpc port in the AM registration for better 
> diagnostics in Application Report.
> {code}
> 18/04/10 14:56:21 INFO Client:
> client token: N/A
> diagnostics: N/A
> ApplicationMaster host: localhost
> ApplicationMaster RPC port: 58338
> queue: default
> start time: 1523397373659
> final status: UNDEFINED
> tracking URL: http://localhost:8088/proxy/application_1523370127531_0016/
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23956) Use effective RPC port in AM registration

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23956:


Assignee: Apache Spark

> Use effective RPC port in AM registration 
> --
>
> Key: SPARK-23956
> URL: https://issues.apache.org/jira/browse/SPARK-23956
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Gera Shegalov
>Assignee: Apache Spark
>Priority: Major
>
> AM's should use their real rpc port in the AM registration for better 
> diagnostics in Application Report.
> {code}
> 18/04/10 14:56:21 INFO Client:
> client token: N/A
> diagnostics: N/A
> ApplicationMaster host: localhost
> ApplicationMaster RPC port: 58338
> queue: default
> start time: 1523397373659
> final status: UNDEFINED
> tracking URL: http://localhost:8088/proxy/application_1523370127531_0016/
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23965) make python/py4j-src-0.x.y.zip file name Spark version-independent

2018-04-11 Thread Ruslan Dautkhanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-23965:
--
Description: 
After each Spark release (that's normally packaged with slightly newer version 
of py4j), we have to adjust our PySpark applications PYTHONPATH to point to 
correct version of python/py4j-src-0.9.2.zip. 

Change to python/py4j-src-0.9.2.zip to python/py4j-src-0.9.6.zip, next release 
to something else etc. 

Possible solutions. Would be great to either
 - rename `python/py4j-src-0.x.y.zip` to `python/py4j-src-latest.zip` or 
`python/py4j-src-current.zip`
 - or make a symlink in Spark distributed `py4j-src-current.zip` to whatever 
version Spark is shipped with.

In either case, if this would be solved, we wouldn't have to adjust PYTHONPATH 
during upgrades like Spark 2.2 to 2.3.. 

Thanks.

  was:
After each Spark release (that's normally packaged with slightly newer version 
of py4j), we have to adjust our PySpark applications PYTHONPATH to point to 
correct version of python/py4j-src-0.9.2.zip. 

Change to python/py4j-src-0.9.2.zip to python/py4j-src-0.9.6.zip, next release 
to something else etc. 

Possible solutions. Would be great to either
- rename `python/py4j-src-0.x.y.zip` to `python/py4j-src-latest.zip` or 
`python/py4j-src-current.zip` 
- make a symlink in Spark distributed `py4j-src-current.zip` to whatever 
version Spark is shipped with.

Thanks.


> make python/py4j-src-0.x.y.zip file name Spark version-independent
> --
>
> Key: SPARK-23965
> URL: https://issues.apache.org/jira/browse/SPARK-23965
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.1, 2.3.0, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> After each Spark release (that's normally packaged with slightly newer 
> version of py4j), we have to adjust our PySpark applications PYTHONPATH to 
> point to correct version of python/py4j-src-0.9.2.zip. 
> Change to python/py4j-src-0.9.2.zip to python/py4j-src-0.9.6.zip, next 
> release to something else etc. 
> Possible solutions. Would be great to either
>  - rename `python/py4j-src-0.x.y.zip` to `python/py4j-src-latest.zip` or 
> `python/py4j-src-current.zip`
>  - or make a symlink in Spark distributed `py4j-src-current.zip` to whatever 
> version Spark is shipped with.
> In either case, if this would be solved, we wouldn't have to adjust 
> PYTHONPATH during upgrades like Spark 2.2 to 2.3.. 
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23955) typo in parameter name 'rawPredicition'

2018-04-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434555#comment-16434555
 ] 

Apache Spark commented on SPARK-23955:
--

User 'codeforfun15' has created a pull request for this issue:
https://github.com/apache/spark/pull/21046

> typo in parameter name 'rawPredicition'
> ---
>
> Key: SPARK-23955
> URL: https://issues.apache.org/jira/browse/SPARK-23955
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: John Bauer
>Priority: Trivial
>
> classifier.py MultilayerPerceptronClassifier.__init__ API call had typo 
> rawPredicition instead of rawPrediction
> also present in doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23955) typo in parameter name 'rawPredicition'

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23955:


Assignee: Apache Spark

> typo in parameter name 'rawPredicition'
> ---
>
> Key: SPARK-23955
> URL: https://issues.apache.org/jira/browse/SPARK-23955
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: John Bauer
>Assignee: Apache Spark
>Priority: Trivial
>
> classifier.py MultilayerPerceptronClassifier.__init__ API call had typo 
> rawPredicition instead of rawPrediction
> also present in doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23955) typo in parameter name 'rawPredicition'

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23955:


Assignee: (was: Apache Spark)

> typo in parameter name 'rawPredicition'
> ---
>
> Key: SPARK-23955
> URL: https://issues.apache.org/jira/browse/SPARK-23955
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: John Bauer
>Priority: Trivial
>
> classifier.py MultilayerPerceptronClassifier.__init__ API call had typo 
> rawPredicition instead of rawPrediction
> also present in doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23965) make python/py4j-src-0.x.y.zip file name Spark version-independent

2018-04-11 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created SPARK-23965:
-

 Summary: make python/py4j-src-0.x.y.zip file name Spark 
version-independent
 Key: SPARK-23965
 URL: https://issues.apache.org/jira/browse/SPARK-23965
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.3.0, 2.2.1, 2.4.0
Reporter: Ruslan Dautkhanov


After each Spark release (that's normally packaged with slightly newer version 
of py4j), we have to adjust our PySpark applications PYTHONPATH to point to 
correct version of python/py4j-src-0.9.2.zip. 

Change to python/py4j-src-0.9.2.zip to python/py4j-src-0.9.6.zip, next release 
to something else etc. 

Possible solutions. Would be great to either
- rename `python/py4j-src-0.x.y.zip` to `python/py4j-src-latest.zip` or 
`python/py4j-src-current.zip` 
- make a symlink in Spark distributed `py4j-src-current.zip` to whatever 
version Spark is shipped with.

Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-04-11 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431403#comment-16431403
 ] 

Bruce Robbins edited comment on SPARK-23715 at 4/11/18 8:09 PM:


I've been convinced this is worth fixing, at least for String input values, 
since a user was actually seeing wrong results despite specifying a datetime 
value with a UTC timezone.

One way to fix this is to create a new expression type for converting string 
values to timestamp values. The Analyzer would place this expression as a left 
child of FromUTCTimestamp, if needed. This new expression type would be more 
aware of FromUTCTimestamp's expectations than a general purpose Cast expression 
(for example, it could reject string datetime values that contain an explicit 
timezone).

Any opinions? [~cloud_fan] [~smilegator]?


was (Author: bersprockets):
I've been convinced this is worth fixing, at least for String input values, 
since a user was actually seeing wrong results despite specifying a datetime 
value with a UTC timezone.

One way to fix this is to create a new expression type for converting string 
values to timestamp values. The Analyzer would place this expression as a left 
child of FromUTCTimestamp, if needed. This new expression type would be more 
aware of FromUTCTimestamp's expectations than a general purpose Cast expression 
(for example, it could reject string datetime values that contain an explicit 
timezone).

Any opinions?

> from_utc_timestamp returns incorrect results for some UTC date/time values
> --
>
> Key: SPARK-23715
> URL: https://issues.apache.org/jira/browse/SPARK-23715
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This produces the expected answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 07:18:23|
> +---+
> {noformat}
> However, the equivalent UTC input (but with an explicit timezone) produces a 
> wrong answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> Additionally, the equivalent Unix time (1520921903, which is also 
> "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
> {noformat}
> df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> These issues stem from the fact that the FromUTCTimestamp expression, despite 
> its name, expects the input to be in the user's local timezone. There is some 
> magic under the covers to make things work (mostly) as the user expects.
> As an example, let's say a user in Los Angeles issues the following:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> {noformat}
> FromUTCTimestamp gets as input a Timestamp (long) value representing
> {noformat}
> 2018-03-13T06:18:23-07:00 (long value 152094710300)
> {noformat}
> What FromUTCTimestamp needs instead is
> {noformat}
> 2018-03-13T06:18:23+00:00 (long value 152092190300)
> {noformat}
> So, it applies the local timezone's offset to the input timestamp to get the 
> correct value (152094710300 minus 7 hours is 152092190300). Then it 
> can process the value and produce the expected output.
> When the user explicitly specifies a time zone, FromUTCTimestamp's 
> assumptions break down. The input is no longer in the local time zone. 
> Because of the way input data is implicitly casted, FromUTCTimestamp never 
> knows whether the input data had an explicit timezone.
> Here are some gory details:
> There is sometimes a mismatch in expectations between the (string => 
> timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp 
> expression never sees the actual input string (the cast "intercepts" the 
> input and converts it to a long timestamp before FromUTCTimestamp uses the 
> value), FromUTCTimestamp cannot reject any input value that would exercise 
> this mismatch in expectations.
> There is a similar mismatch in expectations in the (integer => timestamp) 
> cast and FromUTCTimestamp. As a result, Unix time input almost always 
> produces incorrect output.
> h3. When things work as expected for String input:
> When from_utc_timestamp is passed a string time value with no time zone, 
> DateTimeUtils.stringToTimestamp (called from a Cast 

[jira] [Commented] (SPARK-23914) High-order function: array_union(x, y) → array

2018-04-11 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434494#comment-16434494
 ] 

Kazuaki Ishizaki commented on SPARK-23914:
--

I will work for this, thank you.

> High-order function: array_union(x, y) → array
> --
>
> Key: SPARK-23914
> URL: https://issues.apache.org/jira/browse/SPARK-23914
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array of the elements in the union of x and y, without duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23913) High-order function: array_intersect(x, y) → array

2018-04-11 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434491#comment-16434491
 ] 

Kazuaki Ishizaki commented on SPARK-23913:
--

I will work for this, thank you.

> High-order function: array_intersect(x, y) → array
> --
>
> Key: SPARK-23913
> URL: https://issues.apache.org/jira/browse/SPARK-23913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array of the elements in the intersection of x and y, without 
> duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23964) why does Spillable wait for 32 elements?

2018-04-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434479#comment-16434479
 ] 

Thomas Graves commented on SPARK-23964:
---

I'm not sure, I'm trying to figure out if there is a performance implications 
here and perhaps there are but its at the cost of not being accurate  on memory 
usage.  In the deployments with fixed sized containers this is very important.  
if you wait 32 elements it may cause you to acquire a bigger chunk of memory at 
once vs getting smaller allocations (thus more).

I would think the only check you need is: currentMemory >= myMemoryThreshold, 
the initial threshold is 5MB right now but all its doing is asking for more 
memory, only when it can't get memory does it spill.  And the initial threshold 
is configurable so you can always make it bigger. 

I'm going to try to do some performance tests to see what happens but would 
like to know if anyone has other background.  

> why does Spillable wait for 32 elements?
> 
>
> Key: SPARK-23964
> URL: https://issues.apache.org/jira/browse/SPARK-23964
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> The spillable class has a check in maybeSpill as to when it tries to acquire 
> more memory and determine if it should spill:
> if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
> Before it looks to see if it should spill.  
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83]
> I'm wondering why it has the elementsRead %32  in it?  If I have a small 
> number of elements that are huge this can easily cause OOM before we actually 
> spill.  
> I saw a few conversations on this and one Jira related: 
> https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an 
> answer to this.
> anyone have history on this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23931:


Assignee: Apache Spark

> High-order function: zip(array1, array2[, ...]) → array
> 
>
> Key: SPARK-23931
> URL: https://issues.apache.org/jira/browse/SPARK-23931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the given arrays, element-wise, into a single array of rows. The M-th 
> element of the N-th argument will be the N-th field of the M-th output 
> element. If the arguments have an uneven length, missing values are filled 
> with NULL.
> {noformat}
> SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, 
> null), ROW(null, '3b')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array

2018-04-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434475#comment-16434475
 ] 

Apache Spark commented on SPARK-23931:
--

User 'DylanGuedes' has created a pull request for this issue:
https://github.com/apache/spark/pull/21045

> High-order function: zip(array1, array2[, ...]) → array
> 
>
> Key: SPARK-23931
> URL: https://issues.apache.org/jira/browse/SPARK-23931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the given arrays, element-wise, into a single array of rows. The M-th 
> element of the N-th argument will be the N-th field of the M-th output 
> element. If the arguments have an uneven length, missing values are filled 
> with NULL.
> {noformat}
> SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, 
> null), ROW(null, '3b')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23931:


Assignee: (was: Apache Spark)

> High-order function: zip(array1, array2[, ...]) → array
> 
>
> Key: SPARK-23931
> URL: https://issues.apache.org/jira/browse/SPARK-23931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the given arrays, element-wise, into a single array of rows. The M-th 
> element of the N-th argument will be the N-th field of the M-th output 
> element. If the arguments have an uneven length, missing values are filled 
> with NULL.
> {noformat}
> SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, 
> null), ROW(null, '3b')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23964) why does Spillable wait for 32 elements?

2018-04-11 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434460#comment-16434460
 ] 

Reynold Xin commented on SPARK-23964:
-

Was it trying to reduce overhead?

 

> why does Spillable wait for 32 elements?
> 
>
> Key: SPARK-23964
> URL: https://issues.apache.org/jira/browse/SPARK-23964
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> The spillable class has a check in maybeSpill as to when it tries to acquire 
> more memory and determine if it should spill:
> if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
> Before it looks to see if it should spill.  
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83]
> I'm wondering why it has the elementsRead %32  in it?  If I have a small 
> number of elements that are huge this can easily cause OOM before we actually 
> spill.  
> I saw a few conversations on this and one Jira related: 
> https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an 
> answer to this.
> anyone have history on this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23964) why does Spillable wait for 32 elements?

2018-04-11 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-23964:
--
Description: 
The spillable class has a check in maybeSpill as to when it tries to acquire 
more memory and determine if it should spill:

if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {

Before it looks to see if it should spill.  

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83]

I'm wondering why it has the elementsRead %32  in it?  If I have a small number 
of elements that are huge this can easily cause OOM before we actually spill.  

I saw a few conversations on this and one Jira related: 
https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an 
answer to this.

anyone have history on this?

  was:
The spillable class has a check:

if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {

Before it looks to see if it should spill.  

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83]

I'm wondering why it has the elementsRead %32  in it?  If I have a small number 
of elements that are huge this can easily cause OOM before we actually spill.  

I saw a few conversations on this and one Jira related: 
https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an 
answer to this.

anyone have history on this?


> why does Spillable wait for 32 elements?
> 
>
> Key: SPARK-23964
> URL: https://issues.apache.org/jira/browse/SPARK-23964
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> The spillable class has a check in maybeSpill as to when it tries to acquire 
> more memory and determine if it should spill:
> if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
> Before it looks to see if it should spill.  
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83]
> I'm wondering why it has the elementsRead %32  in it?  If I have a small 
> number of elements that are huge this can easily cause OOM before we actually 
> spill.  
> I saw a few conversations on this and one Jira related: 
> https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an 
> answer to this.
> anyone have history on this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23964) why does Spillable wait for 32 elements?

2018-04-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434454#comment-16434454
 ] 

Thomas Graves commented on SPARK-23964:
---

[~andrewor14]  [~matei] [~r...@databricks.com]

 

A few related threads:

[https://github.com/apache/spark/pull/3302]

[https://github.com/apache/spark/pull/3656]

https://github.com/apache/spark/commit/3be92cdac30cf488e09dbdaaa70e5c4cdaa9a099

> why does Spillable wait for 32 elements?
> 
>
> Key: SPARK-23964
> URL: https://issues.apache.org/jira/browse/SPARK-23964
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> The spillable class has a check:
> if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
> Before it looks to see if it should spill.  
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83]
> I'm wondering why it has the elementsRead %32  in it?  If I have a small 
> number of elements that are huge this can easily cause OOM before we actually 
> spill.  
> I saw a few conversations on this and one Jira related: 
> https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an 
> answer to this.
> anyone have history on this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9312) The OneVsRest model does not provide confidence factor(not probability) along with the prediction

2018-04-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434447#comment-16434447
 ] 

Apache Spark commented on SPARK-9312:
-

User 'ludatabricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/21044

> The OneVsRest model does not provide confidence factor(not probability) along 
> with the prediction
> -
>
> Key: SPARK-9312
> URL: https://issues.apache.org/jira/browse/SPARK-9312
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Badari Madhav
>Priority: Major
>  Labels: features
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23964) why does Spillable wait for 32 elements?

2018-04-11 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-23964:
--
Description: 
The spillable class has a check:

if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {

Before it looks to see if it should spill.  

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83]

I'm wondering why it has the elementsRead %32  in it?  If I have a small number 
of elements that are huge this can easily cause OOM before we actually spill.  

I saw a few conversations on this and one Jira related: 
https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an 
answer to this.

anyone have history on this?

> why does Spillable wait for 32 elements?
> 
>
> Key: SPARK-23964
> URL: https://issues.apache.org/jira/browse/SPARK-23964
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> The spillable class has a check:
> if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
> Before it looks to see if it should spill.  
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83]
> I'm wondering why it has the elementsRead %32  in it?  If I have a small 
> number of elements that are huge this can easily cause OOM before we actually 
> spill.  
> I saw a few conversations on this and one Jira related: 
> https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an 
> answer to this.
> anyone have history on this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23964) why does Spillable wait for 32 elements?

2018-04-11 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-23964:
--
Environment: (was: The spillable class has a check:

if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {

Before it looks to see if it should spill.  

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83]

I'm wondering why it has the elementsRead %32  in it?  If I have a small number 
of elements that are huge this can easily cause OOM before we actually spill.  

I saw a few conversations on this and one Jira related: 
https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an 
answer to this.

anyone have history on this?)

> why does Spillable wait for 32 elements?
> 
>
> Key: SPARK-23964
> URL: https://issues.apache.org/jira/browse/SPARK-23964
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23964) why does Spillable wait for 32 elements?

2018-04-11 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-23964:
-

 Summary: why does Spillable wait for 32 elements?
 Key: SPARK-23964
 URL: https://issues.apache.org/jira/browse/SPARK-23964
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
 Environment: The spillable class has a check:

if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {

Before it looks to see if it should spill.  

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83]

I'm wondering why it has the elementsRead %32  in it?  If I have a small number 
of elements that are huge this can easily cause OOM before we actually spill.  

I saw a few conversations on this and one Jira related: 
https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an 
answer to this.

anyone have history on this?
Reporter: Thomas Graves






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-04-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-22883.
---
   Resolution: Fixed
Fix Version/s: 2.3.1

Issue resolved by pull request 21042
[https://github.com/apache/spark/pull/21042]

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23961) pyspark toLocalIterator throws an exception

2018-04-11 Thread Michel Lemay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michel Lemay updated SPARK-23961:
-
Description: 
Given a dataframe and use toLocalIterator. If we do not consume all records, it 
will throw: 
{quote}ERROR PythonRDD: Error while sending iterator
 java.net.SocketException: Connection reset by peer: socket write error
 at java.net.SocketOutputStream.socketWrite0(Native Method)
 at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
 at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
 at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
 at java.io.DataOutputStream.write(DataOutputStream.java:107)
 at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
 at 
org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497)
 at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
 at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
 at scala.collection.Iterator$class.foreach(Iterator.scala:893)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
 at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
 at 
org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705)
 at 
org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
 at 
org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
 at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706)
{quote}
 

To reproduce, here is a simple pyspark shell script that show the error:
{quote}import itertools
 df = spark.read.parquet("large parquet folder").cache()
print(df.count())
 b = df.toLocalIterator()
 print(len(list(itertools.islice(b, 20
 b = None # Make the iterator goes out of scope.  Throws here.
{quote}
 

Observations:
 * Consuming all records do not throw.  Taking only a subset of the partitions 
create the error.
 * In another experiment, doing the same on a regular RDD works if we 
cache/materialize it. If we do not cache the RDD, it throws similarly.
 * It works in scala shell

 

  was:
Given a dataframe, take it's rdd and use toLocalIterator. If we do not consume 
all records, it will throw: 
{quote}ERROR PythonRDD: Error while sending iterator
java.net.SocketException: Connection reset by peer: socket write error
 at java.net.SocketOutputStream.socketWrite0(Native Method)
 at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
 at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
 at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
 at java.io.DataOutputStream.write(DataOutputStream.java:107)
 at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
 at 
org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497)
 at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
 at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
 at scala.collection.Iterator$class.foreach(Iterator.scala:893)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
 at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
 at 
org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705)
 at 
org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
 at 
org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
 at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706)
{quote}
 

To reproduce, here is a simple pyspark shell script that show the error:
{quote}import itertools
df = spark.read.parquet("large parquet folder")
cachedRDD = df.rdd.cache()
print(cachedRDD.count()) # materialize
b = cachedRDD.toLocalIterator()
print(len(list(itertools.islice(b, 20
b = None # Make the iterator goes out of scope.  Throws here.
{quote}
 

Observations:
 * Consuming all records do not throw.  Taking only a subset of the partitions 
create the error.
 * In another experiment, doing the same on a regular RDD works if we 
cache/materialize it. If we do not cache the RDD, it throws similarly.
 * It works in scala shell

 


> pyspark toLocalIterator throws an exception
> ---
>
> Key: SPARK-23961
> URL: https://issues.apache.org/jira/browse/SPARK-23961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>

[jira] [Assigned] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23963:


Assignee: (was: Apache Spark)

> Queries on text-based Hive tables grow disproportionately slower as the 
> number of columns increase
> --
>
> Key: SPARK-23963
> URL: https://issues.apache.org/jira/browse/SPARK-23963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> TableReader gets disproportionately slower as the number of columns in the 
> query increase.
> For example, reading a table with 6000 columns is 4 times more expensive per 
> record than reading a table with 3000 columns, rather than twice as expensive.
> The increase in processing time is due to several Lists (fieldRefs, 
> fieldOrdinals, and unwrappers), each of which the reader accesses by column 
> number for each column in a record. Because each List has O\(n\) time for 
> lookup by column number, these lookups grow increasingly expensive as the 
> column count increases.
> When I patched the code to change those 3 Lists to Arrays, the query times 
> became proportional.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase

2018-04-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434265#comment-16434265
 ] 

Apache Spark commented on SPARK-23963:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/21043

> Queries on text-based Hive tables grow disproportionately slower as the 
> number of columns increase
> --
>
> Key: SPARK-23963
> URL: https://issues.apache.org/jira/browse/SPARK-23963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> TableReader gets disproportionately slower as the number of columns in the 
> query increase.
> For example, reading a table with 6000 columns is 4 times more expensive per 
> record than reading a table with 3000 columns, rather than twice as expensive.
> The increase in processing time is due to several Lists (fieldRefs, 
> fieldOrdinals, and unwrappers), each of which the reader accesses by column 
> number for each column in a record. Because each List has O\(n\) time for 
> lookup by column number, these lookups grow increasingly expensive as the 
> column count increases.
> When I patched the code to change those 3 Lists to Arrays, the query times 
> became proportional.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23963:


Assignee: Apache Spark

> Queries on text-based Hive tables grow disproportionately slower as the 
> number of columns increase
> --
>
> Key: SPARK-23963
> URL: https://issues.apache.org/jira/browse/SPARK-23963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Minor
>
> TableReader gets disproportionately slower as the number of columns in the 
> query increase.
> For example, reading a table with 6000 columns is 4 times more expensive per 
> record than reading a table with 3000 columns, rather than twice as expensive.
> The increase in processing time is due to several Lists (fieldRefs, 
> fieldOrdinals, and unwrappers), each of which the reader accesses by column 
> number for each column in a record. Because each List has O\(n\) time for 
> lookup by column number, these lookups grow increasingly expensive as the 
> column count increases.
> When I patched the code to change those 3 Lists to Arrays, the query times 
> became proportional.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks

2018-04-11 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-23948:
-
Component/s: Scheduler

> Trigger mapstage's job listener in submitMissingTasks
> -
>
> Key: SPARK-23948
> URL: https://issues.apache.org/jira/browse/SPARK-23948
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: jin xing
>Priority: Major
>
> SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
> "markMapStageJobAsFinished" is called only in 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933
>  and   
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);
> But think about below scenario:
> 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
> 2. We submit stage1 by "submitMapStage";
> 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
> resubmitted as stage0_1 and stage1_1;
> 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, 
> but stage1 is not inside "runningStages". So even though all splits(including 
> the speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
> called;
> 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", 
> there is no missing tasks. But in current code, job listener is not triggered



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-04-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22883:
--
Fix Version/s: 2.4.0

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.4.0
>
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-04-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22883:
--
Target Version/s: 2.3.1, 2.4.0

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.4.0
>
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-04-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434238#comment-16434238
 ] 

Apache Spark commented on SPARK-22883:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/21042

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23734) InvalidSchemaException While Saving ALSModel

2018-04-11 Thread Stanley Poon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434219#comment-16434219
 ] 

Stanley Poon edited comment on SPARK-23734 at 4/11/18 4:58 PM:
---

[~viirya] Thank you for checking into this. I added the Spark release details 
where this is reproducible - v2.3.0-rc5 released on Feb 22, 2018.

And will verify that it is fixed in the next release.


was (Author: spoon):
[~viirya] Thank you for checking into this. I added the Spark release details 
where this is reproducible. And will verify that it is fixed in the next 
release.

> InvalidSchemaException While Saving ALSModel
> 
>
> Key: SPARK-23734
> URL: https://issues.apache.org/jira/browse/SPARK-23734
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
> Environment: macOS 10.13.2
> Scala 2.11.8
> Spark 2.3.0  v2.3.0-rc5
>Reporter: Stanley Poon
>Priority: Major
>  Labels: ALS, parquet, persistence
>
> After fitting an ALSModel, get following error while saving the model:
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
> not be empty. Parquet does not support empty group without leaves. Empty 
> group: spark_schema
> Exactly the same code ran ok on 2.2.1.
> Same issue also occurs on other ALSModels we have.
> h2. *To reproduce*
> Get ALSExample: 
> [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala]
>  and add the following line to save the model right before "spark.stop".
> {quote}   model.write.overwrite().save("SparkExampleALSModel") 
> {quote}
> h2. Stack Trace
> Exception in thread "main" java.lang.ExceptionInInitializerError
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
> at 
> org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103)
> at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83)
> at com.vitalmove.model.ALSExample.main(ALSExample.scala)
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
> not be empty. Parquet does not support empty group without leaves. Empty 
> group: spark_schema
> at org.apache.parquet.schema.GroupType.(GroupType.java:92)
> at org.apache.parquet.schema.GroupType.(GroupType.java:48)
> at 

[jira] [Updated] (SPARK-23734) InvalidSchemaException While Saving ALSModel

2018-04-11 Thread Stanley Poon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanley Poon updated SPARK-23734:
-
Environment: 
macOS 10.13.2

Scala 2.11.8

Spark 2.3.0  v2.3.0-rc5 (Feb 22 2018)

  was:
macOS 10.13.2

Scala 2.11.8

Spark 2.3.0  v2.3.0-rc5


> InvalidSchemaException While Saving ALSModel
> 
>
> Key: SPARK-23734
> URL: https://issues.apache.org/jira/browse/SPARK-23734
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
> Environment: macOS 10.13.2
> Scala 2.11.8
> Spark 2.3.0  v2.3.0-rc5 (Feb 22 2018)
>Reporter: Stanley Poon
>Priority: Major
>  Labels: ALS, parquet, persistence
>
> After fitting an ALSModel, get following error while saving the model:
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
> not be empty. Parquet does not support empty group without leaves. Empty 
> group: spark_schema
> Exactly the same code ran ok on 2.2.1.
> Same issue also occurs on other ALSModels we have.
> h2. *To reproduce*
> Get ALSExample: 
> [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala]
>  and add the following line to save the model right before "spark.stop".
> {quote}   model.write.overwrite().save("SparkExampleALSModel") 
> {quote}
> h2. Stack Trace
> Exception in thread "main" java.lang.ExceptionInInitializerError
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
> at 
> org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103)
> at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83)
> at com.vitalmove.model.ALSExample.main(ALSExample.scala)
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
> not be empty. Parquet does not support empty group without leaves. Empty 
> group: spark_schema
> at org.apache.parquet.schema.GroupType.(GroupType.java:92)
> at org.apache.parquet.schema.GroupType.(GroupType.java:48)
> at org.apache.parquet.schema.MessageType.(MessageType.java:50)
> at org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala:567)
> at 
> 

[jira] [Commented] (SPARK-23734) InvalidSchemaException While Saving ALSModel

2018-04-11 Thread Stanley Poon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434219#comment-16434219
 ] 

Stanley Poon commented on SPARK-23734:
--

[~viirya] Thank you for checking into this. I added the Spark release details 
where this is reproducible. And will verify that it is fixed in the next 
release.

> InvalidSchemaException While Saving ALSModel
> 
>
> Key: SPARK-23734
> URL: https://issues.apache.org/jira/browse/SPARK-23734
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
> Environment: macOS 10.13.2
> Scala 2.11.8
> Spark 2.3.0  v2.3.0-rc5
>Reporter: Stanley Poon
>Priority: Major
>  Labels: ALS, parquet, persistence
>
> After fitting an ALSModel, get following error while saving the model:
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
> not be empty. Parquet does not support empty group without leaves. Empty 
> group: spark_schema
> Exactly the same code ran ok on 2.2.1.
> Same issue also occurs on other ALSModels we have.
> h2. *To reproduce*
> Get ALSExample: 
> [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala]
>  and add the following line to save the model right before "spark.stop".
> {quote}   model.write.overwrite().save("SparkExampleALSModel") 
> {quote}
> h2. Stack Trace
> Exception in thread "main" java.lang.ExceptionInInitializerError
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
> at 
> org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103)
> at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83)
> at com.vitalmove.model.ALSExample.main(ALSExample.scala)
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
> not be empty. Parquet does not support empty group without leaves. Empty 
> group: spark_schema
> at org.apache.parquet.schema.GroupType.(GroupType.java:92)
> at org.apache.parquet.schema.GroupType.(GroupType.java:48)
> at org.apache.parquet.schema.MessageType.(MessageType.java:50)
> at org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala:567)
> at 
> 

[jira] [Updated] (SPARK-23734) InvalidSchemaException While Saving ALSModel

2018-04-11 Thread Stanley Poon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanley Poon updated SPARK-23734:
-
Environment: 
macOS 10.13.2

Scala 2.11.8

Spark 2.3.0  v2.3.0-rc5

  was:
macOS 10.13.2

Scala 2.11.8

Spark 2.3.0


> InvalidSchemaException While Saving ALSModel
> 
>
> Key: SPARK-23734
> URL: https://issues.apache.org/jira/browse/SPARK-23734
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
> Environment: macOS 10.13.2
> Scala 2.11.8
> Spark 2.3.0  v2.3.0-rc5
>Reporter: Stanley Poon
>Priority: Major
>  Labels: ALS, parquet, persistence
>
> After fitting an ALSModel, get following error while saving the model:
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
> not be empty. Parquet does not support empty group without leaves. Empty 
> group: spark_schema
> Exactly the same code ran ok on 2.2.1.
> Same issue also occurs on other ALSModels we have.
> h2. *To reproduce*
> Get ALSExample: 
> [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala]
>  and add the following line to save the model right before "spark.stop".
> {quote}   model.write.overwrite().save("SparkExampleALSModel") 
> {quote}
> h2. Stack Trace
> Exception in thread "main" java.lang.ExceptionInInitializerError
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
> at 
> org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103)
> at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83)
> at com.vitalmove.model.ALSExample.main(ALSExample.scala)
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
> not be empty. Parquet does not support empty group without leaves. Empty 
> group: spark_schema
> at org.apache.parquet.schema.GroupType.(GroupType.java:92)
> at org.apache.parquet.schema.GroupType.(GroupType.java:48)
> at org.apache.parquet.schema.MessageType.(MessageType.java:50)
> at org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala:567)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala)
>  



--
This 

[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>

2018-04-11 Thread Marek Novotny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434185#comment-16434185
 ] 

Marek Novotny commented on SPARK-23936:
---

Shouldn't we overload _concat_ function for maps instead of introducing 
_map_concat_? 

> High-order function: map_concat(map1, map2, ..., mapN) → 
> map
> ---
>
> Key: SPARK-23936
> URL: https://issues.apache.org/jira/browse/SPARK-23936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Returns the union of all the given maps. If a key is found in multiple given 
> maps, that key’s value in the resulting map comes from the last one of those 
> maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase

2018-04-11 Thread Bruce Robbins (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-23963:
--
Description: 
TableReader gets disproportionately slower as the number of columns in the 
query increase.

For example, reading a table with 6000 columns is 4 times more expensive per 
record than reading a table with 3000 columns, rather than twice as expensive.

The increase in processing time is due to several Lists (fieldRefs, 
fieldOrdinals, and unwrappers), each of which the reader accesses by column 
number for each column in a record. Because each List has O(n) time for lookup 
by column number, these lookups grow increasingly expensive as the column count 
increases.

When I patched the code to change those 3 Lists to Arrays, the query times 
became proportional.

 

 

 

 

  was:
TableReader gets disproportionately slower as the number of columns in the 
query increase.

For example, reading a table with 6000 columns is 4 times more expensive per 
record than reading a table with 3000 columns, rather than twice as expensive.

The increase in processing time is due to several Lists (fieldRefs, 
fieldOrdinals, and unwrappers), each of which the reader accesses by column 
number for each column in a record. Because each List has O(n) time for lookup 
by column number, these lookups grow increasingly expensive as the column count 
increases.

When I patched the code to change those 3 Lists to Arrays, the query times 
became proportional.

 

 

 

 


> Queries on text-based Hive tables grow disproportionately slower as the 
> number of columns increase
> --
>
> Key: SPARK-23963
> URL: https://issues.apache.org/jira/browse/SPARK-23963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> TableReader gets disproportionately slower as the number of columns in the 
> query increase.
> For example, reading a table with 6000 columns is 4 times more expensive per 
> record than reading a table with 3000 columns, rather than twice as expensive.
> The increase in processing time is due to several Lists (fieldRefs, 
> fieldOrdinals, and unwrappers), each of which the reader accesses by column 
> number for each column in a record. Because each List has O(n) time for 
> lookup by column number, these lookups grow increasingly expensive as the 
> column count increases.
> When I patched the code to change those 3 Lists to Arrays, the query times 
> became proportional.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase

2018-04-11 Thread Bruce Robbins (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-23963:
--
Description: 
TableReader gets disproportionately slower as the number of columns in the 
query increase.

For example, reading a table with 6000 columns is 4 times more expensive per 
record than reading a table with 3000 columns, rather than twice as expensive.

The increase in processing time is due to several Lists (fieldRefs, 
fieldOrdinals, and unwrappers), each of which the reader accesses by column 
number for each column in a record. Because each List has O\(n\) time for 
lookup by column number, these lookups grow increasingly expensive as the 
column count increases.

When I patched the code to change those 3 Lists to Arrays, the query times 
became proportional.

 

 

 

 

  was:
TableReader gets disproportionately slower as the number of columns in the 
query increase.

For example, reading a table with 6000 columns is 4 times more expensive per 
record than reading a table with 3000 columns, rather than twice as expensive.

The increase in processing time is due to several Lists (fieldRefs, 
fieldOrdinals, and unwrappers), each of which the reader accesses by column 
number for each column in a record. Because each List has O(n) time for lookup 
by column number, these lookups grow increasingly expensive as the column count 
increases.

When I patched the code to change those 3 Lists to Arrays, the query times 
became proportional.

 

 

 

 


> Queries on text-based Hive tables grow disproportionately slower as the 
> number of columns increase
> --
>
> Key: SPARK-23963
> URL: https://issues.apache.org/jira/browse/SPARK-23963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> TableReader gets disproportionately slower as the number of columns in the 
> query increase.
> For example, reading a table with 6000 columns is 4 times more expensive per 
> record than reading a table with 3000 columns, rather than twice as expensive.
> The increase in processing time is due to several Lists (fieldRefs, 
> fieldOrdinals, and unwrappers), each of which the reader accesses by column 
> number for each column in a record. Because each List has O\(n\) time for 
> lookup by column number, these lookups grow increasingly expensive as the 
> column count increases.
> When I patched the code to change those 3 Lists to Arrays, the query times 
> became proportional.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase

2018-04-11 Thread Bruce Robbins (JIRA)
Bruce Robbins created SPARK-23963:
-

 Summary: Queries on text-based Hive tables grow disproportionately 
slower as the number of columns increase
 Key: SPARK-23963
 URL: https://issues.apache.org/jira/browse/SPARK-23963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Bruce Robbins


TableReader gets disproportionately slower as the number of columns in the 
query increase.

For example, reading a table with 6000 columns is 4 times more expensive per 
record than reading a table with 3000 columns, rather than twice as expensive.

The increase in processing time is due to several Lists (fieldRefs, 
fieldOrdinals, and unwrappers), each of which the reader accesses by column 
number for each column in a record. Because each List has O(n) time for lookup 
by column number, these lookups grow increasingly expensive as the column count 
increases.

When I patched the code to change those 3 Lists to Arrays, the query times 
became proportional.

 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23962) Flaky tests from SQLMetricsTestUtils.currentExecutionIds

2018-04-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434077#comment-16434077
 ] 

Apache Spark commented on SPARK-23962:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/21041

> Flaky tests from SQLMetricsTestUtils.currentExecutionIds
> 
>
> Key: SPARK-23962
> URL: https://issues.apache.org/jira/browse/SPARK-23962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Minor
> Attachments: unit-tests.log
>
>
> I've seen 
> {{org.apache.spark.sql.execution.metric.SQLMetricsSuite.SortMergeJoin 
> metrics}} fail 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4150/testReport/org.apache.spark.sql.execution.metric/SQLMetricsSuite/SortMergeJoin_metrics/
>  
> with
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedException: 2 did not equal 1
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did 
> not equal 1
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.getSparkPlanMetrics(SQLMetricsTestUtils.scala:146)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite.getSparkPlanMetrics(SQLMetricsSuite.scala:33)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.testSparkPlanMetrics(SQLMetricsTestUtils.scala:187)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite.testSparkPlanMetrics(SQLMetricsSuite.scala:33)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite$$anonfun$7$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SQLMetricsSuite.scala:188)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempView(SQLTestUtils.scala:260)
> ...
> {noformat}
> I believe this is because {{SQLMetricsTestUtils.currentExecutionId()}} is 
> racing with the listener bus.
> I'll attach trimmed logs as well



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23962) Flaky tests from SQLMetricsTestUtils.currentExecutionIds

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23962:


Assignee: Apache Spark

> Flaky tests from SQLMetricsTestUtils.currentExecutionIds
> 
>
> Key: SPARK-23962
> URL: https://issues.apache.org/jira/browse/SPARK-23962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Minor
> Attachments: unit-tests.log
>
>
> I've seen 
> {{org.apache.spark.sql.execution.metric.SQLMetricsSuite.SortMergeJoin 
> metrics}} fail 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4150/testReport/org.apache.spark.sql.execution.metric/SQLMetricsSuite/SortMergeJoin_metrics/
>  
> with
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedException: 2 did not equal 1
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did 
> not equal 1
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.getSparkPlanMetrics(SQLMetricsTestUtils.scala:146)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite.getSparkPlanMetrics(SQLMetricsSuite.scala:33)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.testSparkPlanMetrics(SQLMetricsTestUtils.scala:187)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite.testSparkPlanMetrics(SQLMetricsSuite.scala:33)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite$$anonfun$7$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SQLMetricsSuite.scala:188)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempView(SQLTestUtils.scala:260)
> ...
> {noformat}
> I believe this is because {{SQLMetricsTestUtils.currentExecutionId()}} is 
> racing with the listener bus.
> I'll attach trimmed logs as well



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23962) Flaky tests from SQLMetricsTestUtils.currentExecutionIds

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23962:


Assignee: (was: Apache Spark)

> Flaky tests from SQLMetricsTestUtils.currentExecutionIds
> 
>
> Key: SPARK-23962
> URL: https://issues.apache.org/jira/browse/SPARK-23962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Minor
> Attachments: unit-tests.log
>
>
> I've seen 
> {{org.apache.spark.sql.execution.metric.SQLMetricsSuite.SortMergeJoin 
> metrics}} fail 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4150/testReport/org.apache.spark.sql.execution.metric/SQLMetricsSuite/SortMergeJoin_metrics/
>  
> with
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedException: 2 did not equal 1
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did 
> not equal 1
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.getSparkPlanMetrics(SQLMetricsTestUtils.scala:146)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite.getSparkPlanMetrics(SQLMetricsSuite.scala:33)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.testSparkPlanMetrics(SQLMetricsTestUtils.scala:187)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite.testSparkPlanMetrics(SQLMetricsSuite.scala:33)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite$$anonfun$7$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SQLMetricsSuite.scala:188)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempView(SQLTestUtils.scala:260)
> ...
> {noformat}
> I believe this is because {{SQLMetricsTestUtils.currentExecutionId()}} is 
> racing with the listener bus.
> I'll attach trimmed logs as well



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22941) Allow SparkSubmit to throw exceptions instead of exiting / printing errors.

2018-04-11 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-22941:


Assignee: Marcelo Vanzin

> Allow SparkSubmit to throw exceptions instead of exiting / printing errors.
> ---
>
> Key: SPARK-22941
> URL: https://issues.apache.org/jira/browse/SPARK-22941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.4.0
>
>
> {{SparkSubmit}} now can be called by the {{SparkLauncher}} library (see 
> SPARK-11035). But if the caller provides incorrect or inconsistent parameters 
> to the app, {{SparkSubmit}} will print errors to the output and call 
> {{System.exit}}, which is not very user friendly in this code path.
> We should modify {{SparkSubmit}} to be more friendly when called this way, 
> while still maintaining the old behavior when called from the command line.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22941) Allow SparkSubmit to throw exceptions instead of exiting / printing errors.

2018-04-11 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-22941.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20925
[https://github.com/apache/spark/pull/20925]

> Allow SparkSubmit to throw exceptions instead of exiting / printing errors.
> ---
>
> Key: SPARK-22941
> URL: https://issues.apache.org/jira/browse/SPARK-22941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.4.0
>
>
> {{SparkSubmit}} now can be called by the {{SparkLauncher}} library (see 
> SPARK-11035). But if the caller provides incorrect or inconsistent parameters 
> to the app, {{SparkSubmit}} will print errors to the output and call 
> {{System.exit}}, which is not very user friendly in this code path.
> We should modify {{SparkSubmit}} to be more friendly when called this way, 
> while still maintaining the old behavior when called from the command line.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23962) Flaky tests from SQLMetricsTestUtils.currentExecutionIds

2018-04-11 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-23962:


 Summary: Flaky tests from SQLMetricsTestUtils.currentExecutionIds
 Key: SPARK-23962
 URL: https://issues.apache.org/jira/browse/SPARK-23962
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 2.4.0
Reporter: Imran Rashid
 Attachments: unit-tests.log

I've seen {{org.apache.spark.sql.execution.metric.SQLMetricsSuite.SortMergeJoin 
metrics}} fail 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4150/testReport/org.apache.spark.sql.execution.metric/SQLMetricsSuite/SortMergeJoin_metrics/
 
with

{noformat}
Error Message
org.scalatest.exceptions.TestFailedException: 2 did not equal 1

Stacktrace
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did not 
equal 1
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
at 
org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.getSparkPlanMetrics(SQLMetricsTestUtils.scala:146)
at 
org.apache.spark.sql.execution.metric.SQLMetricsSuite.getSparkPlanMetrics(SQLMetricsSuite.scala:33)
at 
org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.testSparkPlanMetrics(SQLMetricsTestUtils.scala:187)
at 
org.apache.spark.sql.execution.metric.SQLMetricsSuite.testSparkPlanMetrics(SQLMetricsSuite.scala:33)
at 
org.apache.spark.sql.execution.metric.SQLMetricsSuite$$anonfun$7$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SQLMetricsSuite.scala:188)
at 
org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempView(SQLTestUtils.scala:260)
...
{noformat}

I believe this is because {{SQLMetricsTestUtils.currentExecutionId()}} is 
racing with the listener bus.
I'll attach trimmed logs as well



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23962) Flaky tests from SQLMetricsTestUtils.currentExecutionIds

2018-04-11 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-23962:
-
Attachment: unit-tests.log

> Flaky tests from SQLMetricsTestUtils.currentExecutionIds
> 
>
> Key: SPARK-23962
> URL: https://issues.apache.org/jira/browse/SPARK-23962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Minor
> Attachments: unit-tests.log
>
>
> I've seen 
> {{org.apache.spark.sql.execution.metric.SQLMetricsSuite.SortMergeJoin 
> metrics}} fail 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4150/testReport/org.apache.spark.sql.execution.metric/SQLMetricsSuite/SortMergeJoin_metrics/
>  
> with
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedException: 2 did not equal 1
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did 
> not equal 1
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.getSparkPlanMetrics(SQLMetricsTestUtils.scala:146)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite.getSparkPlanMetrics(SQLMetricsSuite.scala:33)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.testSparkPlanMetrics(SQLMetricsTestUtils.scala:187)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite.testSparkPlanMetrics(SQLMetricsSuite.scala:33)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite$$anonfun$7$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SQLMetricsSuite.scala:188)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempView(SQLTestUtils.scala:260)
> ...
> {noformat}
> I believe this is because {{SQLMetricsTestUtils.currentExecutionId()}} is 
> racing with the listener bus.
> I'll attach trimmed logs as well



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23959) UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0

2018-04-11 Thread Sam De Backer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam De Backer updated SPARK-23959:
--
Description: 
The following snippet works fine in Spark 2.2.1 but gives a rather cryptic 
runtime exception in Spark 2.3.0:
{code:java}
import sparkSession.implicits._
import org.apache.spark.sql.functions._

case class X(xid: Long, yid: Int)
case class Y(yid: Int, zid: Long)
case class Z(zid: Long, b: Boolean)

val xs = Seq(X(1L, 10)).toDS()
val ys = Seq(Y(10, 100L)).toDS()
val zs = Seq.empty[Z].toDS()

val j = xs
  .join(ys, "yid")
  .join(zs, Seq("zid"), "left")
  .withColumn("BAM", when('b, "B").otherwise("NB"))

j.show(){code}
In Spark 2.2.1 it prints to the console
{noformat}
+---+---+---++---+
|zid|yid|xid|   b|BAM|
+---+---+---++---+
|100| 10|  1|null| NB|
+---+---+---++---+{noformat}
In Spark 2.3.0 it results in:
{noformat}
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'BAM
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157)
...{noformat}
The culprit really seems to be DataSet being created from an empty Seq[Z]. When 
you change that to something that will also result in an empty DataSet[Z] it 
works as in Spark 2.2.1, e.g.
{code:java}
val zs = Seq(Z(10L, true)).toDS().filter('zid < Long.MinValue){code}

  was:
The following snippet works fine in Spark 2.2.1 but gives a rather cryptic 
runtime exception in Spark 2.3.0:
{code:java}
import sparkSession.implicits._
import org.apache.spark.sql.functions._

case class X(xid: Long, yid: Int)
case class Y(yid: Int, zid: Long)
case class Z(zid: Long, b: Boolean)

val xs = Seq(X(1L, 10)).toDS()
val ys = Seq(Y(10, 100L)).toDS()
val zs = Seq.empty[Z].toDS()

val j = xs
  .join(ys, "yid")
  .join(zs, Seq("zid"), "left")
  .withColumn("BAM", when('b, "B").otherwise("NB"))

j.show(){code}
In Spark 2.2.1 it prints to the console
{noformat}
+---+---+---++---+
|zid|yid|xid|   b|BAM|
+---+---+---++---+
|100| 10|  1|null| NB|
+---+---+---++---+{noformat}
In Spark 2.3.0 it results in:
{noformat}
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'BAM
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157)
...{noformat}
The culprit really seems to be DataSet being created from an empty Seq[Z]. When 
you change that to something that will also result in an empty DataSet[Z] it 
works as in Spark 2.2.1, e.g.
{code:java}
val zs = Seq(Z(10L, true)).toDS().filter('zid === Long.MinValue){code}


> UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0
> -
>
> Key: SPARK-23959
> URL: https://issues.apache.org/jira/browse/SPARK-23959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sam De Backer
>Priority: Major
>
> The following snippet works fine in Spark 2.2.1 but gives a rather cryptic 
> runtime exception in Spark 2.3.0:
> {code:java}
> import sparkSession.implicits._
> import org.apache.spark.sql.functions._
> case class X(xid: Long, yid: Int)
> case class Y(yid: Int, zid: Long)
> case class Z(zid: Long, b: Boolean)
> val xs = Seq(X(1L, 10)).toDS()
> val ys = Seq(Y(10, 100L)).toDS()
> val zs = Seq.empty[Z].toDS()
> 

[jira] [Resolved] (SPARK-6951) History server slow startup if the event log directory is large

2018-04-11 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-6951.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20952
[https://github.com/apache/spark/pull/20952]

> History server slow startup if the event log directory is large
> ---
>
> Key: SPARK-6951
> URL: https://issues.apache.org/jira/browse/SPARK-6951
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.4.0
>
>
> I started my history server, then navigated to the web UI where I expected to 
> be able to view some completed applications, but the webpage was not 
> available. It turned out that the History Server was not finished parsing all 
> of the event logs in the event log directory that I had specified. I had 
> accumulated a lot of event logs from months of running Spark, so it would 
> have taken a very long time for the History Server to crunch through them 
> all. I purged the event log directory and started from scratch, and the UI 
> loaded immediately.
> We should have a pagination strategy or parse the directory lazily to avoid 
> needing to wait after starting the history server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6951) History server slow startup if the event log directory is large

2018-04-11 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-6951:
---

Assignee: Marcelo Vanzin

> History server slow startup if the event log directory is large
> ---
>
> Key: SPARK-6951
> URL: https://issues.apache.org/jira/browse/SPARK-6951
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.4.0
>
>
> I started my history server, then navigated to the web UI where I expected to 
> be able to view some completed applications, but the webpage was not 
> available. It turned out that the History Server was not finished parsing all 
> of the event logs in the event log directory that I had specified. I had 
> accumulated a lot of event logs from months of running Spark, so it would 
> have taken a very long time for the History Server to crunch through them 
> all. I purged the event log directory and started from scratch, and the UI 
> loaded immediately.
> We should have a pagination strategy or parse the directory lazily to avoid 
> needing to wait after starting the history server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream

2018-04-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434015#comment-16434015
 ] 

Tomasz Gawęda commented on SPARK-12105:
---

+1, It's a quite common question on StackOverflow: "how to print show() result 
in custom PrintStream"

> Add a DataFrame.show() with argument for output PrintStream
> ---
>
> Key: SPARK-12105
> URL: https://issues.apache.org/jira/browse/SPARK-12105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Dean Wampler
>Priority: Minor
>
> It would be nice to send the output of DataFrame.show(...) to a different 
> output stream than stdout, including just capturing the string itself. This 
> is useful, e.g., for testing. Actually, it would be sufficient and perhaps 
> better to just make DataFrame.showString a public method, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12105:


Assignee: Apache Spark

> Add a DataFrame.show() with argument for output PrintStream
> ---
>
> Key: SPARK-12105
> URL: https://issues.apache.org/jira/browse/SPARK-12105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Dean Wampler
>Assignee: Apache Spark
>Priority: Minor
>
> It would be nice to send the output of DataFrame.show(...) to a different 
> output stream than stdout, including just capturing the string itself. This 
> is useful, e.g., for testing. Actually, it would be sufficient and perhaps 
> better to just make DataFrame.showString a public method, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12105:


Assignee: (was: Apache Spark)

> Add a DataFrame.show() with argument for output PrintStream
> ---
>
> Key: SPARK-12105
> URL: https://issues.apache.org/jira/browse/SPARK-12105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Dean Wampler
>Priority: Minor
>
> It would be nice to send the output of DataFrame.show(...) to a different 
> output stream than stdout, including just capturing the string itself. This 
> is useful, e.g., for testing. Actually, it would be sufficient and perhaps 
> better to just make DataFrame.showString a public method, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23960) Mark HashAggregateExec.bufVars as transient

2018-04-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23960:
---

Assignee: Kris Mok

> Mark HashAggregateExec.bufVars as transient
> ---
>
> Key: SPARK-23960
> URL: https://issues.apache.org/jira/browse/SPARK-23960
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Minor
> Fix For: 2.4.0
>
>
> {{HashAggregateExec.bufVars}} is only used during codegen for global 
> aggregation. Specifically it's only used while {{doProduceWithoutKeys()}} is 
> on the stack.
> Currently, if an {{HashAggregateExec}} is ever captured for serialization, 
> the {{bufVars}} would be needlessly serialized.
> This ticket proposes a minor change to mark the {{bufVars}} field as 
> transient to avoid serializing it. Also, null out this field at the end of 
> {{doProduceWithoutKeys()}} to reduce its lifecycle so that the 
> {{Seq[ExprCode]}} being referenced can be GC'd sooner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23960) Mark HashAggregateExec.bufVars as transient

2018-04-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23960.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21039
[https://github.com/apache/spark/pull/21039]

> Mark HashAggregateExec.bufVars as transient
> ---
>
> Key: SPARK-23960
> URL: https://issues.apache.org/jira/browse/SPARK-23960
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Minor
> Fix For: 2.4.0
>
>
> {{HashAggregateExec.bufVars}} is only used during codegen for global 
> aggregation. Specifically it's only used while {{doProduceWithoutKeys()}} is 
> on the stack.
> Currently, if an {{HashAggregateExec}} is ever captured for serialization, 
> the {{bufVars}} would be needlessly serialized.
> This ticket proposes a minor change to mark the {{bufVars}} field as 
> transient to avoid serializing it. Also, null out this field at the end of 
> {{doProduceWithoutKeys()}} to reduce its lifecycle so that the 
> {{Seq[ExprCode]}} being referenced can be GC'd sooner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23927) High-order function: sequence

2018-04-11 Thread Alex Wajda (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433926#comment-16433926
 ] 

Alex Wajda commented on SPARK-23927:


I will take this one.
Thanks.

> High-order function: sequence
> -
>
> Key: SPARK-23927
> URL: https://issues.apache.org/jira/browse/SPARK-23927
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> * sequence(start, stop) → array
> Generate a sequence of integers from start to stop, incrementing by 1 if 
> start is less than or equal to stop, otherwise -1.
> * sequence(start, stop, step) → array
> Generate a sequence of integers from start to stop, incrementing by step.
> * sequence(start, stop) → array
> Generate a sequence of dates from start date to stop date, incrementing by 1 
> day if start date is less than or equal to stop date, otherwise -1 day.
> * sequence(start, stop, step) → array
> Generate a sequence of dates from start to stop, incrementing by step. The 
> type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH.
> * sequence(start, stop, step) → array
> Generate a sequence of timestamps from start to stop, incrementing by step. 
> The type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO 
> MONTH.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23961) pyspark toLocalIterator throws an exception

2018-04-11 Thread Michel Lemay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michel Lemay updated SPARK-23961:
-
Issue Type: Bug  (was: Improvement)

> pyspark toLocalIterator throws an exception
> ---
>
> Key: SPARK-23961
> URL: https://issues.apache.org/jira/browse/SPARK-23961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Michel Lemay
>Priority: Minor
>  Labels: DataFrame, pyspark
>
> Given a dataframe, take it's rdd and use toLocalIterator. If we do not 
> consume all records, it will throw: 
> {quote}ERROR PythonRDD: Error while sending iterator
> java.net.SocketException: Connection reset by peer: socket write error
>  at java.net.SocketOutputStream.socketWrite0(Native Method)
>  at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>  at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>  at java.io.DataOutputStream.write(DataOutputStream.java:107)
>  at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>  at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
>  at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706)
> {quote}
>  
> To reproduce, here is a simple pyspark shell script that show the error:
> {quote}import itertools
> df = spark.read.parquet("large parquet folder")
> cachedRDD = df.rdd.cache()
> print(cachedRDD.count()) # materialize
> b = cachedRDD.toLocalIterator()
> print(len(list(itertools.islice(b, 20
> b = None # Make the iterator goes out of scope.  Throws here.
> {quote}
>  
> Observations:
>  * Consuming all records do not throw.  Taking only a subset of the 
> partitions create the error.
>  * In another experiment, doing the same on a regular RDD works if we 
> cache/materialize it. If we do not cache the RDD, it throws similarly.
>  * It works in scala shell
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23961) pyspark toLocalIterator throws an exception

2018-04-11 Thread Michel Lemay (JIRA)
Michel Lemay created SPARK-23961:


 Summary: pyspark toLocalIterator throws an exception
 Key: SPARK-23961
 URL: https://issues.apache.org/jira/browse/SPARK-23961
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.2.1
Reporter: Michel Lemay


Given a dataframe, take it's rdd and use toLocalIterator. If we do not consume 
all records, it will throw: 
{quote}ERROR PythonRDD: Error while sending iterator
java.net.SocketException: Connection reset by peer: socket write error
 at java.net.SocketOutputStream.socketWrite0(Native Method)
 at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
 at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
 at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
 at java.io.DataOutputStream.write(DataOutputStream.java:107)
 at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
 at 
org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497)
 at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
 at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
 at scala.collection.Iterator$class.foreach(Iterator.scala:893)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
 at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
 at 
org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705)
 at 
org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
 at 
org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
 at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706)
{quote}
 

To reproduce, here is a simple pyspark shell script that show the error:
{quote}import itertools
df = spark.read.parquet("large parquet folder")
cachedRDD = df.rdd.cache()
print(cachedRDD.count()) # materialize
b = cachedRDD.toLocalIterator()
print(len(list(itertools.islice(b, 20
b = None # Make the iterator goes out of scope.  Throws here.
{quote}
 

Observations:
 * Consuming all records do not throw.  Taking only a subset of the partitions 
create the error.
 * In another experiment, doing the same on a regular RDD works if we 
cache/materialize it. If we do not cache the RDD, it throws similarly.
 * It works in scala shell

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23930) High-order function: slice(x, start, length) → array

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23930:


Assignee: Apache Spark

> High-order function: slice(x, start, length) → array
> 
>
> Key: SPARK-23930
> URL: https://issues.apache.org/jira/browse/SPARK-23930
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Subsets array x starting from index start (or starting from the end if start 
> is negative) with a length of length.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23930) High-order function: slice(x, start, length) → array

2018-04-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433837#comment-16433837
 ] 

Apache Spark commented on SPARK-23930:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21040

> High-order function: slice(x, start, length) → array
> 
>
> Key: SPARK-23930
> URL: https://issues.apache.org/jira/browse/SPARK-23930
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Subsets array x starting from index start (or starting from the end if start 
> is negative) with a length of length.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23930) High-order function: slice(x, start, length) → array

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23930:


Assignee: (was: Apache Spark)

> High-order function: slice(x, start, length) → array
> 
>
> Key: SPARK-23930
> URL: https://issues.apache.org/jira/browse/SPARK-23930
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Subsets array x starting from index start (or starting from the end if start 
> is negative) with a length of length.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23951) Use java classed in ExprValue and simplify a bunch of stuff

2018-04-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23951.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21026
[https://github.com/apache/spark/pull/21026]

> Use java classed in ExprValue and simplify a bunch of stuff
> ---
>
> Key: SPARK-23951
> URL: https://issues.apache.org/jira/browse/SPARK-23951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23960) Mark HashAggregateExec.bufVars as transient

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23960:


Assignee: Apache Spark

> Mark HashAggregateExec.bufVars as transient
> ---
>
> Key: SPARK-23960
> URL: https://issues.apache.org/jira/browse/SPARK-23960
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kris Mok
>Assignee: Apache Spark
>Priority: Minor
>
> {{HashAggregateExec.bufVars}} is only used during codegen for global 
> aggregation. Specifically it's only used while {{doProduceWithoutKeys()}} is 
> on the stack.
> Currently, if an {{HashAggregateExec}} is ever captured for serialization, 
> the {{bufVars}} would be needlessly serialized.
> This ticket proposes a minor change to mark the {{bufVars}} field as 
> transient to avoid serializing it. Also, null out this field at the end of 
> {{doProduceWithoutKeys()}} to reduce its lifecycle so that the 
> {{Seq[ExprCode]}} being referenced can be GC'd sooner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23960) Mark HashAggregateExec.bufVars as transient

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23960:


Assignee: (was: Apache Spark)

> Mark HashAggregateExec.bufVars as transient
> ---
>
> Key: SPARK-23960
> URL: https://issues.apache.org/jira/browse/SPARK-23960
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kris Mok
>Priority: Minor
>
> {{HashAggregateExec.bufVars}} is only used during codegen for global 
> aggregation. Specifically it's only used while {{doProduceWithoutKeys()}} is 
> on the stack.
> Currently, if an {{HashAggregateExec}} is ever captured for serialization, 
> the {{bufVars}} would be needlessly serialized.
> This ticket proposes a minor change to mark the {{bufVars}} field as 
> transient to avoid serializing it. Also, null out this field at the end of 
> {{doProduceWithoutKeys()}} to reduce its lifecycle so that the 
> {{Seq[ExprCode]}} being referenced can be GC'd sooner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23960) Mark HashAggregateExec.bufVars as transient

2018-04-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433672#comment-16433672
 ] 

Apache Spark commented on SPARK-23960:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/21039

> Mark HashAggregateExec.bufVars as transient
> ---
>
> Key: SPARK-23960
> URL: https://issues.apache.org/jira/browse/SPARK-23960
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kris Mok
>Priority: Minor
>
> {{HashAggregateExec.bufVars}} is only used during codegen for global 
> aggregation. Specifically it's only used while {{doProduceWithoutKeys()}} is 
> on the stack.
> Currently, if an {{HashAggregateExec}} is ever captured for serialization, 
> the {{bufVars}} would be needlessly serialized.
> This ticket proposes a minor change to mark the {{bufVars}} field as 
> transient to avoid serializing it. Also, null out this field at the end of 
> {{doProduceWithoutKeys()}} to reduce its lifecycle so that the 
> {{Seq[ExprCode]}} being referenced can be GC'd sooner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23960) Mark HashAggregateExec.bufVars as transient

2018-04-11 Thread Kris Mok (JIRA)
Kris Mok created SPARK-23960:


 Summary: Mark HashAggregateExec.bufVars as transient
 Key: SPARK-23960
 URL: https://issues.apache.org/jira/browse/SPARK-23960
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kris Mok


{{HashAggregateExec.bufVars}} is only used during codegen for global 
aggregation. Specifically it's only used while {{doProduceWithoutKeys()}} is on 
the stack.
Currently, if an {{HashAggregateExec}} is ever captured for serialization, the 
{{bufVars}} would be needlessly serialized.

This ticket proposes a minor change to mark the {{bufVars}} field as transient 
to avoid serializing it. Also, null out this field at the end of 
{{doProduceWithoutKeys()}} to reduce its lifecycle so that the 
{{Seq[ExprCode]}} being referenced can be GC'd sooner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23959) UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0

2018-04-11 Thread Sam De Backer (JIRA)
Sam De Backer created SPARK-23959:
-

 Summary: UnresolvedException with DataSet created from Seq.empty 
since Spark 2.3.0
 Key: SPARK-23959
 URL: https://issues.apache.org/jira/browse/SPARK-23959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Sam De Backer


The following snippet works fine in Spark 2.2.1 but gives a rather cryptic 
runtime exception in Spark 2.3.0:
{code:java}
import sparkSession.implicits._
import org.apache.spark.sql.functions._

case class X(xid: Long, yid: Int)
case class Y(yid: Int, zid: Long)
case class Z(zid: Long, b: Boolean)

val xs = Seq(X(1L, 10)).toDS()
val ys = Seq(Y(10, 100L)).toDS()
val zs = Seq.empty[Z].toDS()

val j = xs
  .join(ys, "yid")
  .join(zs, Seq("zid"), "left")
  .withColumn("BAM", when('b, "B").otherwise("NB"))

j.show(){code}
In Spark 2.2.1 it prints to the console
{noformat}
+---+---+---++---+
|zid|yid|xid|   b|BAM|
+---+---+---++---+
|100| 10|  1|null| NB|
+---+---+---++---+{noformat}
In Spark 2.3.0 it results in:
{noformat}
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'BAM
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157)
...{noformat}
The culprit really seems to be DataSet being created from an empty Seq[Z]. When 
you change that to something that will also result in an empty DataSet[Z] it 
works as in Spark 2.2.1, e.g.
{code:java}
val zs = Seq(Z(10L, true)).toDS().filter('zid === Long.MinValue){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23930) High-order function: slice(x, start, length) → array

2018-04-11 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433550#comment-16433550
 ] 

Marco Gaido commented on SPARK-23930:
-

I am working on this.

> High-order function: slice(x, start, length) → array
> 
>
> Key: SPARK-23930
> URL: https://issues.apache.org/jira/browse/SPARK-23930
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Subsets array x starting from index start (or starting from the end if start 
> is negative) with a length of length.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-04-11 Thread Jepson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433536#comment-16433536
 ] 

Jepson commented on SPARK-22968:


[~apachespark] Thank you very much. 

> java.lang.IllegalStateException: No current assignment for partition kssh-2
> ---
>
> Key: SPARK-22968
> URL: https://issues.apache.org/jira/browse/SPARK-22968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Kafka:  0.10.0  (CDH5.12.0)
> Apache Spark 2.1.1
> Spark streaming+Kafka
>Reporter: Jepson
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> *Kafka Broker:*
> {code:java}
>message.max.bytes : 2621440  
> {code}
> *Spark Streaming+Kafka Code:*
> {code:java}
> , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 
> 1048576
> , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
> , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
> , "heartbeat.interval.ms" -> (5000: java.lang.Integer)
> , "receive.buffer.bytes" -> (10485760: java.lang.Integer)
> {code}
> *Error message:*
> {code:java}
> 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
> assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
> kssh-1] for group use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined 
> group use_a_separate_group_id_for_each_stream with generation 4
> 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
> partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 1515116907000 ms
> java.lang.IllegalStateException: No current assignment for partition kssh-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 

[jira] [Assigned] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22968:


Assignee: (was: Apache Spark)

> java.lang.IllegalStateException: No current assignment for partition kssh-2
> ---
>
> Key: SPARK-22968
> URL: https://issues.apache.org/jira/browse/SPARK-22968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Kafka:  0.10.0  (CDH5.12.0)
> Apache Spark 2.1.1
> Spark streaming+Kafka
>Reporter: Jepson
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> *Kafka Broker:*
> {code:java}
>message.max.bytes : 2621440  
> {code}
> *Spark Streaming+Kafka Code:*
> {code:java}
> , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 
> 1048576
> , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
> , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
> , "heartbeat.interval.ms" -> (5000: java.lang.Integer)
> , "receive.buffer.bytes" -> (10485760: java.lang.Integer)
> {code}
> *Error message:*
> {code:java}
> 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
> assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
> kssh-1] for group use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined 
> group use_a_separate_group_id_for_each_stream with generation 4
> 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
> partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 1515116907000 ms
> java.lang.IllegalStateException: No current assignment for partition kssh-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> 

[jira] [Assigned] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22968:


Assignee: Apache Spark

> java.lang.IllegalStateException: No current assignment for partition kssh-2
> ---
>
> Key: SPARK-22968
> URL: https://issues.apache.org/jira/browse/SPARK-22968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Kafka:  0.10.0  (CDH5.12.0)
> Apache Spark 2.1.1
> Spark streaming+Kafka
>Reporter: Jepson
>Assignee: Apache Spark
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> *Kafka Broker:*
> {code:java}
>message.max.bytes : 2621440  
> {code}
> *Spark Streaming+Kafka Code:*
> {code:java}
> , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 
> 1048576
> , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
> , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
> , "heartbeat.interval.ms" -> (5000: java.lang.Integer)
> , "receive.buffer.bytes" -> (10485760: java.lang.Integer)
> {code}
> *Error message:*
> {code:java}
> 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
> assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
> kssh-1] for group use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined 
> group use_a_separate_group_id_for_each_stream with generation 4
> 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
> partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 1515116907000 ms
> java.lang.IllegalStateException: No current assignment for partition kssh-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
>   at scala.util.Try$.apply(Try.scala:192)
>  

[jira] [Commented] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-04-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433510#comment-16433510
 ] 

Apache Spark commented on SPARK-22968:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21038

> java.lang.IllegalStateException: No current assignment for partition kssh-2
> ---
>
> Key: SPARK-22968
> URL: https://issues.apache.org/jira/browse/SPARK-22968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Kafka:  0.10.0  (CDH5.12.0)
> Apache Spark 2.1.1
> Spark streaming+Kafka
>Reporter: Jepson
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> *Kafka Broker:*
> {code:java}
>message.max.bytes : 2621440  
> {code}
> *Spark Streaming+Kafka Code:*
> {code:java}
> , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 
> 1048576
> , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
> , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
> , "heartbeat.interval.ms" -> (5000: java.lang.Integer)
> , "receive.buffer.bytes" -> (10485760: java.lang.Integer)
> {code}
> *Error message:*
> {code:java}
> 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
> assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
> kssh-1] for group use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined 
> group use_a_separate_group_id_for_each_stream with generation 4
> 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
> partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 1515116907000 ms
> java.lang.IllegalStateException: No current assignment for partition kssh-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
>   at 
> 

[jira] [Assigned] (SPARK-23919) High-order function: array_position(x, element) → bigint

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23919:


Assignee: Apache Spark

> High-order function: array_position(x, element) → bigint
> 
>
> Key: SPARK-23919
> URL: https://issues.apache.org/jira/browse/SPARK-23919
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns the position of the first occurrence of the element in array x (or 0 
> if not found).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23919) High-order function: array_position(x, element) → bigint

2018-04-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433460#comment-16433460
 ] 

Apache Spark commented on SPARK-23919:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21037

> High-order function: array_position(x, element) → bigint
> 
>
> Key: SPARK-23919
> URL: https://issues.apache.org/jira/browse/SPARK-23919
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns the position of the first occurrence of the element in array x (or 0 
> if not found).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23919) High-order function: array_position(x, element) → bigint

2018-04-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23919:


Assignee: (was: Apache Spark)

> High-order function: array_position(x, element) → bigint
> 
>
> Key: SPARK-23919
> URL: https://issues.apache.org/jira/browse/SPARK-23919
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns the position of the first occurrence of the element in array x (or 0 
> if not found).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org