[jira] [Updated] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings

2018-03-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23705:
-
Target Version/s:   (was: 2.3.0)

Let's avoid to set a target version which is usually reserved for a committer.

> dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
> --
>
> Key: SPARK-23705
> URL: https://issues.apache.org/jira/browse/SPARK-23705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Khoa Tran
>Priority: Minor
>  Labels: beginner, easyfix, features, newbie, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {code:java}
> // code placeholder
> package org.apache.spark.sql
> .
> .
> .
> class Dataset[T] private[sql](
> .
> .
> .
> def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
>   val colNames: Seq[String] = col1 +: cols
>   RelationalGroupedDataset(
> toDF(), colNames.map(colName => resolve(colName)), 
> RelationalGroupedDataset.GroupByType)
> }
> {code}
> should append a `.distinct` after `colNames` when used in `groupBy` 
>  
> Not sure if the community agrees with this or it's up to the users to perform 
> the distinct operation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23696) StructType.fromString swallows exceptions from DataType.fromJson

2018-03-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401479#comment-16401479
 ] 

Hyukjin Kwon commented on SPARK-23696:
--

Why don't we just directly use {{DataType.fromJson}} if you need to catch the 
exception?

> StructType.fromString swallows exceptions from DataType.fromJson
> 
>
> Key: SPARK-23696
> URL: https://issues.apache.org/jira/browse/SPARK-23696
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Simeon H.K. Fitch
>Priority: Trivial
>
> `StructType.fromString` swallows exceptions from `DataType.fromJson`, 
> assuming they are an indication that the `LegacyTypeStringParser.parse` 
> should be called instead. When that fails (because it throws an excreption), 
> an error message is generated that does not reflect the true problem at hand, 
> effectively swallowing the exception from `DataType.fromJson`. This makes 
> debugging Parquet schema issues more difficult.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23162:


Assignee: Apache Spark

> PySpark ML LinearRegressionSummary missing r2adj
> 
>
> Key: SPARK-23162
> URL: https://issues.apache.org/jira/browse/SPARK-23162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23162:


Assignee: (was: Apache Spark)

> PySpark ML LinearRegressionSummary missing r2adj
> 
>
> Key: SPARK-23162
> URL: https://issues.apache.org/jira/browse/SPARK-23162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Minor
>  Labels: starter
>
> Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj

2018-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401458#comment-16401458
 ] 

Apache Spark commented on SPARK-23162:
--

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/20842

> PySpark ML LinearRegressionSummary missing r2adj
> 
>
> Key: SPARK-23162
> URL: https://issues.apache.org/jira/browse/SPARK-23162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Minor
>  Labels: starter
>
> Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23706) spark.conf.get(value, default=None) should produce None in PySpark

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23706:


Assignee: Apache Spark

> spark.conf.get(value, default=None) should produce None in PySpark
> --
>
> Key: SPARK-23706
> URL: https://issues.apache.org/jira/browse/SPARK-23706
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Scala:
> {code}
> scala> spark.conf.get("hey")
> java.util.NoSuchElementException: hey
>   at 
> org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600)
>   at 
> org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:1600)
>   at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:74)
>   ... 49 elided
> scala> spark.conf.get("hey", null)
> res1: String = null
> scala> spark.conf.get("spark.sql.sources.partitionOverwriteMode", null)
> res2: String = null
> {code}
> Python:
> {code}
> >>> spark.conf.get("hey")
> ...
> py4j.protocol.Py4JJavaError: An error occurred while calling o30.get.
> : java.util.NoSuchElementException: hey
> ...
> >>> spark.conf.get("hey", None)
> ...
> py4j.protocol.Py4JJavaError: An error occurred while calling o30.get.
> : java.util.NoSuchElementException: hey
> ...
> >>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None)
> u'STATIC'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23706) spark.conf.get(value, default=None) should produce None in PySpark

2018-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401451#comment-16401451
 ] 

Apache Spark commented on SPARK-23706:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/20841

> spark.conf.get(value, default=None) should produce None in PySpark
> --
>
> Key: SPARK-23706
> URL: https://issues.apache.org/jira/browse/SPARK-23706
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Scala:
> {code}
> scala> spark.conf.get("hey")
> java.util.NoSuchElementException: hey
>   at 
> org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600)
>   at 
> org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:1600)
>   at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:74)
>   ... 49 elided
> scala> spark.conf.get("hey", null)
> res1: String = null
> scala> spark.conf.get("spark.sql.sources.partitionOverwriteMode", null)
> res2: String = null
> {code}
> Python:
> {code}
> >>> spark.conf.get("hey")
> ...
> py4j.protocol.Py4JJavaError: An error occurred while calling o30.get.
> : java.util.NoSuchElementException: hey
> ...
> >>> spark.conf.get("hey", None)
> ...
> py4j.protocol.Py4JJavaError: An error occurred while calling o30.get.
> : java.util.NoSuchElementException: hey
> ...
> >>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None)
> u'STATIC'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23706) spark.conf.get(value, default=None) should produce None in PySpark

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23706:


Assignee: (was: Apache Spark)

> spark.conf.get(value, default=None) should produce None in PySpark
> --
>
> Key: SPARK-23706
> URL: https://issues.apache.org/jira/browse/SPARK-23706
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Scala:
> {code}
> scala> spark.conf.get("hey")
> java.util.NoSuchElementException: hey
>   at 
> org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600)
>   at 
> org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:1600)
>   at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:74)
>   ... 49 elided
> scala> spark.conf.get("hey", null)
> res1: String = null
> scala> spark.conf.get("spark.sql.sources.partitionOverwriteMode", null)
> res2: String = null
> {code}
> Python:
> {code}
> >>> spark.conf.get("hey")
> ...
> py4j.protocol.Py4JJavaError: An error occurred while calling o30.get.
> : java.util.NoSuchElementException: hey
> ...
> >>> spark.conf.get("hey", None)
> ...
> py4j.protocol.Py4JJavaError: An error occurred while calling o30.get.
> : java.util.NoSuchElementException: hey
> ...
> >>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None)
> u'STATIC'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23706) spark.conf.get(value, default=None) should produce None in PySpark

2018-03-15 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-23706:


 Summary: spark.conf.get(value, default=None) should produce None 
in PySpark
 Key: SPARK-23706
 URL: https://issues.apache.org/jira/browse/SPARK-23706
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon


Scala:

{code}
scala> spark.conf.get("hey")
java.util.NoSuchElementException: hey
  at 
org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600)
  at 
org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:1600)
  at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:74)
  ... 49 elided

scala> spark.conf.get("hey", null)
res1: String = null

scala> spark.conf.get("spark.sql.sources.partitionOverwriteMode", null)
res2: String = null
{code}

Python:

{code}
>>> spark.conf.get("hey")
...
py4j.protocol.Py4JJavaError: An error occurred while calling o30.get.
: java.util.NoSuchElementException: hey
...

>>> spark.conf.get("hey", None)
...
py4j.protocol.Py4JJavaError: An error occurred while calling o30.get.
: java.util.NoSuchElementException: hey
...

>>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None)
u'STATIC'
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings

2018-03-15 Thread Khoa Tran (JIRA)
Khoa Tran created SPARK-23705:
-

 Summary: dataframe.groupBy() may inadvertently receive sequence of 
non-distinct strings
 Key: SPARK-23705
 URL: https://issues.apache.org/jira/browse/SPARK-23705
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Khoa Tran


{code:java}
// code placeholder
package org.apache.spark.sql
.
.
.
class Dataset[T] private[sql](
.
.
.
def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
  val colNames: Seq[String] = col1 +: cols
  RelationalGroupedDataset(
toDF(), colNames.map(colName => resolve(colName)), 
RelationalGroupedDataset.GroupByType)
}
{code}
should append a `.distinct` after `colNames` when used in `groupBy` 

 

Not sure if the community agrees with this or it's up to the users to perform 
the distinct operation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2018-03-15 Thread sirisha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401428#comment-16401428
 ] 

sirisha edited comment on SPARK-23685 at 3/16/18 3:47 AM:
--

[~apachespark] Can anyone please guide me on how to assign this story to 
myself?  I do not see an option to assign it to myself.


was (Author: sindiri):
[~apachespark] Can anyone please guide me on how to assign this pull request to 
myself?  I do not see an option to assign it to myself.

> Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive 
> Offsets (i.e. Log Compaction)
> -
>
> Key: SPARK-23685
> URL: https://issues.apache.org/jira/browse/SPARK-23685
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sirisha
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always 
> be just an increment of 1 .If not, it throws the below exception:
>  
> "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). 
> Some data may have been lost because they are not available in Kafka any 
> more; either the data was aged out by Kafka or the topic may have been 
> deleted before all the data in the topic was processed. If you don't want 
> your streaming query to fail on such cases, set the source option 
> "failOnDataLoss" to "false". "
>  
> FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2018-03-15 Thread sirisha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401428#comment-16401428
 ] 

sirisha commented on SPARK-23685:
-

[~apachespark] Can anyone please guide me on how to assign this pull request to 
myself?  I do not see an option to assign it to myself.

> Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive 
> Offsets (i.e. Log Compaction)
> -
>
> Key: SPARK-23685
> URL: https://issues.apache.org/jira/browse/SPARK-23685
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sirisha
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always 
> be just an increment of 1 .If not, it throws the below exception:
>  
> "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). 
> Some data may have been lost because they are not available in Kafka any 
> more; either the data was aged out by Kafka or the topic may have been 
> deleted before all the data in the topic was processed. If you don't want 
> your streaming query to fail on such cases, set the source option 
> "failOnDataLoss" to "false". "
>  
> FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23673) PySpark dayofweek does not conform with ISO 8601

2018-03-15 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401392#comment-16401392
 ] 

Kazuaki Ishizaki commented on SPARK-23673:
--

In Spark, `dayofweek` comes from SQL. The result value is based on [ODBC 
standard|https://mariadb.com/kb/en/library/dayofweek/].
Would it be better to prepare another function that is compatible with Pandas? 
cc [~ueshin] [~bryanc]

> PySpark dayofweek does not conform with ISO 8601
> 
>
> Key: SPARK-23673
> URL: https://issues.apache.org/jira/browse/SPARK-23673
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.0
>Reporter: Ivan SPM
>Priority: Minor
>
> The new function dayofweek, that returns 1 = Sunday and 7 for Saturday, does 
> not conform with ISO 8601, that states that the first day of the week is 
> Monday, so 1 = Monday and 7 = Sunday. This behavior is also different from 
> Pandas, that uses 0 = Monday and 6 = Sunday, but pandas is at least 
> consistent with the ordering of ISO 8601.
> [https://en.wikipedia.org/wiki/ISO_8601#Week_dates]
> (Also reported by Antonio Pedro Vieira (aps.vieir...@gmail.com))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22390) Aggregate push down

2018-03-15 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401378#comment-16401378
 ] 

Huaxin Gao edited comment on SPARK-22390 at 3/16/18 1:56 AM:
-

[~cloud_fan], I am working on Aggregate push down design doc and prototype. 
Could you please review the doc? Thanks a lot! 

[https://docs.google.com/document/d/1X3EVXjyMv76KuZfX_VjQFmXeAmW3xYHe3M8DlGkbKQ/edit|https://docs.google.com/document/d/1X3EVX-jyMv76KuZfX_VjQFmXeAmW3xYHe3M8DlGkbKQ/edit]


was (Author: huaxingao):
[~cloud_fan], I am working on Aggregate push down design doc and prototype. 
Could you please review the doc? Thanks a lot!  
[https://docs.google.com/document/d/1X3EVX-jyMv76KuZfX_VjQFmXeAmW3xYHe3M8DlGkbKQ/edit|http://example.com]

> Aggregate push down
> ---
>
> Key: SPARK-22390
> URL: https://issues.apache.org/jira/browse/SPARK-22390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22390) Aggregate push down

2018-03-15 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401378#comment-16401378
 ] 

Huaxin Gao commented on SPARK-22390:


[~cloud_fan], I am working on Aggregate push down design doc and prototype. 
Could you please review the doc? Thanks a lot!  
[https://docs.google.com/document/d/1X3EVX-jyMv76KuZfX_VjQFmXeAmW3xYHe3M8DlGkbKQ/edit|http://example.com]

> Aggregate push down
> ---
>
> Key: SPARK-22390
> URL: https://issues.apache.org/jira/browse/SPARK-22390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23651) Add a check for host name

2018-03-15 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian resolved SPARK-23651.
-
Resolution: Fixed

> Add a  check for host name
> --
>
> Key: SPARK-23651
> URL: https://issues.apache.org/jira/browse/SPARK-23651
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> I encountered a error like this:
> _org.apache.spark.SparkException: Invalid Spark URL: 
> spark://HeartbeatReceiver@ci_164:42849_
>     _at 
> org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66)_
>     _at 
> org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)_
>     _at 
> org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)_
>     _at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)_
>     _at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)_
>     _at org.apache.spark.executor.Executor.(Executor.scala:155)_
>     _at 
> org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)_
>     _at 
> org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)_
>     _at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)_
>  
> I didn't  know why this _URL_(spark://HeartbeatReceiver@ci_164:42849) is 
> invalid, so i think we should give a clearer reminder for this error.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23704) PySpark access of individual trees in random forest is slow

2018-03-15 Thread Julian King (JIRA)
Julian King created SPARK-23704:
---

 Summary: PySpark access of individual trees in random forest is 
slow
 Key: SPARK-23704
 URL: https://issues.apache.org/jira/browse/SPARK-23704
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.1
 Environment: PySpark 2.2.1 / Windows 10
Reporter: Julian King


Making predictions from a randomForestClassifier PySpark is much faster than 
making predictions from an individual tree contained within the .trees 
attribute. 

In fact, the model.transform call without an action is more than 10x slower for 
an individual tree vs the model.transform call for the random forest model.

See 
[https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark]
 for example with timing.

Ideally:
 * Getting a prediction from a single tree should be comparable to or faster 
than getting predictions from the whole tree
 * Getting all the predictions from all the individual trees should be 
comparable in speed to getting the predictions from the random forest

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23670) Memory leak of SparkPlanGraphWrapper in sparkUI

2018-03-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23670:
--

Assignee: Myroslav Lisniak

> Memory leak of SparkPlanGraphWrapper in sparkUI
> ---
>
> Key: SPARK-23670
> URL: https://issues.apache.org/jira/browse/SPARK-23670
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Myroslav Lisniak
>Assignee: Myroslav Lisniak
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: heap.png
>
>
> Memory leak on driver for a long time running application. We have 
> application using structured streaming and running 48 hours. But driver fails 
> with out of memory after 25 hours. After investigating heap dump we found 
> that most of the memory was occupied with a lot of *SparkPlanGraphWrapper* 
> objects inside *InMemoryStore*.
>  Application was run with option: 
> --driver-memory 4G



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23670) Memory leak of SparkPlanGraphWrapper in sparkUI

2018-03-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23670.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20813
[https://github.com/apache/spark/pull/20813]

> Memory leak of SparkPlanGraphWrapper in sparkUI
> ---
>
> Key: SPARK-23670
> URL: https://issues.apache.org/jira/browse/SPARK-23670
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Myroslav Lisniak
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
> Attachments: heap.png
>
>
> Memory leak on driver for a long time running application. We have 
> application using structured streaming and running 48 hours. But driver fails 
> with out of memory after 25 hours. After investigating heap dump we found 
> that most of the memory was occupied with a lot of *SparkPlanGraphWrapper* 
> objects inside *InMemoryStore*.
>  Application was run with option: 
> --driver-memory 4G



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23608) SHS needs synchronization between attachSparkUI and detachSparkUI functions

2018-03-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23608:
--

Assignee: Ye Zhou

> SHS needs synchronization between attachSparkUI and detachSparkUI functions
> ---
>
> Key: SPARK-23608
> URL: https://issues.apache.org/jira/browse/SPARK-23608
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Ye Zhou
>Assignee: Ye Zhou
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
>
> We continuously hit an issue with SHS after it runs for a while and have some 
> REST API calls to it. SHS suddenly shows an empty home page with 0 
> application. It is caused by the unexpected JSON data returned from rest call 
> "api/v1/applications?limit=8000". This REST call returns the home page html 
> codes instead of list of application summary. Some other REST call which asks 
> for application detailed information also returns home page html codes. But 
> there are still some working REST calls. We have to restart SHS to solve the 
> issue.
> We attached remote debugger to the problematic process and checked the 
> attached jetty handlers tree in the web server. We found that the jetty 
> handler added by "attachHandler(ApiRootResource.getServletHandler(this))" is 
> not in the tree as well as some other handlers. Without the root resource 
> servlet handler, SHS will not work correctly serving both UI and REST calls. 
> SHS will directly return the HistoryServerPage html to user as it cannot find 
> handlers to handle the request.
> Spark History Server has to attachSparkUI in order to serve user requests. 
> The application SparkUI getting attached when the application details data 
> gets loaded into Guava Cache. While attaching SparkUI, SHS will add attach 
> all jetty handlers into the current web service. But while the data gets 
> cleared out from Guava Cache, SHS will detach all the application's SparkUI 
> jetty handlers. Due to the asynchronous feature in Guava Cache, the clear out 
> from cache is not synchronized with loading into cache. The actual clear out 
> in Guava Cache which triggers detachSparkUI might be detaching the handlers 
> while the attachSparkUI is attaching jetty handlers.
> After adding synchronization between attachSparkUI and detachSparkUI in 
> history server, this issue never happens again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23608) SHS needs synchronization between attachSparkUI and detachSparkUI functions

2018-03-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23608.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20744
[https://github.com/apache/spark/pull/20744]

> SHS needs synchronization between attachSparkUI and detachSparkUI functions
> ---
>
> Key: SPARK-23608
> URL: https://issues.apache.org/jira/browse/SPARK-23608
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Ye Zhou
>Priority: Minor
> Fix For: 2.4.0, 2.3.1
>
>
> We continuously hit an issue with SHS after it runs for a while and have some 
> REST API calls to it. SHS suddenly shows an empty home page with 0 
> application. It is caused by the unexpected JSON data returned from rest call 
> "api/v1/applications?limit=8000". This REST call returns the home page html 
> codes instead of list of application summary. Some other REST call which asks 
> for application detailed information also returns home page html codes. But 
> there are still some working REST calls. We have to restart SHS to solve the 
> issue.
> We attached remote debugger to the problematic process and checked the 
> attached jetty handlers tree in the web server. We found that the jetty 
> handler added by "attachHandler(ApiRootResource.getServletHandler(this))" is 
> not in the tree as well as some other handlers. Without the root resource 
> servlet handler, SHS will not work correctly serving both UI and REST calls. 
> SHS will directly return the HistoryServerPage html to user as it cannot find 
> handlers to handle the request.
> Spark History Server has to attachSparkUI in order to serve user requests. 
> The application SparkUI getting attached when the application details data 
> gets loaded into Guava Cache. While attaching SparkUI, SHS will add attach 
> all jetty handlers into the current web service. But while the data gets 
> cleared out from Guava Cache, SHS will detach all the application's SparkUI 
> jetty handlers. Due to the asynchronous feature in Guava Cache, the clear out 
> from cache is not synchronized with loading into cache. The actual clear out 
> in Guava Cache which triggers detachSparkUI might be detaching the handlers 
> while the attachSparkUI is attaching jetty handlers.
> After adding synchronization between attachSparkUI and detachSparkUI in 
> history server, this issue never happens again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23671) SHS is ignoring number of replay threads

2018-03-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23671:
--

Assignee: Marcelo Vanzin

> SHS is ignoring number of replay threads
> 
>
> Key: SPARK-23671
> URL: https://issues.apache.org/jira/browse/SPARK-23671
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Critical
> Fix For: 2.3.1, 2.4.0
>
>
> I mistakenly flipped a condition in a previous change and the SHS is now 
> basically doing single-threaded parsing of event logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23671) SHS is ignoring number of replay threads

2018-03-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23671.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20814
[https://github.com/apache/spark/pull/20814]

> SHS is ignoring number of replay threads
> 
>
> Key: SPARK-23671
> URL: https://issues.apache.org/jira/browse/SPARK-23671
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Critical
> Fix For: 2.4.0, 2.3.1
>
>
> I mistakenly flipped a condition in a previous change and the SHS is now 
> basically doing single-threaded parsing of event logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency

2018-03-15 Thread Jo Desmet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401300#comment-16401300
 ] 

Jo Desmet commented on SPARK-8008:
--

Too bad that this issue is not considered high priority. Too many times I come 
to the problem that I need to process billions of records. So the only way to 
handle this is to create a huge amount of partitions, and then throttle using 
spark.executor.cores. However this setting effectively throttles my entire RDD, 
not just the portion that loads from database. It would be hugely beneficial 
that I can not only restrict the number of partitions at any time, but also the 
task concurrency at any point in my RDD.

> JDBC data source can overload the external database system due to high 
> concurrency
> --
>
> Key: SPARK-8008
> URL: https://issues.apache.org/jira/browse/SPARK-8008
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rene Treffer
>Priority: Major
>
> Spark tries to load as many partitions as possible in parallel, which can in 
> turn overload the database although it would be possible to load all 
> partitions given a lower concurrency.
> It would be nice to either limit the maximum concurrency or to at least warn 
> about this behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23703) Collapse sequential watermarks

2018-03-15 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23703:
---

 Summary: Collapse sequential watermarks 
 Key: SPARK-23703
 URL: https://issues.apache.org/jira/browse/SPARK-23703
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


When there are two sequential EventTimeWatermark nodes in a query plan, the 
topmost one overrides the column tracking metadata from its children, but 
leaves the nodes themselves untouched. When there is no intervening stateful 
operation to consume the watermark, we should remove the lower node entirely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23702) Forbid watermarks on both sides of a streaming aggregate

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23702:


Assignee: (was: Apache Spark)

> Forbid watermarks on both sides of a streaming aggregate
> 
>
> Key: SPARK-23702
> URL: https://issues.apache.org/jira/browse/SPARK-23702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23702) Forbid watermarks on both sides of a streaming aggregate

2018-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401295#comment-16401295
 ] 

Apache Spark commented on SPARK-23702:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20840

> Forbid watermarks on both sides of a streaming aggregate
> 
>
> Key: SPARK-23702
> URL: https://issues.apache.org/jira/browse/SPARK-23702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23702) Forbid watermarks on both sides of a streaming aggregate

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23702:


Assignee: Apache Spark

> Forbid watermarks on both sides of a streaming aggregate
> 
>
> Key: SPARK-23702
> URL: https://issues.apache.org/jira/browse/SPARK-23702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23658) InProcessAppHandle uses the wrong class in getLogger

2018-03-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23658.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20815
[https://github.com/apache/spark/pull/20815]

> InProcessAppHandle uses the wrong class in getLogger
> 
>
> Key: SPARK-23658
> URL: https://issues.apache.org/jira/browse/SPARK-23658
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Minor
> Fix For: 2.4.0, 2.3.1
>
>
> {{InProcessAppHandle}} uses {{ChildProcAppHandle}} as the class in 
> {{getLogger}}, it should just use its own name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23658) InProcessAppHandle uses the wrong class in getLogger

2018-03-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23658:
--

Assignee: Sahil Takiar

> InProcessAppHandle uses the wrong class in getLogger
> 
>
> Key: SPARK-23658
> URL: https://issues.apache.org/jira/browse/SPARK-23658
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
>
> {{InProcessAppHandle}} uses {{ChildProcAppHandle}} as the class in 
> {{getLogger}}, it should just use its own name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23702) Forbid watermarks on both sides of a streaming aggregate

2018-03-15 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23702:
---

 Summary: Forbid watermarks on both sides of a streaming aggregate
 Key: SPARK-23702
 URL: https://issues.apache.org/jira/browse/SPARK-23702
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23701) Multiple sequential watermarks are not supported

2018-03-15 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23701:
---

 Summary: Multiple sequential watermarks are not supported
 Key: SPARK-23701
 URL: https://issues.apache.org/jira/browse/SPARK-23701
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


In 2.3, we allowed query plans with multiple watermarks to run to enable 
stream-stream joins. But we've only implemented the functionality for 
watermarks in parallel feeding into a join operator. It won't work currently 
(and would require in-depth changes) if the watermarks are sequential in the 
plan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23700) Cleanup unused imports

2018-03-15 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-23700:


 Summary: Cleanup unused imports
 Key: SPARK-23700
 URL: https://issues.apache.org/jira/browse/SPARK-23700
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Bryan Cutler


I've noticed a fair amount of unused imports in pyspark, I'll take a look 
through and try to clean them up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23698) Spark code contains numerous undefined names in Python 3

2018-03-15 Thread cclauss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401252#comment-16401252
 ] 

cclauss commented on SPARK-23698:
-

A PR to fix for 17 of the 20 issues is at 
https://github.com/apache/spark/pull/20838/files

> Spark code contains numerous undefined names in Python 3
> 
>
> Key: SPARK-23698
> URL: https://issues.apache.org/jira/browse/SPARK-23698
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: cclauss
>Priority: Minor
>
> flake8 testing of https://github.com/apache/spark on Python 3.6.3
> $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source 
> --statistics*
> ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input'
> result = raw_input("\n%s (y/n): " % prompt)
>  ^
> ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input'
> primary_author = raw_input(
>  ^
> ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input'
> pick_ref = raw_input("Enter a branch name [%s]: " % default_branch)
>^
> ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input'
> jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id)
>   ^
> ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input'
> fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % 
> default_fix_versions)
>^
> ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input'
> raw_assignee = raw_input(
>^
> ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input'
> pr_num = raw_input("Which pull request would you like to merge? (e.g. 
> 34): ")
>  ^
> ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input'
> result = raw_input("Would you like to use the modified title? (y/n): 
> ")
>  ^
> ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input'
> while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y":
>   ^
> ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input'
> response = raw_input("%s [y/n]: " % msg)
>^
> ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode'
> author = unidecode.unidecode(unicode(author, "UTF-8")).strip()
>  ^
> ./python/setup.py:37:11: F821 undefined name '__version__'
> VERSION = __version__
>   ^
> ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer'
> dispatch[buffer] = save_buffer
>  ^
> ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file'
> dispatch[file] = save_file
>  ^
> ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode'
> if not isinstance(obj, str) and not isinstance(obj, unicode):
> ^
> ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long'
> intlike = (int, long)
> ^
> ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long'
> return self._sc._jvm.Time(long(timestamp * 1000))
>   ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 
> undefined name 'xrange'
> for i in xrange(50):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 
> undefined name 'xrange'
> for j in xrange(5):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 
> undefined name 'xrange'
> for k in xrange(20022):
>  ^
> 20F821 undefined name 'raw_input'
> 20



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-23698) Spark code contains numerous undefined names in Python 3

2018-03-15 Thread cclauss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cclauss updated SPARK-23698:

Comment: was deleted

(was: A PR to fix for 17 of the 20 issues is at 
https://github.com/apache/spark/pull/20838/files)

> Spark code contains numerous undefined names in Python 3
> 
>
> Key: SPARK-23698
> URL: https://issues.apache.org/jira/browse/SPARK-23698
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: cclauss
>Priority: Minor
>
> flake8 testing of https://github.com/apache/spark on Python 3.6.3
> $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source 
> --statistics*
> ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input'
> result = raw_input("\n%s (y/n): " % prompt)
>  ^
> ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input'
> primary_author = raw_input(
>  ^
> ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input'
> pick_ref = raw_input("Enter a branch name [%s]: " % default_branch)
>^
> ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input'
> jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id)
>   ^
> ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input'
> fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % 
> default_fix_versions)
>^
> ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input'
> raw_assignee = raw_input(
>^
> ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input'
> pr_num = raw_input("Which pull request would you like to merge? (e.g. 
> 34): ")
>  ^
> ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input'
> result = raw_input("Would you like to use the modified title? (y/n): 
> ")
>  ^
> ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input'
> while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y":
>   ^
> ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input'
> response = raw_input("%s [y/n]: " % msg)
>^
> ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode'
> author = unidecode.unidecode(unicode(author, "UTF-8")).strip()
>  ^
> ./python/setup.py:37:11: F821 undefined name '__version__'
> VERSION = __version__
>   ^
> ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer'
> dispatch[buffer] = save_buffer
>  ^
> ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file'
> dispatch[file] = save_file
>  ^
> ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode'
> if not isinstance(obj, str) and not isinstance(obj, unicode):
> ^
> ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long'
> intlike = (int, long)
> ^
> ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long'
> return self._sc._jvm.Time(long(timestamp * 1000))
>   ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 
> undefined name 'xrange'
> for i in xrange(50):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 
> undefined name 'xrange'
> for j in xrange(5):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 
> undefined name 'xrange'
> for k in xrange(20022):
>  ^
> 20F821 undefined name 'raw_input'
> 20



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23699) PySpark should raise same Error when Arrow fallback is disabled

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23699:


Assignee: Apache Spark

> PySpark should raise same Error when Arrow fallback is disabled
> ---
>
> Key: SPARK-23699
> URL: https://issues.apache.org/jira/browse/SPARK-23699
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Minor
>
> When a schema or import error is encountered when using Arrow for 
> createDataFrame or toPandas and fallback is disabled, a RuntimeError is 
> raised.  It would be better to raise the same type of error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23699) PySpark should raise same Error when Arrow fallback is disabled

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23699:


Assignee: (was: Apache Spark)

> PySpark should raise same Error when Arrow fallback is disabled
> ---
>
> Key: SPARK-23699
> URL: https://issues.apache.org/jira/browse/SPARK-23699
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> When a schema or import error is encountered when using Arrow for 
> createDataFrame or toPandas and fallback is disabled, a RuntimeError is 
> raised.  It would be better to raise the same type of error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23699) PySpark should raise same Error when Arrow fallback is disabled

2018-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401250#comment-16401250
 ] 

Apache Spark commented on SPARK-23699:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/20839

> PySpark should raise same Error when Arrow fallback is disabled
> ---
>
> Key: SPARK-23699
> URL: https://issues.apache.org/jira/browse/SPARK-23699
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> When a schema or import error is encountered when using Arrow for 
> createDataFrame or toPandas and fallback is disabled, a RuntimeError is 
> raised.  It would be better to raise the same type of error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23698) Spark code contains numerous undefined names in Python 3

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23698:


Assignee: Apache Spark

> Spark code contains numerous undefined names in Python 3
> 
>
> Key: SPARK-23698
> URL: https://issues.apache.org/jira/browse/SPARK-23698
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: cclauss
>Assignee: Apache Spark
>Priority: Minor
>
> flake8 testing of https://github.com/apache/spark on Python 3.6.3
> $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source 
> --statistics*
> ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input'
> result = raw_input("\n%s (y/n): " % prompt)
>  ^
> ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input'
> primary_author = raw_input(
>  ^
> ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input'
> pick_ref = raw_input("Enter a branch name [%s]: " % default_branch)
>^
> ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input'
> jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id)
>   ^
> ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input'
> fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % 
> default_fix_versions)
>^
> ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input'
> raw_assignee = raw_input(
>^
> ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input'
> pr_num = raw_input("Which pull request would you like to merge? (e.g. 
> 34): ")
>  ^
> ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input'
> result = raw_input("Would you like to use the modified title? (y/n): 
> ")
>  ^
> ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input'
> while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y":
>   ^
> ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input'
> response = raw_input("%s [y/n]: " % msg)
>^
> ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode'
> author = unidecode.unidecode(unicode(author, "UTF-8")).strip()
>  ^
> ./python/setup.py:37:11: F821 undefined name '__version__'
> VERSION = __version__
>   ^
> ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer'
> dispatch[buffer] = save_buffer
>  ^
> ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file'
> dispatch[file] = save_file
>  ^
> ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode'
> if not isinstance(obj, str) and not isinstance(obj, unicode):
> ^
> ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long'
> intlike = (int, long)
> ^
> ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long'
> return self._sc._jvm.Time(long(timestamp * 1000))
>   ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 
> undefined name 'xrange'
> for i in xrange(50):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 
> undefined name 'xrange'
> for j in xrange(5):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 
> undefined name 'xrange'
> for k in xrange(20022):
>  ^
> 20F821 undefined name 'raw_input'
> 20



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23698) Spark code contains numerous undefined names in Python 3

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23698:


Assignee: (was: Apache Spark)

> Spark code contains numerous undefined names in Python 3
> 
>
> Key: SPARK-23698
> URL: https://issues.apache.org/jira/browse/SPARK-23698
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: cclauss
>Priority: Minor
>
> flake8 testing of https://github.com/apache/spark on Python 3.6.3
> $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source 
> --statistics*
> ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input'
> result = raw_input("\n%s (y/n): " % prompt)
>  ^
> ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input'
> primary_author = raw_input(
>  ^
> ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input'
> pick_ref = raw_input("Enter a branch name [%s]: " % default_branch)
>^
> ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input'
> jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id)
>   ^
> ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input'
> fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % 
> default_fix_versions)
>^
> ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input'
> raw_assignee = raw_input(
>^
> ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input'
> pr_num = raw_input("Which pull request would you like to merge? (e.g. 
> 34): ")
>  ^
> ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input'
> result = raw_input("Would you like to use the modified title? (y/n): 
> ")
>  ^
> ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input'
> while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y":
>   ^
> ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input'
> response = raw_input("%s [y/n]: " % msg)
>^
> ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode'
> author = unidecode.unidecode(unicode(author, "UTF-8")).strip()
>  ^
> ./python/setup.py:37:11: F821 undefined name '__version__'
> VERSION = __version__
>   ^
> ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer'
> dispatch[buffer] = save_buffer
>  ^
> ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file'
> dispatch[file] = save_file
>  ^
> ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode'
> if not isinstance(obj, str) and not isinstance(obj, unicode):
> ^
> ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long'
> intlike = (int, long)
> ^
> ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long'
> return self._sc._jvm.Time(long(timestamp * 1000))
>   ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 
> undefined name 'xrange'
> for i in xrange(50):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 
> undefined name 'xrange'
> for j in xrange(5):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 
> undefined name 'xrange'
> for k in xrange(20022):
>  ^
> 20F821 undefined name 'raw_input'
> 20



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23698) Spark code contains numerous undefined names in Python 3

2018-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401247#comment-16401247
 ] 

Apache Spark commented on SPARK-23698:
--

User 'cclauss' has created a pull request for this issue:
https://github.com/apache/spark/pull/20838

> Spark code contains numerous undefined names in Python 3
> 
>
> Key: SPARK-23698
> URL: https://issues.apache.org/jira/browse/SPARK-23698
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: cclauss
>Priority: Minor
>
> flake8 testing of https://github.com/apache/spark on Python 3.6.3
> $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source 
> --statistics*
> ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input'
> result = raw_input("\n%s (y/n): " % prompt)
>  ^
> ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input'
> primary_author = raw_input(
>  ^
> ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input'
> pick_ref = raw_input("Enter a branch name [%s]: " % default_branch)
>^
> ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input'
> jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id)
>   ^
> ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input'
> fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % 
> default_fix_versions)
>^
> ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input'
> raw_assignee = raw_input(
>^
> ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input'
> pr_num = raw_input("Which pull request would you like to merge? (e.g. 
> 34): ")
>  ^
> ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input'
> result = raw_input("Would you like to use the modified title? (y/n): 
> ")
>  ^
> ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input'
> while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y":
>   ^
> ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input'
> response = raw_input("%s [y/n]: " % msg)
>^
> ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode'
> author = unidecode.unidecode(unicode(author, "UTF-8")).strip()
>  ^
> ./python/setup.py:37:11: F821 undefined name '__version__'
> VERSION = __version__
>   ^
> ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer'
> dispatch[buffer] = save_buffer
>  ^
> ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file'
> dispatch[file] = save_file
>  ^
> ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode'
> if not isinstance(obj, str) and not isinstance(obj, unicode):
> ^
> ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long'
> intlike = (int, long)
> ^
> ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long'
> return self._sc._jvm.Time(long(timestamp * 1000))
>   ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 
> undefined name 'xrange'
> for i in xrange(50):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 
> undefined name 'xrange'
> for j in xrange(5):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 
> undefined name 'xrange'
> for k in xrange(20022):
>  ^
> 20F821 undefined name 'raw_input'
> 20



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23699) PySpark should raise same Error when Arrow fallback is disabled

2018-03-15 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-23699:


 Summary: PySpark should raise same Error when Arrow fallback is 
disabled
 Key: SPARK-23699
 URL: https://issues.apache.org/jira/browse/SPARK-23699
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.4.0
Reporter: Bryan Cutler


When a schema or import error is encountered when using Arrow for 
createDataFrame or toPandas and fallback is disabled, a RuntimeError is raised. 
 It would be better to raise the same type of error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23632) sparkR.session() error with spark packages - JVM is not ready after 10 seconds

2018-03-15 Thread Jaehyeon Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401210#comment-16401210
 ] 

Jaehyeon Kim commented on SPARK-23632:
--

I've looked into further and found it'd be better to have package download and 
session start separated if it takes long to download a package and its 
dependencies.

A quick check shows the following doesn't throw an error and jars are 
downloaded to _~/.ivy2/jars_.

{code:java}
echo 'print("done")' > pkg_install.R
/usr/local/spark-2.2.1/bin/spark-submit \
--master spark://master:7077 \
--packages org.apache.hadoop:hadoop-aws:2.8.2 \
pkg_install.R
{code}

Let me come back with a structured example.


> sparkR.session() error with spark packages - JVM is not ready after 10 seconds
> --
>
> Key: SPARK-23632
> URL: https://issues.apache.org/jira/browse/SPARK-23632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Jaehyeon Kim
>Priority: Minor
>
> Hi
> When I execute _sparkR.session()_ with _org.apache.hadoop:hadoop-aws:2.8.2_ 
> as following,
> {code:java}
> library(SparkR, lib.loc=file.path(Sys.getenv('SPARK_HOME'),'R', 'lib'))
> ext_opts <- '-Dhttp.proxyHost=10.74.1.25 -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=10.74.1.25 -Dhttps.proxyPort=8080'
> sparkR.session(master = "spark://master:7077",
>appName = 'ml demo',
>sparkConfig = list(spark.driver.memory = '2g'), 
>sparkPackages = 'org.apache.hadoop:hadoop-aws:2.8.2',
>spark.driver.extraJavaOptions = ext_opts)
> {code}
> I see *JVM is not ready after 10 seconds* error. Below shows some of the log 
> messages.
> {code:java}
> Ivy Default Cache set to: /home/rstudio/.ivy2/cache
> The jars for the packages stored in: /home/rstudio/.ivy2/jars
> :: loading settings :: url = 
> jar:file:/usr/local/spark-2.2.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> org.apache.hadoop#hadoop-aws added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found org.apache.hadoop#hadoop-aws;2.8.2 in central
> ...
> ...
>   found javax.servlet.jsp#jsp-api;2.1 in central
> Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap,  : 
>   JVM is not ready after 10 seconds
> ...
> ...
>   found joda-time#joda-time;2.9.4 in central
> downloading 
> https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.2/hadoop-aws-2.8.2.jar
>  ...
> ...
> ...
>   xmlenc#xmlenc;0.52 from central in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   76  |   76  |   76  |   0   ||   76  |   76  |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   76 artifacts copied, 0 already retrieved (27334kB/56ms)
> {code}
> It's fine if I re-execute it after the package and its dependencies are 
> downloaded.
> I consider it's because of this part - 
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L181
> {code:java}
> if (!file.exists(path)) {
>   stop("JVM is not ready after 10 seconds")
> }
> {code}
> Just wonder if it may be possible to update so that a user can determine how 
> much to wait?
> Thanks.
> Regards
> Jaehyeon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23698) Spark code contains numerous undefined names in Python 3

2018-03-15 Thread cclauss (JIRA)
cclauss created SPARK-23698:
---

 Summary: Spark code contains numerous undefined names in Python 3
 Key: SPARK-23698
 URL: https://issues.apache.org/jira/browse/SPARK-23698
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: cclauss


flake8 testing of https://github.com/apache/spark on Python 3.6.3

$ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source 
--statistics*

./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input'
result = raw_input("\n%s (y/n): " % prompt)
 ^
./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input'
primary_author = raw_input(
 ^
./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input'
pick_ref = raw_input("Enter a branch name [%s]: " % default_branch)
   ^
./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input'
jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id)
  ^
./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input'
fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % 
default_fix_versions)
   ^
./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input'
raw_assignee = raw_input(
   ^
./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input'
pr_num = raw_input("Which pull request would you like to merge? (e.g. 34): 
")
 ^
./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input'
result = raw_input("Would you like to use the modified title? (y/n): ")
 ^
./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input'
while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y":
  ^
./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input'
response = raw_input("%s [y/n]: " % msg)
   ^
./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode'
author = unidecode.unidecode(unicode(author, "UTF-8")).strip()
 ^
./python/setup.py:37:11: F821 undefined name '__version__'
VERSION = __version__
  ^
./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer'
dispatch[buffer] = save_buffer
 ^
./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file'
dispatch[file] = save_file
 ^
./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode'
if not isinstance(obj, str) and not isinstance(obj, unicode):
^
./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long'
intlike = (int, long)
^
./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long'
return self._sc._jvm.Time(long(timestamp * 1000))
  ^
./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 
undefined name 'xrange'
for i in xrange(50):
 ^
./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 
undefined name 'xrange'
for j in xrange(5):
 ^
./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 
undefined name 'xrange'
for k in xrange(20022):
 ^
20F821 undefined name 'raw_input'
20



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x

2018-03-15 Thread Sergey Zhemzhitsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-23697:
---
Description: 
I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
failing with
 java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
value copy

It happens while serializing an accumulator 
[here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
{code:java}
val copyAcc = copyAndReset()
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
... although copyAndReset returns zero-value copy for sure, just consider the 
accumulator below
{code:java}
val concatParam = new AccumulatorParam[jl.StringBuilder] {
  override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
jl.StringBuilder()
  override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
jl.StringBuilder = r1.append(r2)
}{code}
So, Spark treats zero value as non-zero due to how 
[isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
 is implemented in LegacyAccumulatorWrapper.
{code:java}
override def isZero: Boolean = _value == param.zero(initialValue){code}
All this means that the values to be accumulated must implement equals and 
hashCode, otherwise isZero is very likely to always return false.

So I'm wondering whether the assertion 
{code:java}
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
is really necessary and whether it can be safely removed from there?

If not - is it ok to just override writeReplace for LegacyAccumulatorWrapper to 
prevent such failures?

  was:
I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
failing with
 java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
value copy

It happens while serializing an accumulator 
[here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
{code:java}
val copyAcc = copyAndReset()
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
... although copyAndReset returns zero-value copy for sure, just consider the 
accumulator below
{code:java}
val concatParam = new AccumulatorParam[jl.StringBuilder] {
  override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
jl.StringBuilder()
  override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
jl.StringBuilder = r1.append(r2)
}{code}
So, Spark treats zero value as non-zero due to how 
[isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
 is implemented in LegacyAccumulatorWrapper.
{code:java}
override def isZero: Boolean = _value == param.zero(initialValue){code}
All this means that the values to be accumulated must implement equals and 
hashCode, otherwise isZero is very likely to always return false.

So I'm wondering whether the assertion 
{code:java}
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
is really necessary and whether it can be safely removed from there?

If not - is it ok to just override writeReplace for LegacyAccumulatorWrapperto 
prevent such failures?


> Accumulators of Spark 1.x no longer work with Spark 2.x
> ---
>
> Key: SPARK-23697
> URL: https://issues.apache.org/jira/browse/SPARK-23697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
> Environment: Spark 2.2.0
> Scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Major
>
> I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
> failing with
>  java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
> value copy
> It happens while serializing an accumulator 
> [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
> {code:java}
> val copyAcc = copyAndReset()
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
> ... although copyAndReset returns zero-value copy for sure, just consider the 
> accumulator below
> {code:java}
> val concatParam = new AccumulatorParam[jl.StringBuilder] {
>   override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
> jl.StringBuilder()
>   override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
> jl.StringBuilder = r1.append(r2)
> }{code}
> So, Spark treats zero value as non-zero due to how 
> 

[jira] [Updated] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x

2018-03-15 Thread Sergey Zhemzhitsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-23697:
---
Description: 
I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
failing with
 java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
value copy

It happens while serializing an accumulator 
[here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
{code:java}
val copyAcc = copyAndReset()
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
... although copyAndReset returns zero-value copy for sure, just consider the 
accumulator below
{code:java}
val concatParam = new AccumulatorParam[jl.StringBuilder] {
  override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
jl.StringBuilder()
  override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
jl.StringBuilder = r1.append(r2)
}{code}
So, Spark treats zero value as non-zero due to how 
[isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
 is implemented in LegacyAccumulatorWrapper.
{code:java}
override def isZero: Boolean = _value == param.zero(initialValue){code}
All this means that the values to be accumulated must implement equals and 
hashCode, otherwise isZero is very likely to always return false.

So I'm wondering whether the assertion 
{code:java}
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
is really necessary and whether it can be safely removed from there?

If not - is it ok to just override writeReplace for LegacyAccumulatorWrapperto 
prevent such failures?

  was:
I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
failing with
 java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
value copy

It happens while serializing an accumulator 
[here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
{code:java}
val copyAcc = copyAndReset()
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
... although copyAndReset returns zero-value copy for sure, just consider the 
accumulator below
{code:java}
val concatParam = new AccumulatorParam[jl.StringBuilder] {
  override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
jl.StringBuilder()
  override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
jl.StringBuilder = r1.append(r2)
}{code}
So, Spark treats zero value as non-zero due to how 
[isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
 is implemented in LegacyAccumulatorWrapper.
{code:java}
override def isZero: Boolean = _value == param.zero(initialValue){code}
All this means that the values to be accumulated must implement equals and 
hashCode, otherwise isZero is very likely to always return false.

So I'm wondering whether the assertion 
{code:java}
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
is really necessary and whether it can be safely removed from there. 


> Accumulators of Spark 1.x no longer work with Spark 2.x
> ---
>
> Key: SPARK-23697
> URL: https://issues.apache.org/jira/browse/SPARK-23697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
> Environment: Spark 2.2.0
> Scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Major
>
> I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
> failing with
>  java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
> value copy
> It happens while serializing an accumulator 
> [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
> {code:java}
> val copyAcc = copyAndReset()
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
> ... although copyAndReset returns zero-value copy for sure, just consider the 
> accumulator below
> {code:java}
> val concatParam = new AccumulatorParam[jl.StringBuilder] {
>   override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
> jl.StringBuilder()
>   override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
> jl.StringBuilder = r1.append(r2)
> }{code}
> So, Spark treats zero value as non-zero due to how 
> 

[jira] [Updated] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x

2018-03-15 Thread Sergey Zhemzhitsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-23697:
---
Description: 
I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
failing with
 java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
value copy

It happens while serializing an accumulator 
[here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
{code:java}
val copyAcc = copyAndReset()
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
... although copyAndReset returns zero-value copy for sure, just consider the 
accumulator below
{code:java}
val concatParam = new AccumulatorParam[jl.StringBuilder] {
  override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
jl.StringBuilder()
  override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
jl.StringBuilder = r1.append(r2)
}{code}
So, Spark treats zero value as non-zero due to how 
[isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
 is implemented in LegacyAccumulatorWrapper.
{code:java}
override def isZero: Boolean = _value == param.zero(initialValue){code}
All this means that the values to be accumulated must implement equals and 
hashCode, otherwise isZero is very likely to always return false.

So I'm wondering whether the assertion 
{code:java}
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
is really necessary and whether it can be safely removed from there. 

  was:
I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
failing with
 java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
value copy

It happens while serializing an accumulator 
[here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
{code:java}
val copyAcc = copyAndReset()
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
... although copyAndReset returns zero-value copy for sure, just consider the 
accumulator below
{code:java}
val concatParam = new AccumulatorParam[jl.StringBuilder] {
  override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
jl.StringBuilder()
  override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
jl.StringBuilder = r1.append(r2)
}{code}
So, Spark treats zero value as non-zero due to how 
[isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
 is implemented in LegacyAccumulatorWrapper
{code:java}
override def isZero: Boolean = _value == param.zero(initialValue){code}
All this means that the values to be accumulated must implement equals and 
hashCode, otherwise isZero is very likely to always return false.

So I'm wondering whether the assertion 
{code:java}
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
is really necessary and whether it can be safely removed from there. 


> Accumulators of Spark 1.x no longer work with Spark 2.x
> ---
>
> Key: SPARK-23697
> URL: https://issues.apache.org/jira/browse/SPARK-23697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
> Environment: Spark 2.2.0
> Scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Major
>
> I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
> failing with
>  java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
> value copy
> It happens while serializing an accumulator 
> [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
> {code:java}
> val copyAcc = copyAndReset()
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
> ... although copyAndReset returns zero-value copy for sure, just consider the 
> accumulator below
> {code:java}
> val concatParam = new AccumulatorParam[jl.StringBuilder] {
>   override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
> jl.StringBuilder()
>   override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
> jl.StringBuilder = r1.append(r2)
> }{code}
> So, Spark treats zero value as non-zero due to how 
> [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
>  is implemented in LegacyAccumulatorWrapper.
> {code:java}
> override def isZero: 

[jira] [Updated] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x

2018-03-15 Thread Sergey Zhemzhitsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-23697:
---
Description: 
I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
failing with
 java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
value copy

It happens while serializing an accumulator 
[here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
{code:java}
val copyAcc = copyAndReset()
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
... although copyAndReset returns zero-value copy for sure, just consider the 
accumulator below
{code:java}
val concatParam = new AccumulatorParam[jl.StringBuilder] {
  override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
jl.StringBuilder()
  override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
jl.StringBuilder = r1.append(r2)
}{code}
So, Spark treats zero value as non-zero due to how 
[isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
 is implemented in LegacyAccumulatorWrapper
{code:java}
override def isZero: Boolean = _value == param.zero(initialValue){code}
All this means that the values to be accumulated must implement equals and 
hashCode, otherwise isZero is very likely to always return false.

So I'm wondering whether the assertion 
{code:java}
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
is really necessary and whether it can be safely removed from there. 

  was:
I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
failing with
java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
value copy

It happens while serializing an accumulator 
[here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
{code:java}
val copyAcc = copyAndReset()
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
... although copyAndReset returns zero-value copy for sure, just consider the 
accumulator below
{code:java}
val concatParam = new AccumulatorParam[jl.StringBuilder] {
  override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
jl.StringBuilder()
  override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
jl.StringBuilder = r1.append(r2)
}{code}

So, Spark treats zero value as non-zero due to how 
[isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
 is implemented in LegacyAccumulatorWrapper
{code:java}
override def isZero: Boolean = _value == param.zero(initialValue){code}

All this means that the values to be accumulated must implement equals and 
hashCode, otherwise `isZero` is very likely to always return false.

So I'm wondering whether the assertion 
{code:java}
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
is really necessary and whether it can be safely removed from there. 


> Accumulators of Spark 1.x no longer work with Spark 2.x
> ---
>
> Key: SPARK-23697
> URL: https://issues.apache.org/jira/browse/SPARK-23697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
> Environment: Spark 2.2.0
> Scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Major
>
> I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
> failing with
>  java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
> value copy
> It happens while serializing an accumulator 
> [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
> {code:java}
> val copyAcc = copyAndReset()
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
> ... although copyAndReset returns zero-value copy for sure, just consider the 
> accumulator below
> {code:java}
> val concatParam = new AccumulatorParam[jl.StringBuilder] {
>   override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
> jl.StringBuilder()
>   override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
> jl.StringBuilder = r1.append(r2)
> }{code}
> So, Spark treats zero value as non-zero due to how 
> [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
>  is implemented in LegacyAccumulatorWrapper
> {code:java}
> override def isZero: 

[jira] [Created] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x

2018-03-15 Thread Sergey Zhemzhitsky (JIRA)
Sergey Zhemzhitsky created SPARK-23697:
--

 Summary: Accumulators of Spark 1.x no longer work with Spark 2.x
 Key: SPARK-23697
 URL: https://issues.apache.org/jira/browse/SPARK-23697
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1, 2.2.0
 Environment: Spark 2.2.0
Scala 2.11
Reporter: Sergey Zhemzhitsky


I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
failing with
java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
value copy

It happens while serializing an accumulator 
[here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
{code:java}
val copyAcc = copyAndReset()
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
... although copyAndReset returns zero-value copy for sure, just consider the 
accumulator below
{code:java}
val concatParam = new AccumulatorParam[jl.StringBuilder] {
  override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
jl.StringBuilder()
  override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
jl.StringBuilder = r1.append(r2)
}{code}

So, Spark treats zero value as non-zero due to how 
[isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
 is implemented in LegacyAccumulatorWrapper
{code:java}
override def isZero: Boolean = _value == param.zero(initialValue){code}

All this means that the values to be accumulated must implement equals and 
hashCode, otherwise `isZero` is very likely to always return false.

So I'm wondering whether the assertion 
{code:java}
assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
is really necessary and whether it can be safely removed from there. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23684) mode append function not working

2018-03-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23684.
--
Resolution: Duplicate

> mode append function not working 
> -
>
> Key: SPARK-23684
> URL: https://issues.apache.org/jira/browse/SPARK-23684
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Evan Zamir
>Priority: Minor
>
> {{df.write.mode('append').jdbc(url, table, properties=\{"driver": 
> "org.postgresql.Driver"}) }}
> produces the following error and does not write to existing table:
> {{2018-03-14 11:00:08,332 root ERROR An error occurred while calling 
> o894.jdbc.}}
> {{: scala.MatchError: null}}
> {{ at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62)}}
> {{ at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)}}
> {{ at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)}}
> {{ at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)}}
> {{ at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)}}
> {{ at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)}}
> {{ at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}}
> {{ at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}}
> {{ at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)}}
> {{ at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}}
> {{ at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)}}
> {{ at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)}}
> {{ at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)}}
> {{ at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)}}
> {{ at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)}}
> {{ at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)}}
> {{ at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:461)}}
> {{ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}}
> {{ at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)}}
> {{ at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
> {{ at java.lang.reflect.Method.invoke(Method.java:498)}}
> {{ at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)}}
> {{ at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)}}
> {{ at py4j.Gateway.invoke(Gateway.java:280)}}
> {{ at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)}}
> {{ at py4j.commands.CallCommand.execute(CallCommand.java:79)}}
> {{ at py4j.GatewayConnection.run(GatewayConnection.java:214)}}
> {{ at java.lang.Thread.run(Thread.java:745)}}
> However,
> {{df.write.jdbc(url, table, properties=\{"driver": 
> "org.postgresql.Driver"},mode='append')}}
> does not produce an error and adds a row to an exisiting table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7131) Move tree,forest implementation from spark.mllib to spark.ml

2018-03-15 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401071#comment-16401071
 ] 

Joseph K. Bradley commented on SPARK-7131:
--

CCing people watching this JIRA about https://github.com/apache/spark/pull/20786
In that PR, we want to make LeafNode and InternalNode into traits (not classes) 
in order to split Regression from Classification nodes (to have stronger 
typing).  Will this break anyone's code outside of org.apache.spark.ml?  I 
doubt it since the node constructors are still private, but I wanted to CC 
people.  Thanks!

> Move tree,forest implementation from spark.mllib to spark.ml
> 
>
> Key: SPARK-7131
> URL: https://issues.apache.org/jira/browse/SPARK-7131
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 1.5.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to change and improve the spark.ml API for trees and ensembles, but 
> we cannot change the old API in spark.mllib.  To support the changes we want 
> to make, we should move the implementation from spark.mllib to spark.ml.  We 
> will generalize and modify it, but will also ensure that we do not change the 
> behavior of the old API.
> There are several steps to this:
> 1. Copy the implementation over to spark.ml and change the spark.ml classes 
> to use that implementation, rather than calling the spark.mllib 
> implementation.  The current spark.ml tests will ensure that the 2 
> implementations learn exactly the same models.  Note: This should include 
> performance testing to make sure the updated code does not have any 
> regressions. --> *UPDATE*: I have run tests using spark-perf, and there were 
> no regressions.
> 2. Remove the spark.mllib implementation, and make the spark.mllib APIs 
> wrappers around the spark.ml implementation.  The spark.ml tests will again 
> ensure that we do not change any behavior.
> 3. Move the unit tests to spark.ml, and change the spark.mllib unit tests to 
> verify model equivalence.
> This JIRA is now for step 1 only.  Steps 2 and 3 will be in separate JIRAs.
> After these updates, we can more safely generalize and improve the spark.ml 
> implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20169) Groupby Bug with Sparksql

2018-03-15 Thread Dylan Guedes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401029#comment-16401029
 ] 

Dylan Guedes commented on SPARK-20169:
--

Hi,

I also reproduced it in v2.3 and master. I think that it is something related 
to the String type because if I cast the jr dataframe column to long it works 
fine - However, if I cast it to String, the bug still happens.

I don't know the catalyst codebase that well (never touched it actually), do 
you guys have a suggestion to where to start looking after I call _jdf? I don't 
know how to follow the trace after converting to the JVM.

Thank you!

> Groupby Bug with Sparksql
> -
>
> Key: SPARK-20169
> URL: https://issues.apache.org/jira/browse/SPARK-20169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Bin Wu
>Priority: Major
>
> We find a potential bug in Catalyst optimizer which cannot correctly 
> process "groupby". You can reproduce it by following simple example:
> =
> from pyspark.sql.functions import *
> #e=sc.parallelize([(1,2),(1,3),(1,4),(2,1),(3,1),(4,1)]).toDF(["src","dst"])
> e = spark.read.csv("graph.csv", header=True)
> r = sc.parallelize([(1,),(2,),(3,),(4,)]).toDF(['src'])
> r1 = e.join(r, 'src').groupBy('dst').count().withColumnRenamed('dst','src')
> jr = e.join(r1, 'src')
> jr.show()
> r2 = jr.groupBy('dst').count()
> r2.show()
> =
> FYI, "graph.csv" contains exactly the same data as the commented line.
> You can find that jr is:
> |src|dst|count|
> |  3|  1|1|
> |  1|  4|3|
> |  1|  3|3|
> |  1|  2|3|
> |  4|  1|1|
> |  2|  1|1|
> But, after the last groupBy, the 3 rows with dst = 1 are not grouped together:
> |dst|count|
> |  1|1|
> |  4|1|
> |  3|1|
> |  2|1|
> |  1|1|
> |  1|1|
> If we build jr directly from raw data (commented line), this error will not 
> show up.  So 
> we suspect  that there is a bug in the Catalyst optimizer when multiple joins 
> and groupBy's 
> are being optimized. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23686:


Assignee: (was: Apache Spark)

> Make better usage of org.apache.spark.ml.util.Instrumentation
> -
>
> Key: SPARK-23686
> URL: https://issues.apache.org/jira/browse/SPARK-23686
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> This Jira is a bit high level and might require subtasks or other jiras for 
> more specific tasks.
> I've noticed that we don't make the best usage of the instrumentation class. 
> Specifically sometimes we bypass the instrumentation class and use the 
> debugger instead. For example, 
> [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143]
> Also there are some things that might be useful to log in the instrumentation 
> class that we currently don't. For example:
> number of training examples
> mean/var of label (regression)
> I know computing these things can be expensive in some cases, but especially 
> when this data is already available we can log it for free. For example, 
> Logistic Regression Summarizer computes some useful data including numRows 
> that we don't log.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation

2018-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400903#comment-16400903
 ] 

Apache Spark commented on SPARK-23686:
--

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/20837

> Make better usage of org.apache.spark.ml.util.Instrumentation
> -
>
> Key: SPARK-23686
> URL: https://issues.apache.org/jira/browse/SPARK-23686
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> This Jira is a bit high level and might require subtasks or other jiras for 
> more specific tasks.
> I've noticed that we don't make the best usage of the instrumentation class. 
> Specifically sometimes we bypass the instrumentation class and use the 
> debugger instead. For example, 
> [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143]
> Also there are some things that might be useful to log in the instrumentation 
> class that we currently don't. For example:
> number of training examples
> mean/var of label (regression)
> I know computing these things can be expensive in some cases, but especially 
> when this data is already available we can log it for free. For example, 
> Logistic Regression Summarizer computes some useful data including numRows 
> that we don't log.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23686:


Assignee: Apache Spark

> Make better usage of org.apache.spark.ml.util.Instrumentation
> -
>
> Key: SPARK-23686
> URL: https://issues.apache.org/jira/browse/SPARK-23686
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Apache Spark
>Priority: Major
>
> This Jira is a bit high level and might require subtasks or other jiras for 
> more specific tasks.
> I've noticed that we don't make the best usage of the instrumentation class. 
> Specifically sometimes we bypass the instrumentation class and use the 
> debugger instead. For example, 
> [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143]
> Also there are some things that might be useful to log in the instrumentation 
> class that we currently don't. For example:
> number of training examples
> mean/var of label (regression)
> I know computing these things can be expensive in some cases, but especially 
> when this data is already available we can log it for free. For example, 
> Logistic Regression Summarizer computes some useful data including numRows 
> that we don't log.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23695) Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled

2018-03-15 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23695.
---
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.1

Issue resolved by pull request 20834
https://github.com/apache/spark/pull/20834

> Confusing error message for PySpark's Kinesis tests when its jar is missing 
> but enabled
> ---
>
> Key: SPARK-23695
> URL: https://issues.apache.org/jira/browse/SPARK-23695
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.3.1, 2.4.0
>
>
> Currently if its jar is missing but the Kinesis tests are enabled:
> {code}
> ENABLE_KINESIS_TESTS=1 SPARK_TESTING=1 bin/pyspark pyspark.streaming.tests
> Skipped test_flume_stream (enable by setting environment variable 
> ENABLE_FLUME_TESTS=1Skipped test_kafka_stream (enable by setting environment 
> variable ENABLE_KAFKA_0_8_TESTS=1Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
>  line 174, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File 
> "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
>  line 72, in _run_code
> exec code in run_globals
>   File "/.../spark/python/pyspark/streaming/tests.py", line 1572, in 
> % kinesis_asl_assembly_dir) +
> NameError: name 'kinesis_asl_assembly_dir' is not defined
> {code}
> It shows a confusing error message. Seems a mistake.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23695) Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled

2018-03-15 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23695:
-

Assignee: Hyukjin Kwon

> Confusing error message for PySpark's Kinesis tests when its jar is missing 
> but enabled
> ---
>
> Key: SPARK-23695
> URL: https://issues.apache.org/jira/browse/SPARK-23695
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
>
> Currently if its jar is missing but the Kinesis tests are enabled:
> {code}
> ENABLE_KINESIS_TESTS=1 SPARK_TESTING=1 bin/pyspark pyspark.streaming.tests
> Skipped test_flume_stream (enable by setting environment variable 
> ENABLE_FLUME_TESTS=1Skipped test_kafka_stream (enable by setting environment 
> variable ENABLE_KAFKA_0_8_TESTS=1Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
>  line 174, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File 
> "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
>  line 72, in _run_code
> exec code in run_globals
>   File "/.../spark/python/pyspark/streaming/tests.py", line 1572, in 
> % kinesis_asl_assembly_dir) +
> NameError: name 'kinesis_asl_assembly_dir' is not defined
> {code}
> It shows a confusing error message. Seems a mistake.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4038) Outlier Detection Algorithm for MLlib

2018-03-15 Thread Gustavo Orair (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400839#comment-16400839
 ] 

Gustavo Orair commented on SPARK-4038:
--

There is a paper that discuss multiple different strategies for distance based 
outlier detection:
 * [http://www2.cs.uh.edu/~ceick/7362/T1-9.pdf]

This paper propose an adaptive parallel algorithm that executes local search 
for neighbors for a selected pool by ranking strategies before examining the 
data for outliers that looks promising for distributed computation.



> Outlier Detection Algorithm for MLlib
> -
>
> Key: SPARK-4038
> URL: https://issues.apache.org/jira/browse/SPARK-4038
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ashutosh Trivedi
>Priority: Minor
>
> The aim of this JIRA is to discuss about which parallel outlier detection 
> algorithms can be included in MLlib. 
> The one which I am familiar with is Attribute Value Frequency (AVF). It 
> scales linearly with the number of data points and attributes, and relies on 
> a single data scan. It is not distance based and well suited for categorical 
> data. In original paper  a parallel version is also given, which is not 
> complected to implement.  I am working on the implementation and soon submit 
> the initial code for review.
> Here is the Link for the paper
> http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382
> As pointed out by Xiangrui in discussion 
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html
> There are other algorithms also. Lets discuss about which will be more 
> general and easily paralleled.
>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23684) mode append function not working

2018-03-15 Thread Evan Zamir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400823#comment-16400823
 ] 

Evan Zamir commented on SPARK-23684:


Yes, you're right. Feel free to close this.

> mode append function not working 
> -
>
> Key: SPARK-23684
> URL: https://issues.apache.org/jira/browse/SPARK-23684
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Evan Zamir
>Priority: Minor
>
> {{df.write.mode('append').jdbc(url, table, properties=\{"driver": 
> "org.postgresql.Driver"}) }}
> produces the following error and does not write to existing table:
> {{2018-03-14 11:00:08,332 root ERROR An error occurred while calling 
> o894.jdbc.}}
> {{: scala.MatchError: null}}
> {{ at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62)}}
> {{ at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)}}
> {{ at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)}}
> {{ at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)}}
> {{ at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)}}
> {{ at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)}}
> {{ at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}}
> {{ at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}}
> {{ at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)}}
> {{ at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}}
> {{ at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)}}
> {{ at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)}}
> {{ at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)}}
> {{ at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)}}
> {{ at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)}}
> {{ at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)}}
> {{ at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:461)}}
> {{ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}}
> {{ at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)}}
> {{ at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
> {{ at java.lang.reflect.Method.invoke(Method.java:498)}}
> {{ at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)}}
> {{ at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)}}
> {{ at py4j.Gateway.invoke(Gateway.java:280)}}
> {{ at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)}}
> {{ at py4j.commands.CallCommand.execute(CallCommand.java:79)}}
> {{ at py4j.GatewayConnection.run(GatewayConnection.java:214)}}
> {{ at java.lang.Thread.run(Thread.java:745)}}
> However,
> {{df.write.jdbc(url, table, properties=\{"driver": 
> "org.postgresql.Driver"},mode='append')}}
> does not produce an error and adds a row to an exisiting table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2018-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400755#comment-16400755
 ] 

Apache Spark commented on SPARK-23685:
--

User 'sirishaSindri' has created a pull request for this issue:
https://github.com/apache/spark/pull/20836

> Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive 
> Offsets (i.e. Log Compaction)
> -
>
> Key: SPARK-23685
> URL: https://issues.apache.org/jira/browse/SPARK-23685
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sirisha
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always 
> be just an increment of 1 .If not, it throws the below exception:
>  
> "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). 
> Some data may have been lost because they are not available in Kafka any 
> more; either the data was aged out by Kafka or the topic may have been 
> deleted before all the data in the topic was processed. If you don't want 
> your streaming query to fail on such cases, set the source option 
> "failOnDataLoss" to "false". "
>  
> FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23685:


Assignee: Apache Spark

> Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive 
> Offsets (i.e. Log Compaction)
> -
>
> Key: SPARK-23685
> URL: https://issues.apache.org/jira/browse/SPARK-23685
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sirisha
>Assignee: Apache Spark
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always 
> be just an increment of 1 .If not, it throws the below exception:
>  
> "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). 
> Some data may have been lost because they are not available in Kafka any 
> more; either the data was aged out by Kafka or the topic may have been 
> deleted before all the data in the topic was processed. If you don't want 
> your streaming query to fail on such cases, set the source option 
> "failOnDataLoss" to "false". "
>  
> FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23685:


Assignee: (was: Apache Spark)

> Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive 
> Offsets (i.e. Log Compaction)
> -
>
> Key: SPARK-23685
> URL: https://issues.apache.org/jira/browse/SPARK-23685
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sirisha
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always 
> be just an increment of 1 .If not, it throws the below exception:
>  
> "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). 
> Some data may have been lost because they are not available in Kafka any 
> more; either the data was aged out by Kafka or the topic may have been 
> deleted before all the data in the topic was processed. If you don't want 
> your streaming query to fail on such cases, set the source option 
> "failOnDataLoss" to "false". "
>  
> FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23696) StructType.fromString swallows exceptions from DataType.fromJson

2018-03-15 Thread Simeon H.K. Fitch (JIRA)
Simeon H.K. Fitch created SPARK-23696:
-

 Summary: StructType.fromString swallows exceptions from 
DataType.fromJson
 Key: SPARK-23696
 URL: https://issues.apache.org/jira/browse/SPARK-23696
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.1
Reporter: Simeon H.K. Fitch


`StructType.fromString` swallows exceptions from `DataType.fromJson`, assuming 
they are an indication that the `LegacyTypeStringParser.parse` should be called 
instead. When that fails (because it throws an excreption), an error message is 
generated that does not reflect the true problem at hand, effectively 
swallowing the exception from `DataType.fromJson`. This makes debugging Parquet 
schema issues more difficult.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23693) SQL function uuid()

2018-03-15 Thread Arseniy Tashoyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arseniy Tashoyan updated SPARK-23693:
-
Description: 
Add function uuid() to org.apache.spark.sql.functions that returns [Universally 
Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].

Sometimes it is necessary to uniquely identify each row in a DataFrame.

Currently the following ways are available:
 * monotonically_increasing_id() function
 * row_number() function over some window
 * convert the DataFrame to RDD and zipWithIndex()

All these approaches do not work when appending this DataFrame to another 
DataFrame (union). Collisions may occur - two rows in different DataFrames may 
have the same ID. Re-generating IDs on the resulting DataFrame is not an 
option, because some data in some other system may already refer to old IDs.

The proposed solution is to add new function:
{code}
def uuid(): Column{code}
that returns String representation of UUID.

UUID is represented as a 128-bit number (two long numbers). Such numbers are 
not supported in Scala or Java. In addition, some storage systems do not 
support 128-bit numbers (Parquet's largest numeric type is INT96). This is the 
reason for the uuid() function to return String.

I already have a simple implementation based on 
[java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
can share it as a PR.

  was:
Add function uuid() to org.apache.spark.sql.functions that returns [Universally 
Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].

Sometimes it is necessary to uniquely identify each row in a DataFrame.

Currently the following ways are available:
 * monotonically_increasing_id() function
 * row_number() function over some window
 * convert the DataFrame to RDD and zipWithIndex()

All these approaches do not work when appending this DataFrame to another 
DataFrame (union). Collisions may occur - two rows in different DataFrames may 
have the same ID. Re-generating IDs on the resulting DataFrame is not an 
option, because some data in some other system may already refer to old IDs.

The proposed solution is to add new function:
{code:scala}
def uuid(): String{code}
that returns String representation of UUID.

UUID is represented as a 128-bit number (two long numbers). Such numbers are 
not supported in Scala or Java. In addition, some storage systems do not 
support 128-bit numbers (Parquet's largest numeric type is INT96). This is the 
reason for the uuid() function to return String.

I already have a simple implementation based on 
[java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
can share it as a PR.


> SQL function uuid()
> ---
>
> Key: SPARK-23693
> URL: https://issues.apache.org/jira/browse/SPARK-23693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Arseniy Tashoyan
>Priority: Minor
>
> Add function uuid() to org.apache.spark.sql.functions that returns 
> [Universally Unique 
> ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].
> Sometimes it is necessary to uniquely identify each row in a DataFrame.
> Currently the following ways are available:
>  * monotonically_increasing_id() function
>  * row_number() function over some window
>  * convert the DataFrame to RDD and zipWithIndex()
> All these approaches do not work when appending this DataFrame to another 
> DataFrame (union). Collisions may occur - two rows in different DataFrames 
> may have the same ID. Re-generating IDs on the resulting DataFrame is not an 
> option, because some data in some other system may already refer to old IDs.
> The proposed solution is to add new function:
> {code}
> def uuid(): Column{code}
> that returns String representation of UUID.
> UUID is represented as a 128-bit number (two long numbers). Such numbers are 
> not supported in Scala or Java. In addition, some storage systems do not 
> support 128-bit numbers (Parquet's largest numeric type is INT96). This is 
> the reason for the uuid() function to return String.
> I already have a simple implementation based on 
> [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
> can share it as a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23693) SQL function uuid()

2018-03-15 Thread Arseniy Tashoyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arseniy Tashoyan updated SPARK-23693:
-
Description: 
Add function uuid() to org.apache.spark.sql.functions that returns [Universally 
Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].

Sometimes it is necessary to uniquely identify each row in a DataFrame.

Currently the following ways are available:
 * monotonically_increasing_id() function
 * row_number() function over some window
 * convert the DataFrame to RDD and zipWithIndex()

All these approaches do not work when appending this DataFrame to another 
DataFrame (union). Collisions may occur - two rows in different DataFrames may 
have the same ID. Re-generating IDs on the resulting DataFrame is not an 
option, because some data in some other system may already refer to old IDs.

The proposed solution is to add new function:
{code:scala}
def uuid(): Column
{code}
that returns String representation of UUID.

UUID is represented as a 128-bit number (two long numbers). Such numbers are 
not supported in Scala or Java. In addition, some storage systems do not 
support 128-bit numbers (Parquet's largest numeric type is INT96). This is the 
reason for the uuid() function to return String.

I already have a simple implementation based on 
[java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
can share it as a PR.

  was:
Add function uuid() to org.apache.spark.sql.functions that returns [Universally 
Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].

Sometimes it is necessary to uniquely identify each row in a DataFrame.

Currently the following ways are available:
 * monotonically_increasing_id() function
 * row_number() function over some window
 * convert the DataFrame to RDD and zipWithIndex()

All these approaches do not work when appending this DataFrame to another 
DataFrame (union). Collisions may occur - two rows in different DataFrames may 
have the same ID. Re-generating IDs on the resulting DataFrame is not an 
option, because some data in some other system may already refer to old IDs.

The proposed solution is to add new function:
{code}
def uuid(): Column{code}
that returns String representation of UUID.

UUID is represented as a 128-bit number (two long numbers). Such numbers are 
not supported in Scala or Java. In addition, some storage systems do not 
support 128-bit numbers (Parquet's largest numeric type is INT96). This is the 
reason for the uuid() function to return String.

I already have a simple implementation based on 
[java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
can share it as a PR.


> SQL function uuid()
> ---
>
> Key: SPARK-23693
> URL: https://issues.apache.org/jira/browse/SPARK-23693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Arseniy Tashoyan
>Priority: Minor
>
> Add function uuid() to org.apache.spark.sql.functions that returns 
> [Universally Unique 
> ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].
> Sometimes it is necessary to uniquely identify each row in a DataFrame.
> Currently the following ways are available:
>  * monotonically_increasing_id() function
>  * row_number() function over some window
>  * convert the DataFrame to RDD and zipWithIndex()
> All these approaches do not work when appending this DataFrame to another 
> DataFrame (union). Collisions may occur - two rows in different DataFrames 
> may have the same ID. Re-generating IDs on the resulting DataFrame is not an 
> option, because some data in some other system may already refer to old IDs.
> The proposed solution is to add new function:
> {code:scala}
> def uuid(): Column
> {code}
> that returns String representation of UUID.
> UUID is represented as a 128-bit number (two long numbers). Such numbers are 
> not supported in Scala or Java. In addition, some storage systems do not 
> support 128-bit numbers (Parquet's largest numeric type is INT96). This is 
> the reason for the uuid() function to return String.
> I already have a simple implementation based on 
> [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
> can share it as a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23695) Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled

2018-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400325#comment-16400325
 ] 

Apache Spark commented on SPARK-23695:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/20834

> Confusing error message for PySpark's Kinesis tests when its jar is missing 
> but enabled
> ---
>
> Key: SPARK-23695
> URL: https://issues.apache.org/jira/browse/SPARK-23695
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Currently if its jar is missing but the Kinesis tests are enabled:
> {code}
> ENABLE_KINESIS_TESTS=1 SPARK_TESTING=1 bin/pyspark pyspark.streaming.tests
> Skipped test_flume_stream (enable by setting environment variable 
> ENABLE_FLUME_TESTS=1Skipped test_kafka_stream (enable by setting environment 
> variable ENABLE_KAFKA_0_8_TESTS=1Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
>  line 174, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File 
> "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
>  line 72, in _run_code
> exec code in run_globals
>   File "/.../spark/python/pyspark/streaming/tests.py", line 1572, in 
> % kinesis_asl_assembly_dir) +
> NameError: name 'kinesis_asl_assembly_dir' is not defined
> {code}
> It shows a confusing error message. Seems a mistake.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23695) Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23695:


Assignee: Apache Spark

> Confusing error message for PySpark's Kinesis tests when its jar is missing 
> but enabled
> ---
>
> Key: SPARK-23695
> URL: https://issues.apache.org/jira/browse/SPARK-23695
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> Currently if its jar is missing but the Kinesis tests are enabled:
> {code}
> ENABLE_KINESIS_TESTS=1 SPARK_TESTING=1 bin/pyspark pyspark.streaming.tests
> Skipped test_flume_stream (enable by setting environment variable 
> ENABLE_FLUME_TESTS=1Skipped test_kafka_stream (enable by setting environment 
> variable ENABLE_KAFKA_0_8_TESTS=1Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
>  line 174, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File 
> "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
>  line 72, in _run_code
> exec code in run_globals
>   File "/.../spark/python/pyspark/streaming/tests.py", line 1572, in 
> % kinesis_asl_assembly_dir) +
> NameError: name 'kinesis_asl_assembly_dir' is not defined
> {code}
> It shows a confusing error message. Seems a mistake.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23695) Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23695:


Assignee: (was: Apache Spark)

> Confusing error message for PySpark's Kinesis tests when its jar is missing 
> but enabled
> ---
>
> Key: SPARK-23695
> URL: https://issues.apache.org/jira/browse/SPARK-23695
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Currently if its jar is missing but the Kinesis tests are enabled:
> {code}
> ENABLE_KINESIS_TESTS=1 SPARK_TESTING=1 bin/pyspark pyspark.streaming.tests
> Skipped test_flume_stream (enable by setting environment variable 
> ENABLE_FLUME_TESTS=1Skipped test_kafka_stream (enable by setting environment 
> variable ENABLE_KAFKA_0_8_TESTS=1Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
>  line 174, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File 
> "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
>  line 72, in _run_code
> exec code in run_globals
>   File "/.../spark/python/pyspark/streaming/tests.py", line 1572, in 
> % kinesis_asl_assembly_dir) +
> NameError: name 'kinesis_asl_assembly_dir' is not defined
> {code}
> It shows a confusing error message. Seems a mistake.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23695) Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled

2018-03-15 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-23695:


 Summary: Confusing error message for PySpark's Kinesis tests when 
its jar is missing but enabled
 Key: SPARK-23695
 URL: https://issues.apache.org/jira/browse/SPARK-23695
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon


Currently if its jar is missing but the Kinesis tests are enabled:

{code}
ENABLE_KINESIS_TESTS=1 SPARK_TESTING=1 bin/pyspark pyspark.streaming.tests
Skipped test_flume_stream (enable by setting environment variable 
ENABLE_FLUME_TESTS=1Skipped test_kafka_stream (enable by setting environment 
variable ENABLE_KAFKA_0_8_TESTS=1Traceback (most recent call last):
  File 
"/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
 line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
  File 
"/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
 line 72, in _run_code
exec code in run_globals
  File "/.../spark/python/pyspark/streaming/tests.py", line 1572, in 
% kinesis_asl_assembly_dir) +
NameError: name 'kinesis_asl_assembly_dir' is not defined
{code}

It shows a confusing error message. Seems a mistake.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23694) The staging directory should under hive.exec.stagingdir if we set hive.exec.stagingdir but not under the table directory

2018-03-15 Thread Yifeng Dong (JIRA)
Yifeng Dong created SPARK-23694:
---

 Summary: The staging directory should under hive.exec.stagingdir 
if we set hive.exec.stagingdir but not under the table directory 
 Key: SPARK-23694
 URL: https://issues.apache.org/jira/browse/SPARK-23694
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Yifeng Dong


When we set hive.exec.stagingdir but not under the table directory, for 
example: /tmp/hive-staging, I think the staging directory should under 
/tmp/hive-staging, not under /tmp/ like /tmp/hive-staging_xxx



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23693) SQL function uuid()

2018-03-15 Thread Arseniy Tashoyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arseniy Tashoyan updated SPARK-23693:
-
Description: 
Add function uuid() to org.apache.spark.sql.functions that returns [Universally 
Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].

Sometimes it is necessary to uniquely identify each row in a DataFrame.

Currently the following ways are available:
 * monotonically_increasing_id() function
 * row_number() function over some window
 * convert the DataFrame to RDD and zipWithIndex()

All these approaches do not work when appending this DataFrame to another 
DataFrame (union). Collisions may occur - two rows in different DataFrames may 
have the same ID. Re-generating IDs on the resulting DataFrame is not an 
option, because some data in some other system may already refer to old IDs.

The proposed solution is to add new function:
{code:scala}
def uuid(): String{code}
that returns String representation of UUID.

UUID is represented as a 128-bit number (two long numbers). Such numbers are 
not supported in Scala or Java. In addition, some storage systems do not 
support 128-bit numbers (Parquet's largest numeric type is INT96). This is the 
reason for the uuid() function to return String.

I already have a simple implementation based on 
[java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
can share it as a PR.

  was:
Add function uuid() to org.apache.spark.sql.functions that returns [Universally 
Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].

Sometimes it is necessary to uniquely identify each row in a DataFrame.

Currently the following ways are available:
 * monotonically_increasing_id() function
 * row_number() function over some window
 * convert the DataFrame to RDD and zipWithIndex()

All these approaches do not work when appending this DataFrame to another 
DataFrame (union). Collisions may occur - two rows in different DataFrames may 
have the same ID. Re-generating IDs on the resulting DataFrame is not an 
option, because some data in some other system may already refer to old IDs.

The proposed solution is to add new function:
{code:java}
def uuid(): String{code}
that returns String representation of UUID.

UUID is represented as a 128-bit number (two long numbers). Such numbers are 
not supported in Scala or Java. In addition, some storage systems do not 
support 128-bit numbers (Parquet's largest numeric type is INT96). This is the 
reason for the uuid() function to return String.

I already have a simple implementation based on 
[java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
can share it as a PR.


> SQL function uuid()
> ---
>
> Key: SPARK-23693
> URL: https://issues.apache.org/jira/browse/SPARK-23693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Arseniy Tashoyan
>Priority: Minor
>
> Add function uuid() to org.apache.spark.sql.functions that returns 
> [Universally Unique 
> ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].
> Sometimes it is necessary to uniquely identify each row in a DataFrame.
> Currently the following ways are available:
>  * monotonically_increasing_id() function
>  * row_number() function over some window
>  * convert the DataFrame to RDD and zipWithIndex()
> All these approaches do not work when appending this DataFrame to another 
> DataFrame (union). Collisions may occur - two rows in different DataFrames 
> may have the same ID. Re-generating IDs on the resulting DataFrame is not an 
> option, because some data in some other system may already refer to old IDs.
> The proposed solution is to add new function:
> {code:scala}
> def uuid(): String{code}
> that returns String representation of UUID.
> UUID is represented as a 128-bit number (two long numbers). Such numbers are 
> not supported in Scala or Java. In addition, some storage systems do not 
> support 128-bit numbers (Parquet's largest numeric type is INT96). This is 
> the reason for the uuid() function to return String.
> I already have a simple implementation based on 
> [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
> can share it as a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23693) SQL function uuid()

2018-03-15 Thread Arseniy Tashoyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arseniy Tashoyan updated SPARK-23693:
-
Description: 
Add function uuid() to org.apache.spark.sql.functions that returns [Universally 
Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].

Sometimes it is necessary to uniquely identify each row in a DataFrame.

Currently the following ways are available:
 * monotonically_increasing_id() function
 * row_number() function over some window
 * convert the DataFrame to RDD and zipWithIndex()

All these approaches do not work when appending this DataFrame to another 
DataFrame (union). Collisions may occur - two rows in different DataFrames may 
have the same ID. Re-generating IDs on the resulting DataFrame is not an 
option, because some data in some other system may already refer to old IDs.

The proposed solution is to add new function:
{code:java}
def uuid(): String{code}
that returns String representation of UUID.

UUID is represented as a 128-bit number (two long numbers). Such numbers are 
not supported in Scala or Java. In addition, some storage systems do not 
support 128-bit numbers (Parquet's largest numeric type is INT96). This is the 
reason for the uuid() function to return String.

I already have a simple implementation based on 
[java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
can share it as a PR.

  was:
Add function uuid() to org.apache.spark.sql.functions that returns [Universally 
Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].

Sometimes it is necessary to uniquely identify each row in a DataFrame.

Currently the following ways are available:
 * monotonically_increasing_id() function
 * row_number() function over some window
 * convert the DataFrame to RDD and zipWithIndex()

All these approaches do not work when appending this DataFrame to another 
DataFrame (union). Collisions may occur - two rows in different DataFrames may 
have the same ID. Re-generating IDs on the resulting DataFrame is not an 
option, because some data in some other system may already refer to old IDs.

The proposed solution is to add new function:

def uuid(): String

that returns String representation of UUID.

UUID is represented as a 128-bit number (two long numbers). Such numbers are 
not supported in Scala or Java. In addition, some storage systems do not 
support 128-bit numbers (Parquet's largest numeric type is INT96). This is the 
reason for the uuid() function to return String.

I already have a simple implementation based on 
[java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
can share it as a PR.


> SQL function uuid()
> ---
>
> Key: SPARK-23693
> URL: https://issues.apache.org/jira/browse/SPARK-23693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Arseniy Tashoyan
>Priority: Minor
>
> Add function uuid() to org.apache.spark.sql.functions that returns 
> [Universally Unique 
> ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].
> Sometimes it is necessary to uniquely identify each row in a DataFrame.
> Currently the following ways are available:
>  * monotonically_increasing_id() function
>  * row_number() function over some window
>  * convert the DataFrame to RDD and zipWithIndex()
> All these approaches do not work when appending this DataFrame to another 
> DataFrame (union). Collisions may occur - two rows in different DataFrames 
> may have the same ID. Re-generating IDs on the resulting DataFrame is not an 
> option, because some data in some other system may already refer to old IDs.
> The proposed solution is to add new function:
> {code:java}
> def uuid(): String{code}
> that returns String representation of UUID.
> UUID is represented as a 128-bit number (two long numbers). Such numbers are 
> not supported in Scala or Java. In addition, some storage systems do not 
> support 128-bit numbers (Parquet's largest numeric type is INT96). This is 
> the reason for the uuid() function to return String.
> I already have a simple implementation based on 
> [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
> can share it as a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23693) SQL function uuid()

2018-03-15 Thread Arseniy Tashoyan (JIRA)
Arseniy Tashoyan created SPARK-23693:


 Summary: SQL function uuid()
 Key: SPARK-23693
 URL: https://issues.apache.org/jira/browse/SPARK-23693
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0, 2.2.1
Reporter: Arseniy Tashoyan


Add function uuid() to org.apache.spark.sql.functions that returns [Universally 
Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].

Sometimes it is necessary to uniquely identify each row in a DataFrame.

Currently the following ways are available:
 * monotonically_increasing_id() function
 * row_number() function over some window
 * convert the DataFrame to RDD and zipWithIndex()

All these approaches do not work when appending this DataFrame to another 
DataFrame (union). Collisions may occur - two rows in different DataFrames may 
have the same ID. Re-generating IDs on the resulting DataFrame is not an 
option, because some data in some other system may already refer to old IDs.

The proposed solution is to add new function:

def uuid(): String

that returns String representation of UUID.

UUID is represented as a 128-bit number (two long numbers). Such numbers are 
not supported in Scala or Java. In addition, some storage systems do not 
support 128-bit numbers (Parquet's largest numeric type is INT96). This is the 
reason for the uuid() function to return String.

I already have a simple implementation based on 
[java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
can share it as a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23683) FileCommitProtocol.instantiate to require 3-arg constructor for dynamic partition overwrite

2018-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400177#comment-16400177
 ] 

Apache Spark commented on SPARK-23683:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/20824

> FileCommitProtocol.instantiate to require 3-arg constructor for dynamic 
> partition overwrite
> ---
>
> Key: SPARK-23683
> URL: https://issues.apache.org/jira/browse/SPARK-23683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Steve Loughran
>Priority: Major
>
> with SPARK-20236 {{FileCommitProtocol.instantiate()}} looks for a three 
> argument constructor, passing in the {{dynamicPartitionOverwrite}} parameter. 
> If there is no such constructor, it falls back to the classic two-arg one.
> When {{InsertIntoHadoopFsRelationCommand}} passes down that 
> {{dynamicPartitionOverwrite}} flag to  {{FileCommitProtocol.instantiate()}}, 
> it _assumes_ that the instantiated protocol supports the specific 
> requirements of dynamic partition overwrite. It does not notice when this 
> does not hold, and so the output generated may be incorrect.
> Proposed: when dynamicPartitionOverwrite == true, require the protocol 
> implementation to have a 3-arg constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23683) FileCommitProtocol.instantiate to require 3-arg constructor for dynamic partition overwrite

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23683:


Assignee: Apache Spark

> FileCommitProtocol.instantiate to require 3-arg constructor for dynamic 
> partition overwrite
> ---
>
> Key: SPARK-23683
> URL: https://issues.apache.org/jira/browse/SPARK-23683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Major
>
> with SPARK-20236 {{FileCommitProtocol.instantiate()}} looks for a three 
> argument constructor, passing in the {{dynamicPartitionOverwrite}} parameter. 
> If there is no such constructor, it falls back to the classic two-arg one.
> When {{InsertIntoHadoopFsRelationCommand}} passes down that 
> {{dynamicPartitionOverwrite}} flag to  {{FileCommitProtocol.instantiate()}}, 
> it _assumes_ that the instantiated protocol supports the specific 
> requirements of dynamic partition overwrite. It does not notice when this 
> does not hold, and so the output generated may be incorrect.
> Proposed: when dynamicPartitionOverwrite == true, require the protocol 
> implementation to have a 3-arg constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23683) FileCommitProtocol.instantiate to require 3-arg constructor for dynamic partition overwrite

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23683:


Assignee: (was: Apache Spark)

> FileCommitProtocol.instantiate to require 3-arg constructor for dynamic 
> partition overwrite
> ---
>
> Key: SPARK-23683
> URL: https://issues.apache.org/jira/browse/SPARK-23683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Steve Loughran
>Priority: Major
>
> with SPARK-20236 {{FileCommitProtocol.instantiate()}} looks for a three 
> argument constructor, passing in the {{dynamicPartitionOverwrite}} parameter. 
> If there is no such constructor, it falls back to the classic two-arg one.
> When {{InsertIntoHadoopFsRelationCommand}} passes down that 
> {{dynamicPartitionOverwrite}} flag to  {{FileCommitProtocol.instantiate()}}, 
> it _assumes_ that the instantiated protocol supports the specific 
> requirements of dynamic partition overwrite. It does not notice when this 
> does not hold, and so the output generated may be incorrect.
> Proposed: when dynamicPartitionOverwrite == true, require the protocol 
> implementation to have a 3-arg constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23692) Print metadata of files when infer schema failed

2018-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400106#comment-16400106
 ] 

Apache Spark commented on SPARK-23692:
--

User 'caneGuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/20833

> Print metadata of files when infer schema failed
> 
>
> Key: SPARK-23692
> URL: https://issues.apache.org/jira/browse/SPARK-23692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Minor
>
> A trivial modify.
> Currently, when we had no input files to infer schema,we will throw below 
> exception.
> For some users it may be misleading.If we can print files' metadata it will 
> be more clearer.
> {code:java}
> Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for 
> Parquet. It must be specified manually.;
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
>  at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>  at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
>  at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
>  at 
> com.xiaomi.matrix.pipeline.jobspark.importer.MatrixAdEventDailyImportJob.(MatrixAdEventDailyImportJob.scala:18)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23692) Print metadata of files when infer schema failed

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23692:


Assignee: (was: Apache Spark)

> Print metadata of files when infer schema failed
> 
>
> Key: SPARK-23692
> URL: https://issues.apache.org/jira/browse/SPARK-23692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Minor
>
> A trivial modify.
> Currently, when we had no input files to infer schema,we will throw below 
> exception.
> For some users it may be misleading.If we can print files' metadata it will 
> be more clearer.
> {code:java}
> Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for 
> Parquet. It must be specified manually.;
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
>  at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>  at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
>  at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
>  at 
> com.xiaomi.matrix.pipeline.jobspark.importer.MatrixAdEventDailyImportJob.(MatrixAdEventDailyImportJob.scala:18)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23692) Print metadata of files when infer schema failed

2018-03-15 Thread zhoukang (JIRA)
zhoukang created SPARK-23692:


 Summary: Print metadata of files when infer schema failed
 Key: SPARK-23692
 URL: https://issues.apache.org/jira/browse/SPARK-23692
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: zhoukang


A trivial modify.
Currently, when we had no input files to infer schema,we will throw below 
exception.
For some users it may be misleading.If we can print files' metadata it will be 
more clearer.

{code:java}
Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for 
Parquet. It must be specified manually.;
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
 at scala.Option.getOrElse(Option.scala:121)
 at 
org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
 at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
 at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
 at 
com.xiaomi.matrix.pipeline.jobspark.importer.MatrixAdEventDailyImportJob.(MatrixAdEventDailyImportJob.scala:18)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23692) Print metadata of files when infer schema failed

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23692:


Assignee: Apache Spark

> Print metadata of files when infer schema failed
> 
>
> Key: SPARK-23692
> URL: https://issues.apache.org/jira/browse/SPARK-23692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Assignee: Apache Spark
>Priority: Minor
>
> A trivial modify.
> Currently, when we had no input files to infer schema,we will throw below 
> exception.
> For some users it may be misleading.If we can print files' metadata it will 
> be more clearer.
> {code:java}
> Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for 
> Parquet. It must be specified manually.;
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
>  at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>  at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
>  at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
>  at 
> com.xiaomi.matrix.pipeline.jobspark.importer.MatrixAdEventDailyImportJob.(MatrixAdEventDailyImportJob.scala:18)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20536) Extend ColumnName to create StructFields with explicit nullable

2018-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400021#comment-16400021
 ] 

Apache Spark commented on SPARK-20536:
--

User 'efimpoberezkin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20832

> Extend ColumnName to create StructFields with explicit nullable
> ---
>
> Key: SPARK-20536
> URL: https://issues.apache.org/jira/browse/SPARK-20536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> {{ColumnName}} defines methods to create {{StructFields}}.
> It'd be very user-friendly if there were methods to create {{StructFields}} 
> with explicit {{nullable}} property (currently implicitly {{true}}).
> That could look as follows:
> {code}
> // E.g. def int: StructField = StructField(name, IntegerType)
> def int(nullable: Boolean): StructField = StructField(name, IntegerType, 
> nullable)
> // or (untested)
> def int(nullable: Boolean): StructField = int.copy(nullable = nullable)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20536) Extend ColumnName to create StructFields with explicit nullable

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20536:


Assignee: (was: Apache Spark)

> Extend ColumnName to create StructFields with explicit nullable
> ---
>
> Key: SPARK-20536
> URL: https://issues.apache.org/jira/browse/SPARK-20536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> {{ColumnName}} defines methods to create {{StructFields}}.
> It'd be very user-friendly if there were methods to create {{StructFields}} 
> with explicit {{nullable}} property (currently implicitly {{true}}).
> That could look as follows:
> {code}
> // E.g. def int: StructField = StructField(name, IntegerType)
> def int(nullable: Boolean): StructField = StructField(name, IntegerType, 
> nullable)
> // or (untested)
> def int(nullable: Boolean): StructField = int.copy(nullable = nullable)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20536) Extend ColumnName to create StructFields with explicit nullable

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20536:


Assignee: Apache Spark

> Extend ColumnName to create StructFields with explicit nullable
> ---
>
> Key: SPARK-20536
> URL: https://issues.apache.org/jira/browse/SPARK-20536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Trivial
>
> {{ColumnName}} defines methods to create {{StructFields}}.
> It'd be very user-friendly if there were methods to create {{StructFields}} 
> with explicit {{nullable}} property (currently implicitly {{true}}).
> That could look as follows:
> {code}
> // E.g. def int: StructField = StructField(name, IntegerType)
> def int(nullable: Boolean): StructField = StructField(name, IntegerType, 
> nullable)
> // or (untested)
> def int(nullable: Boolean): StructField = int.copy(nullable = nullable)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23533) Add support for changing ContinuousDataReader's startOffset

2018-03-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-23533:


Assignee: Li Yuanjian

> Add support for changing ContinuousDataReader's startOffset
> ---
>
> Key: SPARK-23533
> URL: https://issues.apache.org/jira/browse/SPARK-23533
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Assignee: Li Yuanjian
>Priority: Major
> Fix For: 2.4.0
>
>
> As discussion in [https://github.com/apache/spark/pull/20675], we need add a 
> new interface `ContinuousDataReaderFactory` to support the requirements of 
> setting start offset in Continuous Processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23533) Add support for changing ContinuousDataReader's startOffset

2018-03-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-23533.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20689
[https://github.com/apache/spark/pull/20689]

> Add support for changing ContinuousDataReader's startOffset
> ---
>
> Key: SPARK-23533
> URL: https://issues.apache.org/jira/browse/SPARK-23533
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Assignee: Li Yuanjian
>Priority: Major
> Fix For: 2.4.0
>
>
> As discussion in [https://github.com/apache/spark/pull/20675], we need add a 
> new interface `ContinuousDataReaderFactory` to support the requirements of 
> setting start offset in Continuous Processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23614) Union produces incorrect results when caching is used

2018-03-15 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-23614:

Component/s: (was: Spark Core)
 SQL

> Union produces incorrect results when caching is used
> -
>
> Key: SPARK-23614
> URL: https://issues.apache.org/jira/browse/SPARK-23614
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Morten Hornbech
>Priority: Major
>
> We just upgraded from 2.2 to 2.3 and our test suite caught this error:
> {code:java}
> case class TestData(x: Int, y: Int, z: Int)
> val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 
> 6))).cache()
> val group1 = frame.groupBy("x").agg(min(col("y")) as "value")
> val group2 = frame.groupBy("x").agg(min(col("z")) as "value")
> group1.union(group2).show()
> // +---+-+
> // | x|value|
> // +---+-+
> // | 1| 2|
> // | 4| 5|
> // | 1| 2|
> // | 4| 5|
> // +---+-+
> group2.union(group1).show()
> // +---+-+
> // | x|value|
> // +---+-+
> // | 1| 3|
> // | 4| 6|
> // | 1| 3|
> // | 4| 6|
> // +---+-+
> {code}
> The error disappears if the first data frame is not cached or if the two 
> group by's use separate copies. I'm not sure exactly what happens on the 
> insides of Spark, but errors that produce incorrect results rather than 
> exceptions always concerns me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23614) Union produces incorrect results when caching is used

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23614:


Assignee: (was: Apache Spark)

> Union produces incorrect results when caching is used
> -
>
> Key: SPARK-23614
> URL: https://issues.apache.org/jira/browse/SPARK-23614
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Morten Hornbech
>Priority: Major
>
> We just upgraded from 2.2 to 2.3 and our test suite caught this error:
> {code:java}
> case class TestData(x: Int, y: Int, z: Int)
> val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 
> 6))).cache()
> val group1 = frame.groupBy("x").agg(min(col("y")) as "value")
> val group2 = frame.groupBy("x").agg(min(col("z")) as "value")
> group1.union(group2).show()
> // +---+-+
> // | x|value|
> // +---+-+
> // | 1| 2|
> // | 4| 5|
> // | 1| 2|
> // | 4| 5|
> // +---+-+
> group2.union(group1).show()
> // +---+-+
> // | x|value|
> // +---+-+
> // | 1| 3|
> // | 4| 6|
> // | 1| 3|
> // | 4| 6|
> // +---+-+
> {code}
> The error disappears if the first data frame is not cached or if the two 
> group by's use separate copies. I'm not sure exactly what happens on the 
> insides of Spark, but errors that produce incorrect results rather than 
> exceptions always concerns me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23614) Union produces incorrect results when caching is used

2018-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23614:


Assignee: Apache Spark

> Union produces incorrect results when caching is used
> -
>
> Key: SPARK-23614
> URL: https://issues.apache.org/jira/browse/SPARK-23614
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Morten Hornbech
>Assignee: Apache Spark
>Priority: Major
>
> We just upgraded from 2.2 to 2.3 and our test suite caught this error:
> {code:java}
> case class TestData(x: Int, y: Int, z: Int)
> val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 
> 6))).cache()
> val group1 = frame.groupBy("x").agg(min(col("y")) as "value")
> val group2 = frame.groupBy("x").agg(min(col("z")) as "value")
> group1.union(group2).show()
> // +---+-+
> // | x|value|
> // +---+-+
> // | 1| 2|
> // | 4| 5|
> // | 1| 2|
> // | 4| 5|
> // +---+-+
> group2.union(group1).show()
> // +---+-+
> // | x|value|
> // +---+-+
> // | 1| 3|
> // | 4| 6|
> // | 1| 3|
> // | 4| 6|
> // +---+-+
> {code}
> The error disappears if the first data frame is not cached or if the two 
> group by's use separate copies. I'm not sure exactly what happens on the 
> insides of Spark, but errors that produce incorrect results rather than 
> exceptions always concerns me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23614) Union produces incorrect results when caching is used

2018-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399985#comment-16399985
 ] 

Apache Spark commented on SPARK-23614:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20831

> Union produces incorrect results when caching is used
> -
>
> Key: SPARK-23614
> URL: https://issues.apache.org/jira/browse/SPARK-23614
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Morten Hornbech
>Priority: Major
>
> We just upgraded from 2.2 to 2.3 and our test suite caught this error:
> {code:java}
> case class TestData(x: Int, y: Int, z: Int)
> val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 
> 6))).cache()
> val group1 = frame.groupBy("x").agg(min(col("y")) as "value")
> val group2 = frame.groupBy("x").agg(min(col("z")) as "value")
> group1.union(group2).show()
> // +---+-+
> // | x|value|
> // +---+-+
> // | 1| 2|
> // | 4| 5|
> // | 1| 2|
> // | 4| 5|
> // +---+-+
> group2.union(group1).show()
> // +---+-+
> // | x|value|
> // +---+-+
> // | 1| 3|
> // | 4| 6|
> // | 1| 3|
> // | 4| 6|
> // +---+-+
> {code}
> The error disappears if the first data frame is not cached or if the two 
> group by's use separate copies. I'm not sure exactly what happens on the 
> insides of Spark, but errors that produce incorrect results rather than 
> exceptions always concerns me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23677) Selecting columns from joined DataFrames with the same origin yields wrong results

2018-03-15 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399969#comment-16399969
 ] 

Takeshi Yamamuro commented on SPARK-23677:
--

You mean this ticket? SPARK-14948. I think this is a well-known issue.

> Selecting columns from joined DataFrames with the same origin yields wrong 
> results
> --
>
> Key: SPARK-23677
> URL: https://issues.apache.org/jira/browse/SPARK-23677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Martin Mauch
>Priority: Major
>
> When trying to join two DataFrames with the same origin DataFrame and later 
> selecting columns from the join, Spark can't distinguish between the columns 
> and gives a wrong (or at least very surprising) result. One can work around 
> this using expr.
> Here is a minimal example:
>  
> {code:java}
> import spark.implicits._
> val edf = Seq((1), (2), (3), (4), (5)).toDF("num")
> val big = edf.where(edf("num") > 2).alias("big")
> val small = edf.where(edf("num") < 4).alias("small")
> small.join(big, expr("big.num == (small.num + 1)")).select(small("num"), 
> big("num")).show()
> // +---+---+
> // |num|num|
> // +---+---+
> // | 2| 2|
> // | 3| 3|
> // +—+—+
> small.join(big, expr("big.num == (small.num + 1)")).select(expr("small.num"), 
> expr("big.num")).show()
> // +---+---+
> // |num|num|
> // +---+---+
> // | 2| 3|
> // | 3| 4|
> // +---+---+
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org