[jira] [Assigned] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.

2016-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18048:


Assignee: Apache Spark

> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> 
>
> Key: SPARK-18048
> URL: https://issues.apache.org/jira/browse/SPARK-18048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Priyanka Garg
>Assignee: Apache Spark
>
> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> For eg. 
>   checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1), DateType),
> Literal.create(identity(2L), TimestampType)),
>   identity(1)) is throwing error while 
>  checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1L), TimestampType),
> Literal.create(identity(2), DateType)),
>   identity(1L)) works fine.
> The reason for the same is that the If expression 's datatype only considers 
> trueValue.dataType.
> Also, 
>   checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1), DateType),
> Literal.create(identity(2L), TimestampType)),
>   identity(1))
>  is breaking only in case of Generated mutable Projection and Unsafe 
> projection. For all other types its working fine.
> Either both should work or none should work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.

2016-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15601126#comment-15601126
 ] 

Apache Spark commented on SPARK-18048:
--

User 'priyankagargnitk' has created a pull request for this issue:
https://github.com/apache/spark/pull/15609

> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> 
>
> Key: SPARK-18048
> URL: https://issues.apache.org/jira/browse/SPARK-18048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Priyanka Garg
>
> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> For eg. 
>   checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1), DateType),
> Literal.create(identity(2L), TimestampType)),
>   identity(1)) is throwing error while 
>  checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1L), TimestampType),
> Literal.create(identity(2), DateType)),
>   identity(1L)) works fine.
> The reason for the same is that the If expression 's datatype only considers 
> trueValue.dataType.
> Also, 
>   checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1), DateType),
> Literal.create(identity(2L), TimestampType)),
>   identity(1))
>  is breaking only in case of Generated mutable Projection and Unsafe 
> projection. For all other types its working fine.
> Either both should work or none should work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.

2016-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18048:


Assignee: (was: Apache Spark)

> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> 
>
> Key: SPARK-18048
> URL: https://issues.apache.org/jira/browse/SPARK-18048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Priyanka Garg
>
> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> For eg. 
>   checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1), DateType),
> Literal.create(identity(2L), TimestampType)),
>   identity(1)) is throwing error while 
>  checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1L), TimestampType),
> Literal.create(identity(2), DateType)),
>   identity(1L)) works fine.
> The reason for the same is that the If expression 's datatype only considers 
> trueValue.dataType.
> Also, 
>   checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1), DateType),
> Literal.create(identity(2L), TimestampType)),
>   identity(1))
>  is breaking only in case of Generated mutable Projection and Unsafe 
> projection. For all other types its working fine.
> Either both should work or none should work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.

2016-10-23 Thread Priyanka Garg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Priyanka Garg updated SPARK-18048:
--
Description: 
If expression behaves differently if true and false expression are interchanged 
in case of different data types.

For eg. 
  checkEvaluation(
  If(Literal.create(true, BooleanType),
Literal.create(identity(1), DateType),
Literal.create(identity(2L), TimestampType)),
  identity(1)) is throwing error while 

 checkEvaluation(
  If(Literal.create(true, BooleanType),
Literal.create(identity(1L), TimestampType),
Literal.create(identity(2), DateType)),
  identity(1L)) works fine.

The reason for the same is that the If expression 's datatype only considers 
trueValue.dataType.

Also, 
  checkEvaluation(
  If(Literal.create(true, BooleanType),
Literal.create(identity(1), DateType),
Literal.create(identity(2L), TimestampType)),
  identity(1))
 is breaking only in case of Generated mutable Projection and Unsafe 
projection. For all other types its working fine.

Either both should work or none should work

  was:
If expression behaves differently if true and false expression are interchanged 
in case of different data types.

For eg. 
If(Literal.create(geo != null, BooleanType),
Literal.create(null, DateType),
Literal.create(null, TimestampType)) is throwing error while 

If(Literal.create(geo != null, BooleanType),
Literal.create(null, TimestampType),
Literal.create(null, DateType )) works fine.

The reason for the same is that the If expression 's datatype only considers 
trueValue.dataType.

Also, 
If(Literal.create(geo != null, BooleanType),
Literal.create(null, DateType),
Literal.create(null, TimestampType))
 is breaking only in case of Generated mutable Projection and Unsafe 
projection. For all other types its working fine.

Either both should work or none should work


> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> 
>
> Key: SPARK-18048
> URL: https://issues.apache.org/jira/browse/SPARK-18048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Priyanka Garg
>
> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> For eg. 
>   checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1), DateType),
> Literal.create(identity(2L), TimestampType)),
>   identity(1)) is throwing error while 
>  checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1L), TimestampType),
> Literal.create(identity(2), DateType)),
>   identity(1L)) works fine.
> The reason for the same is that the If expression 's datatype only considers 
> trueValue.dataType.
> Also, 
>   checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1), DateType),
> Literal.create(identity(2L), TimestampType)),
>   identity(1))
>  is breaking only in case of Generated mutable Projection and Unsafe 
> projection. For all other types its working fine.
> Either both should work or none should work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.

2016-10-23 Thread Priyanka Garg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Priyanka Garg updated SPARK-18048:
--
Description: 
If expression behaves differently if true and false expression are interchanged 
in case of different data types.

For eg. 
If(Literal.create(geo != null, BooleanType),
Literal.create(null, DateType),
Literal.create(null, TimestampType)) is throwing error while 

If(Literal.create(geo != null, BooleanType),
Literal.create(null, TimestampType),
Literal.create(null, DateType )) works fine.

The reason for the same is that the If expression 's datatype only considers 
trueValue.dataType.

Also, 
If(Literal.create(geo != null, BooleanType),
Literal.create(null, DateType),
Literal.create(null, TimestampType))
 is breaking only in case of Generated mutable Projection and Unsafe 
projection. For all other types its working fine.

Either both should work or none should work

  was:
If expression behaves differently if true and false expression are interchanged 
in case of different data types.

For eg. 
If(Literal.create(geo != null, BooleanType),
Literal.create(null, DateType),
Literal.create(null, TimestampType)) is throwing error while 

If(Literal.create(geo != null, BooleanType),
Literal.create(null, TimestampType),
Literal.create(null, DateType )) works fine.

The reason for the same is that the If expression 's datatype only considers 
trueValue.dataType.

Also, 
If(Literal.create(geo != null, BooleanType),
Literal.create(null, DateType),
Literal.create(null, TimestampType))
 is breaking only in case of Generated mutable Projection and Unsafe 
projection. For all other types its working fine.


> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> 
>
> Key: SPARK-18048
> URL: https://issues.apache.org/jira/browse/SPARK-18048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Priyanka Garg
>
> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> For eg. 
> If(Literal.create(geo != null, BooleanType),
> Literal.create(null, DateType),
> Literal.create(null, TimestampType)) is throwing error while 
> If(Literal.create(geo != null, BooleanType),
> Literal.create(null, TimestampType),
> Literal.create(null, DateType )) works fine.
> The reason for the same is that the If expression 's datatype only considers 
> trueValue.dataType.
> Also, 
> If(Literal.create(geo != null, BooleanType),
> Literal.create(null, DateType),
> Literal.create(null, TimestampType))
>  is breaking only in case of Generated mutable Projection and Unsafe 
> projection. For all other types its working fine.
> Either both should work or none should work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17838) Strict type checking for arguments with a better messages across APIs.

2016-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17838:


Assignee: Apache Spark

> Strict type checking for arguments with a better messages across APIs.
> --
>
> Key: SPARK-17838
> URL: https://issues.apache.org/jira/browse/SPARK-17838
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> It seems there should be more strict type checking for arguments in SparkR 
> APIs. This was discussed in several PRs. 
> https://github.com/apache/spark/pull/15239#discussion_r82445435
> Roughly it seems there are three cases as below:
> The first case below was described in 
> https://github.com/apache/spark/pull/15239#discussion_r82445435
> - Check for {{zero-length variable name}}
> Some of other cases below were handled in 
> https://github.com/apache/spark/pull/15231#discussion_r80417904
> - Catch the exception from JVM and format it as pretty
> - Check strictly types before calling JVM in SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17838) Strict type checking for arguments with a better messages across APIs.

2016-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17838:


Assignee: (was: Apache Spark)

> Strict type checking for arguments with a better messages across APIs.
> --
>
> Key: SPARK-17838
> URL: https://issues.apache.org/jira/browse/SPARK-17838
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Hyukjin Kwon
>
> It seems there should be more strict type checking for arguments in SparkR 
> APIs. This was discussed in several PRs. 
> https://github.com/apache/spark/pull/15239#discussion_r82445435
> Roughly it seems there are three cases as below:
> The first case below was described in 
> https://github.com/apache/spark/pull/15239#discussion_r82445435
> - Check for {{zero-length variable name}}
> Some of other cases below were handled in 
> https://github.com/apache/spark/pull/15231#discussion_r80417904
> - Catch the exception from JVM and format it as pretty
> - Check strictly types before calling JVM in SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17838) Strict type checking for arguments with a better messages across APIs.

2016-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15601064#comment-15601064
 ] 

Apache Spark commented on SPARK-17838:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15608

> Strict type checking for arguments with a better messages across APIs.
> --
>
> Key: SPARK-17838
> URL: https://issues.apache.org/jira/browse/SPARK-17838
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Hyukjin Kwon
>
> It seems there should be more strict type checking for arguments in SparkR 
> APIs. This was discussed in several PRs. 
> https://github.com/apache/spark/pull/15239#discussion_r82445435
> Roughly it seems there are three cases as below:
> The first case below was described in 
> https://github.com/apache/spark/pull/15239#discussion_r82445435
> - Check for {{zero-length variable name}}
> Some of other cases below were handled in 
> https://github.com/apache/spark/pull/15231#discussion_r80417904
> - Catch the exception from JVM and format it as pretty
> - Check strictly types before calling JVM in SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16137) Random Forest wrapper in SparkR

2016-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15601037#comment-15601037
 ] 

Apache Spark commented on SPARK-16137:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/15607

> Random Forest wrapper in SparkR
> ---
>
> Key: SPARK-16137
> URL: https://issues.apache.org/jira/browse/SPARK-16137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Kai Jiang
>
> Implement a wrapper in SparkR to support Random Forest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18070) binary operator should not consider nullability when comparing input types

2016-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18070:


Assignee: Apache Spark  (was: Wenchen Fan)

> binary operator should not consider nullability when comparing input types
> --
>
> Key: SPARK-18070
> URL: https://issues.apache.org/jira/browse/SPARK-18070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18070) binary operator should not consider nullability when comparing input types

2016-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15601009#comment-15601009
 ] 

Apache Spark commented on SPARK-18070:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15606

> binary operator should not consider nullability when comparing input types
> --
>
> Key: SPARK-18070
> URL: https://issues.apache.org/jira/browse/SPARK-18070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18070) binary operator should not consider nullability when comparing input types

2016-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18070:


Assignee: Wenchen Fan  (was: Apache Spark)

> binary operator should not consider nullability when comparing input types
> --
>
> Key: SPARK-18070
> URL: https://issues.apache.org/jira/browse/SPARK-18070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18070) binary operator should not consider nullability when comparing input types

2016-10-23 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-18070:
---

 Summary: binary operator should not consider nullability when 
comparing input types
 Key: SPARK-18070
 URL: https://issues.apache.org/jira/browse/SPARK-18070
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12180) DataFrame.join() in PySpark gives misleading exception when column name exists on both side

2016-10-23 Thread Jagadeesan A S (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagadeesan A S closed SPARK-12180.
--
Resolution: Cannot Reproduce

> DataFrame.join() in PySpark gives misleading exception when column name 
> exists on both side
> ---
>
> Key: SPARK-12180
> URL: https://issues.apache.org/jira/browse/SPARK-12180
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Daniel Thomas
>
> When joining two DataFrames on a column 'session_uuid' I got the following 
> exception, because both DataFrames hat a column called 'at'. The exception is 
> misleading in the cause and in the column causing the problem. Renaming the 
> column fixed the exception.
> ---
> Py4JJavaError Traceback (most recent call last)
> /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
> /Applications/spark-1.5.2-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> Py4JJavaError: An error occurred while calling o484.join.
> : org.apache.spark.sql.AnalysisException: resolved attribute(s) 
> session_uuid#3278 missing from 
> uuid_x#9078,total_session_sec#9115L,at#3248,session_uuid#9114,uuid#9117,at#9084
>  in operator !Join Inner, Some((uuid_x#9078 = session_uuid#3278));
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> AnalysisException Traceback (most recent call last)
>  in ()
>   1 sel_starts = starts.select('uuid', 'at').withColumnRenamed('uuid', 
> 'uuid_x')#.withColumnRenamed('at', 'at_x')
>   2 sel_closes = closes.select('uuid', 'at', 'session_uuid', 
> 'total_session_sec')
> > 3 start_close = sel_starts.join(sel_closes, sel_starts['uuid_x'] == 
> sel_closes['session_uuid'])
>   4 start_close.cache()
>   5 start_close.take(1)
> /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/dataframe.py in 
> join(self, other, on, how)
> 579 on = on[0]
> 580 if how is None:
> --> 581 jdf = self._jdf.join(other._jdf, on._jc, "inner")
> 582 else:
> 583 assert isinstance(how, basestring), "how should be 
> basestring"
> /Applications/spark-1.5.2-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_va

[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI

2016-10-23 Thread Gurvinder (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600938#comment-15600938
 ] 

Gurvinder commented on SPARK-15487:
---

The setting needs to be set for all the application, master and worker 
components as mentioned in document.

"spark.ui.reverseProxy: Enable running Spark Master as reverse proxy for worker 
and application UIs. In this mode, Spark master will reverse proxy the worker 
and application UIs to enable access without requiring direct access to their 
hosts. Use it with caution, as worker and application UI will not be accessible 
directly, you will only be able to access them through spark master/proxy 
public URL. This setting affects all the workers and application UIs running in 
the cluster and must be set on all the workers, drivers and masters."

I set this in spark-default.conf on all the components and it works. The other 
parameter spark.ui.reverseProxyUrl is required if you running master itself 
behind the proxy, then this value must be equal to the FQDN of proxy. Hope that 
helps.

> Spark Master UI to reverse proxy Application and Workers UI
> ---
>
> Key: SPARK-15487
> URL: https://issues.apache.org/jira/browse/SPARK-15487
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gurvinder
>Assignee: Gurvinder
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently when running in Standalone mode, Spark UI's link to workers and 
> application drivers are pointing to internal/protected network endpoints. So 
> to access workers/application UI user's machine has to connect to VPN or need 
> to have access to internal network directly.
> Therefore the proposal is to make Spark master UI reverse proxy this 
> information back to the user. So only Spark master UI needs to be opened up 
> to internet. 
> The minimal changes can be done by adding another route e.g. 
> http://spark-master.com/target// so when request goes to target, 
> ProxyServlet kicks in and takes the  and forwards the request to it 
> and send response back to user.
> More information about discussions for this features can be found on this 
> mailing list thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12451) Regexp functions don't support patterns containing '*/'

2016-10-23 Thread Jagadeesan A S (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagadeesan A S resolved SPARK-12451.

Resolution: Duplicate

> Regexp functions don't support patterns containing '*/'
> ---
>
> Key: SPARK-12451
> URL: https://issues.apache.org/jira/browse/SPARK-12451
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: William Dee
>
> When using the regexp functions in Spark SQL, patterns containing '*/' create 
> runtime errors in the auto generated code. This is due to the fact that the 
> code generator creates a multiline comment containing, amongst other things, 
> the pattern.
> Here is an excerpt from my stacktrace to illustrate: (Helpfully, the stack 
> trace includes all of the auto-generated code)
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: Line 232, Column 
> 54: Unexpected token "," in primary
>   at org.codehaus.janino.Parser.compileException(Parser.java:3125)
>   at org.codehaus.janino.Parser.parsePrimary(Parser.java:2512)
>   at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2252)
>   at 
> org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2211)
>   at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2190)
>   at org.codehaus.janino.Parser.parseShiftExpression(Parser.java:2169)
>   at 
> org.codehaus.janino.Parser.parseRelationalExpression(Parser.java:2072)
>   at org.codehaus.janino.Parser.parseEqualityExpression(Parser.java:2046)
>   at org.codehaus.janino.Parser.parseAndExpression(Parser.java:2025)
>   at 
> org.codehaus.janino.Parser.parseExclusiveOrExpression(Parser.java:2004)
>   at 
> org.codehaus.janino.Parser.parseInclusiveOrExpression(Parser.java:1983)
>   at 
> org.codehaus.janino.Parser.parseConditionalAndExpression(Parser.java:1962)
>   at 
> org.codehaus.janino.Parser.parseConditionalOrExpression(Parser.java:1941)
>   at 
> org.codehaus.janino.Parser.parseConditionalExpression(Parser.java:1922)
>   at 
> org.codehaus.janino.Parser.parseAssignmentExpression(Parser.java:1901)
>   at org.codehaus.janino.Parser.parseExpression(Parser.java:1886)
>   at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1149)
>   at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1085)
>   at 
> org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:938)
>   at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:620)
>   at org.codehaus.janino.Parser.parseClassBody(Parser.java:515)
>   at org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:481)
>   at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:577)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229)
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:192)
>   at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:84)
>   at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:77)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:387)
> ... line 232 ...
> /* regexp_replace(input[46, StringType],^.*/,) */
> 
> /* input[46, StringType] */
> 
> boolean isNull31 = i.isNullAt(46);
> UTF8String primitive32 = isNull31 ? null : (i.getUTF8String(46));
> 
> boolean isNull24 = true;
> UTF8String primitive25 = null;
> if (!isNull31) {
>   /* ^.*/ */
>   
>   /* expression: ^.*/ */
>   Object obj35 = expressions[4].eval(i);
>   boolean isNull33 = obj35 == null;
>   UTF8String primitive34 = null;
>   if (!isNull33) {
> primitive34 = (UTF8String) obj35;
>   }
> ...
> {code}
> Note the multiple multiline comments, these obviously break when the regex 
> pattern contains the end-of-comment token '*/'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18062) ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should return probabilities when given all-0 vector

2016-10-23 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600926#comment-15600926
 ] 

Peng Meng commented on SPARK-18062:
---

This relate to how to understand all-0 rawPrediction, all classes are 
impossible or all classes are the same? If the later,  
ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should return 
a valid probability vector with the uniform distribution.


> ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should 
> return probabilities when given all-0 vector
> 
>
> Key: SPARK-18062
> URL: https://issues.apache.org/jira/browse/SPARK-18062
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> {{ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace}} returns 
> a vector of all-0 when given a rawPrediction vector of all-0.  It should 
> return a valid probability vector with the uniform distribution.
> Note: This will be a *behavior change* but it should be very minor and affect 
> few if any users.  But we should note it in release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns

2016-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18067:


Assignee: (was: Apache Spark)

> SortMergeJoin adds shuffle if join predicates have non partitioned columns
> --
>
> Key: SPARK-18067
> URL: https://issues.apache.org/jira/browse/SPARK-18067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Paul Jones
>Priority: Minor
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>:- Sort [value1#1 ASC,key#0 ASC], false, 0
>:  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- Sort [value2#13 ASC,key#12 ASC], false, 0
>   +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
>  +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we can avoid the shuffle if use a filter statement that can't be pushed 
> in the join.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" >= 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- Filter (value1#1 >= value2#13)
>+- SortMergeJoin [key#0], [key#12]
>   :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>   +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> What's the best way to avoid the filter pushdown here??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns

2016-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600923#comment-15600923
 ] 

Apache Spark commented on SPARK-18067:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/15605

> SortMergeJoin adds shuffle if join predicates have non partitioned columns
> --
>
> Key: SPARK-18067
> URL: https://issues.apache.org/jira/browse/SPARK-18067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Paul Jones
>Priority: Minor
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>:- Sort [value1#1 ASC,key#0 ASC], false, 0
>:  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- Sort [value2#13 ASC,key#12 ASC], false, 0
>   +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
>  +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we can avoid the shuffle if use a filter statement that can't be pushed 
> in the join.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" >= 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- Filter (value1#1 >= value2#13)
>+- SortMergeJoin [key#0], [key#12]
>   :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>   +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> What's the best way to avoid the filter pushdown here??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns

2016-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18067:


Assignee: Apache Spark

> SortMergeJoin adds shuffle if join predicates have non partitioned columns
> --
>
> Key: SPARK-18067
> URL: https://issues.apache.org/jira/browse/SPARK-18067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Paul Jones
>Assignee: Apache Spark
>Priority: Minor
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>:- Sort [value1#1 ASC,key#0 ASC], false, 0
>:  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- Sort [value2#13 ASC,key#12 ASC], false, 0
>   +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
>  +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we can avoid the shuffle if use a filter statement that can't be pushed 
> in the join.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" >= 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- Filter (value1#1 >= value2#13)
>+- SortMergeJoin [key#0], [key#12]
>   :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>   +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> What's the best way to avoid the filter pushdown here??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns

2016-10-23 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-18067:

Summary: SortMergeJoin adds shuffle if join predicates have non partitioned 
columns  (was: Adding filter after SortMergeJoin creates unnecessary shuffle)

> SortMergeJoin adds shuffle if join predicates have non partitioned columns
> --
>
> Key: SPARK-18067
> URL: https://issues.apache.org/jira/browse/SPARK-18067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Paul Jones
>Priority: Minor
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>:- Sort [value1#1 ASC,key#0 ASC], false, 0
>:  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- Sort [value2#13 ASC,key#12 ASC], false, 0
>   +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
>  +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we can avoid the shuffle if use a filter statement that can't be pushed 
> in the join.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" >= 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- Filter (value1#1 >= value2#13)
>+- SortMergeJoin [key#0], [key#12]
>   :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>   +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> What's the best way to avoid the filter pushdown here??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-23 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600728#comment-15600728
 ] 

Liwei Lin edited comment on SPARK-16845 at 10/24/16 3:52 AM:
-

Hi [~dondrake] the other solution is still under discussion & in progress. It'd 
be super helpful if you could create and provide the "explode-union-parquet" 
reproducer which causes the fix to fail. Thank you!


was (Author: lwlin):
The other solution is still under discussion & in progress. It'd be super 
helpful if you could create and provide the "explode-union-parquet" reproducer 
which causes the fix to fail. Thank you!

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18069) Many examples in Python docstrings are incomplete

2016-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18069:


Assignee: Apache Spark

> Many examples in Python docstrings are incomplete
> -
>
> Key: SPARK-18069
> URL: https://issues.apache.org/jira/browse/SPARK-18069
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Mortada Mehyar
>Assignee: Apache Spark
>Priority: Minor
>
> A lot of the python API functions show example usage that is incomplete. The 
> docstring shows output without having the input DataFrame defined. It can be 
> quite confusing trying to understand and/or follow the example.
> For instance, the docstring for `DataFrame.dtypes()` is currently
> {code}
>  def dtypes(self):
>  """Returns all column names and their data types as a list.
>  
>  >>> df.dtypes
>  [('age', 'int'), ('name', 'string')]
>  """
> {code}
> when it should really be
> {code}
>  def dtypes(self):
>  """Returns all column names and their data types as a list.
>  
>  >>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', 
> 'age'])
>  >>> df.dtypes
>  [('age', 'int'), ('name', 'string')]
>  """
> {code}
> I have a pending PR for fixing many of these occurrences here: 
> https://github.com/apache/spark/pull/15053 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18069) Many examples in Python docstrings are incomplete

2016-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600830#comment-15600830
 ] 

Apache Spark commented on SPARK-18069:
--

User 'mortada' has created a pull request for this issue:
https://github.com/apache/spark/pull/15053

> Many examples in Python docstrings are incomplete
> -
>
> Key: SPARK-18069
> URL: https://issues.apache.org/jira/browse/SPARK-18069
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Mortada Mehyar
>Priority: Minor
>
> A lot of the python API functions show example usage that is incomplete. The 
> docstring shows output without having the input DataFrame defined. It can be 
> quite confusing trying to understand and/or follow the example.
> For instance, the docstring for `DataFrame.dtypes()` is currently
> {code}
>  def dtypes(self):
>  """Returns all column names and their data types as a list.
>  
>  >>> df.dtypes
>  [('age', 'int'), ('name', 'string')]
>  """
> {code}
> when it should really be
> {code}
>  def dtypes(self):
>  """Returns all column names and their data types as a list.
>  
>  >>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', 
> 'age'])
>  >>> df.dtypes
>  [('age', 'int'), ('name', 'string')]
>  """
> {code}
> I have a pending PR for fixing many of these occurrences here: 
> https://github.com/apache/spark/pull/15053 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18069) Many examples in Python docstrings are incomplete

2016-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18069:


Assignee: (was: Apache Spark)

> Many examples in Python docstrings are incomplete
> -
>
> Key: SPARK-18069
> URL: https://issues.apache.org/jira/browse/SPARK-18069
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Mortada Mehyar
>Priority: Minor
>
> A lot of the python API functions show example usage that is incomplete. The 
> docstring shows output without having the input DataFrame defined. It can be 
> quite confusing trying to understand and/or follow the example.
> For instance, the docstring for `DataFrame.dtypes()` is currently
> {code}
>  def dtypes(self):
>  """Returns all column names and their data types as a list.
>  
>  >>> df.dtypes
>  [('age', 'int'), ('name', 'string')]
>  """
> {code}
> when it should really be
> {code}
>  def dtypes(self):
>  """Returns all column names and their data types as a list.
>  
>  >>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', 
> 'age'])
>  >>> df.dtypes
>  [('age', 'int'), ('name', 'string')]
>  """
> {code}
> I have a pending PR for fixing many of these occurrences here: 
> https://github.com/apache/spark/pull/15053 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18069) Many examples in Python docstrings are incomplete

2016-10-23 Thread Mortada Mehyar (JIRA)
Mortada Mehyar created SPARK-18069:
--

 Summary: Many examples in Python docstrings are incomplete
 Key: SPARK-18069
 URL: https://issues.apache.org/jira/browse/SPARK-18069
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.0.1
Reporter: Mortada Mehyar
Priority: Minor



A lot of the python API functions show example usage that is incomplete. The 
docstring shows output without having the input DataFrame defined. It can be 
quite confusing trying to understand and/or follow the example.

For instance, the docstring for `DataFrame.dtypes()` is currently


{code}
 def dtypes(self):
 """Returns all column names and their data types as a list.
 
 >>> df.dtypes
 [('age', 'int'), ('name', 'string')]
 """
{code}

when it should really be
{code}
 def dtypes(self):
 """Returns all column names and their data types as a list.
 
 >>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', 
'age'])
 >>> df.dtypes
 [('age', 'int'), ('name', 'string')]
 """
{code}

I have a pending PR for fixing many of these occurrences here: 
https://github.com/apache/spark/pull/15053 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18067) Adding filter after SortMergeJoin creates unnecessary shuffle

2016-10-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600824#comment-15600824
 ] 

Tejas Patil edited comment on SPARK-18067 at 10/24/16 3:35 AM:
---

[~hvanhovell] : Tagging you since you have context of this portion of the code.

I feel that createPartitioning()'s use of `HashPartitioning` is hurting here. 
`HashPartitioning` has stricter semantics and feels like we could have 
something else.

More explanation:
Both children are hash partitioned on `key`. Assume these are the partitions 
for the children:

||partitions||child 1||child 2||
|partition 0|[0, 0, 0, 3]|[0, 0, 3, 3]|
|partition 1|[1, 4, 4]|[4]|
|partition 2|[2, 2]|[2, 5, 5, 5]|

Since we have __all__ the same values of `key` in a given partition, we can 
evaluate other join predicates like (`value1` = `value2`) right there without 
needing any shuffle. 

What is currently being done i.e. `HashPartitioning(key, value)` expects rows 
with same value of `pmod( hash(key, value))` to be in the same partition and 
does not take advantage of the fact that we already have rows with same `key` 
packed together.


was (Author: tejasp):
[~hvanhovell] : Tagging you since you have context of this portion of the code.

I feel that createPartitioning()'s use of `HashPartitioning` is hurting here. 
`HashPartitioning` has stricter semantics and feels like we could have 
something else.

More explanation:
Both children are hash partitioned on `key`. Assume these are the partitions 
for the children:

||partitions||child 1||child 2||
|partition 0|[0, 0, 0, 3]|[0, 0, 3, 3]|
|partition 1|[1, 4, 4]|[4]|
|partition 2|[2, 2]|[2, 5, 5, 5]|

If we have *all* the same values of `key` in a given partition, then we can 
evaluate other join predicates like (`value1` = `value2`) right there without 
needing a shuffle. 

What is currently being done i.e. `HashPartitioning(key, value)` expects rows 
with same value of `pmod( hash(key, value))` to be in the same partition and 
does not take advantage of the fact that we already have rows with same `key` 
packed together.

> Adding filter after SortMergeJoin creates unnecessary shuffle
> -
>
> Key: SPARK-18067
> URL: https://issues.apache.org/jira/browse/SPARK-18067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Paul Jones
>Priority: Minor
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>:- Sort [value1#1 ASC,key#0 ASC], false, 0
>:  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- Sort [value2#13 ASC,key#12 ASC], false, 0
>   +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
>  +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we can avoid the shuffle if use a filter statement that can't be pushed 
> in the join.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" >= 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- Filter (value1#1 >= value2#13)
>+- SortMergeJoin [key#0], [key#12]
>   :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], tr

[jira] [Commented] (SPARK-18067) Adding filter after SortMergeJoin creates unnecessary shuffle

2016-10-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600824#comment-15600824
 ] 

Tejas Patil commented on SPARK-18067:
-

[~hvanhovell] : Tagging you since you have context of this portion of the code.

I feel that createPartitioning()'s use of `HashPartitioning` is hurting here. 
`HashPartitioning` has stricter semantics and feels like we could have 
something else.

More explanation:
Both children are hash partitioned on `key`. Assume these are the partitions 
for the children:

||partitions||child 1||child 2||
|partition 0|[0, 0, 0, 3]|[0, 0, 3, 3]|
|partition 1|[1, 4, 4]|[4]|
|partition 2|[2, 2]|[2, 5, 5, 5]|

If we have *all* the same values of `key` in a given partition, then we can 
evaluate other join predicates like (`value1` = `value2`) right there without 
needing a shuffle. 

What is currently being done i.e. `HashPartitioning(key, value)` expects rows 
with same value of `pmod( hash(key, value))` to be in the same partition and 
does not take advantage of the fact that we already have rows with same `key` 
packed together.

> Adding filter after SortMergeJoin creates unnecessary shuffle
> -
>
> Key: SPARK-18067
> URL: https://issues.apache.org/jira/browse/SPARK-18067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Paul Jones
>Priority: Minor
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>:- Sort [value1#1 ASC,key#0 ASC], false, 0
>:  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- Sort [value2#13 ASC,key#12 ASC], false, 0
>   +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
>  +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we can avoid the shuffle if use a filter statement that can't be pushed 
> in the join.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" >= 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- Filter (value1#1 >= value2#13)
>+- SortMergeJoin [key#0], [key#12]
>   :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>   +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> What's the best way to avoid the filter pushdown here??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18067) Adding filter after SortMergeJoin creates unnecessary shuffle

2016-10-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600820#comment-15600820
 ] 

Tejas Patil commented on SPARK-18067:
-

The predicate `value1 === value2` is pushed down to the Join operator which 
makes sense. Although, later the optimizer could have avoided a shuffle.

Here is whats happening: 
https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L185

{noformat}
val useExistingPartitioning = children.zip(requiredChildDistributions).forall {
  case (child, distribution) =>
child.outputPartitioning.guarantees(
  createPartitioning(distribution, maxChildrenNumPartitions))
}
{noformat}

Child's `outputPartitioning` is `HashPartitioning(expressions = \[key\], 
numPartitions = 200)`.
Partitioning returned by `createPartitioning()` is 
`HashPartitioning(expressions = \[value, key\], numPartitions = 200)`.

Since they don't match, an extra shuffle is added.








> Adding filter after SortMergeJoin creates unnecessary shuffle
> -
>
> Key: SPARK-18067
> URL: https://issues.apache.org/jira/browse/SPARK-18067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Paul Jones
>Priority: Minor
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>:- Sort [value1#1 ASC,key#0 ASC], false, 0
>:  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- Sort [value2#13 ASC,key#12 ASC], false, 0
>   +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
>  +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we can avoid the shuffle if use a filter statement that can't be pushed 
> in the join.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" >= 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- Filter (value1#1 >= value2#13)
>+- SortMergeJoin [key#0], [key#12]
>   :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>   +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> What's the best way to avoid the filter pushdown here??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18067) Adding filter after SortMergeJoin creates unnecessary shuffle

2016-10-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600788#comment-15600788
 ] 

Tejas Patil edited comment on SPARK-18067 at 10/24/16 3:12 AM:
---

You could do :

{noformat}
val joinedOutput = partition1.join(partition2, "key").cache
joinedOutput.filter($"value1" >= $"value2").collect
{noformat}


was (Author: tejasp):
You could do :

{norformat}
val joinedOutput = partition1.join(partition2, "key").cache
joinedOutput.filter($"value1" >= $"value2").collect
{norformat}

> Adding filter after SortMergeJoin creates unnecessary shuffle
> -
>
> Key: SPARK-18067
> URL: https://issues.apache.org/jira/browse/SPARK-18067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Paul Jones
>Priority: Minor
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>:- Sort [value1#1 ASC,key#0 ASC], false, 0
>:  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- Sort [value2#13 ASC,key#12 ASC], false, 0
>   +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
>  +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we can avoid the shuffle if use a filter statement that can't be pushed 
> in the join.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" >= 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- Filter (value1#1 >= value2#13)
>+- SortMergeJoin [key#0], [key#12]
>   :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>   +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> What's the best way to avoid the filter pushdown here??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18067) Adding filter after SortMergeJoin creates unnecessary shuffle

2016-10-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600788#comment-15600788
 ] 

Tejas Patil commented on SPARK-18067:
-

You could do :

{norformat}
val joinedOutput = partition1.join(partition2, "key").cache
joinedOutput.filter($"value1" >= $"value2").collect
{norformat}

> Adding filter after SortMergeJoin creates unnecessary shuffle
> -
>
> Key: SPARK-18067
> URL: https://issues.apache.org/jira/browse/SPARK-18067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Paul Jones
>Priority: Minor
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>:- Sort [value1#1 ASC,key#0 ASC], false, 0
>:  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- Sort [value2#13 ASC,key#12 ASC], false, 0
>   +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
>  +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we can avoid the shuffle if use a filter statement that can't be pushed 
> in the join.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" >= 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- Filter (value1#1 >= value2#13)
>+- SortMergeJoin [key#0], [key#12]
>   :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>   +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> What's the best way to avoid the filter pushdown here??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-23 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600732#comment-15600732
 ] 

Stephane Maarek commented on SPARK-18068:
-

I see TimestampType is a wrapper for java.sql.Timestamp

It seems that it can't parse a string without seconds.

{code}
scala> import java.sql.Timestamp
import java.sql.Timestamp

scala> Timestamp.valueOf("2016-10-07T11:15Z")
java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
hh:mm:ss[.f]
  at java.sql.Timestamp.valueOf(Timestamp.java:204)
  ... 32 elided
{code}

A workaround would be to first convert to a date using the good Java 8 API and 
then passing it to the java.sql.Timestamp class

> Spark SQL doesn't parse some ISO 8601 formatted dates
> -
>
> Key: SPARK-18068
> URL: https://issues.apache.org/jira/browse/SPARK-18068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stephane Maarek
>
> The following fail, but shouldn't according to the ISO 8601 standard (seconds 
> can be omitted). Not sure where the issue lies (probably an external library?)
> {code}
> scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
> res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
> parallelize at :25
> scala> res1.toDF
> res2: org.apache.spark.sql.DataFrame = [value: string]
> scala> res2.select("value").show()
> +-+
> |value|
> +-+
> |2016-10-07T11:15Z|
> +-+
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> res2.select(col("value").cast(TimestampType)).show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> And the schema usage errors out right away:
> {code}
> scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
> parallelize at :33
> scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(tst,TimestampType,true))
> scala> val df = spark.read.schema(schema).json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [tst: timestamp]
> scala> df.show()
> 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> java.lang.IllegalArgumentException: 2016-10-07T11:15Z
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.sp

[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-23 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600728#comment-15600728
 ] 

Liwei Lin commented on SPARK-16845:
---

The other solution is still under discussion & in progress. It'd be super 
helpful if you could create and provide the "explode-union-parquet" reproducer 
which causes the fix to fail. Thank you!

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-23 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-18068:

Priority: Major  (was: Critical)

> Spark SQL doesn't parse some ISO 8601 formatted dates
> -
>
> Key: SPARK-18068
> URL: https://issues.apache.org/jira/browse/SPARK-18068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stephane Maarek
>
> The following fail, but shouldn't according to the ISO 8601 standard (seconds 
> can be omitted). Not sure where the issue lies (probably an external library?)
> {code}
> scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
> res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
> parallelize at :25
> scala> res1.toDF
> res2: org.apache.spark.sql.DataFrame = [value: string]
> scala> res2.select("value").show()
> +-+
> |value|
> +-+
> |2016-10-07T11:15Z|
> +-+
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> res2.select(col("value").cast(TimestampType)).show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> And the schema usage errors out right away:
> {code}
> scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
> parallelize at :33
> scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(tst,TimestampType,true))
> scala> val df = spark.read.schema(schema).json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [tst: timestamp]
> scala> df.show()
> 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> java.lang.IllegalArgumentException: 2016-10-07T11:15Z
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java

[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI

2016-10-23 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600710#comment-15600710
 ] 

Matthew Farrellee commented on SPARK-15487:
---

try just setting the proxy url to "/"

> Spark Master UI to reverse proxy Application and Workers UI
> ---
>
> Key: SPARK-15487
> URL: https://issues.apache.org/jira/browse/SPARK-15487
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gurvinder
>Assignee: Gurvinder
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently when running in Standalone mode, Spark UI's link to workers and 
> application drivers are pointing to internal/protected network endpoints. So 
> to access workers/application UI user's machine has to connect to VPN or need 
> to have access to internal network directly.
> Therefore the proposal is to make Spark master UI reverse proxy this 
> information back to the user. So only Spark master UI needs to be opened up 
> to internet. 
> The minimal changes can be done by adding another route e.g. 
> http://spark-master.com/target// so when request goes to target, 
> ProxyServlet kicks in and takes the  and forwards the request to it 
> and send response back to user.
> More information about discussions for this features can be found on this 
> mailing list thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-23 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-18068:

Description: 
The following fail, but shouldn't according to the ISO 8601 standard (seconds 
can be omitted). Not sure where the issue lies (probably an external library?)

{code}
scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
parallelize at :25

scala> res1.toDF
res2: org.apache.spark.sql.DataFrame = [value: string]

scala> res2.select("value").show()
+-+
|value|
+-+
|2016-10-07T11:15Z|
+-+

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> res2.select(col("value").cast(TimestampType)).show()
+-+
|value|
+-+
| null|
+-+
{code}


And the schema usage errors out right away:

{code}
scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
parallelize at :33

scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(tst,TimestampType,true))

scala> val df = spark.read.schema(schema).json(jsonRDD)
df: org.apache.spark.sql.DataFrame = [tst: timestamp]

scala> df.show()
16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
java.lang.IllegalArgumentException: 2016-10-07T11:15Z
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown Source)
at 
org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
 Source)
at 
javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
at 
javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
at 
javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/10/24 13:06:27 WARN TaskSetManager: Lost task 6.0 in stage 5.0 (TID 23, 
localhost): java.lang.IllegalArgumentException: 2016-10-07T11:15Z
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGrego

[jira] [Commented] (SPARK-13331) AES support for over-the-wire encryption

2016-10-23 Thread Junjie Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600705#comment-15600705
 ] 

Junjie Chen commented on SPARK-13331:
-

Hi [~vanzin]
I have updated the latest patch,  Could you please help to review it? 

Due to an issue (CRYPTO-125) in Common Crypto, the patch has to use two helper 
channels. Once it be fixed and released, I will remove these channels. 

> AES support for over-the-wire encryption
> 
>
> Key: SPARK-13331
> URL: https://issues.apache.org/jira/browse/SPARK-13331
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Dong Chen
>Priority: Minor
>
> In network/common, SASL with DIGEST­-MD5 authentication is used for 
> negotiating a secure communication channel. When SASL operation mode is 
> "auth­-conf", the data transferred on the network is encrypted. DIGEST-MD5 
> mechanism supports following encryption: 3DES, DES, and RC4. The negotiation 
> procedure will select one of them to encrypt / decrypt the data on the 
> channel.
> However, 3des and rc4 are slow relatively. We could add code in the 
> negotiation to make it support AES for more secure and performance.
> The proposed solution is:
> When "auth-conf" is enabled, at the end of original negotiation, the 
> authentication succeeds and a secure channel is built. We could add one more 
> negotiation step: Client and server negotiate whether they both support AES. 
> If yes, the Key and IV used by AES will be generated by server and sent to 
> client through the already secure channel. Then update the encryption / 
> decryption handler to AES at both client and server side. Following data 
> transfer will use AES instead of original encryption algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-23 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-18068:

Description: 
The following fail, but shouldn't according to the ISO 8601 standard (seconds 
can be omitted). Not sure where the issue lies (probably an external library?)

{code:scala}
scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
parallelize at :25

scala> res1.toDF
res2: org.apache.spark.sql.DataFrame = [value: string]

scala> res2.select("value").show()
+-+
|value|
+-+
|2016-10-07T11:15Z|
+-+

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> res2.select(col("value").cast(TimestampType)).show()
+-+
|value|
+-+
| null|
+-+
{code}


And the schema usage errors out right away:

{code:scala}
scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
parallelize at :33

scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(tst,TimestampType,true))

scala> val df = spark.read.schema(schema).json(jsonRDD)
df: org.apache.spark.sql.DataFrame = [tst: timestamp]

scala> df.show()
16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
java.lang.IllegalArgumentException: 2016-10-07T11:15Z
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown Source)
at 
org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
 Source)
at 
javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
at 
javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
at 
javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/10/24 13:06:27 WARN TaskSetManager: Lost task 6.0 in stage 5.0 (TID 23, 
localhost): java.lang.IllegalArgumentException: 2016-10-07T11:15Z
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
Source)
at 
org.apache.xerces.jaxp.datat

[jira] [Created] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-23 Thread Stephane Maarek (JIRA)
Stephane Maarek created SPARK-18068:
---

 Summary: Spark SQL doesn't parse some ISO 8601 formatted dates
 Key: SPARK-18068
 URL: https://issues.apache.org/jira/browse/SPARK-18068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Stephane Maarek
Priority: Critical


The following fail, but shouldn't according to the ISO 8601 standard (seconds 
can be omitted). Not sure where the issue lies (probably an external library?)

```
scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
parallelize at :25

scala> res1.toDF
res2: org.apache.spark.sql.DataFrame = [value: string]

scala> res2.select("value").show()
+-+
|value|
+-+
|2016-10-07T11:15Z|
+-+

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> res2.select(col("value").cast(TimestampType)).show()
+-+
|value|
+-+
| null|
+-+
```

And the schema errors out right away:

```
scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
parallelize at :33

scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(tst,TimestampType,true))

scala> val df = spark.read.schema(schema).json(jsonRDD)
df: org.apache.spark.sql.DataFrame = [tst: timestamp]

scala> df.show()
16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
java.lang.IllegalArgumentException: 2016-10-07T11:15Z
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown Source)
at 
org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
 Source)
at 
javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
at 
javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
at 
javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/10/24 13:06:27 WARN TaskSetManager: Lost task 6.0 in stage 5.0 (TID 23, 
localhost): ja

[jira] [Commented] (SPARK-18064) Spark SQL can't load default config file

2016-10-23 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600483#comment-15600483
 ] 

Hyukjin Kwon commented on SPARK-18064:
--

Could you please fill up the JIRA description? I would like to reproduce this. 
It'd be great if there are some steps to check this.

> Spark SQL can't load default config file 
> -
>
> Key: SPARK-18064
> URL: https://issues.apache.org/jira/browse/SPARK-18064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: darion yaphet
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18067) Adding filter after SortMergeJoin creates unnecessary shuffle

2016-10-23 Thread Paul Jones (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Jones updated SPARK-18067:
---
Description: 
Basic setup

{code}
scala> case class Data1(key: String, value1: Int)
scala> case class Data2(key: String, value2: Int)

scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
.toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
.toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
{code}

Join on key
{code}
scala> partition1.join(partition2, "key").explain
== Physical Plan ==
Project [key#0,value1#1,value2#13]
+- SortMergeJoin [key#0], [key#12]
   :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
[key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
[key#0 ASC], false, 0, None
   +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
[key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), Sort 
[key#12 ASC], false, 0, None
{code}

And we get a super efficient join with no shuffle.

But if we add a filter our join gets less efficient and we end up with a 
shuffle.
{code}
scala> partition1.join(partition2, "key").filter($"value1" === 
$"value2").explain
== Physical Plan ==
Project [key#0,value1#1,value2#13]
+- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
   :- Sort [value1#1 ASC,key#0 ASC], false, 0
   :  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
   : +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
[key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
[key#0 ASC], false, 0, None
   +- Sort [value2#13 ASC,key#12 ASC], false, 0
  +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
 +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
[key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), Sort 
[key#12 ASC], false, 0, None
{code}

And we can avoid the shuffle if use a filter statement that can't be pushed in 
the join.
{code}
scala> partition1.join(partition2, "key").filter($"value1" >= $"value2").explain
== Physical Plan ==
Project [key#0,value1#1,value2#13]
+- Filter (value1#1 >= value2#13)
   +- SortMergeJoin [key#0], [key#12]
  :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
[key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
[key#0 ASC], false, 0, None
  +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
[key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), Sort 
[key#12 ASC], false, 0, None
{code}

What's the best way to avoid the filter pushdown here??

> Adding filter after SortMergeJoin creates unnecessary shuffle
> -
>
> Key: SPARK-18067
> URL: https://issues.apache.org/jira/browse/SPARK-18067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Paul Jones
>Priority: Minor
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>:- Sort [value1#1 ASC,key#0 ASC], false, 0
>:  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- Sort [value2#13 ASC,key#12 ASC], false, 0
>   +- TungstenExchange hashpartitioning(v

[jira] [Created] (SPARK-18067) Adding filter after SortMergeJoin creates unnecessary shuffle

2016-10-23 Thread Paul Jones (JIRA)
Paul Jones created SPARK-18067:
--

 Summary: Adding filter after SortMergeJoin creates unnecessary 
shuffle
 Key: SPARK-18067
 URL: https://issues.apache.org/jira/browse/SPARK-18067
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: Paul Jones
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI

2016-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600365#comment-15600365
 ] 

Cédric Hernalsteens commented on SPARK-15487:
-

While it works great to proxy to the slave nodes (running in docker containers 
with ports not exposed), there seem to be some inconsistencies in the way the 
URI are generated.

I specified

- SPARK_MASTER_OPTS=-Dspark.ui.reverseProxyUrl=http://dev-machine-1/spark/ 
-Dspark.ui.reverseProxy=true
- SPARK_WORKER_OPTS=-Dspark.ui.reverseProxyUrl=http://dev-machine-1/spark/ 
-Dspark.ui.reverseProxy=true

Hoping to be able to reverse proxy the whole thing with nginx:

location /spark/ {
  proxy_pass http://spark-master:8080/;
  proxy_set_header X-Forwarded-Host $host;
  proxy_set_header X-Forwarded-Server $host;
  proxy_set_header X-Real-IP $remote_addr;
  proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}

However in the Master webUI, the URI for the slave nodes are 
http://dev-machine-1/proxy/... and not http://dev-machine-1/spark/proxy/...

Then I decided to use a nginx directive  " location /proxy/ " which is a bit 
ugly but then on the Worker webUI, the URI "back to master" is 
http://dev-machine-1/spark , apparently taking spark.ui.reverseProxyUrl into 
account.

Would someone be able to point to what I'm doing wrong, or confirm this is an 
issue?

> Spark Master UI to reverse proxy Application and Workers UI
> ---
>
> Key: SPARK-15487
> URL: https://issues.apache.org/jira/browse/SPARK-15487
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gurvinder
>Assignee: Gurvinder
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently when running in Standalone mode, Spark UI's link to workers and 
> application drivers are pointing to internal/protected network endpoints. So 
> to access workers/application UI user's machine has to connect to VPN or need 
> to have access to internal network directly.
> Therefore the proposal is to make Spark master UI reverse proxy this 
> information back to the user. So only Spark master UI needs to be opened up 
> to internet. 
> The minimal changes can be done by adding another route e.g. 
> http://spark-master.com/target// so when request goes to target, 
> ProxyServlet kicks in and takes the  and forwards the request to it 
> and send response back to user.
> More information about discussions for this features can be found on this 
> mailing list thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18066) Add Pool usage policies test coverage for FIFO & FAIR Schedulers

2016-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18066:


Assignee: Apache Spark

> Add Pool usage policies test coverage for FIFO & FAIR Schedulers
> 
>
> Key: SPARK-18066
> URL: https://issues.apache.org/jira/browse/SPARK-18066
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Assignee: Apache Spark
>Priority: Minor
>
> The following Pool usage cases need to have Unit test coverage :
> - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* 
> property is set, related pool is not created and *TaskSetManagers* are added 
> to root pool.
> - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is 
> not set. This can be happened when Properties object is null or empty(*new 
> Properties()*) or points default pool(*spark.scheduler.pool*=_default_).
> - FAIR Scheduler creates a new pool with default values when 
> *spark.scheduler.pool* property points _non-existent_ pool. This can be 
> happened when scheduler allocation file is not set or it does not contain 
> related pool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18066) Add Pool usage policies test coverage for FIFO & FAIR Schedulers

2016-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18066:


Assignee: (was: Apache Spark)

> Add Pool usage policies test coverage for FIFO & FAIR Schedulers
> 
>
> Key: SPARK-18066
> URL: https://issues.apache.org/jira/browse/SPARK-18066
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Priority: Minor
>
> The following Pool usage cases need to have Unit test coverage :
> - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* 
> property is set, related pool is not created and *TaskSetManagers* are added 
> to root pool.
> - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is 
> not set. This can be happened when Properties object is null or empty(*new 
> Properties()*) or points default pool(*spark.scheduler.pool*=_default_).
> - FAIR Scheduler creates a new pool with default values when 
> *spark.scheduler.pool* property points _non-existent_ pool. This can be 
> happened when scheduler allocation file is not set or it does not contain 
> related pool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18066) Add Pool usage policies test coverage for FIFO & FAIR Schedulers

2016-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600310#comment-15600310
 ] 

Apache Spark commented on SPARK-18066:
--

User 'erenavsarogullari' has created a pull request for this issue:
https://github.com/apache/spark/pull/15604

> Add Pool usage policies test coverage for FIFO & FAIR Schedulers
> 
>
> Key: SPARK-18066
> URL: https://issues.apache.org/jira/browse/SPARK-18066
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Priority: Minor
>
> The following Pool usage cases need to have Unit test coverage :
> - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* 
> property is set, related pool is not created and *TaskSetManagers* are added 
> to root pool.
> - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is 
> not set. This can be happened when Properties object is null or empty(*new 
> Properties()*) or points default pool(*spark.scheduler.pool*=_default_).
> - FAIR Scheduler creates a new pool with default values when 
> *spark.scheduler.pool* property points _non-existent_ pool. This can be 
> happened when scheduler allocation file is not set or it does not contain 
> related pool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18066) Add Pool usage policies test coverage for FIFO & FAIR Schedulers

2016-10-23 Thread Eren Avsarogullari (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-18066:
---
Summary: Add Pool usage policies test coverage for FIFO & FAIR Schedulers  
(was: Add Pool usage policies test coverage to FIFO & FAIR Schedulers)

> Add Pool usage policies test coverage for FIFO & FAIR Schedulers
> 
>
> Key: SPARK-18066
> URL: https://issues.apache.org/jira/browse/SPARK-18066
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Priority: Minor
>
> The following Pool usage cases need to have Unit test coverage :
> - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* 
> property is set, related pool is not created and *TaskSetManagers* are added 
> to root pool.
> - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is 
> not set. This can be happened when Properties object is null or empty(*new 
> Properties()*) or points default pool(*spark.scheduler.pool*=_default_).
> - FAIR Scheduler creates a new pool with default values when 
> *spark.scheduler.pool* property points _non-existent_ pool. This can be 
> happened when scheduler allocation file is not set or it does not contain 
> related pool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18066) Add Pool usage policies test coverage to FIFO & FAIR Schedulers

2016-10-23 Thread Eren Avsarogullari (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-18066:
---
Description: 
The following Pool usage cases need to have Unit test coverage :

- FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* 
property is set, related pool is not created and *TaskSetManagers* are added to 
root pool.
- FAIR Scheduler uses default pool when *spark.scheduler.pool* property is not 
set. This can be happened when Properties object is null or empty(*new 
Properties()*) or points default pool(*spark.scheduler.pool*=_default_).
- FAIR Scheduler creates a new pool with default values when 
*spark.scheduler.pool* property points _non-existent_ pool. This can be 
happened when scheduler allocation file is not set or it does not contain 
related pool.

  was:
The following Pool usage cases need to have Unit test coverage :

- FIFO Scheduler just uses *root pool* so even if 
{code:java}spark.scheduler.pool{code} property is set, related pool is not 
created and {code:java}TaskSetManagers{code} are added to root pool.
- FAIR Scheduler uses default pool when spark.scheduler.pool property is not 
set. This can be happened when Properties object is null or empty(new 
Properties()) or points default pool(spark.scheduler.pool=default).
- FAIR Scheduler creates a new pool with default values when 
spark.scheduler.pool property points non-existent pool. This can be happened 
when scheduler allocation file is not set or it does not contain related pool.


> Add Pool usage policies test coverage to FIFO & FAIR Schedulers
> ---
>
> Key: SPARK-18066
> URL: https://issues.apache.org/jira/browse/SPARK-18066
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Priority: Minor
>
> The following Pool usage cases need to have Unit test coverage :
> - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* 
> property is set, related pool is not created and *TaskSetManagers* are added 
> to root pool.
> - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is 
> not set. This can be happened when Properties object is null or empty(*new 
> Properties()*) or points default pool(*spark.scheduler.pool*=_default_).
> - FAIR Scheduler creates a new pool with default values when 
> *spark.scheduler.pool* property points _non-existent_ pool. This can be 
> happened when scheduler allocation file is not set or it does not contain 
> related pool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18066) Add Pool usage policies test coverage to FIFO & FAIR Schedulers

2016-10-23 Thread Eren Avsarogullari (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-18066:
---
Description: 
The following Pool usage cases need to have Unit test coverage :

- FIFO Scheduler just uses *root pool* so even if 
{code:java}spark.scheduler.pool{code} property is set, related pool is not 
created and {code:java}TaskSetManagers{code} are added to root pool.
- FAIR Scheduler uses default pool when spark.scheduler.pool property is not 
set. This can be happened when Properties object is null or empty(new 
Properties()) or points default pool(spark.scheduler.pool=default).
- FAIR Scheduler creates a new pool with default values when 
spark.scheduler.pool property points non-existent pool. This can be happened 
when scheduler allocation file is not set or it does not contain related pool.

  was:
The following Pool usage cases need to have Unit test coverage :

- FIFO Scheduler just uses root pool so even if spark.scheduler.pool property 
is set, related pool is not created and TaskSetManagers are added to root pool.
- FAIR Scheduler uses default pool when spark.scheduler.pool property is not 
set. This can be happened when Properties object is null or empty(new 
Properties()) or points default pool(spark.scheduler.pool=default).
- FAIR Scheduler creates a new pool with default values when 
spark.scheduler.pool property points non-existent pool. This can be happened 
when scheduler allocation file is not set or it does not contain related pool.


> Add Pool usage policies test coverage to FIFO & FAIR Schedulers
> ---
>
> Key: SPARK-18066
> URL: https://issues.apache.org/jira/browse/SPARK-18066
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Priority: Minor
>
> The following Pool usage cases need to have Unit test coverage :
> - FIFO Scheduler just uses *root pool* so even if 
> {code:java}spark.scheduler.pool{code} property is set, related pool is not 
> created and {code:java}TaskSetManagers{code} are added to root pool.
> - FAIR Scheduler uses default pool when spark.scheduler.pool property is not 
> set. This can be happened when Properties object is null or empty(new 
> Properties()) or points default pool(spark.scheduler.pool=default).
> - FAIR Scheduler creates a new pool with default values when 
> spark.scheduler.pool property points non-existent pool. This can be happened 
> when scheduler allocation file is not set or it does not contain related pool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18066) Add Pool usage policies test coverage to FIFO & FAIR Schedulers

2016-10-23 Thread Eren Avsarogullari (JIRA)
Eren Avsarogullari created SPARK-18066:
--

 Summary: Add Pool usage policies test coverage to FIFO & FAIR 
Schedulers
 Key: SPARK-18066
 URL: https://issues.apache.org/jira/browse/SPARK-18066
 Project: Spark
  Issue Type: Test
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Eren Avsarogullari
Priority: Minor


The following Pool usage cases need to have Unit test coverage :

- FIFO Scheduler just uses root pool so even if spark.scheduler.pool property 
is set, related pool is not created and TaskSetManagers are added to root pool.
- FAIR Scheduler uses default pool when spark.scheduler.pool property is not 
set. This can be happened when Properties object is null or empty(new 
Properties()) or points default pool(spark.scheduler.pool=default).
- FAIR Scheduler creates a new pool with default values when 
spark.scheduler.pool property points non-existent pool. This can be happened 
when scheduler allocation file is not set or it does not contain related pool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18058) AnalysisException may be thrown when union two DFs whose struct fields have different nullability

2016-10-23 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18058:
--
Fix Version/s: 2.0.2

> AnalysisException may be thrown when union two DFs whose struct fields have 
> different nullability
> -
>
> Key: SPARK-18058
> URL: https://issues.apache.org/jira/browse/SPARK-18058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Cheng Lian
>Assignee: Nan Zhu
> Fix For: 2.0.2, 2.1.0
>
>
> The following Spark shell snippet reproduces this issue:
> {code}
> spark.range(10).createOrReplaceTempView("t1")
> spark.range(10).map(i => i: 
> java.lang.Long).toDF("id").createOrReplaceTempView("t2")
> sql("SELECT struct(id) FROM t1 UNION ALL SELECT struct(id) FROM t2")
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. StructType(StructField(id,LongType,true)) 
> <> StructType(StructField(id,LongType,false)) at the first column of the 
> second table;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:291)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:289)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:289)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:278)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:278)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   ... 50 elided
> {noformat}
> The reason is that we treat two {{StructType}} incompatible even if their 
> only differ from each other in field nullability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema

2016-10-23 Thread Matthew Scruggs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600088#comment-15600088
 ] 

Matthew Scruggs commented on SPARK-18065:
-

Makes sense [~hvanhovell], but the change in behavior from 1.6 to 2 was a bit 
unexpected. This might be more of a documentation/API issue since the behavior 
is valid but different than before.

> Spark 2 allows filter/where on columns not in current schema
> 
>
> Key: SPARK-18065
> URL: https://issues.apache.org/jira/browse/SPARK-18065
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Matthew Scruggs
>
> I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a 
> DataFrame that previously had a column, but no longer has it in its schema 
> due to a select() operation.
> In Spark 1.6.2, in spark-shell, we see that an exception is thrown when 
> attempting to filter/where using the selected-out column:
> {code:title=Spark 1.6.2}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.2
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), 
> (2, "two".selectExpr("_1 as id", "_2 as word")
> df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
> scala> df1.show()
> +---++
> | id|word|
> +---++
> |  1| one|
> |  2| two|
> +---++
> scala> val df2 = df1.select("id")
> df2: org.apache.spark.sql.DataFrame = [id: int]
> scala> df2.printSchema()
> root
>  |-- id: integer (nullable = false)
> scala> df2.where("word = 'one'").show()
> org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input 
> columns: [id];
> {code}
> However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds 
> (no AnalysisException) and seems to filter out data as if the column remains:
> {code:title=Spark 2.0.1}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.1
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val df1 = sc.parallelize(Seq((1, "one"), (2, 
> "two"))).toDF().selectExpr("_1 as id", "_2 as word")
> df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
> scala> df1.show()
> +---++
> | id|word|
> +---++
> |  1| one|
> |  2| two|
> +---++
> scala> val df2 = df1.select("id")
> df2: org.apache.spark.sql.DataFrame = [id: int]
> scala> df2.printSchema()
> root
>  |-- id: integer (nullable = false)
> scala> df2.where("word = 'one'").show()
> +---+
> | id|
> +---+
> |  1|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18058) AnalysisException may be thrown when union two DFs whose struct fields have different nullability

2016-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600060#comment-15600060
 ] 

Apache Spark commented on SPARK-18058:
--

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/15602

> AnalysisException may be thrown when union two DFs whose struct fields have 
> different nullability
> -
>
> Key: SPARK-18058
> URL: https://issues.apache.org/jira/browse/SPARK-18058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Cheng Lian
>Assignee: Nan Zhu
> Fix For: 2.1.0
>
>
> The following Spark shell snippet reproduces this issue:
> {code}
> spark.range(10).createOrReplaceTempView("t1")
> spark.range(10).map(i => i: 
> java.lang.Long).toDF("id").createOrReplaceTempView("t2")
> sql("SELECT struct(id) FROM t1 UNION ALL SELECT struct(id) FROM t2")
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. StructType(StructField(id,LongType,true)) 
> <> StructType(StructField(id,LongType,false)) at the first column of the 
> second table;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:291)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:289)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:289)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:278)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:278)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   ... 50 elided
> {noformat}
> The reason is that we treat two {{StructType}} incompatible even if their 
> only differ from each other in field nullability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18058) AnalysisException may be thrown when union two DFs whose struct fields have different nullability

2016-10-23 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18058.
---
   Resolution: Fixed
 Assignee: Nan Zhu
Fix Version/s: 2.1.0

> AnalysisException may be thrown when union two DFs whose struct fields have 
> different nullability
> -
>
> Key: SPARK-18058
> URL: https://issues.apache.org/jira/browse/SPARK-18058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Cheng Lian
>Assignee: Nan Zhu
> Fix For: 2.1.0
>
>
> The following Spark shell snippet reproduces this issue:
> {code}
> spark.range(10).createOrReplaceTempView("t1")
> spark.range(10).map(i => i: 
> java.lang.Long).toDF("id").createOrReplaceTempView("t2")
> sql("SELECT struct(id) FROM t1 UNION ALL SELECT struct(id) FROM t2")
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. StructType(StructField(id,LongType,true)) 
> <> StructType(StructField(id,LongType,false)) at the first column of the 
> second table;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:291)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:289)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:289)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:278)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:278)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   ... 50 elided
> {noformat}
> The reason is that we treat two {{StructType}} incompatible even if their 
> only differ from each other in field nullability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2016-10-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599977#comment-15599977
 ] 

Tejas Patil commented on SPARK-17495:
-

[~rxin] : Sorry about that. In my original PR I intentionally did not introduce 
any usages of hive hash function in rest of the code to keep the PR atomic. 
Although, I did not intend to have the Jira closed.

There are two places that I can think of at top of my head where hive hash can 
be used:
- When hash() is called as a function in the user query. I will work on this.
- When hash partitioning is done. Would it be possible to get 
https://github.com/apache/spark/pull/15300 reviewed ? After thats in, it will 
allow me to do this in a meaningful way (atleast that was my main objective 
behind this Jira).

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2016-10-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599965#comment-15599965
 ] 

Tejas Patil commented on SPARK-17495:
-

[~hvanhovell] : There were two datatypes that I had added TODO for in the code: 
Decimal and date related types. I will submit a PR

https://github.com/tejasapatil/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala#L631

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema

2016-10-23 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599951#comment-15599951
 ] 

Herman van Hovell commented on SPARK-18065:
---

This is - unfortunately - not really a bug. The SQL spec allows you to order a 
result set based on column that is not in the projection, see TPC-DS query 98 
for an example:
{noformat}
SELECT
  i_item_desc,
  i_category,
  i_class,
  i_current_price,
  sum(ss_ext_sales_price) AS itemrevenue,
  sum(ss_ext_sales_price) * 100 / sum(sum(ss_ext_sales_price))
  OVER
  (PARTITION BY i_class) AS revenueratio
FROM
  store_sales, item, date_dim
WHERE
  ss_item_sk = i_item_sk
AND i_category IN ('Sports', 'Books', 'Home')
AND ss_sold_date_sk = d_date_sk
AND d_date BETWEEN cast('1999-02-22' AS DATE)
  AND (cast('1999-02-22' AS DATE) + INTERVAL 30 days)
GROUP BY
  i_item_id, i_item_desc, i_category, i_class, i_current_price
ORDER BY
  i_category, i_class, i_item_id, i_item_desc, revenueratio
{noformat}

In Spark 1.6 we only resolved such a column if it was part of the child's 
child. In spark 2.0 we search the entire child tree.

> Spark 2 allows filter/where on columns not in current schema
> 
>
> Key: SPARK-18065
> URL: https://issues.apache.org/jira/browse/SPARK-18065
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Matthew Scruggs
>
> I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a 
> DataFrame that previously had a column, but no longer has it in its schema 
> due to a select() operation.
> In Spark 1.6.2, in spark-shell, we see that an exception is thrown when 
> attempting to filter/where using the selected-out column:
> {code:title=Spark 1.6.2}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.2
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), 
> (2, "two".selectExpr("_1 as id", "_2 as word")
> df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
> scala> df1.show()
> +---++
> | id|word|
> +---++
> |  1| one|
> |  2| two|
> +---++
> scala> val df2 = df1.select("id")
> df2: org.apache.spark.sql.DataFrame = [id: int]
> scala> df2.printSchema()
> root
>  |-- id: integer (nullable = false)
> scala> df2.where("word = 'one'").show()
> org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input 
> columns: [id];
> {code}
> However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds 
> (no AnalysisException) and seems to filter out data as if the column remains:
> {code:title=Spark 2.0.1}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.1
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val df1 = sc.parallelize(Seq((1, "one"), (2, 
> "two"))).toDF().selectExpr("_1 as id", "_2 as word")
> df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
> scala> df1.show()
> +---++
> | id|word|
> +---++
> |  1| one|
> |  2| two|
> +---++
> scala> val df2 = df1.select("id")
> df2: org.apache.spark.sql.DataFrame = [id: int]
> scala> df2.printSchema()
> root
>  |-- id: integer (nullable = false)
> scala> df2.where("word = 'one'").show()
> +---+
> | id|
> +---+
> |  1|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15453) FileSourceScanExec to extract `outputOrdering` information

2016-10-23 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-15453:

Summary: FileSourceScanExec to extract `outputOrdering` information  (was: 
Improve join planning for bucketed / sorted tables)

> FileSourceScanExec to extract `outputOrdering` information
> --
>
> Key: SPARK-15453
> URL: https://issues.apache.org/jira/browse/SPARK-15453
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> Datasource allows creation of bucketed and sorted tables but performing joins 
> on such tables still does not utilize this metadata to produce optimal query 
> plan.
> As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
> be hashed + sorted on relevant columns.
> {noformat}
> == Physical Plan ==
> WholeStageCodegen
> :  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
> : :- INPUT
> : +- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
> : +- WholeStageCodegen
> ::  +- Project [j#20,k#21,i#22]
> :: +- Filter (isnotnull(k#21) && isnotnull(j#20))
> ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
> InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> +- WholeStageCodegen
>:  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
>: +- INPUT
>+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
>   +- WholeStageCodegen
>  :  +- Project [j#23,k#24,i#25]
>  : +- Filter (isnotnull(k#24) && isnotnull(j#23))
>  :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
> InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-23 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599904#comment-15599904
 ] 

Saikat Kanjilal commented on SPARK-9487:


Ping on this, [~holdenk] can you let me know if I can move ahead with the above 
approach.
Thanks

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI

2016-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599856#comment-15599856
 ] 

Cédric Hernalsteens edited comment on SPARK-15487 at 10/23/16 3:49 PM:
---

I'm glad to see that the PR went through.

I got it to work by setting 

-Dspark.ui.reverseProxy=true in SPARK_MASTER_OPTS.

Is this the correct way to proceed?


was (Author: chernals):
I'm glad to see that the PR went through.

However I'm not sure how this is supposed to work.

The Master webui gives me this :
worker-20161023152840-172.28.0.8-42846 linking to the internal docker IP.

So I tried to access the worker UI from

http://spark-master-public-ip/worker-20161023152840-172.28.0.8-42846
http://spark-master-public-ip/target/worker-20161023152840-172.28.0.8-42846
http://spark-master-public-ip/target/20161023152840-172.28.0.8-42846

(from what I read in the discussion the last one should have been correct)

However I'm stuck at the master's webui.

I'm running 2.1.0 latest nightly build 
http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/

> Spark Master UI to reverse proxy Application and Workers UI
> ---
>
> Key: SPARK-15487
> URL: https://issues.apache.org/jira/browse/SPARK-15487
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gurvinder
>Assignee: Gurvinder
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently when running in Standalone mode, Spark UI's link to workers and 
> application drivers are pointing to internal/protected network endpoints. So 
> to access workers/application UI user's machine has to connect to VPN or need 
> to have access to internal network directly.
> Therefore the proposal is to make Spark master UI reverse proxy this 
> information back to the user. So only Spark master UI needs to be opened up 
> to internet. 
> The minimal changes can be done by adding another route e.g. 
> http://spark-master.com/target// so when request goes to target, 
> ProxyServlet kicks in and takes the  and forwards the request to it 
> and send response back to user.
> More information about discussions for this features can be found on this 
> mailing list thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema

2016-10-23 Thread Matthew Scruggs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Scruggs updated SPARK-18065:

Description: 
I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a 
DataFrame that previously had a column, but no longer has it in its schema due 
to a select() operation.

In Spark 1.6.2, in spark-shell, we see that an exception is thrown when 
attempting to filter/where using the selected-out column:

{code:title=Spark 1.6.2}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.2
  /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), (2, 
"two".selectExpr("_1 as id", "_2 as word")
df1: org.apache.spark.sql.DataFrame = [id: int, word: string]

scala> df1.show()
+---++
| id|word|
+---++
|  1| one|
|  2| two|
+---++


scala> val df2 = df1.select("id")
df2: org.apache.spark.sql.DataFrame = [id: int]

scala> df2.printSchema()
root
 |-- id: integer (nullable = false)


scala> df2.where("word = 'one'").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input 
columns: [id];
{code}

However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds 
(no AnalysisException) and seems to filter out data as if the column remains:
{code:title=Spark 2.0.1}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df1 = sc.parallelize(Seq((1, "one"), (2, 
"two"))).toDF().selectExpr("_1 as id", "_2 as word")
df1: org.apache.spark.sql.DataFrame = [id: int, word: string]

scala> df1.show()
+---++
| id|word|
+---++
|  1| one|
|  2| two|
+---++


scala> val df2 = df1.select("id")
df2: org.apache.spark.sql.DataFrame = [id: int]


scala> df2.printSchema()
root
 |-- id: integer (nullable = false)


scala> df2.where("word = 'one'").show()
+---+
| id|
+---+
|  1|
+---+
{code}

  was:
I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a 
DataFrame that previously had a column, but no longer has it in its schema due 
to a select() operation.

In Spark 1.6.2, in spark-shell, we see that an exception is thrown when 
attempting to filter/where using the selected-out column:

{code:title=Spark 1.6.2}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.2
  /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), (2, 
"two".selectExpr("_1 as id", "_2 as word")
df1: org.apache.spark.sql.DataFrame = [id: int, word: string]

scala> df1.show()
+---++
| id|word|
+---++
|  1| one|
|  2| two|
+---++


scala> val df2 = df1.select("id")
df2: org.apache.spark.sql.DataFrame = [id: int]

scala> df2.printSchema()
root
 |-- id: integer (nullable = false)


scala> df2.where("word = 'one'").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input 
columns: [id];
{code}

However in Spark 2.0.0 and 2.0.1, we see that the same code succeeds and seems 
to filter out data as if the column remains:
{code:title=Spark 2.0.1}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df1 = sc.parallelize(Seq((1, "one"), (2, 
"two"))).toDF().selectExpr("_1 as id", "_2 as word")
df1: org.apache.spark.sql.DataFrame = [id: int, word: string]

scala> df1.show()
+---++
| id|word|
+---++
|  1| one|
|  2| two|
+---++


scala> val df2 = df1.select("id")
df2: org.apache.spark.sql.DataFrame = [id: int]


scala> df2.printSchema()
root
 |-- id: integer (nullable = false)


scala> df2.where("word = 'one'").show()
+---+
| id|
+---+
|  1|
+---+
{code}


> Spark 2 allows filter/where on columns not in current schema
> 
>
> Key: SPARK-18065
> URL: https://issues.apache.org/jira/browse/SPARK-18065
> Project: Spark
>  Issue T

[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI

2016-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599856#comment-15599856
 ] 

Cédric Hernalsteens commented on SPARK-15487:
-

I'm glad to see that the PR went through.

However I'm not sure how this is supposed to work.

The Master webui gives me this :
worker-20161023152840-172.28.0.8-42846 linking to the internal docker IP.

So I tried to access the worker UI from

http://spark-master-public-ip/worker-20161023152840-172.28.0.8-42846
http://spark-master-public-ip/target/worker-20161023152840-172.28.0.8-42846
http://spark-master-public-ip/target/20161023152840-172.28.0.8-42846

(from what I read in the discussion the last one should have been correct)

However I'm stuck at the master's webui.

I'm running 2.1.0 latest nightly build 
http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/

> Spark Master UI to reverse proxy Application and Workers UI
> ---
>
> Key: SPARK-15487
> URL: https://issues.apache.org/jira/browse/SPARK-15487
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gurvinder
>Assignee: Gurvinder
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently when running in Standalone mode, Spark UI's link to workers and 
> application drivers are pointing to internal/protected network endpoints. So 
> to access workers/application UI user's machine has to connect to VPN or need 
> to have access to internal network directly.
> Therefore the proposal is to make Spark master UI reverse proxy this 
> information back to the user. So only Spark master UI needs to be opened up 
> to internet. 
> The minimal changes can be done by adding another route e.g. 
> http://spark-master.com/target// so when request goes to target, 
> ProxyServlet kicks in and takes the  and forwards the request to it 
> and send response back to user.
> More information about discussions for this features can be found on this 
> mailing list thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema

2016-10-23 Thread Matthew Scruggs (JIRA)
Matthew Scruggs created SPARK-18065:
---

 Summary: Spark 2 allows filter/where on columns not in current 
schema
 Key: SPARK-18065
 URL: https://issues.apache.org/jira/browse/SPARK-18065
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.1, 2.0.0
Reporter: Matthew Scruggs


I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a 
DataFrame that previously had a column, but no longer has it in its schema due 
to a select() operation.

In Spark 1.6.2, in spark-shell, we see that an exception is thrown when 
attempting to filter/where using the selected-out column:

{code:title=Spark 1.6.2}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.2
  /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), (2, 
"two".selectExpr("_1 as id", "_2 as word")
df1: org.apache.spark.sql.DataFrame = [id: int, word: string]

scala> df1.show()
+---++
| id|word|
+---++
|  1| one|
|  2| two|
+---++


scala> val df2 = df1.select("id")
df2: org.apache.spark.sql.DataFrame = [id: int]

scala> df2.printSchema()
root
 |-- id: integer (nullable = false)


scala> df2.where("word = 'one'").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input 
columns: [id];
{code}

However in Spark 2.0.0 and 2.0.1, we see that the same code succeeds and seems 
to filter out data as if the column remains:
{code:title=Spark 2.0.1}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df1 = sc.parallelize(Seq((1, "one"), (2, 
"two"))).toDF().selectExpr("_1 as id", "_2 as word")
df1: org.apache.spark.sql.DataFrame = [id: int, word: string]

scala> df1.show()
+---++
| id|word|
+---++
|  1| one|
|  2| two|
+---++


scala> val df2 = df1.select("id")
df2: org.apache.spark.sql.DataFrame = [id: int]


scala> df2.printSchema()
root
 |-- id: integer (nullable = false)


scala> df2.where("word = 'one'").show()
+---+
| id|
+---+
|  1|
+---+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18064) Spark SQL can't load default config file

2016-10-23 Thread darion yaphet (JIRA)
darion yaphet created SPARK-18064:
-

 Summary: Spark SQL can't load default config file 
 Key: SPARK-18064
 URL: https://issues.apache.org/jira/browse/SPARK-18064
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: darion yaphet






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16881) Migrate Mesos configs to use ConfigEntry

2016-10-23 Thread Sandeep Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599682#comment-15599682
 ] 

Sandeep Singh commented on SPARK-16881:
---

I can work on this.

> Migrate Mesos configs to use ConfigEntry
> 
>
> Key: SPARK-16881
> URL: https://issues.apache.org/jira/browse/SPARK-16881
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>Priority: Minor
>
> https://github.com/apache/spark/pull/14414#discussion_r73032190
> We'd like to migrate Mesos' use of config vars to the new ConfigEntry class 
> so we can a) define all our configs in one place like YARN does, and b) take 
> use of features like default handling and generics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17123) Performing set operations that combine string and date / timestamp columns may result in generated projection code which doesn't compile

2016-10-23 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17123:
--
Fix Version/s: 2.0.2

> Performing set operations that combine string and date / timestamp columns 
> may result in generated projection code which doesn't compile
> 
>
> Key: SPARK-17123
> URL: https://issues.apache.org/jira/browse/SPARK-17123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> The following example program causes SpecificSafeProjection code generation 
> to produce Java code which doesn't compile:
> {code}
> import org.apache.spark.sql.types._
> spark.sql("set spark.sql.codegen.fallback=false")
> val dateDF = spark.createDataFrame(sc.parallelize(Seq(Row(new 
> java.sql.Date(0, StructType(StructField("value", DateType) :: Nil))
> val longDF = sc.parallelize(Seq(new java.sql.Date(0).toString)).toDF
> dateDF.union(longDF).collect()
> {code}
> This fails at runtime with the following error:
> {code}
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 28, Column 107: No applicable constructor/method found 
> for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates 
> are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificSafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private org.apache.spark.sql.types.StructType schema;
> /* 011 */
> /* 012 */
> /* 013 */   public SpecificSafeProjection(Object[] references) {
> /* 014 */ this.references = references;
> /* 015 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 016 */
> /* 017 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 018 */   }
> /* 019 */
> /* 020 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 021 */ InternalRow i = (InternalRow) _i;
> /* 022 */
> /* 023 */ values = new Object[1];
> /* 024 */
> /* 025 */ boolean isNull2 = i.isNullAt(0);
> /* 026 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
> /* 027 */ boolean isNull1 = isNull2;
> /* 028 */ final java.sql.Date value1 = isNull1 ? null : 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2);
> /* 029 */ isNull1 = value1 == null;
> /* 030 */ if (isNull1) {
> /* 031 */   values[0] = null;
> /* 032 */ } else {
> /* 033 */   values[0] = value1;
> /* 034 */ }
> /* 035 */
> /* 036 */ final org.apache.spark.sql.Row value = new 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, 
> schema);
> /* 037 */ if (false) {
> /* 038 */   mutableRow.setNullAt(0);
> /* 039 */ } else {
> /* 040 */
> /* 041 */   mutableRow.update(0, value);
> /* 042 */ }
> /* 043 */
> /* 044 */ return mutableRow;
> /* 045 */   }
> /* 046 */ }
> {code}
> Here, the invocation of {{DateTimeUtils.toJavaDate}} is incorrect because the 
> generated code tries to call it with a UTF8String while the method expects an 
> int instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18027) .sparkStaging not clean on RM ApplicationNotFoundException

2016-10-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599551#comment-15599551
 ] 

Sean Owen commented on SPARK-18027:
---

OK, though Spark will consider the app failed in this case no matter what. Is 
it consistent to not clean it up? it won't recover on the Spark side regardless.

> .sparkStaging not clean on RM ApplicationNotFoundException
> --
>
> Key: SPARK-18027
> URL: https://issues.apache.org/jira/browse/SPARK-18027
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: David Shar
>Priority: Minor
>
> Hi,
> It seems that SPARK-7705 didn't fix all issues with .sparkStaging folder 
> cleanup.
> in Client.scala:monitorApplication 
> {code}
>  val report: ApplicationReport =
> try {
>   getApplicationReport(appId)
> } catch {
>   case e: ApplicationNotFoundException =>
> logError(s"Application $appId not found.")
> return (YarnApplicationState.KILLED, 
> FinalApplicationStatus.KILLED)
>   case NonFatal(e) =>
> logError(s"Failed to contact YARN for application $appId.", e)
> return (YarnApplicationState.FAILED, 
> FinalApplicationStatus.FAILED)
> }
> 
> if (state == YarnApplicationState.FINISHED ||
> state == YarnApplicationState.FAILED ||
> state == YarnApplicationState.KILLED) {
> cleanupStagingDir(appId)
> return (state, report.getFinalApplicationStatus)
>  }
> {code}
> In case of ApplicationNotFoundException, we don't cleanup the sparkStaging 
> folder.
> I believe we should call cleanupStagingDir(appId) on the catch clause above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18045) Move `HiveDataFrameAnalyticsSuite` to package `sql`

2016-10-23 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18045.
---
   Resolution: Fixed
 Assignee: Jiang Xingbo
Fix Version/s: 2.1.0

> Move `HiveDataFrameAnalyticsSuite` to package `sql`
> ---
>
> Key: SPARK-18045
> URL: https://issues.apache.org/jira/browse/SPARK-18045
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Minor
> Fix For: 2.1.0
>
>
> The testsuite `HiveDataFrameAnalyticsSuite` has nothing to do with HIVE, we 
> should move it to package `sql`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children

2016-10-23 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18038.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Move output partitioning definition from UnaryNodeExec to its children
> --
>
> Key: SPARK-18038
> URL: https://issues.apache.org/jira/browse/SPARK-18038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Trivial
> Fix For: 2.1.0
>
>
> This was a suggestion by [~rxin] over one of the dev list discussion : 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html
> {noformat}
> I think this is very risky because preserving output partitioning should not 
> be a property of UnaryNodeExec (e.g. exchange).
> It would be better (safer) to move the output partitioning definition into 
> each of the operator and remove it from UnaryExecNode.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2016-10-23 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599523#comment-15599523
 ] 

Herman van Hovell commented on SPARK-17495:
---

O? Which data types are we missing?

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17123) Performing set operations that combine string and date / timestamp columns may result in generated projection code which doesn't compile

2016-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599444#comment-15599444
 ] 

Apache Spark commented on SPARK-17123:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15601

> Performing set operations that combine string and date / timestamp columns 
> may result in generated projection code which doesn't compile
> 
>
> Key: SPARK-17123
> URL: https://issues.apache.org/jira/browse/SPARK-17123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> The following example program causes SpecificSafeProjection code generation 
> to produce Java code which doesn't compile:
> {code}
> import org.apache.spark.sql.types._
> spark.sql("set spark.sql.codegen.fallback=false")
> val dateDF = spark.createDataFrame(sc.parallelize(Seq(Row(new 
> java.sql.Date(0, StructType(StructField("value", DateType) :: Nil))
> val longDF = sc.parallelize(Seq(new java.sql.Date(0).toString)).toDF
> dateDF.union(longDF).collect()
> {code}
> This fails at runtime with the following error:
> {code}
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 28, Column 107: No applicable constructor/method found 
> for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates 
> are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificSafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private org.apache.spark.sql.types.StructType schema;
> /* 011 */
> /* 012 */
> /* 013 */   public SpecificSafeProjection(Object[] references) {
> /* 014 */ this.references = references;
> /* 015 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 016 */
> /* 017 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 018 */   }
> /* 019 */
> /* 020 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 021 */ InternalRow i = (InternalRow) _i;
> /* 022 */
> /* 023 */ values = new Object[1];
> /* 024 */
> /* 025 */ boolean isNull2 = i.isNullAt(0);
> /* 026 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
> /* 027 */ boolean isNull1 = isNull2;
> /* 028 */ final java.sql.Date value1 = isNull1 ? null : 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2);
> /* 029 */ isNull1 = value1 == null;
> /* 030 */ if (isNull1) {
> /* 031 */   values[0] = null;
> /* 032 */ } else {
> /* 033 */   values[0] = value1;
> /* 034 */ }
> /* 035 */
> /* 036 */ final org.apache.spark.sql.Row value = new 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, 
> schema);
> /* 037 */ if (false) {
> /* 038 */   mutableRow.setNullAt(0);
> /* 039 */ } else {
> /* 040 */
> /* 041 */   mutableRow.update(0, value);
> /* 042 */ }
> /* 043 */
> /* 044 */ return mutableRow;
> /* 045 */   }
> /* 046 */ }
> {code}
> Here, the invocation of {{DateTimeUtils.toJavaDate}} is incorrect because the 
> generated code tries to call it with a UTF8String while the method expects an 
> int instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.1.0

2016-10-23 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599386#comment-15599386
 ] 

Liwei Lin commented on SPARK-18057:
---

Cool !

> Update structured streaming kafka from 10.0.1 to 10.1.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15233) Spark task metrics should include hdfs read write latency

2016-10-23 Thread Yuance Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599290#comment-15599290
 ] 

Yuance Li commented on SPARK-15233:
---

What's new about this issue? Thx

> Spark task metrics should include hdfs read write latency
> -
>
> Key: SPARK-15233
> URL: https://issues.apache.org/jira/browse/SPARK-15233
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>Priority: Minor
>
> Currently the Spark task metrics does not have hdfs read/write latency. It 
> will be very useful to have those to find the bottleneck in the query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2016-10-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599246#comment-15599246
 ] 

Reynold Xin commented on SPARK-17495:
-

[~tejasp] I am going to reopen this. I just realized the committed 
implementation doesn't work for all data types, and are mostly dead code. Can 
we push it to completion by implementing the remaining data types?


> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17495) Hive hash implementation

2016-10-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-17495:
-

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org