[jira] [Assigned] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.
[ https://issues.apache.org/jira/browse/SPARK-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18048: Assignee: Apache Spark > If expression behaves differently if true and false expression are > interchanged in case of different data types. > > > Key: SPARK-18048 > URL: https://issues.apache.org/jira/browse/SPARK-18048 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Priyanka Garg >Assignee: Apache Spark > > If expression behaves differently if true and false expression are > interchanged in case of different data types. > For eg. > checkEvaluation( > If(Literal.create(true, BooleanType), > Literal.create(identity(1), DateType), > Literal.create(identity(2L), TimestampType)), > identity(1)) is throwing error while > checkEvaluation( > If(Literal.create(true, BooleanType), > Literal.create(identity(1L), TimestampType), > Literal.create(identity(2), DateType)), > identity(1L)) works fine. > The reason for the same is that the If expression 's datatype only considers > trueValue.dataType. > Also, > checkEvaluation( > If(Literal.create(true, BooleanType), > Literal.create(identity(1), DateType), > Literal.create(identity(2L), TimestampType)), > identity(1)) > is breaking only in case of Generated mutable Projection and Unsafe > projection. For all other types its working fine. > Either both should work or none should work -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.
[ https://issues.apache.org/jira/browse/SPARK-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15601126#comment-15601126 ] Apache Spark commented on SPARK-18048: -- User 'priyankagargnitk' has created a pull request for this issue: https://github.com/apache/spark/pull/15609 > If expression behaves differently if true and false expression are > interchanged in case of different data types. > > > Key: SPARK-18048 > URL: https://issues.apache.org/jira/browse/SPARK-18048 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Priyanka Garg > > If expression behaves differently if true and false expression are > interchanged in case of different data types. > For eg. > checkEvaluation( > If(Literal.create(true, BooleanType), > Literal.create(identity(1), DateType), > Literal.create(identity(2L), TimestampType)), > identity(1)) is throwing error while > checkEvaluation( > If(Literal.create(true, BooleanType), > Literal.create(identity(1L), TimestampType), > Literal.create(identity(2), DateType)), > identity(1L)) works fine. > The reason for the same is that the If expression 's datatype only considers > trueValue.dataType. > Also, > checkEvaluation( > If(Literal.create(true, BooleanType), > Literal.create(identity(1), DateType), > Literal.create(identity(2L), TimestampType)), > identity(1)) > is breaking only in case of Generated mutable Projection and Unsafe > projection. For all other types its working fine. > Either both should work or none should work -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.
[ https://issues.apache.org/jira/browse/SPARK-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18048: Assignee: (was: Apache Spark) > If expression behaves differently if true and false expression are > interchanged in case of different data types. > > > Key: SPARK-18048 > URL: https://issues.apache.org/jira/browse/SPARK-18048 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Priyanka Garg > > If expression behaves differently if true and false expression are > interchanged in case of different data types. > For eg. > checkEvaluation( > If(Literal.create(true, BooleanType), > Literal.create(identity(1), DateType), > Literal.create(identity(2L), TimestampType)), > identity(1)) is throwing error while > checkEvaluation( > If(Literal.create(true, BooleanType), > Literal.create(identity(1L), TimestampType), > Literal.create(identity(2), DateType)), > identity(1L)) works fine. > The reason for the same is that the If expression 's datatype only considers > trueValue.dataType. > Also, > checkEvaluation( > If(Literal.create(true, BooleanType), > Literal.create(identity(1), DateType), > Literal.create(identity(2L), TimestampType)), > identity(1)) > is breaking only in case of Generated mutable Projection and Unsafe > projection. For all other types its working fine. > Either both should work or none should work -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.
[ https://issues.apache.org/jira/browse/SPARK-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Priyanka Garg updated SPARK-18048: -- Description: If expression behaves differently if true and false expression are interchanged in case of different data types. For eg. checkEvaluation( If(Literal.create(true, BooleanType), Literal.create(identity(1), DateType), Literal.create(identity(2L), TimestampType)), identity(1)) is throwing error while checkEvaluation( If(Literal.create(true, BooleanType), Literal.create(identity(1L), TimestampType), Literal.create(identity(2), DateType)), identity(1L)) works fine. The reason for the same is that the If expression 's datatype only considers trueValue.dataType. Also, checkEvaluation( If(Literal.create(true, BooleanType), Literal.create(identity(1), DateType), Literal.create(identity(2L), TimestampType)), identity(1)) is breaking only in case of Generated mutable Projection and Unsafe projection. For all other types its working fine. Either both should work or none should work was: If expression behaves differently if true and false expression are interchanged in case of different data types. For eg. If(Literal.create(geo != null, BooleanType), Literal.create(null, DateType), Literal.create(null, TimestampType)) is throwing error while If(Literal.create(geo != null, BooleanType), Literal.create(null, TimestampType), Literal.create(null, DateType )) works fine. The reason for the same is that the If expression 's datatype only considers trueValue.dataType. Also, If(Literal.create(geo != null, BooleanType), Literal.create(null, DateType), Literal.create(null, TimestampType)) is breaking only in case of Generated mutable Projection and Unsafe projection. For all other types its working fine. Either both should work or none should work > If expression behaves differently if true and false expression are > interchanged in case of different data types. > > > Key: SPARK-18048 > URL: https://issues.apache.org/jira/browse/SPARK-18048 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Priyanka Garg > > If expression behaves differently if true and false expression are > interchanged in case of different data types. > For eg. > checkEvaluation( > If(Literal.create(true, BooleanType), > Literal.create(identity(1), DateType), > Literal.create(identity(2L), TimestampType)), > identity(1)) is throwing error while > checkEvaluation( > If(Literal.create(true, BooleanType), > Literal.create(identity(1L), TimestampType), > Literal.create(identity(2), DateType)), > identity(1L)) works fine. > The reason for the same is that the If expression 's datatype only considers > trueValue.dataType. > Also, > checkEvaluation( > If(Literal.create(true, BooleanType), > Literal.create(identity(1), DateType), > Literal.create(identity(2L), TimestampType)), > identity(1)) > is breaking only in case of Generated mutable Projection and Unsafe > projection. For all other types its working fine. > Either both should work or none should work -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.
[ https://issues.apache.org/jira/browse/SPARK-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Priyanka Garg updated SPARK-18048: -- Description: If expression behaves differently if true and false expression are interchanged in case of different data types. For eg. If(Literal.create(geo != null, BooleanType), Literal.create(null, DateType), Literal.create(null, TimestampType)) is throwing error while If(Literal.create(geo != null, BooleanType), Literal.create(null, TimestampType), Literal.create(null, DateType )) works fine. The reason for the same is that the If expression 's datatype only considers trueValue.dataType. Also, If(Literal.create(geo != null, BooleanType), Literal.create(null, DateType), Literal.create(null, TimestampType)) is breaking only in case of Generated mutable Projection and Unsafe projection. For all other types its working fine. Either both should work or none should work was: If expression behaves differently if true and false expression are interchanged in case of different data types. For eg. If(Literal.create(geo != null, BooleanType), Literal.create(null, DateType), Literal.create(null, TimestampType)) is throwing error while If(Literal.create(geo != null, BooleanType), Literal.create(null, TimestampType), Literal.create(null, DateType )) works fine. The reason for the same is that the If expression 's datatype only considers trueValue.dataType. Also, If(Literal.create(geo != null, BooleanType), Literal.create(null, DateType), Literal.create(null, TimestampType)) is breaking only in case of Generated mutable Projection and Unsafe projection. For all other types its working fine. > If expression behaves differently if true and false expression are > interchanged in case of different data types. > > > Key: SPARK-18048 > URL: https://issues.apache.org/jira/browse/SPARK-18048 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Priyanka Garg > > If expression behaves differently if true and false expression are > interchanged in case of different data types. > For eg. > If(Literal.create(geo != null, BooleanType), > Literal.create(null, DateType), > Literal.create(null, TimestampType)) is throwing error while > If(Literal.create(geo != null, BooleanType), > Literal.create(null, TimestampType), > Literal.create(null, DateType )) works fine. > The reason for the same is that the If expression 's datatype only considers > trueValue.dataType. > Also, > If(Literal.create(geo != null, BooleanType), > Literal.create(null, DateType), > Literal.create(null, TimestampType)) > is breaking only in case of Generated mutable Projection and Unsafe > projection. For all other types its working fine. > Either both should work or none should work -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17838) Strict type checking for arguments with a better messages across APIs.
[ https://issues.apache.org/jira/browse/SPARK-17838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17838: Assignee: Apache Spark > Strict type checking for arguments with a better messages across APIs. > -- > > Key: SPARK-17838 > URL: https://issues.apache.org/jira/browse/SPARK-17838 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > It seems there should be more strict type checking for arguments in SparkR > APIs. This was discussed in several PRs. > https://github.com/apache/spark/pull/15239#discussion_r82445435 > Roughly it seems there are three cases as below: > The first case below was described in > https://github.com/apache/spark/pull/15239#discussion_r82445435 > - Check for {{zero-length variable name}} > Some of other cases below were handled in > https://github.com/apache/spark/pull/15231#discussion_r80417904 > - Catch the exception from JVM and format it as pretty > - Check strictly types before calling JVM in SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17838) Strict type checking for arguments with a better messages across APIs.
[ https://issues.apache.org/jira/browse/SPARK-17838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17838: Assignee: (was: Apache Spark) > Strict type checking for arguments with a better messages across APIs. > -- > > Key: SPARK-17838 > URL: https://issues.apache.org/jira/browse/SPARK-17838 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Hyukjin Kwon > > It seems there should be more strict type checking for arguments in SparkR > APIs. This was discussed in several PRs. > https://github.com/apache/spark/pull/15239#discussion_r82445435 > Roughly it seems there are three cases as below: > The first case below was described in > https://github.com/apache/spark/pull/15239#discussion_r82445435 > - Check for {{zero-length variable name}} > Some of other cases below were handled in > https://github.com/apache/spark/pull/15231#discussion_r80417904 > - Catch the exception from JVM and format it as pretty > - Check strictly types before calling JVM in SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17838) Strict type checking for arguments with a better messages across APIs.
[ https://issues.apache.org/jira/browse/SPARK-17838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15601064#comment-15601064 ] Apache Spark commented on SPARK-17838: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/15608 > Strict type checking for arguments with a better messages across APIs. > -- > > Key: SPARK-17838 > URL: https://issues.apache.org/jira/browse/SPARK-17838 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Hyukjin Kwon > > It seems there should be more strict type checking for arguments in SparkR > APIs. This was discussed in several PRs. > https://github.com/apache/spark/pull/15239#discussion_r82445435 > Roughly it seems there are three cases as below: > The first case below was described in > https://github.com/apache/spark/pull/15239#discussion_r82445435 > - Check for {{zero-length variable name}} > Some of other cases below were handled in > https://github.com/apache/spark/pull/15231#discussion_r80417904 > - Catch the exception from JVM and format it as pretty > - Check strictly types before calling JVM in SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16137) Random Forest wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15601037#comment-15601037 ] Apache Spark commented on SPARK-16137: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/15607 > Random Forest wrapper in SparkR > --- > > Key: SPARK-16137 > URL: https://issues.apache.org/jira/browse/SPARK-16137 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Kai Jiang > > Implement a wrapper in SparkR to support Random Forest. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18070) binary operator should not consider nullability when comparing input types
[ https://issues.apache.org/jira/browse/SPARK-18070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18070: Assignee: Apache Spark (was: Wenchen Fan) > binary operator should not consider nullability when comparing input types > -- > > Key: SPARK-18070 > URL: https://issues.apache.org/jira/browse/SPARK-18070 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18070) binary operator should not consider nullability when comparing input types
[ https://issues.apache.org/jira/browse/SPARK-18070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15601009#comment-15601009 ] Apache Spark commented on SPARK-18070: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/15606 > binary operator should not consider nullability when comparing input types > -- > > Key: SPARK-18070 > URL: https://issues.apache.org/jira/browse/SPARK-18070 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18070) binary operator should not consider nullability when comparing input types
[ https://issues.apache.org/jira/browse/SPARK-18070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18070: Assignee: Wenchen Fan (was: Apache Spark) > binary operator should not consider nullability when comparing input types > -- > > Key: SPARK-18070 > URL: https://issues.apache.org/jira/browse/SPARK-18070 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18070) binary operator should not consider nullability when comparing input types
Wenchen Fan created SPARK-18070: --- Summary: binary operator should not consider nullability when comparing input types Key: SPARK-18070 URL: https://issues.apache.org/jira/browse/SPARK-18070 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12180) DataFrame.join() in PySpark gives misleading exception when column name exists on both side
[ https://issues.apache.org/jira/browse/SPARK-12180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jagadeesan A S closed SPARK-12180. -- Resolution: Cannot Reproduce > DataFrame.join() in PySpark gives misleading exception when column name > exists on both side > --- > > Key: SPARK-12180 > URL: https://issues.apache.org/jira/browse/SPARK-12180 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Daniel Thomas > > When joining two DataFrames on a column 'session_uuid' I got the following > exception, because both DataFrames hat a column called 'at'. The exception is > misleading in the cause and in the column causing the problem. Renaming the > column fixed the exception. > --- > Py4JJavaError Traceback (most recent call last) > /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/utils.py in > deco(*a, **kw) > 35 try: > ---> 36 return f(*a, **kw) > 37 except py4j.protocol.Py4JJavaError as e: > /Applications/spark-1.5.2-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > Py4JJavaError: An error occurred while calling o484.join. > : org.apache.spark.sql.AnalysisException: resolved attribute(s) > session_uuid#3278 missing from > uuid_x#9078,total_session_sec#9115L,at#3248,session_uuid#9114,uuid#9117,at#9084 > in operator !Join Inner, Some((uuid_x#9078 = session_uuid#3278)); > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > During handling of the above exception, another exception occurred: > AnalysisException Traceback (most recent call last) > in () > 1 sel_starts = starts.select('uuid', 'at').withColumnRenamed('uuid', > 'uuid_x')#.withColumnRenamed('at', 'at_x') > 2 sel_closes = closes.select('uuid', 'at', 'session_uuid', > 'total_session_sec') > > 3 start_close = sel_starts.join(sel_closes, sel_starts['uuid_x'] == > sel_closes['session_uuid']) > 4 start_close.cache() > 5 start_close.take(1) > /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/dataframe.py in > join(self, other, on, how) > 579 on = on[0] > 580 if how is None: > --> 581 jdf = self._jdf.join(other._jdf, on._jc, "inner") > 582 else: > 583 assert isinstance(how, basestring), "how should be > basestring" > /Applications/spark-1.5.2-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py > in __call__(self, *args) > 536 answer = self.gateway_client.send_command(command) > 537 return_value = get_return_va
[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI
[ https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600938#comment-15600938 ] Gurvinder commented on SPARK-15487: --- The setting needs to be set for all the application, master and worker components as mentioned in document. "spark.ui.reverseProxy: Enable running Spark Master as reverse proxy for worker and application UIs. In this mode, Spark master will reverse proxy the worker and application UIs to enable access without requiring direct access to their hosts. Use it with caution, as worker and application UI will not be accessible directly, you will only be able to access them through spark master/proxy public URL. This setting affects all the workers and application UIs running in the cluster and must be set on all the workers, drivers and masters." I set this in spark-default.conf on all the components and it works. The other parameter spark.ui.reverseProxyUrl is required if you running master itself behind the proxy, then this value must be equal to the FQDN of proxy. Hope that helps. > Spark Master UI to reverse proxy Application and Workers UI > --- > > Key: SPARK-15487 > URL: https://issues.apache.org/jira/browse/SPARK-15487 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gurvinder >Assignee: Gurvinder >Priority: Minor > Fix For: 2.1.0 > > > Currently when running in Standalone mode, Spark UI's link to workers and > application drivers are pointing to internal/protected network endpoints. So > to access workers/application UI user's machine has to connect to VPN or need > to have access to internal network directly. > Therefore the proposal is to make Spark master UI reverse proxy this > information back to the user. So only Spark master UI needs to be opened up > to internet. > The minimal changes can be done by adding another route e.g. > http://spark-master.com/target// so when request goes to target, > ProxyServlet kicks in and takes the and forwards the request to it > and send response back to user. > More information about discussions for this features can be found on this > mailing list thread > http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12451) Regexp functions don't support patterns containing '*/'
[ https://issues.apache.org/jira/browse/SPARK-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jagadeesan A S resolved SPARK-12451. Resolution: Duplicate > Regexp functions don't support patterns containing '*/' > --- > > Key: SPARK-12451 > URL: https://issues.apache.org/jira/browse/SPARK-12451 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: William Dee > > When using the regexp functions in Spark SQL, patterns containing '*/' create > runtime errors in the auto generated code. This is due to the fact that the > code generator creates a multiline comment containing, amongst other things, > the pattern. > Here is an excerpt from my stacktrace to illustrate: (Helpfully, the stack > trace includes all of the auto-generated code) > {code} > Caused by: org.codehaus.commons.compiler.CompileException: Line 232, Column > 54: Unexpected token "," in primary > at org.codehaus.janino.Parser.compileException(Parser.java:3125) > at org.codehaus.janino.Parser.parsePrimary(Parser.java:2512) > at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2252) > at > org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2211) > at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2190) > at org.codehaus.janino.Parser.parseShiftExpression(Parser.java:2169) > at > org.codehaus.janino.Parser.parseRelationalExpression(Parser.java:2072) > at org.codehaus.janino.Parser.parseEqualityExpression(Parser.java:2046) > at org.codehaus.janino.Parser.parseAndExpression(Parser.java:2025) > at > org.codehaus.janino.Parser.parseExclusiveOrExpression(Parser.java:2004) > at > org.codehaus.janino.Parser.parseInclusiveOrExpression(Parser.java:1983) > at > org.codehaus.janino.Parser.parseConditionalAndExpression(Parser.java:1962) > at > org.codehaus.janino.Parser.parseConditionalOrExpression(Parser.java:1941) > at > org.codehaus.janino.Parser.parseConditionalExpression(Parser.java:1922) > at > org.codehaus.janino.Parser.parseAssignmentExpression(Parser.java:1901) > at org.codehaus.janino.Parser.parseExpression(Parser.java:1886) > at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1149) > at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1085) > at > org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:938) > at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:620) > at org.codehaus.janino.Parser.parseClassBody(Parser.java:515) > at org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:481) > at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:577) > at > org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:192) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:84) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:77) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:387) > ... line 232 ... > /* regexp_replace(input[46, StringType],^.*/,) */ > > /* input[46, StringType] */ > > boolean isNull31 = i.isNullAt(46); > UTF8String primitive32 = isNull31 ? null : (i.getUTF8String(46)); > > boolean isNull24 = true; > UTF8String primitive25 = null; > if (!isNull31) { > /* ^.*/ */ > > /* expression: ^.*/ */ > Object obj35 = expressions[4].eval(i); > boolean isNull33 = obj35 == null; > UTF8String primitive34 = null; > if (!isNull33) { > primitive34 = (UTF8String) obj35; > } > ... > {code} > Note the multiple multiline comments, these obviously break when the regex > pattern contains the end-of-comment token '*/' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18062) ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should return probabilities when given all-0 vector
[ https://issues.apache.org/jira/browse/SPARK-18062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600926#comment-15600926 ] Peng Meng commented on SPARK-18062: --- This relate to how to understand all-0 rawPrediction, all classes are impossible or all classes are the same? If the later, ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should return a valid probability vector with the uniform distribution. > ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should > return probabilities when given all-0 vector > > > Key: SPARK-18062 > URL: https://issues.apache.org/jira/browse/SPARK-18062 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > > {{ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace}} returns > a vector of all-0 when given a rawPrediction vector of all-0. It should > return a valid probability vector with the uniform distribution. > Note: This will be a *behavior change* but it should be very minor and affect > few if any users. But we should note it in release notes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns
[ https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18067: Assignee: (was: Apache Spark) > SortMergeJoin adds shuffle if join predicates have non partitioned columns > -- > > Key: SPARK-18067 > URL: https://issues.apache.org/jira/browse/SPARK-18067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Paul Jones >Priority: Minor > > Basic setup > {code} > scala> case class Data1(key: String, value1: Int) > scala> case class Data2(key: String, value2: Int) > scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > {code} > Join on key > {code} > scala> partition1.join(partition2, "key").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [key#0], [key#12] >:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we get a super efficient join with no shuffle. > But if we add a filter our join gets less efficient and we end up with a > shuffle. > {code} > scala> partition1.join(partition2, "key").filter($"value1" === > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [value1#1,key#0], [value2#13,key#12] >:- Sort [value1#1 ASC,key#0 ASC], false, 0 >: +- TungstenExchange hashpartitioning(value1#1,key#0,200), None >: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- Sort [value2#13 ASC,key#12 ASC], false, 0 > +- TungstenExchange hashpartitioning(value2#13,key#12,200), None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we can avoid the shuffle if use a filter statement that can't be pushed > in the join. > {code} > scala> partition1.join(partition2, "key").filter($"value1" >= > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- Filter (value1#1 >= value2#13) >+- SortMergeJoin [key#0], [key#12] > :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > What's the best way to avoid the filter pushdown here?? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns
[ https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600923#comment-15600923 ] Apache Spark commented on SPARK-18067: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/15605 > SortMergeJoin adds shuffle if join predicates have non partitioned columns > -- > > Key: SPARK-18067 > URL: https://issues.apache.org/jira/browse/SPARK-18067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Paul Jones >Priority: Minor > > Basic setup > {code} > scala> case class Data1(key: String, value1: Int) > scala> case class Data2(key: String, value2: Int) > scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > {code} > Join on key > {code} > scala> partition1.join(partition2, "key").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [key#0], [key#12] >:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we get a super efficient join with no shuffle. > But if we add a filter our join gets less efficient and we end up with a > shuffle. > {code} > scala> partition1.join(partition2, "key").filter($"value1" === > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [value1#1,key#0], [value2#13,key#12] >:- Sort [value1#1 ASC,key#0 ASC], false, 0 >: +- TungstenExchange hashpartitioning(value1#1,key#0,200), None >: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- Sort [value2#13 ASC,key#12 ASC], false, 0 > +- TungstenExchange hashpartitioning(value2#13,key#12,200), None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we can avoid the shuffle if use a filter statement that can't be pushed > in the join. > {code} > scala> partition1.join(partition2, "key").filter($"value1" >= > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- Filter (value1#1 >= value2#13) >+- SortMergeJoin [key#0], [key#12] > :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > What's the best way to avoid the filter pushdown here?? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns
[ https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18067: Assignee: Apache Spark > SortMergeJoin adds shuffle if join predicates have non partitioned columns > -- > > Key: SPARK-18067 > URL: https://issues.apache.org/jira/browse/SPARK-18067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Paul Jones >Assignee: Apache Spark >Priority: Minor > > Basic setup > {code} > scala> case class Data1(key: String, value1: Int) > scala> case class Data2(key: String, value2: Int) > scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > {code} > Join on key > {code} > scala> partition1.join(partition2, "key").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [key#0], [key#12] >:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we get a super efficient join with no shuffle. > But if we add a filter our join gets less efficient and we end up with a > shuffle. > {code} > scala> partition1.join(partition2, "key").filter($"value1" === > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [value1#1,key#0], [value2#13,key#12] >:- Sort [value1#1 ASC,key#0 ASC], false, 0 >: +- TungstenExchange hashpartitioning(value1#1,key#0,200), None >: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- Sort [value2#13 ASC,key#12 ASC], false, 0 > +- TungstenExchange hashpartitioning(value2#13,key#12,200), None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we can avoid the shuffle if use a filter statement that can't be pushed > in the join. > {code} > scala> partition1.join(partition2, "key").filter($"value1" >= > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- Filter (value1#1 >= value2#13) >+- SortMergeJoin [key#0], [key#12] > :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > What's the best way to avoid the filter pushdown here?? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns
[ https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated SPARK-18067: Summary: SortMergeJoin adds shuffle if join predicates have non partitioned columns (was: Adding filter after SortMergeJoin creates unnecessary shuffle) > SortMergeJoin adds shuffle if join predicates have non partitioned columns > -- > > Key: SPARK-18067 > URL: https://issues.apache.org/jira/browse/SPARK-18067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Paul Jones >Priority: Minor > > Basic setup > {code} > scala> case class Data1(key: String, value1: Int) > scala> case class Data2(key: String, value2: Int) > scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > {code} > Join on key > {code} > scala> partition1.join(partition2, "key").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [key#0], [key#12] >:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we get a super efficient join with no shuffle. > But if we add a filter our join gets less efficient and we end up with a > shuffle. > {code} > scala> partition1.join(partition2, "key").filter($"value1" === > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [value1#1,key#0], [value2#13,key#12] >:- Sort [value1#1 ASC,key#0 ASC], false, 0 >: +- TungstenExchange hashpartitioning(value1#1,key#0,200), None >: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- Sort [value2#13 ASC,key#12 ASC], false, 0 > +- TungstenExchange hashpartitioning(value2#13,key#12,200), None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we can avoid the shuffle if use a filter statement that can't be pushed > in the join. > {code} > scala> partition1.join(partition2, "key").filter($"value1" >= > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- Filter (value1#1 >= value2#13) >+- SortMergeJoin [key#0], [key#12] > :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > What's the best way to avoid the filter pushdown here?? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600728#comment-15600728 ] Liwei Lin edited comment on SPARK-16845 at 10/24/16 3:52 AM: - Hi [~dondrake] the other solution is still under discussion & in progress. It'd be super helpful if you could create and provide the "explode-union-parquet" reproducer which causes the fix to fail. Thank you! was (Author: lwlin): The other solution is still under discussion & in progress. It'd be super helpful if you could create and provide the "explode-union-parquet" reproducer which causes the fix to fail. Thank you! > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18069) Many examples in Python docstrings are incomplete
[ https://issues.apache.org/jira/browse/SPARK-18069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18069: Assignee: Apache Spark > Many examples in Python docstrings are incomplete > - > > Key: SPARK-18069 > URL: https://issues.apache.org/jira/browse/SPARK-18069 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Mortada Mehyar >Assignee: Apache Spark >Priority: Minor > > A lot of the python API functions show example usage that is incomplete. The > docstring shows output without having the input DataFrame defined. It can be > quite confusing trying to understand and/or follow the example. > For instance, the docstring for `DataFrame.dtypes()` is currently > {code} > def dtypes(self): > """Returns all column names and their data types as a list. > > >>> df.dtypes > [('age', 'int'), ('name', 'string')] > """ > {code} > when it should really be > {code} > def dtypes(self): > """Returns all column names and their data types as a list. > > >>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', > 'age']) > >>> df.dtypes > [('age', 'int'), ('name', 'string')] > """ > {code} > I have a pending PR for fixing many of these occurrences here: > https://github.com/apache/spark/pull/15053 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18069) Many examples in Python docstrings are incomplete
[ https://issues.apache.org/jira/browse/SPARK-18069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600830#comment-15600830 ] Apache Spark commented on SPARK-18069: -- User 'mortada' has created a pull request for this issue: https://github.com/apache/spark/pull/15053 > Many examples in Python docstrings are incomplete > - > > Key: SPARK-18069 > URL: https://issues.apache.org/jira/browse/SPARK-18069 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Mortada Mehyar >Priority: Minor > > A lot of the python API functions show example usage that is incomplete. The > docstring shows output without having the input DataFrame defined. It can be > quite confusing trying to understand and/or follow the example. > For instance, the docstring for `DataFrame.dtypes()` is currently > {code} > def dtypes(self): > """Returns all column names and their data types as a list. > > >>> df.dtypes > [('age', 'int'), ('name', 'string')] > """ > {code} > when it should really be > {code} > def dtypes(self): > """Returns all column names and their data types as a list. > > >>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', > 'age']) > >>> df.dtypes > [('age', 'int'), ('name', 'string')] > """ > {code} > I have a pending PR for fixing many of these occurrences here: > https://github.com/apache/spark/pull/15053 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18069) Many examples in Python docstrings are incomplete
[ https://issues.apache.org/jira/browse/SPARK-18069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18069: Assignee: (was: Apache Spark) > Many examples in Python docstrings are incomplete > - > > Key: SPARK-18069 > URL: https://issues.apache.org/jira/browse/SPARK-18069 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Mortada Mehyar >Priority: Minor > > A lot of the python API functions show example usage that is incomplete. The > docstring shows output without having the input DataFrame defined. It can be > quite confusing trying to understand and/or follow the example. > For instance, the docstring for `DataFrame.dtypes()` is currently > {code} > def dtypes(self): > """Returns all column names and their data types as a list. > > >>> df.dtypes > [('age', 'int'), ('name', 'string')] > """ > {code} > when it should really be > {code} > def dtypes(self): > """Returns all column names and their data types as a list. > > >>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', > 'age']) > >>> df.dtypes > [('age', 'int'), ('name', 'string')] > """ > {code} > I have a pending PR for fixing many of these occurrences here: > https://github.com/apache/spark/pull/15053 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18069) Many examples in Python docstrings are incomplete
Mortada Mehyar created SPARK-18069: -- Summary: Many examples in Python docstrings are incomplete Key: SPARK-18069 URL: https://issues.apache.org/jira/browse/SPARK-18069 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.0.1 Reporter: Mortada Mehyar Priority: Minor A lot of the python API functions show example usage that is incomplete. The docstring shows output without having the input DataFrame defined. It can be quite confusing trying to understand and/or follow the example. For instance, the docstring for `DataFrame.dtypes()` is currently {code} def dtypes(self): """Returns all column names and their data types as a list. >>> df.dtypes [('age', 'int'), ('name', 'string')] """ {code} when it should really be {code} def dtypes(self): """Returns all column names and their data types as a list. >>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', 'age']) >>> df.dtypes [('age', 'int'), ('name', 'string')] """ {code} I have a pending PR for fixing many of these occurrences here: https://github.com/apache/spark/pull/15053 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18067) Adding filter after SortMergeJoin creates unnecessary shuffle
[ https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600824#comment-15600824 ] Tejas Patil edited comment on SPARK-18067 at 10/24/16 3:35 AM: --- [~hvanhovell] : Tagging you since you have context of this portion of the code. I feel that createPartitioning()'s use of `HashPartitioning` is hurting here. `HashPartitioning` has stricter semantics and feels like we could have something else. More explanation: Both children are hash partitioned on `key`. Assume these are the partitions for the children: ||partitions||child 1||child 2|| |partition 0|[0, 0, 0, 3]|[0, 0, 3, 3]| |partition 1|[1, 4, 4]|[4]| |partition 2|[2, 2]|[2, 5, 5, 5]| Since we have __all__ the same values of `key` in a given partition, we can evaluate other join predicates like (`value1` = `value2`) right there without needing any shuffle. What is currently being done i.e. `HashPartitioning(key, value)` expects rows with same value of `pmod( hash(key, value))` to be in the same partition and does not take advantage of the fact that we already have rows with same `key` packed together. was (Author: tejasp): [~hvanhovell] : Tagging you since you have context of this portion of the code. I feel that createPartitioning()'s use of `HashPartitioning` is hurting here. `HashPartitioning` has stricter semantics and feels like we could have something else. More explanation: Both children are hash partitioned on `key`. Assume these are the partitions for the children: ||partitions||child 1||child 2|| |partition 0|[0, 0, 0, 3]|[0, 0, 3, 3]| |partition 1|[1, 4, 4]|[4]| |partition 2|[2, 2]|[2, 5, 5, 5]| If we have *all* the same values of `key` in a given partition, then we can evaluate other join predicates like (`value1` = `value2`) right there without needing a shuffle. What is currently being done i.e. `HashPartitioning(key, value)` expects rows with same value of `pmod( hash(key, value))` to be in the same partition and does not take advantage of the fact that we already have rows with same `key` packed together. > Adding filter after SortMergeJoin creates unnecessary shuffle > - > > Key: SPARK-18067 > URL: https://issues.apache.org/jira/browse/SPARK-18067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Paul Jones >Priority: Minor > > Basic setup > {code} > scala> case class Data1(key: String, value1: Int) > scala> case class Data2(key: String, value2: Int) > scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > {code} > Join on key > {code} > scala> partition1.join(partition2, "key").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [key#0], [key#12] >:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we get a super efficient join with no shuffle. > But if we add a filter our join gets less efficient and we end up with a > shuffle. > {code} > scala> partition1.join(partition2, "key").filter($"value1" === > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [value1#1,key#0], [value2#13,key#12] >:- Sort [value1#1 ASC,key#0 ASC], false, 0 >: +- TungstenExchange hashpartitioning(value1#1,key#0,200), None >: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- Sort [value2#13 ASC,key#12 ASC], false, 0 > +- TungstenExchange hashpartitioning(value2#13,key#12,200), None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we can avoid the shuffle if use a filter statement that can't be pushed > in the join. > {code} > scala> partition1.join(partition2, "key").filter($"value1" >= > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- Filter (value1#1 >= value2#13) >+- SortMergeJoin [key#0], [key#12] > :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], tr
[jira] [Commented] (SPARK-18067) Adding filter after SortMergeJoin creates unnecessary shuffle
[ https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600824#comment-15600824 ] Tejas Patil commented on SPARK-18067: - [~hvanhovell] : Tagging you since you have context of this portion of the code. I feel that createPartitioning()'s use of `HashPartitioning` is hurting here. `HashPartitioning` has stricter semantics and feels like we could have something else. More explanation: Both children are hash partitioned on `key`. Assume these are the partitions for the children: ||partitions||child 1||child 2|| |partition 0|[0, 0, 0, 3]|[0, 0, 3, 3]| |partition 1|[1, 4, 4]|[4]| |partition 2|[2, 2]|[2, 5, 5, 5]| If we have *all* the same values of `key` in a given partition, then we can evaluate other join predicates like (`value1` = `value2`) right there without needing a shuffle. What is currently being done i.e. `HashPartitioning(key, value)` expects rows with same value of `pmod( hash(key, value))` to be in the same partition and does not take advantage of the fact that we already have rows with same `key` packed together. > Adding filter after SortMergeJoin creates unnecessary shuffle > - > > Key: SPARK-18067 > URL: https://issues.apache.org/jira/browse/SPARK-18067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Paul Jones >Priority: Minor > > Basic setup > {code} > scala> case class Data1(key: String, value1: Int) > scala> case class Data2(key: String, value2: Int) > scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > {code} > Join on key > {code} > scala> partition1.join(partition2, "key").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [key#0], [key#12] >:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we get a super efficient join with no shuffle. > But if we add a filter our join gets less efficient and we end up with a > shuffle. > {code} > scala> partition1.join(partition2, "key").filter($"value1" === > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [value1#1,key#0], [value2#13,key#12] >:- Sort [value1#1 ASC,key#0 ASC], false, 0 >: +- TungstenExchange hashpartitioning(value1#1,key#0,200), None >: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- Sort [value2#13 ASC,key#12 ASC], false, 0 > +- TungstenExchange hashpartitioning(value2#13,key#12,200), None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we can avoid the shuffle if use a filter statement that can't be pushed > in the join. > {code} > scala> partition1.join(partition2, "key").filter($"value1" >= > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- Filter (value1#1 >= value2#13) >+- SortMergeJoin [key#0], [key#12] > :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > What's the best way to avoid the filter pushdown here?? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18067) Adding filter after SortMergeJoin creates unnecessary shuffle
[ https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600820#comment-15600820 ] Tejas Patil commented on SPARK-18067: - The predicate `value1 === value2` is pushed down to the Join operator which makes sense. Although, later the optimizer could have avoided a shuffle. Here is whats happening: https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L185 {noformat} val useExistingPartitioning = children.zip(requiredChildDistributions).forall { case (child, distribution) => child.outputPartitioning.guarantees( createPartitioning(distribution, maxChildrenNumPartitions)) } {noformat} Child's `outputPartitioning` is `HashPartitioning(expressions = \[key\], numPartitions = 200)`. Partitioning returned by `createPartitioning()` is `HashPartitioning(expressions = \[value, key\], numPartitions = 200)`. Since they don't match, an extra shuffle is added. > Adding filter after SortMergeJoin creates unnecessary shuffle > - > > Key: SPARK-18067 > URL: https://issues.apache.org/jira/browse/SPARK-18067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Paul Jones >Priority: Minor > > Basic setup > {code} > scala> case class Data1(key: String, value1: Int) > scala> case class Data2(key: String, value2: Int) > scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > {code} > Join on key > {code} > scala> partition1.join(partition2, "key").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [key#0], [key#12] >:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we get a super efficient join with no shuffle. > But if we add a filter our join gets less efficient and we end up with a > shuffle. > {code} > scala> partition1.join(partition2, "key").filter($"value1" === > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [value1#1,key#0], [value2#13,key#12] >:- Sort [value1#1 ASC,key#0 ASC], false, 0 >: +- TungstenExchange hashpartitioning(value1#1,key#0,200), None >: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- Sort [value2#13 ASC,key#12 ASC], false, 0 > +- TungstenExchange hashpartitioning(value2#13,key#12,200), None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we can avoid the shuffle if use a filter statement that can't be pushed > in the join. > {code} > scala> partition1.join(partition2, "key").filter($"value1" >= > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- Filter (value1#1 >= value2#13) >+- SortMergeJoin [key#0], [key#12] > :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > What's the best way to avoid the filter pushdown here?? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18067) Adding filter after SortMergeJoin creates unnecessary shuffle
[ https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600788#comment-15600788 ] Tejas Patil edited comment on SPARK-18067 at 10/24/16 3:12 AM: --- You could do : {noformat} val joinedOutput = partition1.join(partition2, "key").cache joinedOutput.filter($"value1" >= $"value2").collect {noformat} was (Author: tejasp): You could do : {norformat} val joinedOutput = partition1.join(partition2, "key").cache joinedOutput.filter($"value1" >= $"value2").collect {norformat} > Adding filter after SortMergeJoin creates unnecessary shuffle > - > > Key: SPARK-18067 > URL: https://issues.apache.org/jira/browse/SPARK-18067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Paul Jones >Priority: Minor > > Basic setup > {code} > scala> case class Data1(key: String, value1: Int) > scala> case class Data2(key: String, value2: Int) > scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > {code} > Join on key > {code} > scala> partition1.join(partition2, "key").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [key#0], [key#12] >:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we get a super efficient join with no shuffle. > But if we add a filter our join gets less efficient and we end up with a > shuffle. > {code} > scala> partition1.join(partition2, "key").filter($"value1" === > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [value1#1,key#0], [value2#13,key#12] >:- Sort [value1#1 ASC,key#0 ASC], false, 0 >: +- TungstenExchange hashpartitioning(value1#1,key#0,200), None >: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- Sort [value2#13 ASC,key#12 ASC], false, 0 > +- TungstenExchange hashpartitioning(value2#13,key#12,200), None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we can avoid the shuffle if use a filter statement that can't be pushed > in the join. > {code} > scala> partition1.join(partition2, "key").filter($"value1" >= > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- Filter (value1#1 >= value2#13) >+- SortMergeJoin [key#0], [key#12] > :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > What's the best way to avoid the filter pushdown here?? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18067) Adding filter after SortMergeJoin creates unnecessary shuffle
[ https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600788#comment-15600788 ] Tejas Patil commented on SPARK-18067: - You could do : {norformat} val joinedOutput = partition1.join(partition2, "key").cache joinedOutput.filter($"value1" >= $"value2").collect {norformat} > Adding filter after SortMergeJoin creates unnecessary shuffle > - > > Key: SPARK-18067 > URL: https://issues.apache.org/jira/browse/SPARK-18067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Paul Jones >Priority: Minor > > Basic setup > {code} > scala> case class Data1(key: String, value1: Int) > scala> case class Data2(key: String, value2: Int) > scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > {code} > Join on key > {code} > scala> partition1.join(partition2, "key").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [key#0], [key#12] >:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we get a super efficient join with no shuffle. > But if we add a filter our join gets less efficient and we end up with a > shuffle. > {code} > scala> partition1.join(partition2, "key").filter($"value1" === > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [value1#1,key#0], [value2#13,key#12] >:- Sort [value1#1 ASC,key#0 ASC], false, 0 >: +- TungstenExchange hashpartitioning(value1#1,key#0,200), None >: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- Sort [value2#13 ASC,key#12 ASC], false, 0 > +- TungstenExchange hashpartitioning(value2#13,key#12,200), None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we can avoid the shuffle if use a filter statement that can't be pushed > in the join. > {code} > scala> partition1.join(partition2, "key").filter($"value1" >= > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- Filter (value1#1 >= value2#13) >+- SortMergeJoin [key#0], [key#12] > :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > What's the best way to avoid the filter pushdown here?? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates
[ https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600732#comment-15600732 ] Stephane Maarek commented on SPARK-18068: - I see TimestampType is a wrapper for java.sql.Timestamp It seems that it can't parse a string without seconds. {code} scala> import java.sql.Timestamp import java.sql.Timestamp scala> Timestamp.valueOf("2016-10-07T11:15Z") java.lang.IllegalArgumentException: Timestamp format must be -mm-dd hh:mm:ss[.f] at java.sql.Timestamp.valueOf(Timestamp.java:204) ... 32 elided {code} A workaround would be to first convert to a date using the good Java 8 API and then passing it to the java.sql.Timestamp class > Spark SQL doesn't parse some ISO 8601 formatted dates > - > > Key: SPARK-18068 > URL: https://issues.apache.org/jira/browse/SPARK-18068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Stephane Maarek > > The following fail, but shouldn't according to the ISO 8601 standard (seconds > can be omitted). Not sure where the issue lies (probably an external library?) > {code} > scala> sc.parallelize(Seq("2016-10-07T11:15Z")) > res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at > parallelize at :25 > scala> res1.toDF > res2: org.apache.spark.sql.DataFrame = [value: string] > scala> res2.select("value").show() > +-+ > |value| > +-+ > |2016-10-07T11:15Z| > +-+ > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> res2.select(col("value").cast(TimestampType)).show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > And the schema usage errors out right away: > {code} > scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}""")) > jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at > parallelize at :33 > scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(tst,TimestampType,true)) > scala> val df = spark.read.schema(schema).json(jsonRDD) > df: org.apache.spark.sql.DataFrame = [tst: timestamp] > scala> df.show() > 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) > java.lang.IllegalArgumentException: 2016-10-07T11:15Z > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown > Source) > at > javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) > at > javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) > at > javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.sp
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600728#comment-15600728 ] Liwei Lin commented on SPARK-16845: --- The other solution is still under discussion & in progress. It'd be super helpful if you could create and provide the "explode-union-parquet" reproducer which causes the fix to fail. Thank you! > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates
[ https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-18068: Priority: Major (was: Critical) > Spark SQL doesn't parse some ISO 8601 formatted dates > - > > Key: SPARK-18068 > URL: https://issues.apache.org/jira/browse/SPARK-18068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Stephane Maarek > > The following fail, but shouldn't according to the ISO 8601 standard (seconds > can be omitted). Not sure where the issue lies (probably an external library?) > {code} > scala> sc.parallelize(Seq("2016-10-07T11:15Z")) > res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at > parallelize at :25 > scala> res1.toDF > res2: org.apache.spark.sql.DataFrame = [value: string] > scala> res2.select("value").show() > +-+ > |value| > +-+ > |2016-10-07T11:15Z| > +-+ > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> res2.select(col("value").cast(TimestampType)).show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > And the schema usage errors out right away: > {code} > scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}""")) > jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at > parallelize at :33 > scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(tst,TimestampType,true)) > scala> val df = spark.read.schema(schema).json(jsonRDD) > df: org.apache.spark.sql.DataFrame = [tst: timestamp] > scala> df.show() > 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) > java.lang.IllegalArgumentException: 2016-10-07T11:15Z > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown > Source) > at > javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) > at > javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) > at > javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java
[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI
[ https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600710#comment-15600710 ] Matthew Farrellee commented on SPARK-15487: --- try just setting the proxy url to "/" > Spark Master UI to reverse proxy Application and Workers UI > --- > > Key: SPARK-15487 > URL: https://issues.apache.org/jira/browse/SPARK-15487 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gurvinder >Assignee: Gurvinder >Priority: Minor > Fix For: 2.1.0 > > > Currently when running in Standalone mode, Spark UI's link to workers and > application drivers are pointing to internal/protected network endpoints. So > to access workers/application UI user's machine has to connect to VPN or need > to have access to internal network directly. > Therefore the proposal is to make Spark master UI reverse proxy this > information back to the user. So only Spark master UI needs to be opened up > to internet. > The minimal changes can be done by adding another route e.g. > http://spark-master.com/target// so when request goes to target, > ProxyServlet kicks in and takes the and forwards the request to it > and send response back to user. > More information about discussions for this features can be found on this > mailing list thread > http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates
[ https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-18068: Description: The following fail, but shouldn't according to the ISO 8601 standard (seconds can be omitted). Not sure where the issue lies (probably an external library?) {code} scala> sc.parallelize(Seq("2016-10-07T11:15Z")) res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :25 scala> res1.toDF res2: org.apache.spark.sql.DataFrame = [value: string] scala> res2.select("value").show() +-+ |value| +-+ |2016-10-07T11:15Z| +-+ scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> res2.select(col("value").cast(TimestampType)).show() +-+ |value| +-+ | null| +-+ {code} And the schema usage errors out right away: {code} scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}""")) jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at :33 scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil) schema: org.apache.spark.sql.types.StructType = StructType(StructField(tst,TimestampType,true)) scala> val df = spark.read.schema(schema).json(jsonRDD) df: org.apache.spark.sql.DataFrame = [tst: timestamp] scala> df.show() 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) java.lang.IllegalArgumentException: 2016-10-07T11:15Z at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown Source) at org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown Source) at javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) at javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) at javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/10/24 13:06:27 WARN TaskSetManager: Lost task 6.0 in stage 5.0 (TID 23, localhost): java.lang.IllegalArgumentException: 2016-10-07T11:15Z at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGrego
[jira] [Commented] (SPARK-13331) AES support for over-the-wire encryption
[ https://issues.apache.org/jira/browse/SPARK-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600705#comment-15600705 ] Junjie Chen commented on SPARK-13331: - Hi [~vanzin] I have updated the latest patch, Could you please help to review it? Due to an issue (CRYPTO-125) in Common Crypto, the patch has to use two helper channels. Once it be fixed and released, I will remove these channels. > AES support for over-the-wire encryption > > > Key: SPARK-13331 > URL: https://issues.apache.org/jira/browse/SPARK-13331 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Dong Chen >Priority: Minor > > In network/common, SASL with DIGEST-MD5 authentication is used for > negotiating a secure communication channel. When SASL operation mode is > "auth-conf", the data transferred on the network is encrypted. DIGEST-MD5 > mechanism supports following encryption: 3DES, DES, and RC4. The negotiation > procedure will select one of them to encrypt / decrypt the data on the > channel. > However, 3des and rc4 are slow relatively. We could add code in the > negotiation to make it support AES for more secure and performance. > The proposed solution is: > When "auth-conf" is enabled, at the end of original negotiation, the > authentication succeeds and a secure channel is built. We could add one more > negotiation step: Client and server negotiate whether they both support AES. > If yes, the Key and IV used by AES will be generated by server and sent to > client through the already secure channel. Then update the encryption / > decryption handler to AES at both client and server side. Following data > transfer will use AES instead of original encryption algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates
[ https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-18068: Description: The following fail, but shouldn't according to the ISO 8601 standard (seconds can be omitted). Not sure where the issue lies (probably an external library?) {code:scala} scala> sc.parallelize(Seq("2016-10-07T11:15Z")) res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :25 scala> res1.toDF res2: org.apache.spark.sql.DataFrame = [value: string] scala> res2.select("value").show() +-+ |value| +-+ |2016-10-07T11:15Z| +-+ scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> res2.select(col("value").cast(TimestampType)).show() +-+ |value| +-+ | null| +-+ {code} And the schema usage errors out right away: {code:scala} scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}""")) jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at :33 scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil) schema: org.apache.spark.sql.types.StructType = StructType(StructField(tst,TimestampType,true)) scala> val df = spark.read.schema(schema).json(jsonRDD) df: org.apache.spark.sql.DataFrame = [tst: timestamp] scala> df.show() 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) java.lang.IllegalArgumentException: 2016-10-07T11:15Z at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown Source) at org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown Source) at javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) at javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) at javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/10/24 13:06:27 WARN TaskSetManager: Lost task 6.0 in stage 5.0 (TID 23, localhost): java.lang.IllegalArgumentException: 2016-10-07T11:15Z at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown Source) at org.apache.xerces.jaxp.datat
[jira] [Created] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates
Stephane Maarek created SPARK-18068: --- Summary: Spark SQL doesn't parse some ISO 8601 formatted dates Key: SPARK-18068 URL: https://issues.apache.org/jira/browse/SPARK-18068 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Stephane Maarek Priority: Critical The following fail, but shouldn't according to the ISO 8601 standard (seconds can be omitted). Not sure where the issue lies (probably an external library?) ``` scala> sc.parallelize(Seq("2016-10-07T11:15Z")) res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :25 scala> res1.toDF res2: org.apache.spark.sql.DataFrame = [value: string] scala> res2.select("value").show() +-+ |value| +-+ |2016-10-07T11:15Z| +-+ scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> res2.select(col("value").cast(TimestampType)).show() +-+ |value| +-+ | null| +-+ ``` And the schema errors out right away: ``` scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}""")) jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at :33 scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil) schema: org.apache.spark.sql.types.StructType = StructType(StructField(tst,TimestampType,true)) scala> val df = spark.read.schema(schema).json(jsonRDD) df: org.apache.spark.sql.DataFrame = [tst: timestamp] scala> df.show() 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) java.lang.IllegalArgumentException: 2016-10-07T11:15Z at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown Source) at org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown Source) at javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) at javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) at javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/10/24 13:06:27 WARN TaskSetManager: Lost task 6.0 in stage 5.0 (TID 23, localhost): ja
[jira] [Commented] (SPARK-18064) Spark SQL can't load default config file
[ https://issues.apache.org/jira/browse/SPARK-18064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600483#comment-15600483 ] Hyukjin Kwon commented on SPARK-18064: -- Could you please fill up the JIRA description? I would like to reproduce this. It'd be great if there are some steps to check this. > Spark SQL can't load default config file > - > > Key: SPARK-18064 > URL: https://issues.apache.org/jira/browse/SPARK-18064 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: darion yaphet > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18067) Adding filter after SortMergeJoin creates unnecessary shuffle
[ https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Jones updated SPARK-18067: --- Description: Basic setup {code} scala> case class Data1(key: String, value1: Int) scala> case class Data2(key: String, value2: Int) scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x)) .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x)) .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache {code} Join on key {code} scala> partition1.join(partition2, "key").explain == Physical Plan == Project [key#0,value1#1,value2#13] +- SortMergeJoin [key#0], [key#12] :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort [key#0 ASC], false, 0, None +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), Sort [key#12 ASC], false, 0, None {code} And we get a super efficient join with no shuffle. But if we add a filter our join gets less efficient and we end up with a shuffle. {code} scala> partition1.join(partition2, "key").filter($"value1" === $"value2").explain == Physical Plan == Project [key#0,value1#1,value2#13] +- SortMergeJoin [value1#1,key#0], [value2#13,key#12] :- Sort [value1#1 ASC,key#0 ASC], false, 0 : +- TungstenExchange hashpartitioning(value1#1,key#0,200), None : +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort [key#0 ASC], false, 0, None +- Sort [value2#13 ASC,key#12 ASC], false, 0 +- TungstenExchange hashpartitioning(value2#13,key#12,200), None +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), Sort [key#12 ASC], false, 0, None {code} And we can avoid the shuffle if use a filter statement that can't be pushed in the join. {code} scala> partition1.join(partition2, "key").filter($"value1" >= $"value2").explain == Physical Plan == Project [key#0,value1#1,value2#13] +- Filter (value1#1 >= value2#13) +- SortMergeJoin [key#0], [key#12] :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort [key#0 ASC], false, 0, None +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), Sort [key#12 ASC], false, 0, None {code} What's the best way to avoid the filter pushdown here?? > Adding filter after SortMergeJoin creates unnecessary shuffle > - > > Key: SPARK-18067 > URL: https://issues.apache.org/jira/browse/SPARK-18067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Paul Jones >Priority: Minor > > Basic setup > {code} > scala> case class Data1(key: String, value1: Int) > scala> case class Data2(key: String, value2: Int) > scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > {code} > Join on key > {code} > scala> partition1.join(partition2, "key").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [key#0], [key#12] >:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we get a super efficient join with no shuffle. > But if we add a filter our join gets less efficient and we end up with a > shuffle. > {code} > scala> partition1.join(partition2, "key").filter($"value1" === > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [value1#1,key#0], [value2#13,key#12] >:- Sort [value1#1 ASC,key#0 ASC], false, 0 >: +- TungstenExchange hashpartitioning(value1#1,key#0,200), None >: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- Sort [value2#13 ASC,key#12 ASC], false, 0 > +- TungstenExchange hashpartitioning(v
[jira] [Created] (SPARK-18067) Adding filter after SortMergeJoin creates unnecessary shuffle
Paul Jones created SPARK-18067: -- Summary: Adding filter after SortMergeJoin creates unnecessary shuffle Key: SPARK-18067 URL: https://issues.apache.org/jira/browse/SPARK-18067 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1 Reporter: Paul Jones Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI
[ https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600365#comment-15600365 ] Cédric Hernalsteens commented on SPARK-15487: - While it works great to proxy to the slave nodes (running in docker containers with ports not exposed), there seem to be some inconsistencies in the way the URI are generated. I specified - SPARK_MASTER_OPTS=-Dspark.ui.reverseProxyUrl=http://dev-machine-1/spark/ -Dspark.ui.reverseProxy=true - SPARK_WORKER_OPTS=-Dspark.ui.reverseProxyUrl=http://dev-machine-1/spark/ -Dspark.ui.reverseProxy=true Hoping to be able to reverse proxy the whole thing with nginx: location /spark/ { proxy_pass http://spark-master:8080/; proxy_set_header X-Forwarded-Host $host; proxy_set_header X-Forwarded-Server $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; } However in the Master webUI, the URI for the slave nodes are http://dev-machine-1/proxy/... and not http://dev-machine-1/spark/proxy/... Then I decided to use a nginx directive " location /proxy/ " which is a bit ugly but then on the Worker webUI, the URI "back to master" is http://dev-machine-1/spark , apparently taking spark.ui.reverseProxyUrl into account. Would someone be able to point to what I'm doing wrong, or confirm this is an issue? > Spark Master UI to reverse proxy Application and Workers UI > --- > > Key: SPARK-15487 > URL: https://issues.apache.org/jira/browse/SPARK-15487 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gurvinder >Assignee: Gurvinder >Priority: Minor > Fix For: 2.1.0 > > > Currently when running in Standalone mode, Spark UI's link to workers and > application drivers are pointing to internal/protected network endpoints. So > to access workers/application UI user's machine has to connect to VPN or need > to have access to internal network directly. > Therefore the proposal is to make Spark master UI reverse proxy this > information back to the user. So only Spark master UI needs to be opened up > to internet. > The minimal changes can be done by adding another route e.g. > http://spark-master.com/target// so when request goes to target, > ProxyServlet kicks in and takes the and forwards the request to it > and send response back to user. > More information about discussions for this features can be found on this > mailing list thread > http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18066) Add Pool usage policies test coverage for FIFO & FAIR Schedulers
[ https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18066: Assignee: Apache Spark > Add Pool usage policies test coverage for FIFO & FAIR Schedulers > > > Key: SPARK-18066 > URL: https://issues.apache.org/jira/browse/SPARK-18066 > Project: Spark > Issue Type: Test > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Assignee: Apache Spark >Priority: Minor > > The following Pool usage cases need to have Unit test coverage : > - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* > property is set, related pool is not created and *TaskSetManagers* are added > to root pool. > - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is > not set. This can be happened when Properties object is null or empty(*new > Properties()*) or points default pool(*spark.scheduler.pool*=_default_). > - FAIR Scheduler creates a new pool with default values when > *spark.scheduler.pool* property points _non-existent_ pool. This can be > happened when scheduler allocation file is not set or it does not contain > related pool. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18066) Add Pool usage policies test coverage for FIFO & FAIR Schedulers
[ https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18066: Assignee: (was: Apache Spark) > Add Pool usage policies test coverage for FIFO & FAIR Schedulers > > > Key: SPARK-18066 > URL: https://issues.apache.org/jira/browse/SPARK-18066 > Project: Spark > Issue Type: Test > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Priority: Minor > > The following Pool usage cases need to have Unit test coverage : > - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* > property is set, related pool is not created and *TaskSetManagers* are added > to root pool. > - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is > not set. This can be happened when Properties object is null or empty(*new > Properties()*) or points default pool(*spark.scheduler.pool*=_default_). > - FAIR Scheduler creates a new pool with default values when > *spark.scheduler.pool* property points _non-existent_ pool. This can be > happened when scheduler allocation file is not set or it does not contain > related pool. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18066) Add Pool usage policies test coverage for FIFO & FAIR Schedulers
[ https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600310#comment-15600310 ] Apache Spark commented on SPARK-18066: -- User 'erenavsarogullari' has created a pull request for this issue: https://github.com/apache/spark/pull/15604 > Add Pool usage policies test coverage for FIFO & FAIR Schedulers > > > Key: SPARK-18066 > URL: https://issues.apache.org/jira/browse/SPARK-18066 > Project: Spark > Issue Type: Test > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Priority: Minor > > The following Pool usage cases need to have Unit test coverage : > - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* > property is set, related pool is not created and *TaskSetManagers* are added > to root pool. > - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is > not set. This can be happened when Properties object is null or empty(*new > Properties()*) or points default pool(*spark.scheduler.pool*=_default_). > - FAIR Scheduler creates a new pool with default values when > *spark.scheduler.pool* property points _non-existent_ pool. This can be > happened when scheduler allocation file is not set or it does not contain > related pool. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18066) Add Pool usage policies test coverage for FIFO & FAIR Schedulers
[ https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-18066: --- Summary: Add Pool usage policies test coverage for FIFO & FAIR Schedulers (was: Add Pool usage policies test coverage to FIFO & FAIR Schedulers) > Add Pool usage policies test coverage for FIFO & FAIR Schedulers > > > Key: SPARK-18066 > URL: https://issues.apache.org/jira/browse/SPARK-18066 > Project: Spark > Issue Type: Test > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Priority: Minor > > The following Pool usage cases need to have Unit test coverage : > - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* > property is set, related pool is not created and *TaskSetManagers* are added > to root pool. > - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is > not set. This can be happened when Properties object is null or empty(*new > Properties()*) or points default pool(*spark.scheduler.pool*=_default_). > - FAIR Scheduler creates a new pool with default values when > *spark.scheduler.pool* property points _non-existent_ pool. This can be > happened when scheduler allocation file is not set or it does not contain > related pool. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18066) Add Pool usage policies test coverage to FIFO & FAIR Schedulers
[ https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-18066: --- Description: The following Pool usage cases need to have Unit test coverage : - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* property is set, related pool is not created and *TaskSetManagers* are added to root pool. - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is not set. This can be happened when Properties object is null or empty(*new Properties()*) or points default pool(*spark.scheduler.pool*=_default_). - FAIR Scheduler creates a new pool with default values when *spark.scheduler.pool* property points _non-existent_ pool. This can be happened when scheduler allocation file is not set or it does not contain related pool. was: The following Pool usage cases need to have Unit test coverage : - FIFO Scheduler just uses *root pool* so even if {code:java}spark.scheduler.pool{code} property is set, related pool is not created and {code:java}TaskSetManagers{code} are added to root pool. - FAIR Scheduler uses default pool when spark.scheduler.pool property is not set. This can be happened when Properties object is null or empty(new Properties()) or points default pool(spark.scheduler.pool=default). - FAIR Scheduler creates a new pool with default values when spark.scheduler.pool property points non-existent pool. This can be happened when scheduler allocation file is not set or it does not contain related pool. > Add Pool usage policies test coverage to FIFO & FAIR Schedulers > --- > > Key: SPARK-18066 > URL: https://issues.apache.org/jira/browse/SPARK-18066 > Project: Spark > Issue Type: Test > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Priority: Minor > > The following Pool usage cases need to have Unit test coverage : > - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* > property is set, related pool is not created and *TaskSetManagers* are added > to root pool. > - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is > not set. This can be happened when Properties object is null or empty(*new > Properties()*) or points default pool(*spark.scheduler.pool*=_default_). > - FAIR Scheduler creates a new pool with default values when > *spark.scheduler.pool* property points _non-existent_ pool. This can be > happened when scheduler allocation file is not set or it does not contain > related pool. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18066) Add Pool usage policies test coverage to FIFO & FAIR Schedulers
[ https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-18066: --- Description: The following Pool usage cases need to have Unit test coverage : - FIFO Scheduler just uses *root pool* so even if {code:java}spark.scheduler.pool{code} property is set, related pool is not created and {code:java}TaskSetManagers{code} are added to root pool. - FAIR Scheduler uses default pool when spark.scheduler.pool property is not set. This can be happened when Properties object is null or empty(new Properties()) or points default pool(spark.scheduler.pool=default). - FAIR Scheduler creates a new pool with default values when spark.scheduler.pool property points non-existent pool. This can be happened when scheduler allocation file is not set or it does not contain related pool. was: The following Pool usage cases need to have Unit test coverage : - FIFO Scheduler just uses root pool so even if spark.scheduler.pool property is set, related pool is not created and TaskSetManagers are added to root pool. - FAIR Scheduler uses default pool when spark.scheduler.pool property is not set. This can be happened when Properties object is null or empty(new Properties()) or points default pool(spark.scheduler.pool=default). - FAIR Scheduler creates a new pool with default values when spark.scheduler.pool property points non-existent pool. This can be happened when scheduler allocation file is not set or it does not contain related pool. > Add Pool usage policies test coverage to FIFO & FAIR Schedulers > --- > > Key: SPARK-18066 > URL: https://issues.apache.org/jira/browse/SPARK-18066 > Project: Spark > Issue Type: Test > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Priority: Minor > > The following Pool usage cases need to have Unit test coverage : > - FIFO Scheduler just uses *root pool* so even if > {code:java}spark.scheduler.pool{code} property is set, related pool is not > created and {code:java}TaskSetManagers{code} are added to root pool. > - FAIR Scheduler uses default pool when spark.scheduler.pool property is not > set. This can be happened when Properties object is null or empty(new > Properties()) or points default pool(spark.scheduler.pool=default). > - FAIR Scheduler creates a new pool with default values when > spark.scheduler.pool property points non-existent pool. This can be happened > when scheduler allocation file is not set or it does not contain related pool. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18066) Add Pool usage policies test coverage to FIFO & FAIR Schedulers
Eren Avsarogullari created SPARK-18066: -- Summary: Add Pool usage policies test coverage to FIFO & FAIR Schedulers Key: SPARK-18066 URL: https://issues.apache.org/jira/browse/SPARK-18066 Project: Spark Issue Type: Test Components: Scheduler Affects Versions: 2.1.0 Reporter: Eren Avsarogullari Priority: Minor The following Pool usage cases need to have Unit test coverage : - FIFO Scheduler just uses root pool so even if spark.scheduler.pool property is set, related pool is not created and TaskSetManagers are added to root pool. - FAIR Scheduler uses default pool when spark.scheduler.pool property is not set. This can be happened when Properties object is null or empty(new Properties()) or points default pool(spark.scheduler.pool=default). - FAIR Scheduler creates a new pool with default values when spark.scheduler.pool property points non-existent pool. This can be happened when scheduler allocation file is not set or it does not contain related pool. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18058) AnalysisException may be thrown when union two DFs whose struct fields have different nullability
[ https://issues.apache.org/jira/browse/SPARK-18058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-18058: -- Fix Version/s: 2.0.2 > AnalysisException may be thrown when union two DFs whose struct fields have > different nullability > - > > Key: SPARK-18058 > URL: https://issues.apache.org/jira/browse/SPARK-18058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.1 >Reporter: Cheng Lian >Assignee: Nan Zhu > Fix For: 2.0.2, 2.1.0 > > > The following Spark shell snippet reproduces this issue: > {code} > spark.range(10).createOrReplaceTempView("t1") > spark.range(10).map(i => i: > java.lang.Long).toDF("id").createOrReplaceTempView("t2") > sql("SELECT struct(id) FROM t1 UNION ALL SELECT struct(id) FROM t2") > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Union can only be performed on tables > with the compatible column types. StructType(StructField(id,LongType,true)) > <> StructType(StructField(id,LongType,false)) at the first column of the > second table; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:291) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:289) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:278) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:278) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573) > ... 50 elided > {noformat} > The reason is that we treat two {{StructType}} incompatible even if their > only differ from each other in field nullability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema
[ https://issues.apache.org/jira/browse/SPARK-18065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600088#comment-15600088 ] Matthew Scruggs commented on SPARK-18065: - Makes sense [~hvanhovell], but the change in behavior from 1.6 to 2 was a bit unexpected. This might be more of a documentation/API issue since the behavior is valid but different than before. > Spark 2 allows filter/where on columns not in current schema > > > Key: SPARK-18065 > URL: https://issues.apache.org/jira/browse/SPARK-18065 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Matthew Scruggs > > I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a > DataFrame that previously had a column, but no longer has it in its schema > due to a select() operation. > In Spark 1.6.2, in spark-shell, we see that an exception is thrown when > attempting to filter/where using the selected-out column: > {code:title=Spark 1.6.2} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.6.2 > /_/ > Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > SQL context available as sqlContext. > scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), > (2, "two".selectExpr("_1 as id", "_2 as word") > df1: org.apache.spark.sql.DataFrame = [id: int, word: string] > scala> df1.show() > +---++ > | id|word| > +---++ > | 1| one| > | 2| two| > +---++ > scala> val df2 = df1.select("id") > df2: org.apache.spark.sql.DataFrame = [id: int] > scala> df2.printSchema() > root > |-- id: integer (nullable = false) > scala> df2.where("word = 'one'").show() > org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input > columns: [id]; > {code} > However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds > (no AnalysisException) and seems to filter out data as if the column remains: > {code:title=Spark 2.0.1} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.0.1 > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val df1 = sc.parallelize(Seq((1, "one"), (2, > "two"))).toDF().selectExpr("_1 as id", "_2 as word") > df1: org.apache.spark.sql.DataFrame = [id: int, word: string] > scala> df1.show() > +---++ > | id|word| > +---++ > | 1| one| > | 2| two| > +---++ > scala> val df2 = df1.select("id") > df2: org.apache.spark.sql.DataFrame = [id: int] > scala> df2.printSchema() > root > |-- id: integer (nullable = false) > scala> df2.where("word = 'one'").show() > +---+ > | id| > +---+ > | 1| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18058) AnalysisException may be thrown when union two DFs whose struct fields have different nullability
[ https://issues.apache.org/jira/browse/SPARK-18058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600060#comment-15600060 ] Apache Spark commented on SPARK-18058: -- User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/15602 > AnalysisException may be thrown when union two DFs whose struct fields have > different nullability > - > > Key: SPARK-18058 > URL: https://issues.apache.org/jira/browse/SPARK-18058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.1 >Reporter: Cheng Lian >Assignee: Nan Zhu > Fix For: 2.1.0 > > > The following Spark shell snippet reproduces this issue: > {code} > spark.range(10).createOrReplaceTempView("t1") > spark.range(10).map(i => i: > java.lang.Long).toDF("id").createOrReplaceTempView("t2") > sql("SELECT struct(id) FROM t1 UNION ALL SELECT struct(id) FROM t2") > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Union can only be performed on tables > with the compatible column types. StructType(StructField(id,LongType,true)) > <> StructType(StructField(id,LongType,false)) at the first column of the > second table; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:291) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:289) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:278) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:278) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573) > ... 50 elided > {noformat} > The reason is that we treat two {{StructType}} incompatible even if their > only differ from each other in field nullability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18058) AnalysisException may be thrown when union two DFs whose struct fields have different nullability
[ https://issues.apache.org/jira/browse/SPARK-18058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-18058. --- Resolution: Fixed Assignee: Nan Zhu Fix Version/s: 2.1.0 > AnalysisException may be thrown when union two DFs whose struct fields have > different nullability > - > > Key: SPARK-18058 > URL: https://issues.apache.org/jira/browse/SPARK-18058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.1 >Reporter: Cheng Lian >Assignee: Nan Zhu > Fix For: 2.1.0 > > > The following Spark shell snippet reproduces this issue: > {code} > spark.range(10).createOrReplaceTempView("t1") > spark.range(10).map(i => i: > java.lang.Long).toDF("id").createOrReplaceTempView("t2") > sql("SELECT struct(id) FROM t1 UNION ALL SELECT struct(id) FROM t2") > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Union can only be performed on tables > with the compatible column types. StructType(StructField(id,LongType,true)) > <> StructType(StructField(id,LongType,false)) at the first column of the > second table; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:291) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:289) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:278) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:278) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573) > ... 50 elided > {noformat} > The reason is that we treat two {{StructType}} incompatible even if their > only differ from each other in field nullability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599977#comment-15599977 ] Tejas Patil commented on SPARK-17495: - [~rxin] : Sorry about that. In my original PR I intentionally did not introduce any usages of hive hash function in rest of the code to keep the PR atomic. Although, I did not intend to have the Jira closed. There are two places that I can think of at top of my head where hive hash can be used: - When hash() is called as a function in the user query. I will work on this. - When hash partitioning is done. Would it be possible to get https://github.com/apache/spark/pull/15300 reviewed ? After thats in, it will allow me to do this in a meaningful way (atleast that was my main objective behind this Jira). > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.1.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599965#comment-15599965 ] Tejas Patil commented on SPARK-17495: - [~hvanhovell] : There were two datatypes that I had added TODO for in the code: Decimal and date related types. I will submit a PR https://github.com/tejasapatil/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala#L631 > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.1.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema
[ https://issues.apache.org/jira/browse/SPARK-18065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599951#comment-15599951 ] Herman van Hovell commented on SPARK-18065: --- This is - unfortunately - not really a bug. The SQL spec allows you to order a result set based on column that is not in the projection, see TPC-DS query 98 for an example: {noformat} SELECT i_item_desc, i_category, i_class, i_current_price, sum(ss_ext_sales_price) AS itemrevenue, sum(ss_ext_sales_price) * 100 / sum(sum(ss_ext_sales_price)) OVER (PARTITION BY i_class) AS revenueratio FROM store_sales, item, date_dim WHERE ss_item_sk = i_item_sk AND i_category IN ('Sports', 'Books', 'Home') AND ss_sold_date_sk = d_date_sk AND d_date BETWEEN cast('1999-02-22' AS DATE) AND (cast('1999-02-22' AS DATE) + INTERVAL 30 days) GROUP BY i_item_id, i_item_desc, i_category, i_class, i_current_price ORDER BY i_category, i_class, i_item_id, i_item_desc, revenueratio {noformat} In Spark 1.6 we only resolved such a column if it was part of the child's child. In spark 2.0 we search the entire child tree. > Spark 2 allows filter/where on columns not in current schema > > > Key: SPARK-18065 > URL: https://issues.apache.org/jira/browse/SPARK-18065 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Matthew Scruggs > > I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a > DataFrame that previously had a column, but no longer has it in its schema > due to a select() operation. > In Spark 1.6.2, in spark-shell, we see that an exception is thrown when > attempting to filter/where using the selected-out column: > {code:title=Spark 1.6.2} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.6.2 > /_/ > Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > SQL context available as sqlContext. > scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), > (2, "two".selectExpr("_1 as id", "_2 as word") > df1: org.apache.spark.sql.DataFrame = [id: int, word: string] > scala> df1.show() > +---++ > | id|word| > +---++ > | 1| one| > | 2| two| > +---++ > scala> val df2 = df1.select("id") > df2: org.apache.spark.sql.DataFrame = [id: int] > scala> df2.printSchema() > root > |-- id: integer (nullable = false) > scala> df2.where("word = 'one'").show() > org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input > columns: [id]; > {code} > However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds > (no AnalysisException) and seems to filter out data as if the column remains: > {code:title=Spark 2.0.1} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.0.1 > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val df1 = sc.parallelize(Seq((1, "one"), (2, > "two"))).toDF().selectExpr("_1 as id", "_2 as word") > df1: org.apache.spark.sql.DataFrame = [id: int, word: string] > scala> df1.show() > +---++ > | id|word| > +---++ > | 1| one| > | 2| two| > +---++ > scala> val df2 = df1.select("id") > df2: org.apache.spark.sql.DataFrame = [id: int] > scala> df2.printSchema() > root > |-- id: integer (nullable = false) > scala> df2.where("word = 'one'").show() > +---+ > | id| > +---+ > | 1| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15453) FileSourceScanExec to extract `outputOrdering` information
[ https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated SPARK-15453: Summary: FileSourceScanExec to extract `outputOrdering` information (was: Improve join planning for bucketed / sorted tables) > FileSourceScanExec to extract `outputOrdering` information > -- > > Key: SPARK-15453 > URL: https://issues.apache.org/jira/browse/SPARK-15453 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.1.0 > > > Datasource allows creation of bucketed and sorted tables but performing joins > on such tables still does not utilize this metadata to produce optimal query > plan. > As below, the `Exchange` and `Sort` can be avoided if the tables are known to > be hashed + sorted on relevant columns. > {noformat} > == Physical Plan == > WholeStageCodegen > : +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None > : :- INPUT > : +- INPUT > :- WholeStageCodegen > : : +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0 > : : +- INPUT > : +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None > : +- WholeStageCodegen > :: +- Project [j#20,k#21,i#22] > :: +- Filter (isnotnull(k#21) && isnotnull(j#20)) > ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, > InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), > IsNotNull(j)], ReadSchema: struct > +- WholeStageCodegen >: +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0 >: +- INPUT >+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None > +- WholeStageCodegen > : +- Project [j#23,k#24,i#25] > : +- Filter (isnotnull(k#24) && isnotnull(j#23)) > :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, > InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), > IsNotNull(j)], ReadSchema: struct > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599904#comment-15599904 ] Saikat Kanjilal commented on SPARK-9487: Ping on this, [~holdenk] can you let me know if I can move ahead with the above approach. Thanks > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI
[ https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599856#comment-15599856 ] Cédric Hernalsteens edited comment on SPARK-15487 at 10/23/16 3:49 PM: --- I'm glad to see that the PR went through. I got it to work by setting -Dspark.ui.reverseProxy=true in SPARK_MASTER_OPTS. Is this the correct way to proceed? was (Author: chernals): I'm glad to see that the PR went through. However I'm not sure how this is supposed to work. The Master webui gives me this : worker-20161023152840-172.28.0.8-42846 linking to the internal docker IP. So I tried to access the worker UI from http://spark-master-public-ip/worker-20161023152840-172.28.0.8-42846 http://spark-master-public-ip/target/worker-20161023152840-172.28.0.8-42846 http://spark-master-public-ip/target/20161023152840-172.28.0.8-42846 (from what I read in the discussion the last one should have been correct) However I'm stuck at the master's webui. I'm running 2.1.0 latest nightly build http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/ > Spark Master UI to reverse proxy Application and Workers UI > --- > > Key: SPARK-15487 > URL: https://issues.apache.org/jira/browse/SPARK-15487 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gurvinder >Assignee: Gurvinder >Priority: Minor > Fix For: 2.1.0 > > > Currently when running in Standalone mode, Spark UI's link to workers and > application drivers are pointing to internal/protected network endpoints. So > to access workers/application UI user's machine has to connect to VPN or need > to have access to internal network directly. > Therefore the proposal is to make Spark master UI reverse proxy this > information back to the user. So only Spark master UI needs to be opened up > to internet. > The minimal changes can be done by adding another route e.g. > http://spark-master.com/target// so when request goes to target, > ProxyServlet kicks in and takes the and forwards the request to it > and send response back to user. > More information about discussions for this features can be found on this > mailing list thread > http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema
[ https://issues.apache.org/jira/browse/SPARK-18065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Scruggs updated SPARK-18065: Description: I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a DataFrame that previously had a column, but no longer has it in its schema due to a select() operation. In Spark 1.6.2, in spark-shell, we see that an exception is thrown when attempting to filter/where using the selected-out column: {code:title=Spark 1.6.2} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.2 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), (2, "two".selectExpr("_1 as id", "_2 as word") df1: org.apache.spark.sql.DataFrame = [id: int, word: string] scala> df1.show() +---++ | id|word| +---++ | 1| one| | 2| two| +---++ scala> val df2 = df1.select("id") df2: org.apache.spark.sql.DataFrame = [id: int] scala> df2.printSchema() root |-- id: integer (nullable = false) scala> df2.where("word = 'one'").show() org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input columns: [id]; {code} However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds (no AnalysisException) and seems to filter out data as if the column remains: {code:title=Spark 2.0.1} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. scala> val df1 = sc.parallelize(Seq((1, "one"), (2, "two"))).toDF().selectExpr("_1 as id", "_2 as word") df1: org.apache.spark.sql.DataFrame = [id: int, word: string] scala> df1.show() +---++ | id|word| +---++ | 1| one| | 2| two| +---++ scala> val df2 = df1.select("id") df2: org.apache.spark.sql.DataFrame = [id: int] scala> df2.printSchema() root |-- id: integer (nullable = false) scala> df2.where("word = 'one'").show() +---+ | id| +---+ | 1| +---+ {code} was: I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a DataFrame that previously had a column, but no longer has it in its schema due to a select() operation. In Spark 1.6.2, in spark-shell, we see that an exception is thrown when attempting to filter/where using the selected-out column: {code:title=Spark 1.6.2} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.2 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), (2, "two".selectExpr("_1 as id", "_2 as word") df1: org.apache.spark.sql.DataFrame = [id: int, word: string] scala> df1.show() +---++ | id|word| +---++ | 1| one| | 2| two| +---++ scala> val df2 = df1.select("id") df2: org.apache.spark.sql.DataFrame = [id: int] scala> df2.printSchema() root |-- id: integer (nullable = false) scala> df2.where("word = 'one'").show() org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input columns: [id]; {code} However in Spark 2.0.0 and 2.0.1, we see that the same code succeeds and seems to filter out data as if the column remains: {code:title=Spark 2.0.1} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. scala> val df1 = sc.parallelize(Seq((1, "one"), (2, "two"))).toDF().selectExpr("_1 as id", "_2 as word") df1: org.apache.spark.sql.DataFrame = [id: int, word: string] scala> df1.show() +---++ | id|word| +---++ | 1| one| | 2| two| +---++ scala> val df2 = df1.select("id") df2: org.apache.spark.sql.DataFrame = [id: int] scala> df2.printSchema() root |-- id: integer (nullable = false) scala> df2.where("word = 'one'").show() +---+ | id| +---+ | 1| +---+ {code} > Spark 2 allows filter/where on columns not in current schema > > > Key: SPARK-18065 > URL: https://issues.apache.org/jira/browse/SPARK-18065 > Project: Spark > Issue T
[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI
[ https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599856#comment-15599856 ] Cédric Hernalsteens commented on SPARK-15487: - I'm glad to see that the PR went through. However I'm not sure how this is supposed to work. The Master webui gives me this : worker-20161023152840-172.28.0.8-42846 linking to the internal docker IP. So I tried to access the worker UI from http://spark-master-public-ip/worker-20161023152840-172.28.0.8-42846 http://spark-master-public-ip/target/worker-20161023152840-172.28.0.8-42846 http://spark-master-public-ip/target/20161023152840-172.28.0.8-42846 (from what I read in the discussion the last one should have been correct) However I'm stuck at the master's webui. I'm running 2.1.0 latest nightly build http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/ > Spark Master UI to reverse proxy Application and Workers UI > --- > > Key: SPARK-15487 > URL: https://issues.apache.org/jira/browse/SPARK-15487 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gurvinder >Assignee: Gurvinder >Priority: Minor > Fix For: 2.1.0 > > > Currently when running in Standalone mode, Spark UI's link to workers and > application drivers are pointing to internal/protected network endpoints. So > to access workers/application UI user's machine has to connect to VPN or need > to have access to internal network directly. > Therefore the proposal is to make Spark master UI reverse proxy this > information back to the user. So only Spark master UI needs to be opened up > to internet. > The minimal changes can be done by adding another route e.g. > http://spark-master.com/target// so when request goes to target, > ProxyServlet kicks in and takes the and forwards the request to it > and send response back to user. > More information about discussions for this features can be found on this > mailing list thread > http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema
Matthew Scruggs created SPARK-18065: --- Summary: Spark 2 allows filter/where on columns not in current schema Key: SPARK-18065 URL: https://issues.apache.org/jira/browse/SPARK-18065 Project: Spark Issue Type: Bug Affects Versions: 2.0.1, 2.0.0 Reporter: Matthew Scruggs I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a DataFrame that previously had a column, but no longer has it in its schema due to a select() operation. In Spark 1.6.2, in spark-shell, we see that an exception is thrown when attempting to filter/where using the selected-out column: {code:title=Spark 1.6.2} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.2 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), (2, "two".selectExpr("_1 as id", "_2 as word") df1: org.apache.spark.sql.DataFrame = [id: int, word: string] scala> df1.show() +---++ | id|word| +---++ | 1| one| | 2| two| +---++ scala> val df2 = df1.select("id") df2: org.apache.spark.sql.DataFrame = [id: int] scala> df2.printSchema() root |-- id: integer (nullable = false) scala> df2.where("word = 'one'").show() org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input columns: [id]; {code} However in Spark 2.0.0 and 2.0.1, we see that the same code succeeds and seems to filter out data as if the column remains: {code:title=Spark 2.0.1} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. scala> val df1 = sc.parallelize(Seq((1, "one"), (2, "two"))).toDF().selectExpr("_1 as id", "_2 as word") df1: org.apache.spark.sql.DataFrame = [id: int, word: string] scala> df1.show() +---++ | id|word| +---++ | 1| one| | 2| two| +---++ scala> val df2 = df1.select("id") df2: org.apache.spark.sql.DataFrame = [id: int] scala> df2.printSchema() root |-- id: integer (nullable = false) scala> df2.where("word = 'one'").show() +---+ | id| +---+ | 1| +---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18064) Spark SQL can't load default config file
darion yaphet created SPARK-18064: - Summary: Spark SQL can't load default config file Key: SPARK-18064 URL: https://issues.apache.org/jira/browse/SPARK-18064 Project: Spark Issue Type: Bug Components: SQL Reporter: darion yaphet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16881) Migrate Mesos configs to use ConfigEntry
[ https://issues.apache.org/jira/browse/SPARK-16881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599682#comment-15599682 ] Sandeep Singh commented on SPARK-16881: --- I can work on this. > Migrate Mesos configs to use ConfigEntry > > > Key: SPARK-16881 > URL: https://issues.apache.org/jira/browse/SPARK-16881 > Project: Spark > Issue Type: Task > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Michael Gummelt >Priority: Minor > > https://github.com/apache/spark/pull/14414#discussion_r73032190 > We'd like to migrate Mesos' use of config vars to the new ConfigEntry class > so we can a) define all our configs in one place like YARN does, and b) take > use of features like default handling and generics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17123) Performing set operations that combine string and date / timestamp columns may result in generated projection code which doesn't compile
[ https://issues.apache.org/jira/browse/SPARK-17123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-17123: -- Fix Version/s: 2.0.2 > Performing set operations that combine string and date / timestamp columns > may result in generated projection code which doesn't compile > > > Key: SPARK-17123 > URL: https://issues.apache.org/jira/browse/SPARK-17123 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.0.2, 2.1.0 > > > The following example program causes SpecificSafeProjection code generation > to produce Java code which doesn't compile: > {code} > import org.apache.spark.sql.types._ > spark.sql("set spark.sql.codegen.fallback=false") > val dateDF = spark.createDataFrame(sc.parallelize(Seq(Row(new > java.sql.Date(0, StructType(StructField("value", DateType) :: Nil)) > val longDF = sc.parallelize(Seq(new java.sql.Date(0).toString)).toDF > dateDF.union(longDF).collect() > {code} > This fails at runtime with the following error: > {code} > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 28, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates > are: "public static java.sql.Date > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private org.apache.spark.sql.types.StructType schema; > /* 011 */ > /* 012 */ > /* 013 */ public SpecificSafeProjection(Object[] references) { > /* 014 */ this.references = references; > /* 015 */ mutableRow = (MutableRow) references[references.length - 1]; > /* 016 */ > /* 017 */ this.schema = (org.apache.spark.sql.types.StructType) > references[0]; > /* 018 */ } > /* 019 */ > /* 020 */ public java.lang.Object apply(java.lang.Object _i) { > /* 021 */ InternalRow i = (InternalRow) _i; > /* 022 */ > /* 023 */ values = new Object[1]; > /* 024 */ > /* 025 */ boolean isNull2 = i.isNullAt(0); > /* 026 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 027 */ boolean isNull1 = isNull2; > /* 028 */ final java.sql.Date value1 = isNull1 ? null : > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2); > /* 029 */ isNull1 = value1 == null; > /* 030 */ if (isNull1) { > /* 031 */ values[0] = null; > /* 032 */ } else { > /* 033 */ values[0] = value1; > /* 034 */ } > /* 035 */ > /* 036 */ final org.apache.spark.sql.Row value = new > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, > schema); > /* 037 */ if (false) { > /* 038 */ mutableRow.setNullAt(0); > /* 039 */ } else { > /* 040 */ > /* 041 */ mutableRow.update(0, value); > /* 042 */ } > /* 043 */ > /* 044 */ return mutableRow; > /* 045 */ } > /* 046 */ } > {code} > Here, the invocation of {{DateTimeUtils.toJavaDate}} is incorrect because the > generated code tries to call it with a UTF8String while the method expects an > int instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18027) .sparkStaging not clean on RM ApplicationNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599551#comment-15599551 ] Sean Owen commented on SPARK-18027: --- OK, though Spark will consider the app failed in this case no matter what. Is it consistent to not clean it up? it won't recover on the Spark side regardless. > .sparkStaging not clean on RM ApplicationNotFoundException > -- > > Key: SPARK-18027 > URL: https://issues.apache.org/jira/browse/SPARK-18027 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: David Shar >Priority: Minor > > Hi, > It seems that SPARK-7705 didn't fix all issues with .sparkStaging folder > cleanup. > in Client.scala:monitorApplication > {code} > val report: ApplicationReport = > try { > getApplicationReport(appId) > } catch { > case e: ApplicationNotFoundException => > logError(s"Application $appId not found.") > return (YarnApplicationState.KILLED, > FinalApplicationStatus.KILLED) > case NonFatal(e) => > logError(s"Failed to contact YARN for application $appId.", e) > return (YarnApplicationState.FAILED, > FinalApplicationStatus.FAILED) > } > > if (state == YarnApplicationState.FINISHED || > state == YarnApplicationState.FAILED || > state == YarnApplicationState.KILLED) { > cleanupStagingDir(appId) > return (state, report.getFinalApplicationStatus) > } > {code} > In case of ApplicationNotFoundException, we don't cleanup the sparkStaging > folder. > I believe we should call cleanupStagingDir(appId) on the catch clause above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18045) Move `HiveDataFrameAnalyticsSuite` to package `sql`
[ https://issues.apache.org/jira/browse/SPARK-18045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-18045. --- Resolution: Fixed Assignee: Jiang Xingbo Fix Version/s: 2.1.0 > Move `HiveDataFrameAnalyticsSuite` to package `sql` > --- > > Key: SPARK-18045 > URL: https://issues.apache.org/jira/browse/SPARK-18045 > Project: Spark > Issue Type: Test > Components: SQL >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Minor > Fix For: 2.1.0 > > > The testsuite `HiveDataFrameAnalyticsSuite` has nothing to do with HIVE, we > should move it to package `sql`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children
[ https://issues.apache.org/jira/browse/SPARK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-18038. --- Resolution: Fixed Fix Version/s: 2.1.0 > Move output partitioning definition from UnaryNodeExec to its children > -- > > Key: SPARK-18038 > URL: https://issues.apache.org/jira/browse/SPARK-18038 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Trivial > Fix For: 2.1.0 > > > This was a suggestion by [~rxin] over one of the dev list discussion : > http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html > {noformat} > I think this is very risky because preserving output partitioning should not > be a property of UnaryNodeExec (e.g. exchange). > It would be better (safer) to move the output partitioning definition into > each of the operator and remove it from UnaryExecNode. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599523#comment-15599523 ] Herman van Hovell commented on SPARK-17495: --- O? Which data types are we missing? > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.1.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17123) Performing set operations that combine string and date / timestamp columns may result in generated projection code which doesn't compile
[ https://issues.apache.org/jira/browse/SPARK-17123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599444#comment-15599444 ] Apache Spark commented on SPARK-17123: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/15601 > Performing set operations that combine string and date / timestamp columns > may result in generated projection code which doesn't compile > > > Key: SPARK-17123 > URL: https://issues.apache.org/jira/browse/SPARK-17123 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > The following example program causes SpecificSafeProjection code generation > to produce Java code which doesn't compile: > {code} > import org.apache.spark.sql.types._ > spark.sql("set spark.sql.codegen.fallback=false") > val dateDF = spark.createDataFrame(sc.parallelize(Seq(Row(new > java.sql.Date(0, StructType(StructField("value", DateType) :: Nil)) > val longDF = sc.parallelize(Seq(new java.sql.Date(0).toString)).toDF > dateDF.union(longDF).collect() > {code} > This fails at runtime with the following error: > {code} > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 28, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates > are: "public static java.sql.Date > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private org.apache.spark.sql.types.StructType schema; > /* 011 */ > /* 012 */ > /* 013 */ public SpecificSafeProjection(Object[] references) { > /* 014 */ this.references = references; > /* 015 */ mutableRow = (MutableRow) references[references.length - 1]; > /* 016 */ > /* 017 */ this.schema = (org.apache.spark.sql.types.StructType) > references[0]; > /* 018 */ } > /* 019 */ > /* 020 */ public java.lang.Object apply(java.lang.Object _i) { > /* 021 */ InternalRow i = (InternalRow) _i; > /* 022 */ > /* 023 */ values = new Object[1]; > /* 024 */ > /* 025 */ boolean isNull2 = i.isNullAt(0); > /* 026 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 027 */ boolean isNull1 = isNull2; > /* 028 */ final java.sql.Date value1 = isNull1 ? null : > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2); > /* 029 */ isNull1 = value1 == null; > /* 030 */ if (isNull1) { > /* 031 */ values[0] = null; > /* 032 */ } else { > /* 033 */ values[0] = value1; > /* 034 */ } > /* 035 */ > /* 036 */ final org.apache.spark.sql.Row value = new > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, > schema); > /* 037 */ if (false) { > /* 038 */ mutableRow.setNullAt(0); > /* 039 */ } else { > /* 040 */ > /* 041 */ mutableRow.update(0, value); > /* 042 */ } > /* 043 */ > /* 044 */ return mutableRow; > /* 045 */ } > /* 046 */ } > {code} > Here, the invocation of {{DateTimeUtils.toJavaDate}} is incorrect because the > generated code tries to call it with a UTF8String while the method expects an > int instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599386#comment-15599386 ] Liwei Lin commented on SPARK-18057: --- Cool ! > Update structured streaming kafka from 10.0.1 to 10.1.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Sub-task >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15233) Spark task metrics should include hdfs read write latency
[ https://issues.apache.org/jira/browse/SPARK-15233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599290#comment-15599290 ] Yuance Li commented on SPARK-15233: --- What's new about this issue? Thx > Spark task metrics should include hdfs read write latency > - > > Key: SPARK-15233 > URL: https://issues.apache.org/jira/browse/SPARK-15233 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.1 >Reporter: Sital Kedia >Priority: Minor > > Currently the Spark task metrics does not have hdfs read/write latency. It > will be very useful to have those to find the bottleneck in the query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599246#comment-15599246 ] Reynold Xin commented on SPARK-17495: - [~tejasp] I am going to reopen this. I just realized the committed implementation doesn't work for all data types, and are mostly dead code. Can we push it to completion by implementing the remaining data types? > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.1.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-17495: - > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.1.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org