[jira] [Reopened] (SPARK-22600) Fix 64kb limit for deeply nested expressions under wholestage codegen

2017-12-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-22600:
-

> Fix 64kb limit for deeply nested expressions under wholestage codegen
> -
>
> Key: SPARK-22600
> URL: https://issues.apache.org/jira/browse/SPARK-22600
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> This is an extension of SPARK-22543 to fix 64kb compile error for deeply 
> nested expressions under wholestage codegen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22600) Fix 64kb limit for deeply nested expressions under wholestage codegen

2017-12-13 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290423#comment-16290423
 ] 

Liang-Chi Hsieh edited comment on SPARK-22600 at 12/14/17 6:59 AM:
---

The current approach proposes a new contract that Spark doesn't promise before: 
{{Expression.genCode}} must output something that can be used as parameter name 
or literal. If we output a java expression that can produce a value such as 
{{var1 + 1}}, because it can't be used as parameter name, this approach will 
fail the compilation.

To change the expression with a generated parameter name should be difficult as 
we already use the expression in generated code.

If we accept this new contract, we should document it clearly and check if any 
places use such expression as codegen output.

The current approach is documented in the design doc. Please give me feedbacks 
if you have time to go through it. Thank you.

The design doc is posted at 
https://docs.google.com/document/d/1By_V-A2sxCWbP7dZ5EzHIuMSe8K0fQL9lqovGWXnsfs/edit?usp=sharing




was (Author: viirya):
The current approach proposes a new contract that Spark doesn't promise before: 
{{Expression.genCode}} must output something that can be used as parameter name 
or literal. If we output a java expression that can produce a value such as 
{{var1 + 1}}, because it can't be used as parameter name, this approach will 
fail the compilation.

To change the expression with a generated parameter name should be difficult as 
we already use the expression in generated code.

If we accept this new contract, we should document it clearly and check if any 
places use such expression as codegen output.

The current approach is documented in the design doc.

The design doc is posted at 
https://docs.google.com/document/d/1By_V-A2sxCWbP7dZ5EzHIuMSe8K0fQL9lqovGWXnsfs/edit?usp=sharing



> Fix 64kb limit for deeply nested expressions under wholestage codegen
> -
>
> Key: SPARK-22600
> URL: https://issues.apache.org/jira/browse/SPARK-22600
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> This is an extension of SPARK-22543 to fix 64kb compile error for deeply 
> nested expressions under wholestage codegen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22600) Fix 64kb limit for deeply nested expressions under wholestage codegen

2017-12-13 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290423#comment-16290423
 ] 

Liang-Chi Hsieh commented on SPARK-22600:
-

The current approach proposes a new contract that Spark doesn't promise before: 
{{Expression.genCode}} must output something that can be used as parameter name 
or literal. If we output a java expression that can produce a value such as 
{{var1 + 1}}, because it can't be used as parameter name, this approach will 
fail the compilation.

To change the expression with a generated parameter name should be difficult as 
we already use the expression in generated code.

If we accept this new contract, we should document it clearly and check if any 
places use such expression as codegen output.

The current approach is documented in the design doc.

The design doc is posted at 
https://docs.google.com/document/d/1By_V-A2sxCWbP7dZ5EzHIuMSe8K0fQL9lqovGWXnsfs/edit?usp=sharing



> Fix 64kb limit for deeply nested expressions under wholestage codegen
> -
>
> Key: SPARK-22600
> URL: https://issues.apache.org/jira/browse/SPARK-22600
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> This is an extension of SPARK-22543 to fix 64kb compile error for deeply 
> nested expressions under wholestage codegen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22779) ConfigEntry's default value should actually be a value

2017-12-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22779:

Fix Version/s: 2.3.0
  Component/s: (was: Spark Core)
   SQL

> ConfigEntry's default value should actually be a value
> --
>
> Key: SPARK-22779
> URL: https://issues.apache.org/jira/browse/SPARK-22779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Reynold Xin
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> ConfigEntry's config value right now shows a human readable message. In some 
> places in SQL we actually rely on default value for real to be setting the 
> values. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22779) ConfigEntry's default value should actually be a value

2017-12-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22779.
-
Resolution: Fixed
  Assignee: Marcelo Vanzin  (was: Reynold Xin)

> ConfigEntry's default value should actually be a value
> --
>
> Key: SPARK-22779
> URL: https://issues.apache.org/jira/browse/SPARK-22779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Reynold Xin
>Assignee: Marcelo Vanzin
>
> ConfigEntry's config value right now shows a human readable message. In some 
> places in SQL we actually rely on default value for real to be setting the 
> values. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22732) Add DataSourceV2 streaming APIs

2017-12-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22732.
--
   Resolution: Fixed
 Assignee: Jose Torres
Fix Version/s: 2.3.0

> Add DataSourceV2 streaming APIs
> ---
>
> Key: SPARK-22732
> URL: https://issues.apache.org/jira/browse/SPARK-22732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Assignee: Jose Torres
> Fix For: 2.3.0
>
>
> Structured Streaming APIs are currently tucked in a spark internal package. 
> We need to expose a new version in the DataSourceV2 framework, and add the 
> APIs required for continuous processing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2017-12-13 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-3181.

   Resolution: Fixed
Fix Version/s: 2.3.0

> Add Robust Regression Algorithm with Huber Estimator
> 
>
> Key: SPARK-3181
> URL: https://issues.apache.org/jira/browse/SPARK-3181
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Fan Jiang
>Assignee: Yanbo Liang
>  Labels: features
> Fix For: 2.3.0
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression  to employ a 
> fitting criterion that is not as vulnerable as least square.
> In 1973, Huber introduced M-estimation for regression which stands for 
> "maximum likelihood type". The method is resistant to outliers in the 
> response variable and has been widely used.
> The new feature for MLlib will contain 3 new files
> /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
> /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
> /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
> and one new class HuberRobustGradient in 
> /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22776) Increase default value of spark.sql.codegen.maxFields

2017-12-13 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290351#comment-16290351
 ] 

Kazuaki Ishizaki commented on SPARK-22776:
--

Wait for re-merging SPARK-22600

> Increase default value of spark.sql.codegen.maxFields
> -
>
> Key: SPARK-22776
> URL: https://issues.apache.org/jira/browse/SPARK-22776
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> Since there are lots of effort to avoid limitation of Java class files, 
> generated code for whole-stage codegen works with wider columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22781) Support creating streaming dataset with ORC files

2017-12-13 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22781:
--
Summary: Support creating streaming dataset with ORC files  (was: Support 
creating streaming dataset with ORC file format)

> Support creating streaming dataset with ORC files
> -
>
> Key: SPARK-22781
> URL: https://issues.apache.org/jira/browse/SPARK-22781
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>
> This issue supports creating streaming dataset with ORC file format



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22781) Support creating streaming dataset with ORC file format

2017-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22781:


Assignee: Apache Spark

> Support creating streaming dataset with ORC file format
> ---
>
> Key: SPARK-22781
> URL: https://issues.apache.org/jira/browse/SPARK-22781
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue supports creating streaming dataset with ORC file format



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22781) Support creating streaming dataset with ORC file format

2017-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22781:


Assignee: (was: Apache Spark)

> Support creating streaming dataset with ORC file format
> ---
>
> Key: SPARK-22781
> URL: https://issues.apache.org/jira/browse/SPARK-22781
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>
> This issue supports creating streaming dataset with ORC file format



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22781) Support creating streaming dataset with ORC file format

2017-12-13 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-22781:
-

 Summary: Support creating streaming dataset with ORC file format
 Key: SPARK-22781
 URL: https://issues.apache.org/jira/browse/SPARK-22781
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun


This issue supports creating streaming dataset with ORC file format



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22779) ConfigEntry's default value should actually be a value

2017-12-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290259#comment-16290259
 ] 

Apache Spark commented on SPARK-22779:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19974

> ConfigEntry's default value should actually be a value
> --
>
> Key: SPARK-22779
> URL: https://issues.apache.org/jira/browse/SPARK-22779
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> ConfigEntry's config value right now shows a human readable message. In some 
> places in SQL we actually rely on default value for real to be setting the 
> values. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21417) Detect transitive join conditions via expressions

2017-12-13 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21417:
--
Fix Version/s: (was: 2.3.0)

> Detect transitive join conditions via expressions
> -
>
> Key: SPARK-21417
> URL: https://issues.apache.org/jira/browse/SPARK-21417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Claus Stadler
>Assignee: Anton Okolnychyi
>
> _Disclaimer: The nature of this report is similar to that of 
> https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my 
> understanding) uses its own SQL implementation, the requested improvement has 
> to be treated as a separate issue._
> Given table aliases ta, tb column names ca, cb, and an arbitrary 
> (deterministic) expression expr then calcite should be capable to infer join 
> conditions by transitivity:
> {noformat}
> ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb
> {noformat}
> The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries 
> such as
> {code:java}
> SELECT {
>   dbr:Leipzig a ?type .
>   dbr:Leipzig dbo:mayor ?mayor
> }
> {code}
> result in an SQL query similar to
> {noformat}
> SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig'
> {noformat}
> A consequence of the join condition not being recognized is, that Apache 
> SPARK does not find an executable plan to process the query.
> Self contained example:
> {code:java}
> package my.package;
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.scalatest._
> class TestSparkSqlJoin extends FlatSpec {
>   "SPARK SQL processor" should "be capable of handling transitive join 
> conditions" in {
> val spark = SparkSession
>   .builder()
>   .master("local[2]")
>   .appName("Spark SQL parser bug")
>   .getOrCreate()
> import spark.implicits._
> // The schema is encoded in a string
> val schemaString = "s p o"
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
>   .map(fieldName => StructField(fieldName, StringType, nullable = true))
> val schema = StructType(fields)
> val data = List(("s1", "p1", "o1"))
> val dataRDD = spark.sparkContext.parallelize(data).map(attributes => 
> Row(attributes._1, attributes._2, attributes._3))
> val df = spark.createDataFrame(dataRDD, schema).as("TRIPLES")
> df.createOrReplaceTempView("TRIPLES")
> println("First Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = B.s AND A.s = 
> 'dbr:Leipzig'").show(10)
> println("Second Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = 'dbr:Leipzig' 
> AND B.s = 'dbr:Leipzig'").show(10)
>   }
> }
> {code}
> Output (excerpt):
> {noformat}
> First Query
> ...
> +---+
> |  s|
> +---+
> +---+
> Second Query
> - should be capable of handling transitive join conditions *** FAILED ***
>   org.apache.spark.sql.AnalysisException: Detected cartesian product for 
> INNER join between logical plans
> Project [s#3]
> +- Filter (isnotnull(s#3) && (s#3 = dbr:Leipzig))
>+- LogicalRDD [s#3, p#4, o#5]
> and
> Project
> +- Filter (isnotnull(s#20) && (s#20 = dbr:Leipzig))
>+- LogicalRDD [s#20, p#21, o#22]
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these 
> relations.;
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   ...
> Run completed in 6 seconds, 833 milliseconds.
> Total number of tests run: 1
> Suites: completed 1, aborted 0
> Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
> *** 1 TEST FAILED ***
> {noformat}
> Expected:
> A correctly working, executable, query plan for the second query 

[jira] [Commented] (SPARK-22771) SQL concat for binary

2017-12-13 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290174#comment-16290174
 ] 

Takeshi Yamamuro commented on SPARK-22771:
--

ok, I'll take

> SQL concat for binary 
> --
>
> Key: SPARK-22771
> URL: https://issues.apache.org/jira/browse/SPARK-22771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Fernando Pereira
>Priority: Minor
>
> spark.sql {{concat}}  function automatically casts arguments to StringType 
> and returns a String.
> This might be the behavior of traditional databases, however in Spark there's 
> Binary as a standard type, and concat'ing binary seems reasonable if it 
> returns another binary sequence.
> Taking the example of, e.g. Python where both {{bytes}} and {{unicode}} 
> represent text, by concat'ing both we end up with the same type as the 
> arguments, and in case they are intermixed (str + unicode) the most generic 
> type is returned (unicode).
> Following the same principle, I believe that when concat'ing binary it would 
> make sense to return a binary. 
> In terms of Spark behavior, it would affect only the case when all arguments 
> are binary. All other cases should remain unchanged.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7721) Generate test coverage report from Python

2017-12-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290175#comment-16290175
 ] 

Hyukjin Kwon commented on SPARK-7721:
-

Sure, I didn't mean to rush and start to proceed without investigating and 
checking the whole stuff ahead. Just wanted to check your thought ahead. Will 
try to have some time to take a look and proceed this bit by bit, and of course 
will update you.

> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22780) make insert commands have real children to fix UI issues

2017-12-13 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22780:
---

 Summary: make insert commands have real children to fix UI issues
 Key: SPARK-22780
 URL: https://issues.apache.org/jira/browse/SPARK-22780
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22716) Avoid the creation of mutable states in addReferenceObj

2017-12-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22716:

Parent Issue: SPARK-22692  (was: SPARK-22510)

> Avoid the creation of mutable states in addReferenceObj
> ---
>
> Key: SPARK-22716
> URL: https://issues.apache.org/jira/browse/SPARK-22716
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> `ctx.addReferenceObj` creats a global variable just for referring an object, 
> which seems an overkill. We should revisit it and always use 
> `ctx.addReferenceMinorObj`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22776) Increase default value of spark.sql.codegen.maxFields

2017-12-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22776:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22510

> Increase default value of spark.sql.codegen.maxFields
> -
>
> Key: SPARK-22776
> URL: https://issues.apache.org/jira/browse/SPARK-22776
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> Since there are lots of effort to avoid limitation of Java class files, 
> generated code for whole-stage codegen works with wider columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22771) SQL concat for binary

2017-12-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290160#comment-16290160
 ] 

Xiao Li commented on SPARK-22771:
-

This looks reasonable. We should fix it.

> SQL concat for binary 
> --
>
> Key: SPARK-22771
> URL: https://issues.apache.org/jira/browse/SPARK-22771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Fernando Pereira
>Priority: Minor
>
> spark.sql {{concat}}  function automatically casts arguments to StringType 
> and returns a String.
> This might be the behavior of traditional databases, however in Spark there's 
> Binary as a standard type, and concat'ing binary seems reasonable if it 
> returns another binary sequence.
> Taking the example of, e.g. Python where both {{bytes}} and {{unicode}} 
> represent text, by concat'ing both we end up with the same type as the 
> arguments, and in case they are intermixed (str + unicode) the most generic 
> type is returned (unicode).
> Following the same principle, I believe that when concat'ing binary it would 
> make sense to return a binary. 
> In terms of Spark behavior, it would affect only the case when all arguments 
> are binary. All other cases should remain unchanged.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22359) Improve the test coverage of window functions

2017-12-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290159#comment-16290159
 ] 

Xiao Li commented on SPARK-22359:
-

https://github.com/postgrespro/postgrespro/blob/ac92c4a9a53c88843533154d2224323509134323/src/test/regress/sql/window.sql

Above link has a few good examples.

> Improve the test coverage of window functions
> -
>
> Key: SPARK-22359
> URL: https://issues.apache.org/jira/browse/SPARK-22359
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>
> There are already quite a few integration tests using window functions, but 
> the unit tests coverage for window funtions is not ideal.
> We'd like to test the following aspects:
> * Specifications
> ** different partition clauses (none, one, multiple)
> ** different order clauses (none, one, multiple, asc/desc, nulls first/last)
> * Frames and their combinations
> ** OffsetWindowFunctionFrame
> ** UnboundedWindowFunctionFrame
> ** SlidingWindowFunctionFrame
> ** UnboundedPrecedingWindowFunctionFrame
> ** UnboundedFollowingWindowFunctionFrame
> * Aggregate function types
> ** Declarative
> ** Imperative
> ** UDAF
> * Spilling
> ** Cover the conditions that WindowExec should spill at least once 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22747) Shorten lifetime of global variables used in HashAggregateExec

2017-12-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22747:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22510

> Shorten lifetime of global variables used in HashAggregateExec
> --
>
> Key: SPARK-22747
> URL: https://issues.apache.org/jira/browse/SPARK-22747
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> Generated code in {{HashAggregateExec}} uses global mutable variables that 
> are passed to successor operations thru {{consume()}} method. It may cause 
> issue in SPARK-22668.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22716) Avoid the creation of mutable states in addReferenceObj

2017-12-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22716:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22510

> Avoid the creation of mutable states in addReferenceObj
> ---
>
> Key: SPARK-22716
> URL: https://issues.apache.org/jira/browse/SPARK-22716
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> `ctx.addReferenceObj` creats a global variable just for referring an object, 
> which seems an overkill. We should revisit it and always use 
> `ctx.addReferenceMinorObj`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22669) Avoid unnecessary function calls in code generation

2017-12-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22669:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> Avoid unnecessary function calls in code generation
> ---
>
> Key: SPARK-22669
> URL: https://issues.apache.org/jira/browse/SPARK-22669
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> In many parts of the codebase for code generation, we are splitting the code 
> to avoid exceptions due to the 64KB method size limit. This is generating a 
> lot of methods which are called every time, even though sometime this is not 
> needed. As pointed out here: 
> https://github.com/apache/spark/pull/19752#discussion_r153081547, this is a 
> not negligible overhead which can be avoided.
> In this JIRA, I propose to use the same approach throughout all the other 
> cases, when possible. I am going to submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22752) FileNotFoundException while reading from Kafka

2017-12-13 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290150#comment-16290150
 ] 

Shixiong Zhu commented on SPARK-22752:
--

What's your code? You probably hit SPARK-21977

> FileNotFoundException while reading from Kafka
> --
>
> Key: SPARK-22752
> URL: https://issues.apache.org/jira/browse/SPARK-22752
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> We are running a stateful structured streaming job which reads from Kafka and 
> writes to HDFS. And we are hitting this exception:
> {noformat}
> 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): 
> java.lang.IllegalStateException: Error reading delta file 
> /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, 
> part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta 
> does not exist
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:360)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>   at scala.Option.getOrElse(Option.scala:121)
> {noformat}
> Of course, the file doesn't exist in HDFS. And in the {{state/0/0}} directory 
> there is no file at all. While we have some files in the commits and offsets 
> folders. I am not sure about the reason of this behavior. It seems to happen 
> on the second time the job is started, after the first one failed. So it 
> looks like task failures can generate it. Or it might be related to 
> watermarks, since there are some problems related to the incoming data for 
> which the watermark was filtering all the incoming data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22772) elt should use splitExpressionsWithCurrentInputs to split expression codes

2017-12-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22772:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> elt should use splitExpressionsWithCurrentInputs to split expression codes
> --
>
> Key: SPARK-22772
> URL: https://issues.apache.org/jira/browse/SPARK-22772
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> In SPARK-22550, elt is changed to use {{buildCodeBlocks}} to manually 
> expression codes. We should use {{splitExpressionsWithCurrentInputs}} to do 
> that because it considers both normal codegen and wholestage codegen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7721) Generate test coverage report from Python

2017-12-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290134#comment-16290134
 ] 

Reynold Xin commented on SPARK-7721:


We definitely don't need to do it in one-go, but with all the stuff like this 
the key is to know for sure we can do it. Otherwise they become some half 
-baked infra that's committed but not actually functioning, and brings more 
hassle than needed.


> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22771) SQL concat for binary

2017-12-13 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290101#comment-16290101
 ] 

Takeshi Yamamuro commented on SPARK-22771:
--

[~smilegator] It's worth fixing this?

> SQL concat for binary 
> --
>
> Key: SPARK-22771
> URL: https://issues.apache.org/jira/browse/SPARK-22771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Fernando Pereira
>Priority: Minor
>
> spark.sql {{concat}}  function automatically casts arguments to StringType 
> and returns a String.
> This might be the behavior of traditional databases, however in Spark there's 
> Binary as a standard type, and concat'ing binary seems reasonable if it 
> returns another binary sequence.
> Taking the example of, e.g. Python where both {{bytes}} and {{unicode}} 
> represent text, by concat'ing both we end up with the same type as the 
> arguments, and in case they are intermixed (str + unicode) the most generic 
> type is returned (unicode).
> Following the same principle, I believe that when concat'ing binary it would 
> make sense to return a binary. 
> In terms of Spark behavior, it would affect only the case when all arguments 
> are binary. All other cases should remain unchanged.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22600) Fix 64kb limit for deeply nested expressions under wholestage codegen

2017-12-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22600:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> Fix 64kb limit for deeply nested expressions under wholestage codegen
> -
>
> Key: SPARK-22600
> URL: https://issues.apache.org/jira/browse/SPARK-22600
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> This is an extension of SPARK-22543 to fix 64kb compile error for deeply 
> nested expressions under wholestage codegen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21870) Split codegen'd aggregation code into small functions for the HotSpot

2017-12-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21870.
-
Resolution: Duplicate

> Split codegen'd aggregation code into small functions for the HotSpot
> -
>
> Key: SPARK-21870
> URL: https://issues.apache.org/jira/browse/SPARK-21870
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In SPARK-21603, we got performance regression if the HotSpot didn't compile 
> too long functions (the limit is 8 in bytecode size).
> I checked and I found the codegen of `HashAggregateExec` frequently goes over 
> the limit, for example:
> {code}
> spark.range(1000).selectExpr("id % 1024 AS a", "id AS 
> b").write.saveAsTable("t")
> sql("SELECT a, KURTOSIS(b)FROM t GROUP BY a")
> {code}
> This query goes over the limit and the actual bytecode size is `12356`.
> So, it might be better to split the aggregation code into piecies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22756) Run SparkR tests if hive_thriftserver module has code changes

2017-12-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22756.
--
Resolution: Not A Problem

Hey [~smilegator], let me leave this resolved based on what we talked but 
please reopen if I am mistaken.

> Run SparkR tests if hive_thriftserver module has code changes
> -
>
> Key: SPARK-22756
> URL: https://issues.apache.org/jira/browse/SPARK-22756
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> The recent PR change in hive_thriftserver caused the test failure in CRAN 
> requirements. To some extends, SparkR module depends on hive_thriftserver 
> module, so we should run hive_thriftserver tests if hive_thriftserver module 
> has code changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22779) ConfigEntry's default value should actually be a value

2017-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22779:


Assignee: Reynold Xin  (was: Apache Spark)

> ConfigEntry's default value should actually be a value
> --
>
> Key: SPARK-22779
> URL: https://issues.apache.org/jira/browse/SPARK-22779
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> ConfigEntry's config value right now shows a human readable message. In some 
> places in SQL we actually rely on default value for real to be setting the 
> values. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22779) ConfigEntry's default value should actually be a value

2017-12-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290054#comment-16290054
 ] 

Apache Spark commented on SPARK-22779:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19973

> ConfigEntry's default value should actually be a value
> --
>
> Key: SPARK-22779
> URL: https://issues.apache.org/jira/browse/SPARK-22779
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> ConfigEntry's config value right now shows a human readable message. In some 
> places in SQL we actually rely on default value for real to be setting the 
> values. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22779) ConfigEntry's default value should actually be a value

2017-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22779:


Assignee: Apache Spark  (was: Reynold Xin)

> ConfigEntry's default value should actually be a value
> --
>
> Key: SPARK-22779
> URL: https://issues.apache.org/jira/browse/SPARK-22779
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> ConfigEntry's config value right now shows a human readable message. In some 
> places in SQL we actually rely on default value for real to be setting the 
> values. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22496) beeline display operation log

2017-12-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22496:
-
Fix Version/s: (was: 2.3.0)

> beeline display operation log
> -
>
> Key: SPARK-22496
> URL: https://issues.apache.org/jira/browse/SPARK-22496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StephenZou
>Priority: Minor
>
> For now,when end user runs queries in beeline or in hue through STS, 
> no logs are displayed, end user will wait until the job finishes or fails. 
> Progress information is needed to inform end users how the job is running if 
> they are not familiar with yarn RM or standalone spark master ui. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22496) beeline display operation log

2017-12-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-22496:
--

> beeline display operation log
> -
>
> Key: SPARK-22496
> URL: https://issues.apache.org/jira/browse/SPARK-22496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StephenZou
>Priority: Minor
>
> For now,when end user runs queries in beeline or in hue through STS, 
> no logs are displayed, end user will wait until the job finishes or fails. 
> Progress information is needed to inform end users how the job is running if 
> they are not familiar with yarn RM or standalone spark master ui. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22779) ConfigEntry's default value should actually be a value

2017-12-13 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-22779:
---

 Summary: ConfigEntry's default value should actually be a value
 Key: SPARK-22779
 URL: https://issues.apache.org/jira/browse/SPARK-22779
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Reynold Xin
Assignee: Reynold Xin


ConfigEntry's config value right now shows a human readable message. In some 
places in SQL we actually rely on default value for real to be setting the 
values. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22764) Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"

2017-12-13 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-22764:


Assignee: Marcelo Vanzin

> Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"
> --
>
> Key: SPARK-22764
> URL: https://issues.apache.org/jira/browse/SPARK-22764
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Saw this in a PR builder:
> {noformat}
> [info] - Cancelling stages/jobs with custom reasons. *** FAILED *** (135 
> milliseconds)
> [info]   Expected exception org.apache.spark.SparkException to be thrown, but 
> no exception was thrown (SparkContextSuite.scala:531)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
> [info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
> {noformat}
> From the logs, the job is finishing before the test code cancels it:
> {noformat}
> 17/12/12 11:00:41.680 Executor task launch worker for task 1 INFO Executor: 
> Finished task 0.0 in stage 1.0 (TID 1). 703 bytes result sent to driver
> 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSetManager: Finished task 
> 0.0 in stage 1.0 (TID 1) in 13 ms on localhost (executor driver) (1/1)
> 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSchedulerImpl: Removed 
> TaskSet 1.0, whose tasks have all completed, from pool 
> 17/12/12 11:00:41.681 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 
> 1 (apply at Assertions.scala:805) finished in 0.066 s
> 17/12/12 11:00:41.681 pool-1-thread-1-ScalaTest-running-SparkContextSuite 
> INFO DAGScheduler: Job 1 finished: apply at Assertions.scala:805, took 
> 0.066946 s
> 17/12/12 11:00:41.682 spark-listener-group-shared INFO DAGScheduler: Asked to 
> cancel job 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22764) Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"

2017-12-13 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-22764.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19956
[https://github.com/apache/spark/pull/19956]

> Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"
> --
>
> Key: SPARK-22764
> URL: https://issues.apache.org/jira/browse/SPARK-22764
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Saw this in a PR builder:
> {noformat}
> [info] - Cancelling stages/jobs with custom reasons. *** FAILED *** (135 
> milliseconds)
> [info]   Expected exception org.apache.spark.SparkException to be thrown, but 
> no exception was thrown (SparkContextSuite.scala:531)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
> [info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
> {noformat}
> From the logs, the job is finishing before the test code cancels it:
> {noformat}
> 17/12/12 11:00:41.680 Executor task launch worker for task 1 INFO Executor: 
> Finished task 0.0 in stage 1.0 (TID 1). 703 bytes result sent to driver
> 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSetManager: Finished task 
> 0.0 in stage 1.0 (TID 1) in 13 ms on localhost (executor driver) (1/1)
> 17/12/12 11:00:41.681 task-result-getter-1 INFO TaskSchedulerImpl: Removed 
> TaskSet 1.0, whose tasks have all completed, from pool 
> 17/12/12 11:00:41.681 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 
> 1 (apply at Assertions.scala:805) finished in 0.066 s
> 17/12/12 11:00:41.681 pool-1-thread-1-ScalaTest-running-SparkContextSuite 
> INFO DAGScheduler: Job 1 finished: apply at Assertions.scala:805, took 
> 0.066946 s
> 17/12/12 11:00:41.682 spark-listener-group-shared INFO DAGScheduler: Asked to 
> cancel job 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22772) elt should use splitExpressionsWithCurrentInputs to split expression codes

2017-12-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22772.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.3.0

> elt should use splitExpressionsWithCurrentInputs to split expression codes
> --
>
> Key: SPARK-22772
> URL: https://issues.apache.org/jira/browse/SPARK-22772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> In SPARK-22550, elt is changed to use {{buildCodeBlocks}} to manually 
> expression codes. We should use {{splitExpressionsWithCurrentInputs}} to do 
> that because it considers both normal codegen and wholestage codegen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22647) Docker files for image creation

2017-12-13 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289988#comment-16289988
 ] 

Erik Erlandson commented on SPARK-22647:


I'd like to propose migrating our images onto centos, which should also fix 
this particular issue.

> Docker files for image creation
> ---
>
> Key: SPARK-22647
> URL: https://issues.apache.org/jira/browse/SPARK-22647
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> This covers the dockerfiles that need to be shipped to enable the Kubernetes 
> backend for Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22359) Improve the test coverage of window functions

2017-12-13 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289965#comment-16289965
 ] 

Gabor Somogyi edited comment on SPARK-22359 at 12/13/17 9:42 PM:
-

OK [~smurakozi], then I'll jump on Frames.


was (Author: gsomogyi):
OK, then I'll jump on Frames.

> Improve the test coverage of window functions
> -
>
> Key: SPARK-22359
> URL: https://issues.apache.org/jira/browse/SPARK-22359
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>
> There are already quite a few integration tests using window functions, but 
> the unit tests coverage for window funtions is not ideal.
> We'd like to test the following aspects:
> * Specifications
> ** different partition clauses (none, one, multiple)
> ** different order clauses (none, one, multiple, asc/desc, nulls first/last)
> * Frames and their combinations
> ** OffsetWindowFunctionFrame
> ** UnboundedWindowFunctionFrame
> ** SlidingWindowFunctionFrame
> ** UnboundedPrecedingWindowFunctionFrame
> ** UnboundedFollowingWindowFunctionFrame
> * Aggregate function types
> ** Declarative
> ** Imperative
> ** UDAF
> * Spilling
> ** Cover the conditions that WindowExec should spill at least once 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22359) Improve the test coverage of window functions

2017-12-13 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289965#comment-16289965
 ] 

Gabor Somogyi commented on SPARK-22359:
---

OK, then I'll jump on Frames.

> Improve the test coverage of window functions
> -
>
> Key: SPARK-22359
> URL: https://issues.apache.org/jira/browse/SPARK-22359
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>
> There are already quite a few integration tests using window functions, but 
> the unit tests coverage for window funtions is not ideal.
> We'd like to test the following aspects:
> * Specifications
> ** different partition clauses (none, one, multiple)
> ** different order clauses (none, one, multiple, asc/desc, nulls first/last)
> * Frames and their combinations
> ** OffsetWindowFunctionFrame
> ** UnboundedWindowFunctionFrame
> ** SlidingWindowFunctionFrame
> ** UnboundedPrecedingWindowFunctionFrame
> ** UnboundedFollowingWindowFunctionFrame
> * Aggregate function types
> ** Declarative
> ** Imperative
> ** UDAF
> * Spilling
> ** Cover the conditions that WindowExec should spill at least once 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive

2017-12-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22574:
--

Assignee: German Schiavon Matteo

> Wrong request causing Spark Dispatcher going inactive
> -
>
> Key: SPARK-22574
> URL: https://issues.apache.org/jira/browse/SPARK-22574
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Assignee: German Schiavon Matteo
>Priority: Minor
> Fix For: 2.2.2, 2.3.0
>
>
> When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
> causing a bad state of Dispatcher and making it inactive as a mesos framework.
> The class CreateSubmissionRequest initialise its arguments to null as follows:
> {code:title=CreateSubmissionRequest.scala|borderStyle=solid}
>   var appResource: String = null
>   var mainClass: String = null
>   var appArgs: Array[String] = null
>   var sparkProperties: Map[String, String] = null
>   var environmentVariables: Map[String, String] = null
> {code}
> There are some checks of these variables but not in all of them, for example 
> in appArgs and environmentVariables. 
> If you don't set _appArgs_ it will cause the following error: 
> {code:title=error|borderStyle=solid}
> 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers.
> Exception in thread "Thread-22" java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
> {code}
> Because it's trying to access to it without checking whether is null or not.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive

2017-12-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22574.

   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.2

Issue resolved by pull request 19966
[https://github.com/apache/spark/pull/19966]

> Wrong request causing Spark Dispatcher going inactive
> -
>
> Key: SPARK-22574
> URL: https://issues.apache.org/jira/browse/SPARK-22574
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Priority: Minor
> Fix For: 2.2.2, 2.3.0
>
>
> When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
> causing a bad state of Dispatcher and making it inactive as a mesos framework.
> The class CreateSubmissionRequest initialise its arguments to null as follows:
> {code:title=CreateSubmissionRequest.scala|borderStyle=solid}
>   var appResource: String = null
>   var mainClass: String = null
>   var appArgs: Array[String] = null
>   var sparkProperties: Map[String, String] = null
>   var environmentVariables: Map[String, String] = null
> {code}
> There are some checks of these variables but not in all of them, for example 
> in appArgs and environmentVariables. 
> If you don't set _appArgs_ it will cause the following error: 
> {code:title=error|borderStyle=solid}
> 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers.
> Exception in thread "Thread-22" java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
> {code}
> Because it's trying to access to it without checking whether is null or not.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289923#comment-16289923
 ] 

Apache Spark commented on SPARK-22778:
--

User 'liyinan926' has created a pull request for this issue:
https://github.com/apache/spark/pull/19972

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Critical
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22778:


Assignee: Apache Spark

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Apache Spark
>Priority: Critical
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22778:


Assignee: (was: Apache Spark)

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Critical
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21417) Detect transitive join conditions via expressions

2017-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21417:


Assignee: Anton Okolnychyi  (was: Apache Spark)

> Detect transitive join conditions via expressions
> -
>
> Key: SPARK-21417
> URL: https://issues.apache.org/jira/browse/SPARK-21417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Claus Stadler
>Assignee: Anton Okolnychyi
> Fix For: 2.3.0
>
>
> _Disclaimer: The nature of this report is similar to that of 
> https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my 
> understanding) uses its own SQL implementation, the requested improvement has 
> to be treated as a separate issue._
> Given table aliases ta, tb column names ca, cb, and an arbitrary 
> (deterministic) expression expr then calcite should be capable to infer join 
> conditions by transitivity:
> {noformat}
> ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb
> {noformat}
> The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries 
> such as
> {code:java}
> SELECT {
>   dbr:Leipzig a ?type .
>   dbr:Leipzig dbo:mayor ?mayor
> }
> {code}
> result in an SQL query similar to
> {noformat}
> SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig'
> {noformat}
> A consequence of the join condition not being recognized is, that Apache 
> SPARK does not find an executable plan to process the query.
> Self contained example:
> {code:java}
> package my.package;
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.scalatest._
> class TestSparkSqlJoin extends FlatSpec {
>   "SPARK SQL processor" should "be capable of handling transitive join 
> conditions" in {
> val spark = SparkSession
>   .builder()
>   .master("local[2]")
>   .appName("Spark SQL parser bug")
>   .getOrCreate()
> import spark.implicits._
> // The schema is encoded in a string
> val schemaString = "s p o"
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
>   .map(fieldName => StructField(fieldName, StringType, nullable = true))
> val schema = StructType(fields)
> val data = List(("s1", "p1", "o1"))
> val dataRDD = spark.sparkContext.parallelize(data).map(attributes => 
> Row(attributes._1, attributes._2, attributes._3))
> val df = spark.createDataFrame(dataRDD, schema).as("TRIPLES")
> df.createOrReplaceTempView("TRIPLES")
> println("First Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = B.s AND A.s = 
> 'dbr:Leipzig'").show(10)
> println("Second Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = 'dbr:Leipzig' 
> AND B.s = 'dbr:Leipzig'").show(10)
>   }
> }
> {code}
> Output (excerpt):
> {noformat}
> First Query
> ...
> +---+
> |  s|
> +---+
> +---+
> Second Query
> - should be capable of handling transitive join conditions *** FAILED ***
>   org.apache.spark.sql.AnalysisException: Detected cartesian product for 
> INNER join between logical plans
> Project [s#3]
> +- Filter (isnotnull(s#3) && (s#3 = dbr:Leipzig))
>+- LogicalRDD [s#3, p#4, o#5]
> and
> Project
> +- Filter (isnotnull(s#20) && (s#20 = dbr:Leipzig))
>+- LogicalRDD [s#20, p#21, o#22]
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these 
> relations.;
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   ...
> Run completed in 6 seconds, 833 milliseconds.
> Total number of tests run: 1
> Suites: completed 1, aborted 0
> Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
> *** 1 TEST FAILED ***
> {noformat}
> Expected:
> A correctly working, 

[jira] [Assigned] (SPARK-21417) Detect transitive join conditions via expressions

2017-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21417:


Assignee: Apache Spark  (was: Anton Okolnychyi)

> Detect transitive join conditions via expressions
> -
>
> Key: SPARK-21417
> URL: https://issues.apache.org/jira/browse/SPARK-21417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Claus Stadler
>Assignee: Apache Spark
> Fix For: 2.3.0
>
>
> _Disclaimer: The nature of this report is similar to that of 
> https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my 
> understanding) uses its own SQL implementation, the requested improvement has 
> to be treated as a separate issue._
> Given table aliases ta, tb column names ca, cb, and an arbitrary 
> (deterministic) expression expr then calcite should be capable to infer join 
> conditions by transitivity:
> {noformat}
> ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb
> {noformat}
> The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries 
> such as
> {code:java}
> SELECT {
>   dbr:Leipzig a ?type .
>   dbr:Leipzig dbo:mayor ?mayor
> }
> {code}
> result in an SQL query similar to
> {noformat}
> SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig'
> {noformat}
> A consequence of the join condition not being recognized is, that Apache 
> SPARK does not find an executable plan to process the query.
> Self contained example:
> {code:java}
> package my.package;
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.scalatest._
> class TestSparkSqlJoin extends FlatSpec {
>   "SPARK SQL processor" should "be capable of handling transitive join 
> conditions" in {
> val spark = SparkSession
>   .builder()
>   .master("local[2]")
>   .appName("Spark SQL parser bug")
>   .getOrCreate()
> import spark.implicits._
> // The schema is encoded in a string
> val schemaString = "s p o"
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
>   .map(fieldName => StructField(fieldName, StringType, nullable = true))
> val schema = StructType(fields)
> val data = List(("s1", "p1", "o1"))
> val dataRDD = spark.sparkContext.parallelize(data).map(attributes => 
> Row(attributes._1, attributes._2, attributes._3))
> val df = spark.createDataFrame(dataRDD, schema).as("TRIPLES")
> df.createOrReplaceTempView("TRIPLES")
> println("First Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = B.s AND A.s = 
> 'dbr:Leipzig'").show(10)
> println("Second Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = 'dbr:Leipzig' 
> AND B.s = 'dbr:Leipzig'").show(10)
>   }
> }
> {code}
> Output (excerpt):
> {noformat}
> First Query
> ...
> +---+
> |  s|
> +---+
> +---+
> Second Query
> - should be capable of handling transitive join conditions *** FAILED ***
>   org.apache.spark.sql.AnalysisException: Detected cartesian product for 
> INNER join between logical plans
> Project [s#3]
> +- Filter (isnotnull(s#3) && (s#3 = dbr:Leipzig))
>+- LogicalRDD [s#3, p#4, o#5]
> and
> Project
> +- Filter (isnotnull(s#20) && (s#20 = dbr:Leipzig))
>+- LogicalRDD [s#20, p#21, o#22]
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these 
> relations.;
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   ...
> Run completed in 6 seconds, 833 milliseconds.
> Total number of tests run: 1
> Suites: completed 1, aborted 0
> Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
> *** 1 TEST FAILED ***
> {noformat}
> Expected:
> A correctly working, 

[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289907#comment-16289907
 ] 

Yinan Li commented on SPARK-22778:
--

Just verified that the fix worked. I'm gonna send a PR soon.

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Critical
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21417) Detect transitive join conditions via expressions

2017-12-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-21417:
-

> Detect transitive join conditions via expressions
> -
>
> Key: SPARK-21417
> URL: https://issues.apache.org/jira/browse/SPARK-21417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Claus Stadler
>Assignee: Anton Okolnychyi
> Fix For: 2.3.0
>
>
> _Disclaimer: The nature of this report is similar to that of 
> https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my 
> understanding) uses its own SQL implementation, the requested improvement has 
> to be treated as a separate issue._
> Given table aliases ta, tb column names ca, cb, and an arbitrary 
> (deterministic) expression expr then calcite should be capable to infer join 
> conditions by transitivity:
> {noformat}
> ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb
> {noformat}
> The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries 
> such as
> {code:java}
> SELECT {
>   dbr:Leipzig a ?type .
>   dbr:Leipzig dbo:mayor ?mayor
> }
> {code}
> result in an SQL query similar to
> {noformat}
> SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig'
> {noformat}
> A consequence of the join condition not being recognized is, that Apache 
> SPARK does not find an executable plan to process the query.
> Self contained example:
> {code:java}
> package my.package;
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.scalatest._
> class TestSparkSqlJoin extends FlatSpec {
>   "SPARK SQL processor" should "be capable of handling transitive join 
> conditions" in {
> val spark = SparkSession
>   .builder()
>   .master("local[2]")
>   .appName("Spark SQL parser bug")
>   .getOrCreate()
> import spark.implicits._
> // The schema is encoded in a string
> val schemaString = "s p o"
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
>   .map(fieldName => StructField(fieldName, StringType, nullable = true))
> val schema = StructType(fields)
> val data = List(("s1", "p1", "o1"))
> val dataRDD = spark.sparkContext.parallelize(data).map(attributes => 
> Row(attributes._1, attributes._2, attributes._3))
> val df = spark.createDataFrame(dataRDD, schema).as("TRIPLES")
> df.createOrReplaceTempView("TRIPLES")
> println("First Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = B.s AND A.s = 
> 'dbr:Leipzig'").show(10)
> println("Second Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = 'dbr:Leipzig' 
> AND B.s = 'dbr:Leipzig'").show(10)
>   }
> }
> {code}
> Output (excerpt):
> {noformat}
> First Query
> ...
> +---+
> |  s|
> +---+
> +---+
> Second Query
> - should be capable of handling transitive join conditions *** FAILED ***
>   org.apache.spark.sql.AnalysisException: Detected cartesian product for 
> INNER join between logical plans
> Project [s#3]
> +- Filter (isnotnull(s#3) && (s#3 = dbr:Leipzig))
>+- LogicalRDD [s#3, p#4, o#5]
> and
> Project
> +- Filter (isnotnull(s#20) && (s#20 = dbr:Leipzig))
>+- LogicalRDD [s#20, p#21, o#22]
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these 
> relations.;
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   ...
> Run completed in 6 seconds, 833 milliseconds.
> Total number of tests run: 1
> Suites: completed 1, aborted 0
> Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
> *** 1 TEST FAILED ***
> {noformat}
> Expected:
> A correctly working, executable, query plan for the second query (ideally 
> equivalent 

[jira] [Resolved] (SPARK-21417) Detect transitive join conditions via expressions

2017-12-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21417.
-
Resolution: Not A Problem

> Detect transitive join conditions via expressions
> -
>
> Key: SPARK-21417
> URL: https://issues.apache.org/jira/browse/SPARK-21417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Claus Stadler
>Assignee: Anton Okolnychyi
> Fix For: 2.3.0
>
>
> _Disclaimer: The nature of this report is similar to that of 
> https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my 
> understanding) uses its own SQL implementation, the requested improvement has 
> to be treated as a separate issue._
> Given table aliases ta, tb column names ca, cb, and an arbitrary 
> (deterministic) expression expr then calcite should be capable to infer join 
> conditions by transitivity:
> {noformat}
> ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb
> {noformat}
> The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries 
> such as
> {code:java}
> SELECT {
>   dbr:Leipzig a ?type .
>   dbr:Leipzig dbo:mayor ?mayor
> }
> {code}
> result in an SQL query similar to
> {noformat}
> SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig'
> {noformat}
> A consequence of the join condition not being recognized is, that Apache 
> SPARK does not find an executable plan to process the query.
> Self contained example:
> {code:java}
> package my.package;
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.scalatest._
> class TestSparkSqlJoin extends FlatSpec {
>   "SPARK SQL processor" should "be capable of handling transitive join 
> conditions" in {
> val spark = SparkSession
>   .builder()
>   .master("local[2]")
>   .appName("Spark SQL parser bug")
>   .getOrCreate()
> import spark.implicits._
> // The schema is encoded in a string
> val schemaString = "s p o"
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
>   .map(fieldName => StructField(fieldName, StringType, nullable = true))
> val schema = StructType(fields)
> val data = List(("s1", "p1", "o1"))
> val dataRDD = spark.sparkContext.parallelize(data).map(attributes => 
> Row(attributes._1, attributes._2, attributes._3))
> val df = spark.createDataFrame(dataRDD, schema).as("TRIPLES")
> df.createOrReplaceTempView("TRIPLES")
> println("First Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = B.s AND A.s = 
> 'dbr:Leipzig'").show(10)
> println("Second Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = 'dbr:Leipzig' 
> AND B.s = 'dbr:Leipzig'").show(10)
>   }
> }
> {code}
> Output (excerpt):
> {noformat}
> First Query
> ...
> +---+
> |  s|
> +---+
> +---+
> Second Query
> - should be capable of handling transitive join conditions *** FAILED ***
>   org.apache.spark.sql.AnalysisException: Detected cartesian product for 
> INNER join between logical plans
> Project [s#3]
> +- Filter (isnotnull(s#3) && (s#3 = dbr:Leipzig))
>+- LogicalRDD [s#3, p#4, o#5]
> and
> Project
> +- Filter (isnotnull(s#20) && (s#20 = dbr:Leipzig))
>+- LogicalRDD [s#20, p#21, o#22]
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these 
> relations.;
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   ...
> Run completed in 6 seconds, 833 milliseconds.
> Total number of tests run: 1
> Suites: completed 1, aborted 0
> Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
> *** 1 TEST FAILED ***
> {noformat}
> Expected:
> A correctly working, executable, query plan for the 

[jira] [Reopened] (SPARK-21417) Detect transitive join conditions via expressions

2017-12-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-21417:
-

> Detect transitive join conditions via expressions
> -
>
> Key: SPARK-21417
> URL: https://issues.apache.org/jira/browse/SPARK-21417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Claus Stadler
>Assignee: Anton Okolnychyi
> Fix For: 2.3.0
>
>
> _Disclaimer: The nature of this report is similar to that of 
> https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my 
> understanding) uses its own SQL implementation, the requested improvement has 
> to be treated as a separate issue._
> Given table aliases ta, tb column names ca, cb, and an arbitrary 
> (deterministic) expression expr then calcite should be capable to infer join 
> conditions by transitivity:
> {noformat}
> ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb
> {noformat}
> The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries 
> such as
> {code:java}
> SELECT {
>   dbr:Leipzig a ?type .
>   dbr:Leipzig dbo:mayor ?mayor
> }
> {code}
> result in an SQL query similar to
> {noformat}
> SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig'
> {noformat}
> A consequence of the join condition not being recognized is, that Apache 
> SPARK does not find an executable plan to process the query.
> Self contained example:
> {code:java}
> package my.package;
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.scalatest._
> class TestSparkSqlJoin extends FlatSpec {
>   "SPARK SQL processor" should "be capable of handling transitive join 
> conditions" in {
> val spark = SparkSession
>   .builder()
>   .master("local[2]")
>   .appName("Spark SQL parser bug")
>   .getOrCreate()
> import spark.implicits._
> // The schema is encoded in a string
> val schemaString = "s p o"
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
>   .map(fieldName => StructField(fieldName, StringType, nullable = true))
> val schema = StructType(fields)
> val data = List(("s1", "p1", "o1"))
> val dataRDD = spark.sparkContext.parallelize(data).map(attributes => 
> Row(attributes._1, attributes._2, attributes._3))
> val df = spark.createDataFrame(dataRDD, schema).as("TRIPLES")
> df.createOrReplaceTempView("TRIPLES")
> println("First Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = B.s AND A.s = 
> 'dbr:Leipzig'").show(10)
> println("Second Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = 'dbr:Leipzig' 
> AND B.s = 'dbr:Leipzig'").show(10)
>   }
> }
> {code}
> Output (excerpt):
> {noformat}
> First Query
> ...
> +---+
> |  s|
> +---+
> +---+
> Second Query
> - should be capable of handling transitive join conditions *** FAILED ***
>   org.apache.spark.sql.AnalysisException: Detected cartesian product for 
> INNER join between logical plans
> Project [s#3]
> +- Filter (isnotnull(s#3) && (s#3 = dbr:Leipzig))
>+- LogicalRDD [s#3, p#4, o#5]
> and
> Project
> +- Filter (isnotnull(s#20) && (s#20 = dbr:Leipzig))
>+- LogicalRDD [s#20, p#21, o#22]
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these 
> relations.;
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   ...
> Run completed in 6 seconds, 833 milliseconds.
> Total number of tests run: 1
> Suites: completed 1, aborted 0
> Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
> *** 1 TEST FAILED ***
> {noformat}
> Expected:
> A correctly working, executable, query plan for the second query (ideally 
> equivalent 

[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289881#comment-16289881
 ] 

Anirudh Ramanathan commented on SPARK-22778:


Excellent - thanks for the quick debug Matt. Now waiting for confirmation that 
the fix is sufficient. I also suggest we fix the URL format to prepend `k8s://` 
as opposed to just `k8s:` in the interest of having a URL looking more 
well-formed. 

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Critical
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289876#comment-16289876
 ] 

Yinan Li commented on SPARK-22778:
--

Ah, yes, the PR missed that. OK, I'm gonna give that a try and submit a PR to 
fix it.

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Critical
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289875#comment-16289875
 ] 

Matt Cheah commented on SPARK-22778:


And then notice we don't even have a {{resources}} directory on master: 
https://github.com/apache/spark/tree/master/resource-managers/kubernetes/core/src.

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Critical
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289871#comment-16289871
 ] 

Matt Cheah commented on SPARK-22778:


I see the problem. We're missing the {{META-INF.services}} file that tells 
service loaders to include our implementation. See 
https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/core/src/main/resources/META-INF/services/org.apache.spark.scheduler.ExternalClusterManager.

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Critical
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289868#comment-16289868
 ] 

Anirudh Ramanathan edited comment on SPARK-22778 at 12/13/17 8:41 PM:
--

I've verified that the image contains the right jars.
One more thing that changed from underneath us is 
https://github.com/apache/spark/pull/19631. Not sure yet if that's related.


was (Author: foxish):
I've verified that the image contains the right jars.
One more thing that changed from underneath us is 
https://github.com/apache/spark/pull/19631. Not sure if that's related.

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Critical
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289868#comment-16289868
 ] 

Anirudh Ramanathan commented on SPARK-22778:


I've verified that the image contains the right jars.
One more thing that changed from underneath us is 
https://github.com/apache/spark/pull/19631. Not sure if that's related.

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Critical
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Matt Cheah (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah updated SPARK-22778:
---
Priority: Critical  (was: Major)

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Critical
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289836#comment-16289836
 ] 

Matt Cheah edited comment on SPARK-22778 at 12/13/17 8:28 PM:
--

The {{canCreate}} method for {{KubernetesClusterManager}} should match that 
URI. The primary possibility I can think of is that the 
{{KubernetesClusterManager}} isn't being service loaded at all, which would 
imply that {{spark-kubernetes}} isn't on the classpath. Can we verify that the 
Docker image contains all of the correct jars?


was (Author: mcheah):
The `canCreate` method for `KubernetesClusterManager` should match that URI. 
The primary possibility I can think of is that the KubernetesClusterManager 
isn't being service loaded at all, which would imply that `spark-kubernetes` 
isn't on the classpath. Can we verify that the Docker image contains all of the 
correct jars?

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289836#comment-16289836
 ] 

Matt Cheah commented on SPARK-22778:


The `canCreate` method for `KubernetesClusterManager` should match that URI. 
The primary possibility I can think of is that the KubernetesClusterManager 
isn't being service loaded at all, which would imply that `spark-kubernetes` 
isn't on the classpath. Can we verify that the Docker image contains all of the 
correct jars?

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289822#comment-16289822
 ] 

Yinan Li edited comment on SPARK-22778 at 12/13/17 8:24 PM:


Just some background on this. The validation and parsing of k8s master url has 
been moved to SparkSubmit as being suggested in the review. The parsed master 
URL (https://... for example) is appended a {{k8s}} prefix after the parsing to 
satisfy {{KubernetesClusterManager}}, whose {{canCreate}} method is based on if 
the master URL starts {{k8s}}. That's why you see the {{k8s:}} prefix. The 
issue seems that in the driver pod {{SparkContext}} could not find 
{{KubernetesClusterManager}} based on the debug messages I added. The code that 
triggered the error (with the debugging I added) is as follows:

{code:java}
private def getClusterManager(url: String): Option[ExternalClusterManager] = {
val loader = Utils.getContextOrSparkClassLoader
val serviceLoaders =
  ServiceLoader.load(classOf[ExternalClusterManager], loader).asScala
serviceLoaders.foreach { loader =>
  logInfo(s"Found the following external cluster manager: $loader")
}

val filteredServiceLoaders = serviceLoaders.filter(_.canCreate(url))
if (filteredServiceLoaders.size > 1) {
  throw new SparkException(
s"Multiple external cluster managers registered for the url $url: 
$serviceLoaders")
} else if (filteredServiceLoaders.isEmpty) {
  logWarning(s"No external cluster manager registered for url $url")
}
filteredServiceLoaders.headOption
  }
{code}

And I got the following:
{code:java}
No external cluster manager registered for url k8s:https://35.226.8.173
{code}



was (Author: liyinan926):
Just some background on this. The validation and parsing of k8s master url has 
been moved to SparkSubmit as being suggested in the review. The parsed master 
URL (https://... for example) is appended a {{k8s}} prefix after the parsing to 
satisfy {{KubernetesClusterManager}}, whose {{canCreate}} method is based on if 
the master URL starts {{k8s}}. That's why you see the {{k8s:}} prefix. The 
issue seems that in the driver pod {{SparkContext}} could not find 
{{KubernetesClusterManager}} based on the debug messages I added:

{code:scala}
private def getClusterManager(url: String): Option[ExternalClusterManager] = {
val loader = Utils.getContextOrSparkClassLoader
val serviceLoaders =
  ServiceLoader.load(classOf[ExternalClusterManager], loader).asScala
serviceLoaders.foreach { loader =>
  logInfo(s"Found the following external cluster manager: $loader")
}

val filteredServiceLoaders = serviceLoaders.filter(_.canCreate(url))
if (filteredServiceLoaders.size > 1) {
  throw new SparkException(
s"Multiple external cluster managers registered for the url $url: 
$serviceLoaders")
} else if (filteredServiceLoaders.isEmpty) {
  logWarning(s"No external cluster manager registered for url $url")
}
filteredServiceLoaders.headOption
  }
{code}

And I got the following:
{code:java}
No external cluster manager registered for url k8s:https://35.226.8.173
{code}


> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. 

[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289822#comment-16289822
 ] 

Yinan Li commented on SPARK-22778:
--

Just some background on this. The validation and parsing of k8s master url has 
been moved to SparkSubmit as being suggested in the review. The parsed master 
URL (https://... for example) is appended a {{k8s}} prefix after the parsing to 
satisfy {{KubernetesClusterManager}}, whose {{canCreate}} method is based on if 
the master URL starts {{k8s}}. That's why you see the {{k8s:}} prefix. The 
issue seems that in the driver pod {{SparkContext}} could not find 
{{KubernetesClusterManager}} based on the debug messages I added:

{code:scala}
private def getClusterManager(url: String): Option[ExternalClusterManager] = {
val loader = Utils.getContextOrSparkClassLoader
val serviceLoaders =
  ServiceLoader.load(classOf[ExternalClusterManager], loader).asScala
serviceLoaders.foreach { loader =>
  logInfo(s"Found the following external cluster manager: $loader")
}

val filteredServiceLoaders = serviceLoaders.filter(_.canCreate(url))
if (filteredServiceLoaders.size > 1) {
  throw new SparkException(
s"Multiple external cluster managers registered for the url $url: 
$serviceLoaders")
} else if (filteredServiceLoaders.isEmpty) {
  logWarning(s"No external cluster manager registered for url $url")
}
filteredServiceLoaders.headOption
  }
{code}

And I got the following:
{code:java}
No external cluster manager registered for url k8s:https://35.226.8.173
{code}


> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289818#comment-16289818
 ] 

Anirudh Ramanathan edited comment on SPARK-22778 at 12/13/17 8:23 PM:
--

I submitted it as `k8s://https://xx.yy.zz.ww` to spark submit. However, it 
seems there is some change in how the validation of said URL occurs on the 
client-side - which makes us strip out the k8s and add it back in the above 
format. That might be at fault here. 

Here's my full spark-submit command:

bin/spark-submit \
  --deploy-mode cluster \
  --class org.apache.spark.examples.SparkPi \
  --master k8s://https://xx.yy.zz.ww \
  --conf spark.executor.instances=5 \
  --conf spark.app.name=spark-pi \
  --conf 
spark.kubernetes.driver.docker.image=foxish/spark-driver:spark-k8s-master-13dec-11-56
 \
  --conf 
spark.kubernetes.executor.docker.image=foxish/spark-executor:spark-k8s-master-13dec-11-56
 \
  local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0-SNAPSHOT.jar


was (Author: foxish):
I submitted it as `k8s://https://xx.yy.zz.ww` to spark submit. However, it 
seems there is some change in how the validation of said URL occurs on the 
client-side - which makes us strip out the k8s and add it back in the above 
format. That might be at fault here. 

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289818#comment-16289818
 ] 

Anirudh Ramanathan commented on SPARK-22778:


I submitted it as `k8s://https://xx.yy.zz.ww` to spark submit. However, it 
seems there is some change in how the validation of said URL occurs on the 
client-side - which makes us strip out the k8s and add it back in the above 
format. That might be at fault here. 

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289815#comment-16289815
 ] 

Matt Cheah commented on SPARK-22778:


Think that URI should be 'k8s://https://xx.yy.zz.ww' - notice the extra slashes?

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-22778:
---
Description: 
Building images based on master and deploying Spark PI results in the following 
error.

2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Could not parse 
Master URL: 'k8s:https://xx.yy.zz.ww'
at 
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
at org.apache.spark.SparkContext.(SparkContext.scala:496)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
/tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd

This is likely an artifact seen because of changes in master, or our submission 
code in the reviews. We haven't seen this on our fork. Hopefully once 
integration tests are ported against upstream/master, we will catch these 
issues earlier. 

  was:
Building images based on master and deploying Spark PI results in the following 
error.

2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Could not parse 
Master URL: 'k8s:https://35.197.21.13'
at 
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
at org.apache.spark.SparkContext.(SparkContext.scala:496)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
/tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd

This is likely an artifact seen because of changes in master, or our submission 
code in the reviews. We haven't seen this on our fork. Hopefully once 
integration tests are ported against upstream/master, we will catch these 
issues earlier. 


> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we 

[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing tests

2017-12-13 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289808#comment-16289808
 ] 

Anirudh Ramanathan commented on SPARK-22778:


[~mcheah] [~kimoonkim] [~ifilonenko] PTAL

> Kubernetes scheduler at master failing tests
> 
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://35.197.21.13'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully.

2017-12-13 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-22778:
---
Summary: Kubernetes scheduler at master failing to run applications 
successfully.  (was: Kubernetes scheduler at master failing tests)

> Kubernetes scheduler at master failing to run applications successfully.
> 
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://35.197.21.13'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-13 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-22778:
---
Summary: Kubernetes scheduler at master failing to run applications 
successfully  (was: Kubernetes scheduler at master failing to run applications 
successfully.)

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://35.197.21.13'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22778) Kubernetes scheduler at master failing tests

2017-12-13 Thread Anirudh Ramanathan (JIRA)
Anirudh Ramanathan created SPARK-22778:
--

 Summary: Kubernetes scheduler at master failing tests
 Key: SPARK-22778
 URL: https://issues.apache.org/jira/browse/SPARK-22778
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Anirudh Ramanathan


Building images based on master and deploying Spark PI results in the following 
error.

2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Could not parse 
Master URL: 'k8s:https://35.197.21.13'
at 
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
at org.apache.spark.SparkContext.(SparkContext.scala:496)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
/tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd

This is likely an artifact seen because of changes in master, or our submission 
code in the reviews. We haven't seen this on our fork. Hopefully once 
integration tests are ported against upstream/master, we will catch these 
issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22360) Add unit test for Window Specifications

2017-12-13 Thread Sandor Murakozi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289803#comment-16289803
 ] 

Sandor Murakozi edited comment on SPARK-22360 at 12/13/17 8:09 PM:
---

I just did a quick check of the existing test cases to see the current coverage:

different partition clauses 
* None
** Window.rowsBetween 
** reverse unbounded range frame
** window function with aggregates
** Null inputs
* One
** Lots of tests
* Multiple
** No test

different order clauses 
* None
** Window.rowsBetween
** window function should fail if order by clause is not specified
** statistical functions
* One
** Lots of tests
* Multiple
** No test
* asc
** lots of tests
* desc
** aggregation and range between with unbounded
** reverse sliding range frame
** reverse unbounded range frame
* nulls first/last
** last/first with ignoreNulls

I will create tests for those that are not yet covered. I will also check if 
there are any special combinations (possibly also considering frames) that 
require additional test cases. 

[~jiangxb] are there other cases that need to be covered? 
Do you think it would be worthwhile to have a set of new test cases focusing 
and systematically going through all the partition and order clauses?


was (Author: smurakozi):
I just did a quick check of the existing test cases to see the current coverage:

different partition clauses 
* None
** Window.rowsBetween 
** reverse unbounded range frame
** window function with aggregates
** Null inputs
* One
** Lots of tests
* Multiple
** No test
different order clauses 
* None
** Window.rowsBetween
** window function should fail if order by clause is not specified
** statistical functions
* One
** Lots of tests
* Multiple
** No test
* asc
** lots of tests
* desc
** aggregation and range between with unbounded
** reverse sliding range frame
** reverse unbounded range frame
* nulls first/last
** last/first with ignoreNulls

I will create tests for those that are not yet covered. I will also check if 
there are any special combinations (possibly also considering frames) that 
require additional test cases. 

[~jiangxb] are there other cases that need to be covered? 
Do you think it would be worthwhile to have a set of new test cases focusing 
and systematically going through all the partition and order clauses?

> Add unit test for Window Specifications
> ---
>
> Key: SPARK-22360
> URL: https://issues.apache.org/jira/browse/SPARK-22360
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>
> * different partition clauses (none, one, multiple)
> * different order clauses (none, one, multiple, asc/desc, nulls first/last)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22647) Docker files for image creation

2017-12-13 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289805#comment-16289805
 ] 

Anirudh Ramanathan commented on SPARK-22647:


[~mcheah] [~erikerlandson] PTAL

> Docker files for image creation
> ---
>
> Key: SPARK-22647
> URL: https://issues.apache.org/jira/browse/SPARK-22647
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> This covers the dockerfiles that need to be shipped to enable the Kubernetes 
> backend for Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22360) Add unit test for Window Specifications

2017-12-13 Thread Sandor Murakozi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289803#comment-16289803
 ] 

Sandor Murakozi commented on SPARK-22360:
-

I just did a quick check of the existing test cases to see the current coverage:

different partition clauses 
* None
** Window.rowsBetween 
** reverse unbounded range frame
** window function with aggregates
** Null inputs
* One
** Lots of tests
* Multiple
** No test
* different order clauses 
* None
** Window.rowsBetween
** window function should fail if order by clause is not specified
** statistical functions
* One
** Lots of tests
* Multiple
** No test
* asc
** lots of tests
* desc
** aggregation and range between with unbounded
** reverse sliding range frame
** reverse unbounded range frame
* nulls first/last
** last/first with ignoreNulls

I will create tests for those that are not yet covered. I will also check if 
there are any special combinations (possibly also considering frames) that 
require additional test cases. 

[~jiangxb] are there other cases that need to be covered? 
Do you think it would be worthwhile to have a set of new test cases focusing 
and systematically going through all the partition and order clauses?

> Add unit test for Window Specifications
> ---
>
> Key: SPARK-22360
> URL: https://issues.apache.org/jira/browse/SPARK-22360
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>
> * different partition clauses (none, one, multiple)
> * different order clauses (none, one, multiple, asc/desc, nulls first/last)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22360) Add unit test for Window Specifications

2017-12-13 Thread Sandor Murakozi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289803#comment-16289803
 ] 

Sandor Murakozi edited comment on SPARK-22360 at 12/13/17 8:08 PM:
---

I just did a quick check of the existing test cases to see the current coverage:

different partition clauses 
* None
** Window.rowsBetween 
** reverse unbounded range frame
** window function with aggregates
** Null inputs
* One
** Lots of tests
* Multiple
** No test
different order clauses 
* None
** Window.rowsBetween
** window function should fail if order by clause is not specified
** statistical functions
* One
** Lots of tests
* Multiple
** No test
* asc
** lots of tests
* desc
** aggregation and range between with unbounded
** reverse sliding range frame
** reverse unbounded range frame
* nulls first/last
** last/first with ignoreNulls

I will create tests for those that are not yet covered. I will also check if 
there are any special combinations (possibly also considering frames) that 
require additional test cases. 

[~jiangxb] are there other cases that need to be covered? 
Do you think it would be worthwhile to have a set of new test cases focusing 
and systematically going through all the partition and order clauses?


was (Author: smurakozi):
I just did a quick check of the existing test cases to see the current coverage:

different partition clauses 
* None
** Window.rowsBetween 
** reverse unbounded range frame
** window function with aggregates
** Null inputs
* One
** Lots of tests
* Multiple
** No test
* different order clauses 
* None
** Window.rowsBetween
** window function should fail if order by clause is not specified
** statistical functions
* One
** Lots of tests
* Multiple
** No test
* asc
** lots of tests
* desc
** aggregation and range between with unbounded
** reverse sliding range frame
** reverse unbounded range frame
* nulls first/last
** last/first with ignoreNulls

I will create tests for those that are not yet covered. I will also check if 
there are any special combinations (possibly also considering frames) that 
require additional test cases. 

[~jiangxb] are there other cases that need to be covered? 
Do you think it would be worthwhile to have a set of new test cases focusing 
and systematically going through all the partition and order clauses?

> Add unit test for Window Specifications
> ---
>
> Key: SPARK-22360
> URL: https://issues.apache.org/jira/browse/SPARK-22360
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>
> * different partition clauses (none, one, multiple)
> * different order clauses (none, one, multiple, asc/desc, nulls first/last)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22777) Docker container built for Kubernetes doesn't allow running entrypoint.sh

2017-12-13 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-22777:
---
Description: 
Default docker images built thrown an error when trying to run on a cluster. 
The error looks like the following:

{
{noformat}
  9s9s  1   kubelet, 
gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
 Warning Failed  Error: failed to start container 
"spark-kubernetes-driver": Error response from daemon: {"message":"oci runtime 
error: container_linux.go:247: starting container process caused \"exec: 
\\\"/opt/entrypoint.sh\\\": permission denied\"\n"}
{noformat}
}

The fix is probably just changing permissions for the entrypoint script in the 
default docker image.

  was:
Default docker images built thrown an error when trying to run on a cluster. 
The error looks like the following:

{code:bash}

  9s9s  1   kubelet, 
gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
 Warning Failed  Error: failed to start container 
"spark-kubernetes-driver": Error response from daemon: {"message":"oci runtime 
error: container_linux.go:247: starting container process caused \"exec: 
\\\"/opt/entrypoint.sh\\\": permission denied\"\n"}
{code}


The fix is probably just changing permissions for the entrypoint script in the 
default docker image.


> Docker container built for Kubernetes doesn't allow running entrypoint.sh
> -
>
> Key: SPARK-22777
> URL: https://issues.apache.org/jira/browse/SPARK-22777
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> Default docker images built thrown an error when trying to run on a cluster. 
> The error looks like the following:
> {
> {noformat}
>   9s  9s  1   kubelet, 
> gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
>  Warning Failed  Error: failed to start container 
> "spark-kubernetes-driver": Error response from daemon: {"message":"oci 
> runtime error: container_linux.go:247: starting container process caused 
> \"exec: \\\"/opt/entrypoint.sh\\\": permission denied\"\n"}
> {noformat}
> }
> The fix is probably just changing permissions for the entrypoint script in 
> the default docker image.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22777) Docker container built for Kubernetes doesn't allow running entrypoint.sh

2017-12-13 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-22777:
---
Description: 
Default docker images built thrown an error when trying to run on a cluster. 
The error looks like the following:


{panel:title=My title}
  9s9s  1   kubelet, 
gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
 Warning Failed  Error: failed to start container 
"spark-kubernetes-driver": Error response from daemon: {"message":"oci runtime 
error: container_linux.go:247: starting container process caused \"exec: 
\\\"/opt/entrypoint.sh\\\": permission denied\"\n"}

{panel}

The fix is probably just changing permissions for the entrypoint script in the 
default docker image.

  was:
Default docker images built thrown an error when trying to run on a cluster. 
The error looks like the following:

{
{noformat}
  9s9s  1   kubelet, 
gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
 Warning Failed  Error: failed to start container 
"spark-kubernetes-driver": Error response from daemon: {"message":"oci runtime 
error: container_linux.go:247: starting container process caused \"exec: 
\\\"/opt/entrypoint.sh\\\": permission denied\"\n"}
{noformat}
}

The fix is probably just changing permissions for the entrypoint script in the 
default docker image.


> Docker container built for Kubernetes doesn't allow running entrypoint.sh
> -
>
> Key: SPARK-22777
> URL: https://issues.apache.org/jira/browse/SPARK-22777
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> Default docker images built thrown an error when trying to run on a cluster. 
> The error looks like the following:
> {panel:title=My title}
>   9s  9s  1   kubelet, 
> gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
>  Warning Failed  Error: failed to start container 
> "spark-kubernetes-driver": Error response from daemon: {"message":"oci 
> runtime error: container_linux.go:247: starting container process caused 
> \"exec: \\\"/opt/entrypoint.sh\\\": permission denied\"\n"}
> {panel}
> The fix is probably just changing permissions for the entrypoint script in 
> the default docker image.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22777) Docker container built for Kubernetes doesn't allow running entrypoint.sh

2017-12-13 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-22777:
---
Description: 
Default docker images built thrown an error when trying to run on a cluster. 
The error looks like the following:


```
  9s9s  1   kubelet, 
gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
 Warning Failed  Error: failed to start container 
"spark-kubernetes-driver": Error response from daemon: {"message":"oci runtime 
error: container_linux.go:247: starting container process caused \"exec: 
\\\"/opt/entrypoint.sh\\\": permission denied\"\n"}
```

The fix is probably just changing permissions for the entrypoint script in the 
default docker image.

  was:
Default docker images built thrown an error when trying to run on a cluster. 
The error looks like the following:


{panel:title=My title}
  9s9s  1   kubelet, 
gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
 Warning Failed  Error: failed to start container 
"spark-kubernetes-driver": Error response from daemon: {"message":"oci runtime 
error: container_linux.go:247: starting container process caused \"exec: 
\\\"/opt/entrypoint.sh\\\": permission denied\"\n"}

{panel}

The fix is probably just changing permissions for the entrypoint script in the 
default docker image.


> Docker container built for Kubernetes doesn't allow running entrypoint.sh
> -
>
> Key: SPARK-22777
> URL: https://issues.apache.org/jira/browse/SPARK-22777
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> Default docker images built thrown an error when trying to run on a cluster. 
> The error looks like the following:
> ```
>   9s  9s  1   kubelet, 
> gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
>  Warning Failed  Error: failed to start container 
> "spark-kubernetes-driver": Error response from daemon: {"message":"oci 
> runtime error: container_linux.go:247: starting container process caused 
> \"exec: \\\"/opt/entrypoint.sh\\\": permission denied\"\n"}
> ```
> The fix is probably just changing permissions for the entrypoint script in 
> the default docker image.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22777) Docker container built for Kubernetes doesn't allow running entrypoint.sh

2017-12-13 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-22777:
---
Description: 
Default docker images built thrown an error when trying to run on a cluster. 
The error looks like the following:

{code:bash}

  9s9s  1   kubelet, 
gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
 Warning Failed  Error: failed to start container 
"spark-kubernetes-driver": Error response from daemon: {"message":"oci runtime 
error: container_linux.go:247: starting container process caused \"exec: 
\\\"/opt/entrypoint.sh\\\": permission denied\"\n"}
{code}


The fix is probably just changing permissions for the entrypoint script in the 
default docker image.

  was:
Default docker images built thrown an error when trying to run on a cluster. 
The error looks like the following:

  9s9s  1   kubelet, 
gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
 Warning Failed  Error: failed to start container 
"spark-kubernetes-driver": Error response from daemon: {"message":"oci runtime 
error: container_linux.go:247: starting container process caused \"exec: 
\\\"/opt/entrypoint.sh\\\": permission denied\"\n"}

The fix is probably just changing permissions for the entrypoint script in the 
default docker image.


> Docker container built for Kubernetes doesn't allow running entrypoint.sh
> -
>
> Key: SPARK-22777
> URL: https://issues.apache.org/jira/browse/SPARK-22777
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> Default docker images built thrown an error when trying to run on a cluster. 
> The error looks like the following:
> {code:bash}
>   9s  9s  1   kubelet, 
> gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
>  Warning Failed  Error: failed to start container 
> "spark-kubernetes-driver": Error response from daemon: {"message":"oci 
> runtime error: container_linux.go:247: starting container process caused 
> \"exec: \\\"/opt/entrypoint.sh\\\": permission denied\"\n"}
> {code}
> The fix is probably just changing permissions for the entrypoint script in 
> the default docker image.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22777) Docker container built for Kubernetes doesn't allow running entrypoint.sh

2017-12-13 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289795#comment-16289795
 ] 

Anirudh Ramanathan edited comment on SPARK-22777 at 12/13/17 8:02 PM:
--

[~liyinan926] [~ssuchter] [~kimoonkim]


was (Author: foxish):
[~liyinan926] [~ssuchter]

> Docker container built for Kubernetes doesn't allow running entrypoint.sh
> -
>
> Key: SPARK-22777
> URL: https://issues.apache.org/jira/browse/SPARK-22777
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> Default docker images built thrown an error when trying to run on a cluster. 
> The error looks like the following:
>   9s  9s  1   kubelet, 
> gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
>  Warning Failed  Error: failed to start container 
> "spark-kubernetes-driver": Error response from daemon: {"message":"oci 
> runtime error: container_linux.go:247: starting container process caused 
> \"exec: \\\"/opt/entrypoint.sh\\\": permission denied\"\n"}
> The fix is probably just changing permissions for the entrypoint script in 
> the default docker image.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22777) Docker container built for Kubernetes doesn't allow running entrypoint.sh

2017-12-13 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-22777:
---

[~liyinan926] [~ssuchter]

> Docker container built for Kubernetes doesn't allow running entrypoint.sh
> -
>
> Key: SPARK-22777
> URL: https://issues.apache.org/jira/browse/SPARK-22777
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> Default docker images built thrown an error when trying to run on a cluster. 
> The error looks like the following:
>   9s  9s  1   kubelet, 
> gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
>  Warning Failed  Error: failed to start container 
> "spark-kubernetes-driver": Error response from daemon: {"message":"oci 
> runtime error: container_linux.go:247: starting container process caused 
> \"exec: \\\"/opt/entrypoint.sh\\\": permission denied\"\n"}
> The fix is probably just changing permissions for the entrypoint script in 
> the default docker image.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22777) Docker container built for Kubernetes doesn't allow running entrypoint.sh

2017-12-13 Thread Anirudh Ramanathan (JIRA)
Anirudh Ramanathan created SPARK-22777:
--

 Summary: Docker container built for Kubernetes doesn't allow 
running entrypoint.sh
 Key: SPARK-22777
 URL: https://issues.apache.org/jira/browse/SPARK-22777
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Anirudh Ramanathan
Priority: Minor


Default docker images built thrown an error when trying to run on a cluster. 
The error looks like the following:

  9s9s  1   kubelet, 
gke-jupytercluster2-default-pool-6be20085-4nm4spec.containers{spark-kubernetes-driver}
 Warning Failed  Error: failed to start container 
"spark-kubernetes-driver": Error response from daemon: {"message":"oci runtime 
error: container_linux.go:247: starting container process caused \"exec: 
\\\"/opt/entrypoint.sh\\\": permission denied\"\n"}

The fix is probably just changing permissions for the entrypoint script in the 
default docker image.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22765) Create a new executor allocation scheme based on that of MR

2017-12-13 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289793#comment-16289793
 ] 

Thomas Graves commented on SPARK-22765:
---

ok so before you do anything else I would suggest trying spark version with  
SPARK-21656 or backporting it and then having a small idle timeout to see if 
that meets your needs.

I assume even if you put a new feature in your would have to configure it for 
different types of jobs so I don't see how that would be any different then 
setting idle timeout different per job?

> Create a new executor allocation scheme based on that of MR
> ---
>
> Key: SPARK-22765
> URL: https://issues.apache.org/jira/browse/SPARK-22765
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>
> Many users migrating their workload from MR to Spark find a significant 
> resource consumption hike (i.e, SPARK-22683). While this might not be a 
> concern for users that are more performance centric, for others conscious 
> about cost, such hike creates a migration obstacle. This situation can get 
> worse as more users are moving to cloud.
> Dynamic allocation make it possible for Spark to be deployed in multi-tenant 
> environment. With its performance-centric design, its inefficiency has also 
> unfortunately shown up, especially when compared with MR. Thus, it's believed 
> that MR-styled scheduler still has its merit. Based on our research, the 
> inefficiency associated with dynamic allocation comes in many aspects such as 
> executor idling out, bigger executors, many stages (rather than 2 stages only 
> in MR) in a spark job, etc.
> Rather than fine tuning dynamic allocation for efficiency, the proposal here 
> is to add a new, efficiency-centric  scheduling scheme based on that of MR. 
> Such a MR-based scheme can be further enhanced and be more adapted to Spark 
> execution model. This alternative is expected to offer good performance 
> improvement (compared to MR) still with similar to or even better efficiency 
> than MR.
> Inputs are greatly welcome!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22359) Improve the test coverage of window functions

2017-12-13 Thread Sandor Murakozi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289792#comment-16289792
 ] 

Sandor Murakozi commented on SPARK-22359:
-

I'm glad that you join [~gsomogyi]. 
Do you have any preferred subtasks? I've started to work on the first, dealing 
with WindowSpec.

> Improve the test coverage of window functions
> -
>
> Key: SPARK-22359
> URL: https://issues.apache.org/jira/browse/SPARK-22359
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>
> There are already quite a few integration tests using window functions, but 
> the unit tests coverage for window funtions is not ideal.
> We'd like to test the following aspects:
> * Specifications
> ** different partition clauses (none, one, multiple)
> ** different order clauses (none, one, multiple, asc/desc, nulls first/last)
> * Frames and their combinations
> ** OffsetWindowFunctionFrame
> ** UnboundedWindowFunctionFrame
> ** SlidingWindowFunctionFrame
> ** UnboundedPrecedingWindowFunctionFrame
> ** UnboundedFollowingWindowFunctionFrame
> * Aggregate function types
> ** Declarative
> ** Imperative
> ** UDAF
> * Spilling
> ** Cover the conditions that WindowExec should spill at least once 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22754) Check spark.executor.heartbeatInterval setting in case of ExecutorLost

2017-12-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22754:
--

Assignee: zhoukang

> Check spark.executor.heartbeatInterval setting in case of ExecutorLost
> --
>
> Key: SPARK-22754
> URL: https://issues.apache.org/jira/browse/SPARK-22754
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Minor
> Fix For: 2.3.0
>
>
> If spark.executor.heartbeatInterval bigger than spark.network.timeout,it will 
> almost always cause exception below.
> {code:java}
> Job aborted due to stage failure: Task 4763 in stage 3.0 failed 4 times, most 
> recent failure: Lost task 4763.3 in stage 3.0 (TID 22383, executor id: 4761, 
> host: xxx): ExecutorLostFailure (executor 4761 exited caused by one of the 
> running tasks) Reason: Executor heartbeat timed out after 154022 ms
> {code}
> Since many users do not get that point.He will set 
> spark.executor.heartbeatInterval incorrectly.
> We should check this case when submit applications.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22754) Check spark.executor.heartbeatInterval setting in case of ExecutorLost

2017-12-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22754.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19942
[https://github.com/apache/spark/pull/19942]

> Check spark.executor.heartbeatInterval setting in case of ExecutorLost
> --
>
> Key: SPARK-22754
> URL: https://issues.apache.org/jira/browse/SPARK-22754
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Minor
> Fix For: 2.3.0
>
>
> If spark.executor.heartbeatInterval bigger than spark.network.timeout,it will 
> almost always cause exception below.
> {code:java}
> Job aborted due to stage failure: Task 4763 in stage 3.0 failed 4 times, most 
> recent failure: Lost task 4763.3 in stage 3.0 (TID 22383, executor id: 4761, 
> host: xxx): ExecutorLostFailure (executor 4761 exited caused by one of the 
> running tasks) Reason: Executor heartbeat timed out after 154022 ms
> {code}
> Since many users do not get that point.He will set 
> spark.executor.heartbeatInterval incorrectly.
> We should check this case when submit applications.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22776) Increase default value of spark.sql.codegen.maxFields

2017-12-13 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-22776:


 Summary: Increase default value of spark.sql.codegen.maxFields
 Key: SPARK-22776
 URL: https://issues.apache.org/jira/browse/SPARK-22776
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


Since there are lots of effort to avoid limitation of Java class files, 
generated code for whole-stage codegen works with wider columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22765) Create a new executor allocation scheme based on that of MR

2017-12-13 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289726#comment-16289726
 ] 

Xuefu Zhang commented on SPARK-22765:
-

Yes, we are using Hive on Spark. Our Spark version is 1.6.1, which is old. 
Obviously it doesn't have the fix in SPARK-21656.

As commented in SPARK-22683, our comparison was made between all jobs for MR VS 
all jobs (usually just 1) for Spark for individual queries.

> Create a new executor allocation scheme based on that of MR
> ---
>
> Key: SPARK-22765
> URL: https://issues.apache.org/jira/browse/SPARK-22765
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>
> Many users migrating their workload from MR to Spark find a significant 
> resource consumption hike (i.e, SPARK-22683). While this might not be a 
> concern for users that are more performance centric, for others conscious 
> about cost, such hike creates a migration obstacle. This situation can get 
> worse as more users are moving to cloud.
> Dynamic allocation make it possible for Spark to be deployed in multi-tenant 
> environment. With its performance-centric design, its inefficiency has also 
> unfortunately shown up, especially when compared with MR. Thus, it's believed 
> that MR-styled scheduler still has its merit. Based on our research, the 
> inefficiency associated with dynamic allocation comes in many aspects such as 
> executor idling out, bigger executors, many stages (rather than 2 stages only 
> in MR) in a spark job, etc.
> Rather than fine tuning dynamic allocation for efficiency, the proposal here 
> is to add a new, efficiency-centric  scheduling scheme based on that of MR. 
> Such a MR-based scheme can be further enhanced and be more adapted to Spark 
> execution model. This alternative is expected to offer good performance 
> improvement (compared to MR) still with similar to or even better efficiency 
> than MR.
> Inputs are greatly welcome!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22765) Create a new executor allocation scheme based on that of MR

2017-12-13 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289700#comment-16289700
 ] 

Thomas Graves commented on SPARK-22765:
---

ok so its basically they idle timeout during DAG computing or scheduler not 
fast enough to deploy task.  What version of Spark are you using, we did 
actually recently make a change to dynamic allocation where it won't idle 
timeout executors when it has tasks to run on them.  
https://issues.apache.org/jira/browse/SPARK-21656

Are you using hive with spark? did you compare resource utilization for the 
spark job compared to the multiple MR jobs that get run for single query?

> Create a new executor allocation scheme based on that of MR
> ---
>
> Key: SPARK-22765
> URL: https://issues.apache.org/jira/browse/SPARK-22765
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>
> Many users migrating their workload from MR to Spark find a significant 
> resource consumption hike (i.e, SPARK-22683). While this might not be a 
> concern for users that are more performance centric, for others conscious 
> about cost, such hike creates a migration obstacle. This situation can get 
> worse as more users are moving to cloud.
> Dynamic allocation make it possible for Spark to be deployed in multi-tenant 
> environment. With its performance-centric design, its inefficiency has also 
> unfortunately shown up, especially when compared with MR. Thus, it's believed 
> that MR-styled scheduler still has its merit. Based on our research, the 
> inefficiency associated with dynamic allocation comes in many aspects such as 
> executor idling out, bigger executors, many stages (rather than 2 stages only 
> in MR) in a spark job, etc.
> Rather than fine tuning dynamic allocation for efficiency, the proposal here 
> is to add a new, efficiency-centric  scheduling scheme based on that of MR. 
> Such a MR-based scheme can be further enhanced and be more adapted to Spark 
> execution model. This alternative is expected to offer good performance 
> improvement (compared to MR) still with similar to or even better efficiency 
> than MR.
> Inputs are greatly welcome!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-13 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289689#comment-16289689
 ] 

Xuefu Zhang commented on SPARK-22683:
-

[~tgraves], I can speak on our use case, where same queries  running on MR vs 
Spark via Hive. Because Spark gets rid of the intermediate HDFS reads/writes of 
MR, we expected better efficiency in addition to perf gains. While our 
expectation is met for some of our queries, usually long running ones with many 
stages, for the resource usage is much worse for other queries, especially 
those short running ones. 

I believe that efficiency can be substantially enhanced in both cases.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> and idling overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22765) Create a new executor allocation scheme based on that of MR

2017-12-13 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289676#comment-16289676
 ] 

Xuefu Zhang edited comment on SPARK-22765 at 12/13/17 6:30 PM:
---

Hi [~tgraves], Thanks for your input.

In our busy, heavily loaded cluster environment, we have found that any idle 
time less than 60s is a problem. 30s works for small jobs, but starts having 
problem for bigger jobs. The symptom is that newly allocated executors are 
idled out before completing a single tasks! I suspected that this is caused by 
a busy scheduler. As a result, we have to keep 60s as a minimum.

Having said that, however, I'm not against container reuse. Also, I used the 
word "enhanced" to qualify MR scheduling. Reusing is good, but in my opinion 
the speculation factor in dynamic allocation goes against efficiency. That is, 
you set an idle time just in case a new task comes within that period of time. 
When that doesn't happen, you waste your executor for 1 minute. (This is good 
for performance.) Please note that this happens a lot at the end of each stage 
because no tasks from the next stage will be scheduled until the current stage 
finishes.

If we can remove the speculation aspect of the scheduling, the efficiency 
should improve significantly with some compromise on performance. This would be 
a good start point, which is the main purpose of my proposal of an enhanced 
MR-style scheduling, which is open to many other possible improvements.



was (Author: xuefuz):
Hi [~tgraves], Thanks for your input.

In our busy, heavily loaded cluster environment, we have found that any idle 
time less than 60s is a problem. 30s works for small jobs, but starts having 
problem for bigger jobs. The symptom is that newly allocated executors are 
idled out before completing a single tasks! I suspected that this is caused by 
a busy scheduler. As a result, we have to keep 60s as a minimum.

Having said that, however, I'm not against container reuse. Also, I used the 
word "enhanced" to improve on MR scheduling. Reusing is good, but in my opinion 
the speculation factor in dynamic allocation goes against efficiency. That is, 
you set an idle time just in case a new task comes within that period of time. 
When that doesn't happen, you waste your executor for 1 minute. (This is good 
for performance.) Please note that this happens a lot at the end of each stage 
because no tasks from the next stage will be scheduled until the current stage 
finishes.

If we can remove the speculation aspect of the scheduling, the efficiency 
should improve significantly with some compromise on performance. This would be 
a good start point, which is the main purpose of my proposal of an enhanced 
MR-style scheduling, which is open to many other possible improvements.


> Create a new executor allocation scheme based on that of MR
> ---
>
> Key: SPARK-22765
> URL: https://issues.apache.org/jira/browse/SPARK-22765
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>
> Many users migrating their workload from MR to Spark find a significant 
> resource consumption hike (i.e, SPARK-22683). While this might not be a 
> concern for users that are more performance centric, for others conscious 
> about cost, such hike creates a migration obstacle. This situation can get 
> worse as more users are moving to cloud.
> Dynamic allocation make it possible for Spark to be deployed in multi-tenant 
> environment. With its performance-centric design, its inefficiency has also 
> unfortunately shown up, especially when compared with MR. Thus, it's believed 
> that MR-styled scheduler still has its merit. Based on our research, the 
> inefficiency associated with dynamic allocation comes in many aspects such as 
> executor idling out, bigger executors, many stages (rather than 2 stages only 
> in MR) in a spark job, etc.
> Rather than fine tuning dynamic allocation for efficiency, the proposal here 
> is to add a new, efficiency-centric  scheduling scheme based on that of MR. 
> Such a MR-based scheme can be further enhanced and be more adapted to Spark 
> execution model. This alternative is expected to offer good performance 
> improvement (compared to MR) still with similar to or even better efficiency 
> than MR.
> Inputs are greatly welcome!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18278) SPIP: Support native submission of spark jobs to a kubernetes cluster

2017-12-13 Thread Yinan Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yinan Li updated SPARK-18278:
-
Component/s: Kubernetes

> SPIP: Support native submission of spark jobs to a kubernetes cluster
> -
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Kubernetes, Scheduler, 
> Spark Core
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>  Labels: SPIP
> Attachments: SPARK-18278 Spark on Kubernetes Design Proposal Revision 
> 2 (1).pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22765) Create a new executor allocation scheme based on that of MR

2017-12-13 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289676#comment-16289676
 ] 

Xuefu Zhang commented on SPARK-22765:
-

Hi [~tgraves], Thanks for your input.

In our busy, heavily loaded cluster environment, we have found that any idle 
time less than 60s is a problem. 30s works for small jobs, but starts having 
problem for bigger jobs. The symptom is that newly allocated executors are 
idled out before completing a single tasks! I suspected that this is caused by 
a busy scheduler. As a result, we have to keep 60s as a minimum.

Having said that, however, I'm not against container reuse. Also, I used the 
word "enhanced" to improve on MR scheduling. Reusing is good, but in my opinion 
the speculation factor in dynamic allocation goes against efficiency. That is, 
you set an idle time just in case a new task comes within that period of time. 
When that doesn't happen, you waste your executor for 1 minute. (This is good 
for performance.) Please note that this happens a lot at the end of each stage 
because no tasks from the next stage will be scheduled until the current stage 
finishes.

If we can remove the speculation aspect of the scheduling, the efficiency 
should improve significantly with some compromise on performance. This would be 
a good start point, which is the main purpose of my proposal of an enhanced 
MR-style scheduling, which is open to many other possible improvements.


> Create a new executor allocation scheme based on that of MR
> ---
>
> Key: SPARK-22765
> URL: https://issues.apache.org/jira/browse/SPARK-22765
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>
> Many users migrating their workload from MR to Spark find a significant 
> resource consumption hike (i.e, SPARK-22683). While this might not be a 
> concern for users that are more performance centric, for others conscious 
> about cost, such hike creates a migration obstacle. This situation can get 
> worse as more users are moving to cloud.
> Dynamic allocation make it possible for Spark to be deployed in multi-tenant 
> environment. With its performance-centric design, its inefficiency has also 
> unfortunately shown up, especially when compared with MR. Thus, it's believed 
> that MR-styled scheduler still has its merit. Based on our research, the 
> inefficiency associated with dynamic allocation comes in many aspects such as 
> executor idling out, bigger executors, many stages (rather than 2 stages only 
> in MR) in a spark job, etc.
> Rather than fine tuning dynamic allocation for efficiency, the proposal here 
> is to add a new, efficiency-centric  scheduling scheme based on that of MR. 
> Such a MR-based scheme can be further enhanced and be more adapted to Spark 
> execution model. This alternative is expected to offer good performance 
> improvement (compared to MR) still with similar to or even better efficiency 
> than MR.
> Inputs are greatly welcome!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-13 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289656#comment-16289656
 ] 

Thomas Graves commented on SPARK-22683:
---

I am also curious, when you are comparing spark to MR based jobs and assuming 
you are running hive or pig, are you comparing spark to resource usage across 
the multiple MR jobs?   Or are you running straight MR job vs spark app?  You 
have to look across the jobs because its using resources by having to write to 
hdfs and read from hdfs between jobs.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> and idling overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-13 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289614#comment-16289614
 ] 

Thomas Graves commented on SPARK-22683:
---

So the issue brought up here seems to be resource waste vs run time.  I agree 
that resource waste is an important thing.  There are lots of jobs that have 
very short tasks.  It might be nice to update the description to go into more 
details what the issue you are seeing rather then talking about your proposed 
solution.

You mention fast tasks end up getting run on small # of executors. this can 
actually be beneficial to resource usage right? no need to get more if they run 
fast enough on small number, the downside here is if we have requested them 
from yarn and are getting them then it wastes for that startup/shutdown period. 
the other thing you mention though is this affects shuffle later, you can ask 
spark to wait longer for a minimal number to help with the shuffle issue (also 
set spark.dynamicAllocation.initialExecutors if its an early stage) which again 
adversely affects resource usage.  This is another balance point though which 
goes against what you originally asked I think unless you actually put in 
another config that tries to enforce spreading those. 

I agree with Sean on the point that I think this would be a hard thing to 
optimize for all jobs, long running vs short. The fact you are asking for 
5+cores per executor will naturally waste more resources when the executor 
isn't being used, that is inherent and until we go to resizing those (quickly) 
will always be an issue.   But if we can find something that by defaults works 
better for the majority of workloads that it makes sense to improve. As with 
any config though, how do I know what to set the tasksPerSlot as? it requires 
configuration and it could affect performance.  

The reason I was told the dynamic allocation exponential ramps up is to try to 
allow the quick tasks to run on existing executors before asking for more.  You 
are essentially saying this isn't working well enough.  But is it not working 
well enough because we are doing the exponential ask?  What if we ask for all 
up front like MR does?  I see you made one comment about executors not used 
started didn't get many run on it and then idle timed out so that might not 
help here, but the question is would yarn give you more immediately if you 
asked for them all first.   Were your benchmarks done on a busy cluster or a 
empty cluster, how fast was your container allocation, did you hit other user 
limits, etc.  how many executors and how quickly they are allocated will be 
affected by those things.  

Above you say "When running with 6 tasks per executor slot, our Spark jobs 
consume in average 30% less vcorehours than the MR jobs, this setting being 
valid for different workload sizes."  Was this with this patch applied or 
without?

in 
https://issues.apache.org/jira/browse/SPARK-22683?focusedCommentId=16286032=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16286032
 .   The WallTimeGain wrt MR (%) , does this mean positive numbers ran faster 
then MR?  why is running with 6 or 8 slower?  is it shuffle issues or mistuning 
with gc, or just unknown overhead?

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
> The current dynamic allocation policy allocates enough executors
> to have each taskSlot execute a single task, which minimizes latency, 
> but wastes resources when tasks are small regarding executor allocation
> and idling overhead. 
> By adding the tasksPerExecutorSlot, it is made possible to specify how many 
> tasks
> a single slot should ideally execute to mitigate the overhead of executor
> allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22765) Create a new executor allocation scheme based on that of MR

2017-12-13 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289613#comment-16289613
 ] 

Thomas Graves commented on SPARK-22765:
---

why doesn't idle timeout very small < 5 seconds work?  If there is no work to 
be done it should exit soon after task finishes similar to MR.  Note that tez 
added container reuse as well which is similar to spark scheme. 

Basically I think you are proposing a change that essentially add a config that 
does not reuse containers. 
I'm not sure I agree with this and that it will help resource utilization. 
Especially when looking at the whole ecosystem. Without reuse you have to go 
back to yarn to ask for more, which depending on cluster usage could cause 
significant overhead to wait for more containers. You are bringing up and 
killing processes, you are re-downloading things into distributed cache, etc.  
So from vcore/memory per second on yarn it might be better but it affects other 
things as well.  Just something to keep in mind.  

You are seeing this on very short running tasks?

> Create a new executor allocation scheme based on that of MR
> ---
>
> Key: SPARK-22765
> URL: https://issues.apache.org/jira/browse/SPARK-22765
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>
> Many users migrating their workload from MR to Spark find a significant 
> resource consumption hike (i.e, SPARK-22683). While this might not be a 
> concern for users that are more performance centric, for others conscious 
> about cost, such hike creates a migration obstacle. This situation can get 
> worse as more users are moving to cloud.
> Dynamic allocation make it possible for Spark to be deployed in multi-tenant 
> environment. With its performance-centric design, its inefficiency has also 
> unfortunately shown up, especially when compared with MR. Thus, it's believed 
> that MR-styled scheduler still has its merit. Based on our research, the 
> inefficiency associated with dynamic allocation comes in many aspects such as 
> executor idling out, bigger executors, many stages (rather than 2 stages only 
> in MR) in a spark job, etc.
> Rather than fine tuning dynamic allocation for efficiency, the proposal here 
> is to add a new, efficiency-centric  scheduling scheme based on that of MR. 
> Such a MR-based scheme can be further enhanced and be more adapted to Spark 
> execution model. This alternative is expected to offer good performance 
> improvement (compared to MR) still with similar to or even better efficiency 
> than MR.
> Inputs are greatly welcome!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22774) Add compilation check for generated code in TPCDSQuerySuite

2017-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22774:


Assignee: Apache Spark

> Add compilation check for generated code in TPCDSQuerySuite
> ---
>
> Key: SPARK-22774
> URL: https://issues.apache.org/jira/browse/SPARK-22774
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> {{TPCDSQuerySuite}} already checks whether analysis can be performed 
> correctly. In addition, it would be good to check whether generated Java code 
> can be compiled correctly



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22774) Add compilation check for generated code in TPCDSQuerySuite

2017-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22774:


Assignee: (was: Apache Spark)

> Add compilation check for generated code in TPCDSQuerySuite
> ---
>
> Key: SPARK-22774
> URL: https://issues.apache.org/jira/browse/SPARK-22774
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> {{TPCDSQuerySuite}} already checks whether analysis can be performed 
> correctly. In addition, it would be good to check whether generated Java code 
> can be compiled correctly



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >