[jira] [Updated] (SPARK-23788) Race condition in StreamingQuerySuite

2018-03-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-23788:
-
Fix Version/s: 2.2.2

> Race condition in StreamingQuerySuite
> -
>
> Key: SPARK-23788
> URL: https://issues.apache.org/jira/browse/SPARK-23788
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Minor
> Fix For: 2.2.2, 2.3.1, 2.4.0
>
>
> The serializability test uses the same MemoryStream instance for 3 different 
> queries. If any of those queries ask it to commit before the others have run, 
> the rest will see empty dataframes. This can fail the test if q3 is affected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23788) Race condition in StreamingQuerySuite

2018-03-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-23788.
--
   Resolution: Fixed
 Assignee: Jose Torres
Fix Version/s: 2.4.0
   2.3.1

> Race condition in StreamingQuerySuite
> -
>
> Key: SPARK-23788
> URL: https://issues.apache.org/jira/browse/SPARK-23788
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
>
> The serializability test uses the same MemoryStream instance for 3 different 
> queries. If any of those queries ask it to commit before the others have run, 
> the rest will see empty dataframes. This can fail the test if q3 is affected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23727) Support DATE predict push down in parquet

2018-03-24 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23727:
--
Issue Type: Improvement  (was: Bug)

> Support DATE predict push down in parquet
> -
>
> Key: SPARK-23727
> URL: https://issues.apache.org/jira/browse/SPARK-23727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>
> DATE predict push down is missing, should be supported.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23598) WholeStageCodegen can lead to IllegalAccessError calling append for HashAggregateExec

2018-03-24 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412835#comment-16412835
 ] 

Dongjoon Hyun commented on SPARK-23598:
---

Hi, [~hvanhovell] and [~kiszk].
Although this test case is failing in `branch-2.3` sometimes, I added `2.3.1` 
to `Fixed Versions` because the patch landed on `branch-2.3`.
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/lastCompletedBuild/testReport/org.apache.spark.sql.execution/WholeStageCodegenSuite/SPARK_23598__Codegen_working_for_lots_of_aggregation_operations_without_runtime_errors/

> WholeStageCodegen can lead to IllegalAccessError  calling append for 
> HashAggregateExec
> --
>
> Key: SPARK-23598
> URL: https://issues.apache.org/jira/browse/SPARK-23598
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: David Vogelbacher
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Got the following stacktrace for a large QueryPlan using WholeStageCodeGen:
> {noformat}
> java.lang.IllegalAccessError: tried to access method 
> org.apache.spark.sql.execution.BufferedRowIterator.append(Lorg/apache/spark/sql/catalyst/InternalRow;)V
>  from class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7$agg_NestedClass
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7$agg_NestedClass.agg_doAggregateWithKeysOutput$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345){noformat}
> After disabling codegen, everything works.
> The root cause seems to be that we are trying to call the protected _append_ 
> method of 
> [BufferedRowIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/BufferedRowIterator.java#L68]
>  from an inner-class of a sub-class that is loaded by a different 
> class-loader (after codegen compilation).
> [https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-5.html#jvms-5.4.4] 
> states that a protected method _R_ can be accessed only if one of the 
> following two conditions is fulfilled:
>  # R is protected and is declared in a class C, and D is either a subclass of 
> C or C itself. Furthermore, if R is not static, then the symbolic reference 
> to R must contain a symbolic reference to a class T, such that T is either a 
> subclass of D, a superclass of D, or D itself.
>  # R is either protected or has default access (that is, neither public nor 
> protected nor private), and is declared by a class in the same run-time 
> package as D.
> 2.) doesn't apply as we have loaded the class with a different class loader 
> (and are in a different package) and 1.) doesn't apply because we are 
> apparently trying to call the method from an inner class of a subclass of 
> _BufferedRowIterator_.
> Looking at the Code path of _WholeStageCodeGen_, the following happens:
>  # In 
> [WholeStageCodeGen|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L527],
>  we create the subclass of _BufferedRowIterator_, along with a _processNext_ 
> method for processing the output of the child plan.
>  # In the child, which is a 
> [HashAggregateExec|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L517],
>  we create the method which shows up at the top of the stack trace (called 
> _doAggregateWithKeysOutput_ )
>  # We add this method to the compiled code invoking _addNewFunction_ of 
> [CodeGenerator|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L460]
> In the generated function body we call the _append_ method.|
> Now, the _addNewFunction_ method states that:
> {noformat}
> If the 

[jira] [Updated] (SPARK-23598) WholeStageCodegen can lead to IllegalAccessError calling append for HashAggregateExec

2018-03-24 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23598:
--
Fix Version/s: 2.3.1

> WholeStageCodegen can lead to IllegalAccessError  calling append for 
> HashAggregateExec
> --
>
> Key: SPARK-23598
> URL: https://issues.apache.org/jira/browse/SPARK-23598
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: David Vogelbacher
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Got the following stacktrace for a large QueryPlan using WholeStageCodeGen:
> {noformat}
> java.lang.IllegalAccessError: tried to access method 
> org.apache.spark.sql.execution.BufferedRowIterator.append(Lorg/apache/spark/sql/catalyst/InternalRow;)V
>  from class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7$agg_NestedClass
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7$agg_NestedClass.agg_doAggregateWithKeysOutput$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345){noformat}
> After disabling codegen, everything works.
> The root cause seems to be that we are trying to call the protected _append_ 
> method of 
> [BufferedRowIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/BufferedRowIterator.java#L68]
>  from an inner-class of a sub-class that is loaded by a different 
> class-loader (after codegen compilation).
> [https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-5.html#jvms-5.4.4] 
> states that a protected method _R_ can be accessed only if one of the 
> following two conditions is fulfilled:
>  # R is protected and is declared in a class C, and D is either a subclass of 
> C or C itself. Furthermore, if R is not static, then the symbolic reference 
> to R must contain a symbolic reference to a class T, such that T is either a 
> subclass of D, a superclass of D, or D itself.
>  # R is either protected or has default access (that is, neither public nor 
> protected nor private), and is declared by a class in the same run-time 
> package as D.
> 2.) doesn't apply as we have loaded the class with a different class loader 
> (and are in a different package) and 1.) doesn't apply because we are 
> apparently trying to call the method from an inner class of a subclass of 
> _BufferedRowIterator_.
> Looking at the Code path of _WholeStageCodeGen_, the following happens:
>  # In 
> [WholeStageCodeGen|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L527],
>  we create the subclass of _BufferedRowIterator_, along with a _processNext_ 
> method for processing the output of the child plan.
>  # In the child, which is a 
> [HashAggregateExec|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L517],
>  we create the method which shows up at the top of the stack trace (called 
> _doAggregateWithKeysOutput_ )
>  # We add this method to the compiled code invoking _addNewFunction_ of 
> [CodeGenerator|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L460]
> In the generated function body we call the _append_ method.|
> Now, the _addNewFunction_ method states that:
> {noformat}
> If the code for the `OuterClass` grows too large, the function will be 
> inlined into a new private, inner class
> {noformat}
> This indeed seems to happen: the _doAggregateWithKeysOutput_ method is put 
> into a new private inner class. Thus, it doesn't have access to the protected 
> _append_ method anymore but still tries to call it, which results in the 
> _IllegalAccessError._ 
> Possible fixes:
>  * Pass in the _inlineToOuterClass_ flag when invoking the _addNewFunction_
>  * 

[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission

2018-03-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412820#comment-16412820
 ] 

Marcelo Vanzin commented on SPARK-23782:


bq. The users can see which applications have been run by each users...

Sorry but I don't consider any of the things you mentioned sensitive. They 
basically boil down to: there are other users in the system, and they can run 
applications.

The consequences of this feature for the usability of the SHS (different users 
see different things) is a lot worse. I'm still against it unless you can make 
a very good case for it, which I haven't seen yet.

> SHS should not show applications to user without read permission
> 
>
> Key: SPARK-23782
> URL: https://issues.apache.org/jira/browse/SPARK-23782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> The History Server shows all the applications to all the users, even though 
> they have no permission to read them. They cannot read the details of the 
> applications they cannot access, but still anybody can list all the 
> applications submitted by all users.
> For instance, if we have an admin user {{admin}} and two normal users {{u1}} 
> and {{u2}}, and each of them submitted one application, all of them can see 
> in the main page of SHS:
> ||App ID||App Name|| ... ||Spark User|| ... ||
> |app-123456789|The Admin App| .. |admin| ... |
> |app-123456790|u1 secret app| .. |u1| ... |
> |app-123456791|u2 secret app| .. |u2| ... |
> Then clicking on each application, the proper permissions are applied and 
> each user can see only the applications he has the read permission for.
> Instead, each user should see only the applications he has the permission to 
> read and he/she should not be able to see applications he has not the 
> permissions for.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23791) Sub-optimal generated code for sum aggregating

2018-03-24 Thread Valentin Nikotin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412799#comment-16412799
 ] 

Valentin Nikotin commented on SPARK-23791:
--

When testing aggregation with different number of columns (v2.3.0) 
I found that 100 columns works the approx same time for both cases.
With 90 columns Spark job failed with 
{noformat}
18/03/25 00:11:33 ERROR Executor: Exception in task 117.0 in stage 1.0 (TID 4)
java.lang.ClassFormatError: Too many arguments in method signature in class 
file 
org/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage2
at java.lang.ClassLoader.defineClass1(Native Method)
{noformat}


> Sub-optimal generated code for sum aggregating
> --
>
> Key: SPARK-23791
> URL: https://issues.apache.org/jira/browse/SPARK-23791
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Valentin Nikotin
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It appears to be that with wholeStage codegen enabled simple spark job 
> performing sum aggregation of 50 columns runs ~4 timer slower than without 
> wholeStage codegen.
> Please check test case code. Please note that udf is only to prevent 
> elimination optimizations that could be applied to literals. 
> {code:scala}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.{Column, DataFrame, SparkSession}
> import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED
> object SPARK_23791 {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .master("local[4]")
>   .appName("test")
>   .getOrCreate()
> def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
> DataFrame) =
>   (0 until cnt).foldLeft(inputDF)((df, idx) => 
> df.withColumn(s"$prefix$idx", value))
> val dummy = udf(() => Option.empty[Int])
> def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = {
>   val t0 = System.nanoTime()
>   spark.range(rows).toDF()
> .withColumn("grp", col("id").mod(grps))
> .transform(addConstColumns("null_", cnt, dummy()))
> .groupBy("grp")
> .agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*)
> .collect()
>   val t1 = System.nanoTime()
>   (t1 - t0) / 1e9
> }
> val timings = for (i <- 1 to 3) yield {
>   spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true)
>   val with_wholestage = test()
>   spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false)
>   val without_wholestage = test()
>   (with_wholestage, without_wholestage)
> }
> timings.foreach(println)
> println("Press enter ...")
> System.in.read()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23791) Sub-optimal generated code for sum aggregating

2018-03-24 Thread Valentin Nikotin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentin Nikotin updated SPARK-23791:
-
Description: 
It appears to be that with wholeStage codegen enabled simple spark job 
performing sum aggregation of 50 columns runs ~4 timer slower than without 
wholeStage codegen.

Please check test case code. Please note that udf is only to prevent 
elimination optimizations that could be applied to literals. 


{code:scala}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED

object SPARK_23791 {

  def main(args: Array[String]): Unit = {

val spark = SparkSession
  .builder()
  .master("local[4]")
  .appName("test")
  .getOrCreate()

def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
DataFrame) =
  (0 until cnt).foldLeft(inputDF)((df, idx) => 
df.withColumn(s"$prefix$idx", value))

val dummy = udf(() => Option.empty[Int])

def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = {
  val t0 = System.nanoTime()
  spark.range(rows).toDF()
.withColumn("grp", col("id").mod(grps))
.transform(addConstColumns("null_", cnt, dummy()))
.groupBy("grp")
.agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*)
.collect()
  val t1 = System.nanoTime()
  (t1 - t0) / 1e9
}

val timings = for (i <- 1 to 3) yield {
  spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true)
  val with_wholestage = test()
  spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false)
  val without_wholestage = test()
  (with_wholestage, without_wholestage)
}

timings.foreach(println)

println("Press enter ...")
System.in.read()
  }
}
{code}


  was:
It appears to be that with wholeStage codegen enabled simple spark job 
performing sum aggregation of 50 nullable columns runs ~4 timer slower than 
without wholeStage codegen.

Please check test case code. Please note that udf is only to prevent 
elimination optimizations that could be applied to literals. 


{code:scala}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED

object SPARK_23791 {

  def main(args: Array[String]): Unit = {

val spark = SparkSession
  .builder()
  .master("local[4]")
  .appName("test")
  .getOrCreate()

def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
DataFrame) =
  (0 until cnt).foldLeft(inputDF)((df, idx) => 
df.withColumn(s"$prefix$idx", value))

val dummy = udf(() => Option.empty[Int])

def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = {
  val t0 = System.nanoTime()
  spark.range(rows).toDF()
.withColumn("grp", col("id").mod(grps))
.transform(addConstColumns("null_", cnt, dummy()))
.groupBy("grp")
.agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*)
.collect()
  val t1 = System.nanoTime()
  (t1 - t0) / 1e9
}

val timings = for (i <- 1 to 3) yield {
  spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true)
  val with_wholestage = test()
  spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false)
  val without_wholestage = test()
  (with_wholestage, without_wholestage)
}

timings.foreach(println)

println("Press enter ...")
System.in.read()
  }
}
{code}



> Sub-optimal generated code for sum aggregating
> --
>
> Key: SPARK-23791
> URL: https://issues.apache.org/jira/browse/SPARK-23791
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Valentin Nikotin
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It appears to be that with wholeStage codegen enabled simple spark job 
> performing sum aggregation of 50 columns runs ~4 timer slower than without 
> wholeStage codegen.
> Please check test case code. Please note that udf is only to prevent 
> elimination optimizations that could be applied to literals. 
> {code:scala}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.{Column, DataFrame, SparkSession}
> import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED
> object SPARK_23791 {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .master("local[4]")
>   .appName("test")
>   .getOrCreate()
> def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
> 

[jira] [Updated] (SPARK-23791) Sub-optimal generated code for sum aggregating

2018-03-24 Thread Valentin Nikotin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentin Nikotin updated SPARK-23791:
-
Summary: Sub-optimal generated code for sum aggregating  (was: Sub-optimal 
generated code when aggregating nullable columns)

> Sub-optimal generated code for sum aggregating
> --
>
> Key: SPARK-23791
> URL: https://issues.apache.org/jira/browse/SPARK-23791
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Valentin Nikotin
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It appears to be that with wholeStage codegen enabled simple spark job 
> performing sum aggregation of 50 nullable columns runs ~4 timer slower than 
> without wholeStage codegen.
> Please check test case code. Please note that udf is only to prevent 
> elimination optimizations that could be applied to literals. 
> {code:scala}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.{Column, DataFrame, SparkSession}
> import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED
> object SPARK_23791 {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .master("local[4]")
>   .appName("test")
>   .getOrCreate()
> def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
> DataFrame) =
>   (0 until cnt).foldLeft(inputDF)((df, idx) => 
> df.withColumn(s"$prefix$idx", value))
> val dummy = udf(() => Option.empty[Int])
> def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = {
>   val t0 = System.nanoTime()
>   spark.range(rows).toDF()
> .withColumn("grp", col("id").mod(grps))
> .transform(addConstColumns("null_", cnt, dummy()))
> .groupBy("grp")
> .agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*)
> .collect()
>   val t1 = System.nanoTime()
>   (t1 - t0) / 1e9
> }
> val timings = for (i <- 1 to 3) yield {
>   spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true)
>   val with_wholestage = test()
>   spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false)
>   val without_wholestage = test()
>   (with_wholestage, without_wholestage)
> }
> timings.foreach(println)
> println("Press enter ...")
> System.in.read()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23791) Sub-optimal generated code when aggregating nullable columns

2018-03-24 Thread Valentin Nikotin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentin Nikotin updated SPARK-23791:
-
Environment: (was: {code:java}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED

object TestCase {

  def main(args: Array[String]): Unit = {

val spark = SparkSession
  .builder()
  .master("local[4]")
  .appName("test")
  .getOrCreate()

def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
DataFrame) =
  (0 until cnt).foldLeft(inputDF)((df, idx) => 
df.withColumn(s"$prefix$idx", value))

val dummy = udf(() => Option.empty[Int])

def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = {
  val t0 = System.nanoTime()
  spark.range(rows).toDF()
.withColumn("grp", col("id").mod(grps))
.transform(addConstColumns("null_", cnt, dummy()))
.groupBy("grp")
.agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*)
.collect()
  val t1 = System.nanoTime()
  (t1 - t0) / 1e9
}

val timings = for (i <- 1 to 3) yield {
  spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true)
  val with_wholestage = test()
  spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false)
  val without_wholestage = test()
  (with_wholestage, without_wholestage)
}

timings.foreach(println)

println("Press enter ...")
System.in.read()
  }
}
{code})

> Sub-optimal generated code when aggregating nullable columns
> 
>
> Key: SPARK-23791
> URL: https://issues.apache.org/jira/browse/SPARK-23791
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Valentin Nikotin
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It appears to be that with wholeStage codegen enabled simple spark job 
> performing sum aggregation of 50 nullable columns runs ~4 timer slower than 
> without wholeStage codegen.
> Please check test case code. Please note that udf is only to prevent 
> elimination optimizations that could be applied to literals. 
> {code:scala}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.{Column, DataFrame, SparkSession}
> import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED
> object SPARK_23791 {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .master("local[4]")
>   .appName("test")
>   .getOrCreate()
> def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
> DataFrame) =
>   (0 until cnt).foldLeft(inputDF)((df, idx) => 
> df.withColumn(s"$prefix$idx", value))
> val dummy = udf(() => Option.empty[Int])
> def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = {
>   val t0 = System.nanoTime()
>   spark.range(rows).toDF()
> .withColumn("grp", col("id").mod(grps))
> .transform(addConstColumns("null_", cnt, dummy()))
> .groupBy("grp")
> .agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*)
> .collect()
>   val t1 = System.nanoTime()
>   (t1 - t0) / 1e9
> }
> val timings = for (i <- 1 to 3) yield {
>   spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true)
>   val with_wholestage = test()
>   spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false)
>   val without_wholestage = test()
>   (with_wholestage, without_wholestage)
> }
> timings.foreach(println)
> println("Press enter ...")
> System.in.read()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23791) Sub-optimal generated code when aggregating nullable columns

2018-03-24 Thread Valentin Nikotin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentin Nikotin updated SPARK-23791:
-
Description: 
It appears to be that with wholeStage codegen enabled simple spark job 
performing sum aggregation of 50 nullable columns runs ~4 timer slower than 
without wholeStage codegen.

Please check test case code. Please note that udf is only to prevent 
elimination optimizations that could be applied to literals. 


{code:java}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED

object SPARK_23791 {

  def main(args: Array[String]): Unit = {

val spark = SparkSession
  .builder()
  .master("local[4]")
  .appName("test")
  .getOrCreate()

def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
DataFrame) =
  (0 until cnt).foldLeft(inputDF)((df, idx) => 
df.withColumn(s"$prefix$idx", value))

val dummy = udf(() => Option.empty[Int])

def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = {
  val t0 = System.nanoTime()
  spark.range(rows).toDF()
.withColumn("grp", col("id").mod(grps))
.transform(addConstColumns("null_", cnt, dummy()))
.groupBy("grp")
.agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*)
.collect()
  val t1 = System.nanoTime()
  (t1 - t0) / 1e9
}

val timings = for (i <- 1 to 3) yield {
  spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true)
  val with_wholestage = test()
  spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false)
  val without_wholestage = test()
  (with_wholestage, without_wholestage)
}

timings.foreach(println)

println("Press enter ...")
System.in.read()
  }
}
{code}


  was:
It appears to be that with wholeStage codegen enabled simple spark job 
performing sum aggregation of 50 nullable columns runs ~4 timer slower than 
without wholeStage codegen.

Please check test case code. Please note that udf is only to prevent 
elimination optimizations that could be applied to literals. 


> Sub-optimal generated code when aggregating nullable columns
> 
>
> Key: SPARK-23791
> URL: https://issues.apache.org/jira/browse/SPARK-23791
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.0, 2.3.0
> Environment: {code:java}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.{Column, DataFrame, SparkSession}
> import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED
> object TestCase {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .master("local[4]")
>   .appName("test")
>   .getOrCreate()
> def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
> DataFrame) =
>   (0 until cnt).foldLeft(inputDF)((df, idx) => 
> df.withColumn(s"$prefix$idx", value))
> val dummy = udf(() => Option.empty[Int])
> def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = {
>   val t0 = System.nanoTime()
>   spark.range(rows).toDF()
> .withColumn("grp", col("id").mod(grps))
> .transform(addConstColumns("null_", cnt, dummy()))
> .groupBy("grp")
> .agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*)
> .collect()
>   val t1 = System.nanoTime()
>   (t1 - t0) / 1e9
> }
> val timings = for (i <- 1 to 3) yield {
>   spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true)
>   val with_wholestage = test()
>   spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false)
>   val without_wholestage = test()
>   (with_wholestage, without_wholestage)
> }
> timings.foreach(println)
> println("Press enter ...")
> System.in.read()
>   }
> }
> {code}
>Reporter: Valentin Nikotin
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It appears to be that with wholeStage codegen enabled simple spark job 
> performing sum aggregation of 50 nullable columns runs ~4 timer slower than 
> without wholeStage codegen.
> Please check test case code. Please note that udf is only to prevent 
> elimination optimizations that could be applied to literals. 
> {code:java}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.{Column, DataFrame, SparkSession}
> import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED
> object SPARK_23791 {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .master("local[4]")
>  

[jira] [Updated] (SPARK-23791) Sub-optimal generated code when aggregating nullable columns

2018-03-24 Thread Valentin Nikotin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentin Nikotin updated SPARK-23791:
-
Description: 
It appears to be that with wholeStage codegen enabled simple spark job 
performing sum aggregation of 50 nullable columns runs ~4 timer slower than 
without wholeStage codegen.

Please check test case code. Please note that udf is only to prevent 
elimination optimizations that could be applied to literals. 


{code:scala}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED

object SPARK_23791 {

  def main(args: Array[String]): Unit = {

val spark = SparkSession
  .builder()
  .master("local[4]")
  .appName("test")
  .getOrCreate()

def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
DataFrame) =
  (0 until cnt).foldLeft(inputDF)((df, idx) => 
df.withColumn(s"$prefix$idx", value))

val dummy = udf(() => Option.empty[Int])

def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = {
  val t0 = System.nanoTime()
  spark.range(rows).toDF()
.withColumn("grp", col("id").mod(grps))
.transform(addConstColumns("null_", cnt, dummy()))
.groupBy("grp")
.agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*)
.collect()
  val t1 = System.nanoTime()
  (t1 - t0) / 1e9
}

val timings = for (i <- 1 to 3) yield {
  spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true)
  val with_wholestage = test()
  spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false)
  val without_wholestage = test()
  (with_wholestage, without_wholestage)
}

timings.foreach(println)

println("Press enter ...")
System.in.read()
  }
}
{code}


  was:
It appears to be that with wholeStage codegen enabled simple spark job 
performing sum aggregation of 50 nullable columns runs ~4 timer slower than 
without wholeStage codegen.

Please check test case code. Please note that udf is only to prevent 
elimination optimizations that could be applied to literals. 


{code:java}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED

object SPARK_23791 {

  def main(args: Array[String]): Unit = {

val spark = SparkSession
  .builder()
  .master("local[4]")
  .appName("test")
  .getOrCreate()

def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
DataFrame) =
  (0 until cnt).foldLeft(inputDF)((df, idx) => 
df.withColumn(s"$prefix$idx", value))

val dummy = udf(() => Option.empty[Int])

def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = {
  val t0 = System.nanoTime()
  spark.range(rows).toDF()
.withColumn("grp", col("id").mod(grps))
.transform(addConstColumns("null_", cnt, dummy()))
.groupBy("grp")
.agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*)
.collect()
  val t1 = System.nanoTime()
  (t1 - t0) / 1e9
}

val timings = for (i <- 1 to 3) yield {
  spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true)
  val with_wholestage = test()
  spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false)
  val without_wholestage = test()
  (with_wholestage, without_wholestage)
}

timings.foreach(println)

println("Press enter ...")
System.in.read()
  }
}
{code}



> Sub-optimal generated code when aggregating nullable columns
> 
>
> Key: SPARK-23791
> URL: https://issues.apache.org/jira/browse/SPARK-23791
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.0, 2.3.0
> Environment: {code:java}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.{Column, DataFrame, SparkSession}
> import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED
> object TestCase {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .master("local[4]")
>   .appName("test")
>   .getOrCreate()
> def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
> DataFrame) =
>   (0 until cnt).foldLeft(inputDF)((df, idx) => 
> df.withColumn(s"$prefix$idx", value))
> val dummy = udf(() => Option.empty[Int])
> def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = {
>   val t0 = System.nanoTime()
>   spark.range(rows).toDF()
> .withColumn("grp", col("id").mod(grps))
> .transform(addConstColumns("null_", cnt, dummy()))
> 

[jira] [Updated] (SPARK-23791) Sub-optimal generated code when aggregating nullable columns

2018-03-24 Thread Valentin Nikotin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentin Nikotin updated SPARK-23791:
-
Description: 
It appears to be that with wholeStage codegen enabled simple spark job 
performing sum aggregation of 50 nullable columns runs ~4 timer slower than 
without wholeStage codegen.

Please check test case code. Please note that udf is only to prevent 
elimination optimizations that could be applied to literals. 

  was:
It appears to be that with wholeStage codegen enabled simple spark job 
performing sum aggregation of 50 nullable columns runs ~4 timer slower than 
without wholeStage codegen.

Please check test case code. Please note that udf is only to prevent 
elimination optimizations that could be applied to literals.

 


> Sub-optimal generated code when aggregating nullable columns
> 
>
> Key: SPARK-23791
> URL: https://issues.apache.org/jira/browse/SPARK-23791
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.0, 2.3.0
> Environment: {code:java}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.{Column, DataFrame, SparkSession}
> import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED
> object TestCase {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .master("local[4]")
>   .appName("test")
>   .getOrCreate()
> def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
> DataFrame) =
>   (0 until cnt).foldLeft(inputDF)((df, idx) => 
> df.withColumn(s"$prefix$idx", value))
> val dummy = udf(() => Option.empty[Int])
> def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = {
>   val t0 = System.nanoTime()
>   spark.range(rows).toDF()
> .withColumn("grp", col("id").mod(grps))
> .transform(addConstColumns("null_", cnt, dummy()))
> .groupBy("grp")
> .agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*)
> .collect()
>   val t1 = System.nanoTime()
>   (t1 - t0) / 1e9
> }
> val timings = for (i <- 1 to 3) yield {
>   spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true)
>   val with_wholestage = test()
>   spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false)
>   val without_wholestage = test()
>   (with_wholestage, without_wholestage)
> }
> timings.foreach(println)
> println("Press enter ...")
> System.in.read()
>   }
> }
> {code}
>Reporter: Valentin Nikotin
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It appears to be that with wholeStage codegen enabled simple spark job 
> performing sum aggregation of 50 nullable columns runs ~4 timer slower than 
> without wholeStage codegen.
> Please check test case code. Please note that udf is only to prevent 
> elimination optimizations that could be applied to literals. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23791) Sub-optimal generated code when aggregating nullable columns

2018-03-24 Thread Valentin Nikotin (JIRA)
Valentin Nikotin created SPARK-23791:


 Summary: Sub-optimal generated code when aggregating nullable 
columns
 Key: SPARK-23791
 URL: https://issues.apache.org/jira/browse/SPARK-23791
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.3.0, 2.2.0
 Environment: {code:java}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED

object TestCase {

  def main(args: Array[String]): Unit = {

val spark = SparkSession
  .builder()
  .master("local[4]")
  .appName("test")
  .getOrCreate()

def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: 
DataFrame) =
  (0 until cnt).foldLeft(inputDF)((df, idx) => 
df.withColumn(s"$prefix$idx", value))

val dummy = udf(() => Option.empty[Int])

def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = {
  val t0 = System.nanoTime()
  spark.range(rows).toDF()
.withColumn("grp", col("id").mod(grps))
.transform(addConstColumns("null_", cnt, dummy()))
.groupBy("grp")
.agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*)
.collect()
  val t1 = System.nanoTime()
  (t1 - t0) / 1e9
}

val timings = for (i <- 1 to 3) yield {
  spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true)
  val with_wholestage = test()
  spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false)
  val without_wholestage = test()
  (with_wholestage, without_wholestage)
}

timings.foreach(println)

println("Press enter ...")
System.in.read()
  }
}
{code}
Reporter: Valentin Nikotin


It appears to be that with wholeStage codegen enabled simple spark job 
performing sum aggregation of 50 nullable columns runs ~4 timer slower than 
without wholeStage codegen.

Please check test case code. Please note that udf is only to prevent 
elimination optimizations that could be applied to literals.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412711#comment-16412711
 ] 

Apache Spark commented on SPARK-23645:
--

User 'mstewart141' has created a pull request for this issue:
https://github.com/apache/spark/pull/20900

> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Priority: Minor
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap this udf with a function and attach docstring from func
> """
> # It is possible for a callable instance without __name__ attribute or/and
> # __module__ attribute to be wrapped here. For example, 
> functools.partial. In this case,
> # we should avoid wrapping the attributes from the wrapped function to 
> the wrapper
> # function. So, we take out these attribute names from the default names 
> to set and
> # then manually assign it after being wrapped.
> assignments = tuple(
> a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
> '__module__')
> @functools.wraps(self.func, assigned=assignments)
> def wrapper(*args):
> return self(*args)
> ...{code}
> as seen in:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()  # fail with ~no stacktrace thanks 
> to wrapper helper
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()
> TypeError: wrapper() got an unexpected keyword argument 'a'{code}
>  
>  
> *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF 
> to be called as such, but the cols tuple that gets passed in the call method:
> {code:java}
> _to_seq(sc, cols, _to_java_column{code}
>  has to be in the right order based on the functions defined argument inputs, 
> or the function will return incorrect results. so, the challenge here is to:
> (a) make sure to reconstruct the proper order of the full args/kwargs
> --> args first, and then kwargs (not in the order passed but in the order 
> requested by the fn)
> (b) handle python2 and python3 `inspect` module inconsistencies 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-24 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-23790:

Description: 
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333] and the problem was reported 
first [here|https://issues.apache.org/jira/browse/SPARK-19995] for yarn.

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:
{quote}Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authenticationat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
{quote}
This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this issue the workaround decided was to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 

  was:
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:
{quote}Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authenticationat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
{quote}
This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this 
[issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876] and the workaround 
decided was to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 


> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333] and the problem was 
> reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for 
> yarn.
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this issue the workaround decided was to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-24 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412644#comment-16412644
 ] 

Stavros Kontopoulos commented on SPARK-23790:
-

[~susanxhuynh] fyi. [~vanzin], [~jerryshao] do you think we should revert back 
to the other solution with the doAsRealUser(SessionState.start(state))? I dont 
think there is much progress 
[here|https://issues.apache.org/jira/browse/MAPREDUCE-6876].

> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333].
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this 
> [issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876] and the 
> workaround decided was to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-24 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-23790:

Description: 
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:
{quote}Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authenticationat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
{quote}
This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this 
[issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876] and the workaround 
decided was to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 

  was:
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:
{quote}Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authenticationat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
{quote}
This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this issue and the workaround decided was to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 


> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333].
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this 
> [issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876] and the 
> workaround decided was to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-24 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-23790:

Description: 
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:
{quote}Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authenticationat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
{quote}
This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this issue where we had some issues in the past and the 
workaround decided was to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 

  was:
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:
{quote}Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authenticationat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
{quote}
This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this 
[issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876] where we had some 
issues in the past and the workaround decided is to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 


> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333].
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this issue where we had some issues in the past and the 
> workaround decided was to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-24 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-23790:

Description: 
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:
{quote}Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authenticationat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
{quote}
This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this 
[issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876] where we had some 
issues in the past and the workaround decided is to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 

  was:
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:
{quote}Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authenticationat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
{quote}
This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this [ 
issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had some 
issues in the past and the workaround decided is to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 


> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333].
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this 
> [issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876] where we had 
> some issues in the past and the workaround decided is to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-24 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-23790:

Description: 
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:
{quote}Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authenticationat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
{quote}
This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this issue and the workaround decided was to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 

  was:
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:
{quote}Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authenticationat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
{quote}
This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this issue where we had some issues in the past and the 
workaround decided was to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 


> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333].
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this issue and the workaround decided was to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-24 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-23790:

Description: 
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:
{quote}Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authenticationat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
{quote}
This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this [ 
issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had some 
issues in the past and the workaround decided is to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 

  was:
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:
{quote}Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authenticationat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
{quote}
This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this[ 
issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had some 
issues in the past and the workaround decided is to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 


> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333].
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this [ 
> issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had 
> some issues in the past and the workaround decided is to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-24 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-23790:

Description: 
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:
{quote}Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authenticationat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
{quote}
This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this[ 
issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had some 
issues in the past and the workaround decided is to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 

  was:
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:

Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authentication
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)

This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this[ 
issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had some 
issues in the past and the workaround decided is to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 


> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333].
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this[ 
> issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had 
> some issues in the past and the workaround decided is to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-24 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-23790:

Description: 
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache TokenCache.obtainTokensForNamenodes. Eventually this will fail with:

Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authentication
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)

This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this[ 
issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had some 
issues in the past and the workaround decided is to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 

  was:
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This is easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache TokenCache.obtainTokensForNamenodes. Eventually this will fail with:

Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authentication
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)

This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this[ 
issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had some 
issues in the past and the workaround decided is to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 


> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333].
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache TokenCache.obtainTokensForNamenodes. Eventually this will fail with:
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authentication
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this[ 
> issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had 
> some issues in the past and the workaround decided is to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-24 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-23790:

Description: 
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This is easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache TokenCache.obtainTokensForNamenodes. Eventually this will fail with:

Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authentication
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)

This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this[ 
issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had some 
issues in the past and the workaround decided is to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 

  was:
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This is easily fixed with the proposed fix here: 
[https://github.com/apache/spark/pull/17333]

The other option is to add the delegation tokens to the current user's UGI as 
in here: [https://github.com/apache/spark/pull/17335]. The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache TokenCache.obtainTokensForNamenodes. Eventually this will fail with:

Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authentication
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)

This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this issue: 
https://issues.apache.org/jira/browse/MAPREDUCE-6876

 


> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This is easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333].
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache TokenCache.obtainTokensForNamenodes. Eventually this will fail with:
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authentication
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this[ 
> issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had 
> some issues in the past and the workaround decided is to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-24 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-23790:

Description: 
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
with:

Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authentication
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)

This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this[ 
issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had some 
issues in the past and the workaround decided is to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 

  was:
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This can be easily fixed with the proposed fix 
[here|https://github.com/apache/spark/pull/17333].

The other option is to add the delegation tokens to the current user's UGI as 
in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache TokenCache.obtainTokensForNamenodes. Eventually this will fail with:

Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authentication
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)

This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this[ 
issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had some 
issues in the past and the workaround decided is to 
[trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
 hadoop.

 


> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333].
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authentication
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this[ 
> issue|https://issues.apache.org/jira/browse/MAPREDUCE-6876]  where we had 
> some issues in the past and the workaround decided is to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-24 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-23790:

Description: 
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This is easily fixed with the proposed fix here: 
[https://github.com/apache/spark/pull/17333]

The other option is to add the delegation tokens to the current user's UGI as 
in here: [https://github.com/apache/spark/pull/17335]. The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter, uses FileInputFormat to get the splits which calls the local ticket 
cache TokenCache.obtainTokensForNamenodes. Eventually this will fail with:

Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authentication
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)

This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this issue: 
https://issues.apache.org/jira/browse/MAPREDUCE-6876

 

  was:
This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This is easily fixed with the proposed fix here: 
[https://github.com/apache/spark/pull/17333]

The other option is to add the delegation tokens to the current user's UGI as 
in here: [https://github.com/apache/spark/pull/17335]. The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter FileInputFormat to get the splits which call the local ticket cache 
TokenCache.obtainTokensForNamenodes which will fail with:

Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authentication
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)

This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this issue: 
https://issues.apache.org/jira/browse/MAPREDUCE-6876

 


> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This is easily fixed with the proposed fix here: 
> [https://github.com/apache/spark/pull/17333]
> The other option is to add the delegation tokens to the current user's UGI as 
> in here: [https://github.com/apache/spark/pull/17335]. The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache TokenCache.obtainTokensForNamenodes. Eventually this will fail with:
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authentication
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this issue: 
> https://issues.apache.org/jira/browse/MAPREDUCE-6876
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-24 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-23790:
---

 Summary: proxy-user failed connecting to a kerberos configured 
metastore
 Key: SPARK-23790
 URL: https://issues.apache.org/jira/browse/SPARK-23790
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.3.0
Reporter: Stavros Kontopoulos


This appeared at a customer trying to integrate with a kerberized hdfs cluster.

This is easily fixed with the proposed fix here: 
[https://github.com/apache/spark/pull/17333]

The other option is to add the delegation tokens to the current user's UGI as 
in here: [https://github.com/apache/spark/pull/17335]. The last fixes the 
problem but leads to a failure when someones uses a HadoopRDD because the 
latter FileInputFormat to get the splits which call the local ticket cache 
TokenCache.obtainTokensForNamenodes which will fail with:

Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authentication
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)

This implies that security mode is SIMPLE and hadoop libs there are not aware 
of kerberos.

This is related to this issue: 
https://issues.apache.org/jira/browse/MAPREDUCE-6876

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission

2018-03-24 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412510#comment-16412510
 ] 

Marco Gaido commented on SPARK-23782:
-

[~vanzin] thanks for the link. I see that in the discussion there were doubts 
about this, so the PR removed this part from it to focus on the other aspects, 
but there was no strong opinion against this..

bq. What sensitive information is being exposed to users that should not see it?

The users can see which applications have been run by each users, when, how 
long they last, their names and which applications other users are running, if 
they are connected though a spark-shell for instance and so on. These are 
information which should not be shared with non-authorized people and if the 
names of the applications are meaningful a user can easily guess what the 
others are doing on the cluster.

If you compare how other systems work, moreover, of course they do not show to 
non-admin users what the others are doing. Our current situation is the same as 
if in Oracle or Postgres you were able to list the queries run by other users: 
of course each user can list only its queries.

bq. Won't you get that same info if you go to the resource manager's page and 
look at what applications have run?

I am not sure how the RM UI works. If it lists all the applications to all the 
users, even though they do not have the rights for it, it is a big security 
hole, since there you can also retrieve the logs. I hope the RM has better 
security than this but I am not an expert on it. And if it doesn't, I do 
believe it should be fixed. Moreover, I think we should not focus for Spark on 
a specific resource manager (YARN), since Spark can run in many modes other 
than it.

> SHS should not show applications to user without read permission
> 
>
> Key: SPARK-23782
> URL: https://issues.apache.org/jira/browse/SPARK-23782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> The History Server shows all the applications to all the users, even though 
> they have no permission to read them. They cannot read the details of the 
> applications they cannot access, but still anybody can list all the 
> applications submitted by all users.
> For instance, if we have an admin user {{admin}} and two normal users {{u1}} 
> and {{u2}}, and each of them submitted one application, all of them can see 
> in the main page of SHS:
> ||App ID||App Name|| ... ||Spark User|| ... ||
> |app-123456789|The Admin App| .. |admin| ... |
> |app-123456790|u1 secret app| .. |u1| ... |
> |app-123456791|u2 secret app| .. |u2| ... |
> Then clicking on each application, the proper permissions are applied and 
> each user can see only the applications he has the read permission for.
> Instead, each user should see only the applications he has the permission to 
> read and he/she should not be able to see applications he has not the 
> permissions for.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23789) Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider

2018-03-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23789:


Assignee: Apache Spark

> Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider
> -
>
> Key: SPARK-23789
> URL: https://issues.apache.org/jira/browse/SPARK-23789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {noformat}
> 18/03/23 23:33:35 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
> longer has any effect.  Use hive.hmshandler.retry.* instead
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name hive.metastore.local does 
> not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.metastore.ds.retry.attempts does not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.metastore.ds.retry.interval does not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.server2.enable.impersonation does not exist
> 18/03/23 23:33:35 INFO metastore: Trying to connect to metastore with URI 
> thrift://metastore.com:9083
> 18/03/23 23:33:35 ERROR TSaslTransport: SASL negotiation failure
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
>   at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
>   at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:124)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
>   at 
> 

[jira] [Commented] (SPARK-23789) Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider

2018-03-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412470#comment-16412470
 ] 

Apache Spark commented on SPARK-23789:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/20898

> Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider
> -
>
> Key: SPARK-23789
> URL: https://issues.apache.org/jira/browse/SPARK-23789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> 18/03/23 23:33:35 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
> longer has any effect.  Use hive.hmshandler.retry.* instead
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name hive.metastore.local does 
> not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.metastore.ds.retry.attempts does not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.metastore.ds.retry.interval does not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.server2.enable.impersonation does not exist
> 18/03/23 23:33:35 INFO metastore: Trying to connect to metastore with URI 
> thrift://metastore.com:9083
> 18/03/23 23:33:35 ERROR TSaslTransport: SASL negotiation failure
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
>   at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
>   at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:124)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> 

[jira] [Assigned] (SPARK-23789) Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider

2018-03-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23789:


Assignee: (was: Apache Spark)

> Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider
> -
>
> Key: SPARK-23789
> URL: https://issues.apache.org/jira/browse/SPARK-23789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> 18/03/23 23:33:35 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
> longer has any effect.  Use hive.hmshandler.retry.* instead
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name hive.metastore.local does 
> not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.metastore.ds.retry.attempts does not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.metastore.ds.retry.interval does not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.server2.enable.impersonation does not exist
> 18/03/23 23:33:35 INFO metastore: Trying to connect to metastore with URI 
> thrift://metastore.com:9083
> 18/03/23 23:33:35 ERROR TSaslTransport: SASL negotiation failure
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
>   at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
>   at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:124)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
>   at 
> 

[jira] [Created] (SPARK-23789) Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider

2018-03-24 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-23789:
---

 Summary: Shouldn't set hive.metastore.uris before invoking 
HiveDelegationTokenProvider
 Key: SPARK-23789
 URL: https://issues.apache.org/jira/browse/SPARK-23789
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.4.0
Reporter: Yuming Wang



{noformat}
18/03/23 23:33:35 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
longer has any effect.  Use hive.hmshandler.retry.* instead
18/03/23 23:33:35 WARN HiveConf: HiveConf of name hive.metastore.local does not 
exist
18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
hive.metastore.ds.retry.attempts does not exist
18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
hive.metastore.ds.retry.interval does not exist
18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
hive.server2.enable.impersonation does not exist
18/03/23 23:33:35 INFO metastore: Trying to connect to metastore with URI 
thrift://metastore.com:9083
18/03/23 23:33:35 ERROR TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at 
org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
at 
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:124)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed 
to find any Kerberos tgt)
at 

[jira] [Comment Edited] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-03-24 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412458#comment-16412458
 ] 

Felix Cheung edited comment on SPARK-23780 at 3/24/18 6:53 AM:
---

here

[https://github.com/mages/googleVis/blob/master/R/zzz.R#L39]

 or here

[https://github.com/jeroen/jsonlite/blob/master/R/toJSON.R#L2] 


was (Author: felixcheung):
here

[https://github.com/mages/googleVis/blob/master/R/zzz.R#L39]

 

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-03-24 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412458#comment-16412458
 ] 

Felix Cheung commented on SPARK-23780:
--

here

[https://github.com/mages/googleVis/blob/master/R/zzz.R#L39]

 

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-03-24 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412457#comment-16412457
 ] 

Felix Cheung commented on SPARK-23780:
--

hmm, I think the cause of this is the incompatibility of the method signature 
of toJSON

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org