[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation

2021-03-18 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304397#comment-17304397
 ] 

Erik Erlandson commented on SPARK-24432:


[~dongjoon] should this be closed, now that spark 3.1 is available (per 
[above|https://issues.apache.org/jira/browse/SPARK-24432?focusedCommentId=17224905=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17224905])

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-07 Thread Erik Erlandson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-32159:
---
Fix Version/s: 3.0.1

> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
> Fix For: 3.0.1
>
>
> The new user defined aggregator feature (SPARK-27296) based on calling 
> 'functions.udaf(aggregator)' works fine when the aggregator input type is 
> atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an 
> array, like 'Aggregator[Array[Double], _, _]',  it is tripping over the 
> following:
> /**
>  * When constructing [[MapObjects]], the element type must be given, which 
> may not be available
>  * before analysis. This class acts like a placeholder for [[MapObjects]], 
> and will be replaced by
>  * [[MapObjects]] during analysis after the input data is resolved.
>  * Note that, ideally we should not serialize and send unresolved expressions 
> to executors, but
>  * users may accidentally do this(e.g. mistakenly reference an encoder 
> instance when implementing
>  * Aggregator). Here we mark `function` as transient because it may reference 
> scala Type, which is
>  * not serializable. Then even users mistakenly reference unresolved 
> expression and serialize it,
>  * it's just a performance issue(more network traffic), and will not fail.
>  */
>  case class UnresolvedMapObjects(
>  {color:#de350b}@transient function: Expression => Expression{color},
>  child: Expression,
>  customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
> Unevaluable {
>  override lazy val resolved = false
> override def dataType: DataType = 
> customCollectionCls.map(ObjectType.apply).getOrElse
> { throw new UnsupportedOperationException("not resolved") }
> }
>  
> *The '@transient' is causing the function to be unpacked as 'null' over on 
> the executors, and it is causing a null-pointer exception here, when it tries 
> to do 'function(loopVar)'*
> object MapObjects {
>  def apply(
>  function: Expression => Expression,
>  inputData: Expression,
>  elementType: DataType,
>  elementNullable: Boolean = true,
>  customCollectionCls: Option[Class[_]] = None): MapObjects =
> { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) 
> MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, 
> customCollectionCls) }
> }
> *I believe it may be possible to just use 'loopVar' instead of 
> 'function(loopVar)', whenever 'function' is null, but need second opinion 
> from catalyst developers on what a robust fix should be*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Erik Erlandson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-32159:
---
Description: 
The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
 case class UnresolvedMapObjects(
 {color:#de350b}@transient function: Expression => Expression{color},
 child: Expression,
 customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
 override lazy val resolved = false

override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse

{ throw new UnsupportedOperationException("not resolved") }

}

 

*The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'*

{{object MapObjects {
 def apply(
 function: Expression => Expression,
 inputData: Expression,
 elementType: DataType,
 elementNullable: Boolean = true,
 customCollectionCls: Option[Class[_]] = None): MapObjects =

{ val loopVar = LambdaVariable("MapObject", elementType, elementNullable) 
MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, 
customCollectionCls) }

}
 }}

*I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be*

  was:
The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

{{/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
case class UnresolvedMapObjects(
@transient function: Expression => Expression,
child: Expression,
customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
  override lazy val resolved = false

  override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse {
throw new UnsupportedOperationException("not resolved")
  }
}}}

The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'

{{object MapObjects {
  def apply(
  function: Expression => Expression,
  inputData: Expression,
  elementType: DataType,
  elementNullable: Boolean = true,
  customCollectionCls: Option[Class[_]] = None): MapObjects = {
val loopVar = LambdaVariable("MapObject", elementType, elementNullable)
MapObjects(loopVar, function(loopVar), inputData, customCollectionCls)
  }
}
}}

I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be


> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Spark

[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Erik Erlandson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-32159:
---
Description: 
The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
 case class UnresolvedMapObjects(
 {color:#de350b}@transient function: Expression => Expression{color},
 child: Expression,
 customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
 override lazy val resolved = false

override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse

{ throw new UnsupportedOperationException("not resolved") }

}

 

*The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'*

object MapObjects {
 def apply(
 function: Expression => Expression,
 inputData: Expression,
 elementType: DataType,
 elementNullable: Boolean = true,
 customCollectionCls: Option[Class[_]] = None): MapObjects =

{ val loopVar = LambdaVariable("MapObject", elementType, elementNullable) 
MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, 
customCollectionCls) }

}

*I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be*

  was:
The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
 case class UnresolvedMapObjects(
 {color:#de350b}@transient function: Expression => Expression{color},
 child: Expression,
 customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
 override lazy val resolved = false

override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse

{ throw new UnsupportedOperationException("not resolved") }

}

 

*The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'*

{{object MapObjects {
 def apply(
 function: Expression => Expression,
 inputData: Expression,
 elementType: DataType,
 elementNullable: Boolean = true,
 customCollectionCls: Option[Class[_]] = None): MapObjects =

{ val loopVar = LambdaVariable("MapObject", elementType, elementNullable) 
MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, 
customCollectionCls) }

}
 }}

*I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be*


> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Spark
>

[jira] [Commented] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150490#comment-17150490
 ] 

Erik Erlandson commented on SPARK-32159:


cc [~cloud_fan]

> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
>
> The new user defined aggregator feature (SPARK-27296) based on calling 
> 'functions.udaf(aggregator)' works fine when the aggregator input type is 
> atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an 
> array, like 'Aggregator[Array[Double], _, _]',  it is tripping over the 
> following:
> {{/**
>  * When constructing [[MapObjects]], the element type must be given, which 
> may not be available
>  * before analysis. This class acts like a placeholder for [[MapObjects]], 
> and will be replaced by
>  * [[MapObjects]] during analysis after the input data is resolved.
>  * Note that, ideally we should not serialize and send unresolved expressions 
> to executors, but
>  * users may accidentally do this(e.g. mistakenly reference an encoder 
> instance when implementing
>  * Aggregator). Here we mark `function` as transient because it may reference 
> scala Type, which is
>  * not serializable. Then even users mistakenly reference unresolved 
> expression and serialize it,
>  * it's just a performance issue(more network traffic), and will not fail.
>  */
> case class UnresolvedMapObjects(
> @transient function: Expression => Expression,
> child: Expression,
> customCollectionCls: Option[Class[_]] = None) extends UnaryExpression 
> with Unevaluable {
>   override lazy val resolved = false
>   override def dataType: DataType = 
> customCollectionCls.map(ObjectType.apply).getOrElse {
> throw new UnsupportedOperationException("not resolved")
>   }
> }}}
> The '@transient' is causing the function to be unpacked as 'null' over on the 
> executors, and it is causing a null-pointer exception here, when it tries to 
> do 'function(loopVar)'
> {{object MapObjects {
>   def apply(
>   function: Expression => Expression,
>   inputData: Expression,
>   elementType: DataType,
>   elementNullable: Boolean = true,
>   customCollectionCls: Option[Class[_]] = None): MapObjects = {
> val loopVar = LambdaVariable("MapObject", elementType, elementNullable)
> MapObjects(loopVar, function(loopVar), inputData, customCollectionCls)
>   }
> }
> }}
> I believe it may be possible to just use 'loopVar' instead of 
> 'function(loopVar)', whenever 'function' is null, but need second opinion 
> from catalyst developers on what a robust fix should be



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Erik Erlandson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-32159:
---
Description: 
The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

{{/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
case class UnresolvedMapObjects(
@transient function: Expression => Expression,
child: Expression,
customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
  override lazy val resolved = false

  override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse {
throw new UnsupportedOperationException("not resolved")
  }
}}}

The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'

{{object MapObjects {
  def apply(
  function: Expression => Expression,
  inputData: Expression,
  elementType: DataType,
  elementNullable: Boolean = true,
  customCollectionCls: Option[Class[_]] = None): MapObjects = {
val loopVar = LambdaVariable("MapObject", elementType, elementNullable)
MapObjects(loopVar, function(loopVar), inputData, customCollectionCls)
  }
}
}}

I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be

  was:
The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

{{
/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
case class UnresolvedMapObjects(
@transient function: Expression => Expression,
child: Expression,
customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
  override lazy val resolved = false

  override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse {
throw new UnsupportedOperationException("not resolved")
  }
}

}} 

The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'

{{
object MapObjects {
  def apply(
  function: Expression => Expression,
  inputData: Expression,
  elementType: DataType,
  elementNullable: Boolean = true,
  customCollectionCls: Option[Class[_]] = None): MapObjects = {
val loopVar = LambdaVariable("MapObject", elementType, elementNullable)
MapObjects(loopVar, function(loopVar), inputData, customCollectionCls)
  }
}
}}

I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be


> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: 

[jira] [Created] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Erik Erlandson (Jira)
Erik Erlandson created SPARK-32159:
--

 Summary: New udaf(Aggregator) has an integration bug with 
UnresolvedMapObjects serialization
 Key: SPARK-32159
 URL: https://issues.apache.org/jira/browse/SPARK-32159
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Erik Erlandson


The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

{{
/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
case class UnresolvedMapObjects(
@transient function: Expression => Expression,
child: Expression,
customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
  override lazy val resolved = false

  override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse {
throw new UnsupportedOperationException("not resolved")
  }
}

}} 

The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'

{{
object MapObjects {
  def apply(
  function: Expression => Expression,
  inputData: Expression,
  elementType: DataType,
  elementNullable: Boolean = true,
  customCollectionCls: Option[Class[_]] = None): MapObjects = {
val loopVar = LambdaVariable("MapObject", elementType, elementNullable)
MapObjects(loopVar, function(loopVar), inputData, customCollectionCls)
  }
}
}}

I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30520) Eliminate deprecation warnings for UserDefinedAggregateFunction

2020-06-15 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136153#comment-17136153
 ] 

Erik Erlandson commented on SPARK-30520:


Starting in spark 3.0, any custom aggregator that would have been implemented 
using UserDefinedAggregateFunction should now be implemented using Aggregator. 
To use a custom Aggregator with a dynamically typed DataFrame (aka 
DataSet[Row]), register it using org.apache.spark.sql.functions.udaf

 

> Eliminate deprecation warnings for UserDefinedAggregateFunction
> ---
>
> Key: SPARK-30520
> URL: https://issues.apache.org/jira/browse/SPARK-30520
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> {code}
> /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala
> Warning:Warning:line (718)class UserDefinedAggregateFunction in package 
> expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now 
> be registered as a UDF via the functions.udaf(agg) method.
>   val udaf = 
> clazz.getConstructor().newInstance().asInstanceOf[UserDefinedAggregateFunction]
> Warning:Warning:line (719)method register in class UDFRegistration is 
> deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered 
> as a UDF via the functions.udaf(agg) method.
>   register(name, udaf)
> /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
> Warning:Warning:line (328)class UserDefinedAggregateFunction in package 
> expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now 
> be registered as a UDF via the functions.udaf(agg) method.
> udaf: UserDefinedAggregateFunction,
> Warning:Warning:line (326)class UserDefinedAggregateFunction in package 
> expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now 
> be registered as a UDF via the functions.udaf(agg) method.
> case class ScalaUDAF(
> /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFunctionsSuite.scala
> Warning:Warning:line (363)class UserDefinedAggregateFunction in package 
> expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now 
> be registered as a UDF via the functions.udaf(agg) method.
> val udaf = new UserDefinedAggregateFunction {
> /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/test/java/test/org/apache/spark/sql/MyDoubleSum.java
> Warning:Warning:line (25)java: 
> org.apache.spark.sql.expressions.UserDefinedAggregateFunction in 
> org.apache.spark.sql.expressions has been deprecated
> Warning:Warning:line (35)java: 
> org.apache.spark.sql.expressions.UserDefinedAggregateFunction in 
> org.apache.spark.sql.expressions has been deprecated
> /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/test/java/test/org/apache/spark/sql/MyDoubleAvg.java
> Warning:Warning:line (25)java: 
> org.apache.spark.sql.expressions.UserDefinedAggregateFunction in 
> org.apache.spark.sql.expressions has been deprecated
> Warning:Warning:line (36)java: 
> org.apache.spark.sql.expressions.UserDefinedAggregateFunction in 
> org.apache.spark.sql.expressions has been deprecated
> /Users/maxim/proj/eliminate-expr-info-warnings/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala
> Warning:Warning:line (36)class UserDefinedAggregateFunction in package 
> expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now 
> be registered as a UDF via the functions.udaf(agg) method.
> class ScalaAggregateFunction(schema: StructType) extends 
> UserDefinedAggregateFunction {
> Warning:Warning:line (73)class UserDefinedAggregateFunction in package 
> expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now 
> be registered as a UDF via the functions.udaf(agg) method.
> class ScalaAggregateFunctionWithoutInputSchema extends 
> UserDefinedAggregateFunction {
> Warning:Warning:line (100)class UserDefinedAggregateFunction in package 
> expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now 
> be registered as a UDF via the functions.udaf(agg) method.
> class LongProductSum extends UserDefinedAggregateFunction {
> Warning:Warning:line (189)method register in class UDFRegistration is 
> deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered 
> as a UDF via the functions.udaf(agg) method.
> spark.udf.register("mydoublesum", new MyDoubleSum)
> Warning:Warning:line (190)method register in class UDFRegistration is 
> deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be 

[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2020-05-28 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119054#comment-17119054
 ] 

Erik Erlandson commented on SPARK-7768:
---

Are there any issues with making this kind of change on a minor release 
boundary?

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30424) Change ExpressionEncoder toRow method to return UnsafeRow

2020-01-14 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015183#comment-17015183
 ] 

Erik Erlandson commented on SPARK-30424:


The main place this change causes a compile fail on is in SparkSession:

 
{code:java}
def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame{code}
And the key RDD impacted is LogicalRDD.

What I'm wondering is whether it is appropriate to change the signature of the 
RDD in LogicalRDD from InternalRow to the more specific UnsafeRow. My intuition 
is no, however it's also true that this is what's actually occurring under the 
hood currently, so I'm curious what the catalyst maintainers think about it.

 

 

> Change ExpressionEncoder toRow method to return UnsafeRow
> -
>
> Key: SPARK-30424
> URL: https://issues.apache.org/jira/browse/SPARK-30424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Minor
>
> [~wenchen] observed that the toRow() method on ExpressionEncoder can have its 
> return type specified as UnsafeRow. See discussion on 
> [https://github.com/apache/spark/pull/25024] 
>  
> Not a high priority but could be done for 3.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30424) Change ExpressionEncoder toRow method to return UnsafeRow

2020-01-05 Thread Erik Erlandson (Jira)
Erik Erlandson created SPARK-30424:
--

 Summary: Change ExpressionEncoder toRow method to return UnsafeRow
 Key: SPARK-30424
 URL: https://issues.apache.org/jira/browse/SPARK-30424
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Erik Erlandson


[~wenchen] observed that the toRow() method on ExpressionEncoder can have its 
return type specified as UnsafeRow. See discussion on 
[https://github.com/apache/spark/pull/25024] 

 

Not a high priority but could be done for 3.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30423) Deprecate UserDefinedAggregateFunction

2020-01-05 Thread Erik Erlandson (Jira)
Erik Erlandson created SPARK-30423:
--

 Summary: Deprecate UserDefinedAggregateFunction
 Key: SPARK-30423
 URL: https://issues.apache.org/jira/browse/SPARK-30423
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Erik Erlandson
Assignee: Erik Erlandson


Anticipating the merging of SPARK-27296, the legacy methodology for 
implementing custom user defined aggregators over untyped DataFrame based on 
UserDefinedAggregateFunction will be made obsolete. This class should be 
annotated as deprecated once the new capability is officially merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30422) deprecate UserDefinedAggregateFunction in favor of SPARK-27296

2020-01-05 Thread Erik Erlandson (Jira)
Erik Erlandson created SPARK-30422:
--

 Summary: deprecate UserDefinedAggregateFunction in favor of 
SPARK-27296
 Key: SPARK-30422
 URL: https://issues.apache.org/jira/browse/SPARK-30422
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Erik Erlandson






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29865) k8s executor pods all have different prefixes in client mode

2019-11-14 Thread Erik Erlandson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson reassigned SPARK-29865:
--

Assignee: Marcelo Masiero Vanzin

> k8s executor pods all have different prefixes in client mode
> 
>
> Key: SPARK-29865
> URL: https://issues.apache.org/jira/browse/SPARK-29865
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Minor
> Fix For: 3.0.0
>
>
> This works in cluster mode since the features set things up so that all 
> executor pods have the same name prefix.
> But in client mode features are not used; so each executor ends up with a 
> different name prefix, which makes debugging a little bit annoying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29865) k8s executor pods all have different prefixes in client mode

2019-11-14 Thread Erik Erlandson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-29865.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26488
[https://github.com/apache/spark/pull/26488]

> k8s executor pods all have different prefixes in client mode
> 
>
> Key: SPARK-29865
> URL: https://issues.apache.org/jira/browse/SPARK-29865
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Priority: Minor
> Fix For: 3.0.0
>
>
> This works in cluster mode since the features set things up so that all 
> executor pods have the same name prefix.
> But in client mode features are not used; so each executor ends up with a 
> different name prefix, which makes debugging a little bit annoying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27296) Efficient User Defined Aggregators

2019-10-19 Thread Erik Erlandson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-27296:
---
Fix Version/s: (was: 3.0.0)

> Efficient User Defined Aggregators 
> ---
>
> Key: SPARK-27296
> URL: https://issues.apache.org/jira/browse/SPARK-27296
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Major
>  Labels: performance, usability
>
> Spark's UDAFs appear to be serializing and de-serializing to/from the 
> MutableAggregationBuffer for each row.  This gist shows a small reproducing 
> UDAF and a spark shell session:
> [https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]
> The UDAF and its compantion UDT are designed to count the number of times 
> that ser/de is invoked for the aggregator.  The spark shell session 
> demonstrates that it is executing ser/de on every row of the data frame.
> Note, Spark's pre-defined aggregators do not have this problem, as they are 
> based on an internal aggregating trait that does the correct thing and only 
> calls ser/de at points such as partition boundaries, presenting final 
> results, etc.
> This is a major problem for UDAFs, as it means that every UDAF is doing a 
> massive amount of unnecessary work per row, including but not limited to Row 
> object allocations. For a more realistic UDAF having its own non trivial 
> internal structure it is obviously that much worse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27296) Efficient User Defined Aggregators

2019-10-19 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955216#comment-16955216
 ] 

Erik Erlandson commented on SPARK-27296:


This started with the goal of fixing the performance bug in UDAF, but 
ultimately is a new variation on user defined aggregation, so I'm no longer 
sure if this Jira should be categorized as "bug" or "feature"

> Efficient User Defined Aggregators 
> ---
>
> Key: SPARK-27296
> URL: https://issues.apache.org/jira/browse/SPARK-27296
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Major
>  Labels: performance, usability
> Fix For: 3.0.0
>
>
> Spark's UDAFs appear to be serializing and de-serializing to/from the 
> MutableAggregationBuffer for each row.  This gist shows a small reproducing 
> UDAF and a spark shell session:
> [https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]
> The UDAF and its compantion UDT are designed to count the number of times 
> that ser/de is invoked for the aggregator.  The spark shell session 
> demonstrates that it is executing ser/de on every row of the data frame.
> Note, Spark's pre-defined aggregators do not have this problem, as they are 
> based on an internal aggregating trait that does the correct thing and only 
> calls ser/de at points such as partition boundaries, presenting final 
> results, etc.
> This is a major problem for UDAFs, as it means that every UDAF is doing a 
> massive amount of unnecessary work per row, including but not limited to Row 
> object allocations. For a more realistic UDAF having its own non trivial 
> internal structure it is obviously that much worse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27296) Efficient User Defined Aggregators

2019-10-19 Thread Erik Erlandson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-27296:
---
Fix Version/s: 3.0.0
  Summary: Efficient User Defined Aggregators   (was: User Defined 
Aggregating Functions (UDAFs) have a major efficiency problem)

> Efficient User Defined Aggregators 
> ---
>
> Key: SPARK-27296
> URL: https://issues.apache.org/jira/browse/SPARK-27296
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Major
>  Labels: performance, usability
> Fix For: 3.0.0
>
>
> Spark's UDAFs appear to be serializing and de-serializing to/from the 
> MutableAggregationBuffer for each row.  This gist shows a small reproducing 
> UDAF and a spark shell session:
> [https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]
> The UDAF and its compantion UDT are designed to count the number of times 
> that ser/de is invoked for the aggregator.  The spark shell session 
> demonstrates that it is executing ser/de on every row of the data frame.
> Note, Spark's pre-defined aggregators do not have this problem, as they are 
> based on an internal aggregating trait that does the correct thing and only 
> calls ser/de at points such as partition boundaries, presenting final 
> results, etc.
> This is a major problem for UDAFs, as it means that every UDAF is doing a 
> massive amount of unnecessary work per row, including but not limited to Row 
> object allocations. For a more realistic UDAF having its own non trivial 
> internal structure it is obviously that much worse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-07-24 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892037#comment-16892037
 ] 

Erik Erlandson commented on SPARK-27812:


Agreed with [~skonto] that downgrading isn't a good option.  We need to keep 
abreast of K8s (and the api) over time. Invoking sys.exit in theory seems a bit 
heavy handed, BUT it's also better than just hanging, and I don't know how one 
would manage a controlled unwind of a deadlock.

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27296) User Defined Aggregating Functions (UDAFs) have a major efficiency problem

2019-07-06 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879747#comment-16879747
 ] 

Erik Erlandson commented on SPARK-27296:


I wrote up my benchmarking results 
[here|https://github.com/apache/spark/pull/25024#issue-293548866]. For 
aggregators having a non-trivial serde cost, the performance improvement can be 
two orders of magnitude. For aggregators with more simple serde, the 
improvement is correspondingly smaller.

> User Defined Aggregating Functions (UDAFs) have a major efficiency problem
> --
>
> Key: SPARK-27296
> URL: https://issues.apache.org/jira/browse/SPARK-27296
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Major
>  Labels: performance, usability
>
> Spark's UDAFs appear to be serializing and de-serializing to/from the 
> MutableAggregationBuffer for each row.  This gist shows a small reproducing 
> UDAF and a spark shell session:
> [https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]
> The UDAF and its compantion UDT are designed to count the number of times 
> that ser/de is invoked for the aggregator.  The spark shell session 
> demonstrates that it is executing ser/de on every row of the data frame.
> Note, Spark's pre-defined aggregators do not have this problem, as they are 
> based on an internal aggregating trait that does the correct thing and only 
> calls ser/de at points such as partition boundaries, presenting final 
> results, etc.
> This is a major problem for UDAFs, as it means that every UDAF is doing a 
> massive amount of unnecessary work per row, including but not limited to Row 
> object allocations. For a more realistic UDAF having its own non trivial 
> internal structure it is obviously that much worse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27296) User Defined Aggregating Functions (UDAFs) have a major efficiency problem

2019-07-05 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-27296:
---
Target Version/s: 3.0.0

> User Defined Aggregating Functions (UDAFs) have a major efficiency problem
> --
>
> Key: SPARK-27296
> URL: https://issues.apache.org/jira/browse/SPARK-27296
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Major
>  Labels: performance, usability
>
> Spark's UDAFs appear to be serializing and de-serializing to/from the 
> MutableAggregationBuffer for each row.  This gist shows a small reproducing 
> UDAF and a spark shell session:
> [https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]
> The UDAF and its compantion UDT are designed to count the number of times 
> that ser/de is invoked for the aggregator.  The spark shell session 
> demonstrates that it is executing ser/de on every row of the data frame.
> Note, Spark's pre-defined aggregators do not have this problem, as they are 
> based on an internal aggregating trait that does the correct thing and only 
> calls ser/de at points such as partition boundaries, presenting final 
> results, etc.
> This is a major problem for UDAFs, as it means that every UDAF is doing a 
> massive amount of unnecessary work per row, including but not limited to Row 
> object allocations. For a more realistic UDAF having its own non trivial 
> internal structure it is obviously that much worse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27296) User Defined Aggregating Functions (UDAFs) have a major efficiency problem

2019-07-05 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson reassigned SPARK-27296:
--

Assignee: Erik Erlandson

> User Defined Aggregating Functions (UDAFs) have a major efficiency problem
> --
>
> Key: SPARK-27296
> URL: https://issues.apache.org/jira/browse/SPARK-27296
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Major
>  Labels: performance, usability
>
> Spark's UDAFs appear to be serializing and de-serializing to/from the 
> MutableAggregationBuffer for each row.  This gist shows a small reproducing 
> UDAF and a spark shell session:
> [https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]
> The UDAF and its compantion UDT are designed to count the number of times 
> that ser/de is invoked for the aggregator.  The spark shell session 
> demonstrates that it is executing ser/de on every row of the data frame.
> Note, Spark's pre-defined aggregators do not have this problem, as they are 
> based on an internal aggregating trait that does the correct thing and only 
> calls ser/de at points such as partition boundaries, presenting final 
> results, etc.
> This is a major problem for UDAFs, as it means that every UDAF is doing a 
> massive amount of unnecessary work per row, including but not limited to Row 
> object allocations. For a more realistic UDAF having its own non trivial 
> internal structure it is obviously that much worse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27296) User Defined Aggregating Functions (UDAFs) have a major efficiency problem

2019-07-02 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877374#comment-16877374
 ] 

Erik Erlandson commented on SPARK-27296:


The basic approach as described above appears to be working (see the linked 
PR). To obtain the desired behavior I had to create a new API, which is fairly 
similar to UDAF, but inherits from TypedImperativeAggregate. This new API 
supports UDT and Column instantiation, and so I believe it offers feature 
parity with the original UDAF, with substantial performance improvements.

> User Defined Aggregating Functions (UDAFs) have a major efficiency problem
> --
>
> Key: SPARK-27296
> URL: https://issues.apache.org/jira/browse/SPARK-27296
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: performance, usability
>
> Spark's UDAFs appear to be serializing and de-serializing to/from the 
> MutableAggregationBuffer for each row.  This gist shows a small reproducing 
> UDAF and a spark shell session:
> [https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]
> The UDAF and its compantion UDT are designed to count the number of times 
> that ser/de is invoked for the aggregator.  The spark shell session 
> demonstrates that it is executing ser/de on every row of the data frame.
> Note, Spark's pre-defined aggregators do not have this problem, as they are 
> based on an internal aggregating trait that does the correct thing and only 
> calls ser/de at points such as partition boundaries, presenting final 
> results, etc.
> This is a major problem for UDAFs, as it means that every UDAF is doing a 
> massive amount of unnecessary work per row, including but not limited to Row 
> object allocations. For a more realistic UDAF having its own non trivial 
> internal structure it is obviously that much worse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27936) Support local dependency uploading from --py-files

2019-06-03 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854998#comment-16854998
 ] 

Erik Erlandson commented on SPARK-27936:


cc [~skonto]

> Support local dependency uploading from --py-files
> --
>
> Key: SPARK-27936
> URL: https://issues.apache.org/jira/browse/SPARK-27936
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Erik Erlandson
>Priority: Major
>
> Support python dependency uploads, as in SPARK-23153



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27936) Support local dependency uploading from --py-files

2019-06-03 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-27936:
--

 Summary: Support local dependency uploading from --py-files
 Key: SPARK-27936
 URL: https://issues.apache.org/jira/browse/SPARK-27936
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Erik Erlandson


Support python dependency uploads, as in SPARK-23153



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27872) Driver and executors use a different service account breaking pull secrets

2019-05-29 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851026#comment-16851026
 ] 

Erik Erlandson commented on SPARK-27872:


[~skonto], executors were never given a service account (or, "default") mostly 
on the principle of least permissions, however I see no problem with providing 
them the same service account as the driver if it is required for some purpose. 
Definitely feel free to submit a PR for review.

> Driver and executors use a different service account breaking pull secrets
> --
>
> Key: SPARK-27872
> URL: https://issues.apache.org/jira/browse/SPARK-27872
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Driver and executors use different service accounts in case the driver has 
> one set up which is different than default: 
> [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd]
> This makes the executor pods fail when the user links the driver service 
> account with a pull secret: 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account].
>  Executors will not use the driver's service account and will not be able to 
> get the secret in order to pull the related image. 
> I am not sure what is the assumption here for using the default account for 
> executors, probably because of the fact that this account is limited (btw 
> executors dont create resources)? This is an inconsistency that could be 
> worked around with the pod template feature in Spark 3.0.0 but it breaks pull 
> secrets and in general I think its a bug to have it. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23153) Support application dependencies in submission client's local file system

2019-05-22 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson reassigned SPARK-23153:
--

 Assignee: Stavros Kontopoulos
Fix Version/s: 3.0.0

> Support application dependencies in submission client's local file system
> -
>
> Key: SPARK-23153
> URL: https://issues.apache.org/jira/browse/SPARK-23153
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently local dependencies are not supported with Spark on K8S i.e. if the 
> user has code or dependencies only on the client where they run 
> {{spark-submit}} then the current implementation has no way to make those 
> visible to the Spark application running inside the K8S pods that get 
> launched.  This limits users to only running applications where the code and 
> dependencies are either baked into the Docker images used or where those are 
> available via some external and globally accessible file system e.g. HDFS 
> which are not viable options for many users and environments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23153) Support application dependencies in submission client's local file system

2019-05-22 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-23153.

Resolution: Fixed

> Support application dependencies in submission client's local file system
> -
>
> Key: SPARK-23153
> URL: https://issues.apache.org/jira/browse/SPARK-23153
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently local dependencies are not supported with Spark on K8S i.e. if the 
> user has code or dependencies only on the client where they run 
> {{spark-submit}} then the current implementation has no way to make those 
> visible to the Spark application running inside the K8S pods that get 
> launched.  This limits users to only running applications where the code and 
> dependencies are either baked into the Docker images used or where those are 
> available via some external and globally accessible file system e.g. HDFS 
> which are not viable options for many users and environments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27296) User Defined Aggregating Functions (UDAFs) have a major efficiency problem

2019-03-28 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804176#comment-16804176
 ] 

Erik Erlandson commented on SPARK-27296:


My initial proposal would be to alter the logic underneath
{code:java}
register(name: String, udaf: UserDefinedAggregateFunction){code}
so that the UDAF gets hooked to a TypedImperativeAggregate, and registered in 
the same way that objects like CountMinSketchAgg are.

> User Defined Aggregating Functions (UDAFs) have a major efficiency problem
> --
>
> Key: SPARK-27296
> URL: https://issues.apache.org/jira/browse/SPARK-27296
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: performance, usability
>
> Spark's UDAFs appear to be serializing and de-serializing to/from the 
> MutableAggregationBuffer for each row.  This gist shows a small reproducing 
> UDAF and a spark shell session:
> [https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]
> The UDAF and its compantion UDT are designed to count the number of times 
> that ser/de is invoked for the aggregator.  The spark shell session 
> demonstrates that it is executing ser/de on every row of the data frame.
> Note, Spark's pre-defined aggregators do not have this problem, as they are 
> based on an internal aggregating trait that does the correct thing and only 
> calls ser/de at points such as partition boundaries, presenting final 
> results, etc.
> This is a major problem for UDAFs, as it means that every UDAF is doing a 
> massive amount of unnecessary work per row, including but not limited to Row 
> object allocations. For a more realistic UDAF having its own non trivial 
> internal structure it is obviously that much worse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27296) User Defined Aggregating Functions (UDAFs) have a major efficiency problem

2019-03-27 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-27296:
--

 Summary: User Defined Aggregating Functions (UDAFs) have a major 
efficiency problem
 Key: SPARK-27296
 URL: https://issues.apache.org/jira/browse/SPARK-27296
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL, Structured Streaming
Affects Versions: 2.4.0, 2.3.3, 3.0.0
Reporter: Erik Erlandson


Spark's UDAFs appear to be serializing and de-serializing to/from the 
MutableAggregationBuffer for each row.  This gist shows a small reproducing 
UDAF and a spark shell session:

[https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]

The UDAF and its compantion UDT are designed to count the number of times that 
ser/de is invoked for the aggregator.  The spark shell session demonstrates 
that it is executing ser/de on every row of the data frame.

Note, Spark's pre-defined aggregators do not have this problem, as they are 
based on an internal aggregating trait that does the correct thing and only 
calls ser/de at points such as partition boundaries, presenting final results, 
etc.

This is a major problem for UDAFs, as it means that every UDAF is doing a 
massive amount of unnecessary work per row, including but not limited to Row 
object allocations. For a more realistic UDAF having its own non trivial 
internal structure it is obviously that much worse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26973) Kubernetes version support strategy on test nodes / backend

2019-02-22 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775435#comment-16775435
 ] 

Erik Erlandson commented on SPARK-26973:


A couple other points:
 * Currently, k8s is evolving in a manner where breakage of existing 
functionality is low probability, and so testing against the earliest version 
we wish to support is probably optimal in a scenario where we are choosing one 
version to test against. (This heuristic might change in the future, for 
example if k8s goes to a 2.x series where backward compatibility may be broken)
 * The integration testing was designed to support running against external 
clusters (GCP, etc) - this might provide an approach to supporting testing 
against multiple k8s versions. However, it would come with additional op-ex 
costs and decreased control over the environment. I mention it mostly because 
it's a plausible path to outsourcing some of the combinatorics that 
[~shaneknapp] discussed above

> Kubernetes version support strategy on test nodes / backend
> ---
>
> Key: SPARK-26973
> URL: https://issues.apache.org/jira/browse/SPARK-26973
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes has a policy for supporting three minor releases and the current 
> ones are defined here: 
> [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]
> Moving from release 1.x to 1.(x+1) happens roughly every 100 
> days:[https://gravitational.com/blog/kubernetes-release-cycle.]
> This has an effect on dependencies upgrade at the Spark on K8s backend and 
> the version of Minikube required to be supported for testing. One other issue 
> is what the users actually want at the given time of a release. Some popular 
> vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap 
> for releases and may not catch up fast (what is our view on this).
> Follow the comments for a recent discussion on the topic: 
> [https://github.com/apache/spark/pull/23814.]
> Clearly we need a strategy for this.
> A couple of options for the current state of things:
> a) Support only the last two versions, but that leaves out a version that 
> still receives patches.
> b) Support only the latest, which makes testing easier, but leaves out other 
> currently maintained version.
> A good strategy will optimize at least the following:
> 1) percentage of users satisfied at release time.
> 2) how long it takes to support the latest K8s version
> 3) testing requirements eg. minikube versions used
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24434) Support user-specified driver and executor pod templates

2018-11-26 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-24434:
---
Fix Version/s: 3.0.0

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
> Fix For: 3.0.0
>
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25828) Bumping Version of kubernetes.client to latest version

2018-10-26 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson reassigned SPARK-25828:
--

Assignee: Ilan Filonenko

> Bumping Version of kubernetes.client to latest version
> --
>
> Key: SPARK-25828
> URL: https://issues.apache.org/jira/browse/SPARK-25828
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Assignee: Ilan Filonenko
>Priority: Minor
> Fix For: 3.0.0
>
>
> Upgrade the Kubernetes client version to at least 
> [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
> as we are falling behind on fabric8 updates. This will be an update to both 
> kubernetes/core and kubernetes/integration-tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25828) Bumping Version of kubernetes.client to latest version

2018-10-26 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-25828.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22820
[https://github.com/apache/spark/pull/22820]

> Bumping Version of kubernetes.client to latest version
> --
>
> Key: SPARK-25828
> URL: https://issues.apache.org/jira/browse/SPARK-25828
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Assignee: Ilan Filonenko
>Priority: Minor
> Fix For: 3.0.0
>
>
> Upgrade the Kubernetes client version to at least 
> [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
> as we are falling behind on fabric8 updates. This will be an update to both 
> kubernetes/core and kubernetes/integration-tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25828) Bumping Version of kubernetes.client to latest version

2018-10-24 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662978#comment-16662978
 ] 

Erik Erlandson commented on SPARK-25828:


cc [~skonto]

> Bumping Version of kubernetes.client to latest version
> --
>
> Key: SPARK-25828
> URL: https://issues.apache.org/jira/browse/SPARK-25828
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Minor
>
> Upgrade the Kubernetes client version to at least 
> [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
> as we are falling behind on fabric8 updates. This will be an update to both 
> kubernetes/core and kubernetes/integration-tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25782) Add PCA Aggregator to support grouping

2018-10-19 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16657398#comment-16657398
 ] 

Erik Erlandson commented on SPARK-25782:


An ML Estimator also arguably would be a good API to expose

> Add PCA Aggregator to support grouping
> --
>
> Key: SPARK-25782
> URL: https://issues.apache.org/jira/browse/SPARK-25782
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.3.2
>Reporter: Matt Saunders
>Priority: Minor
>
> I built an Aggregator that computes PCA on grouped datasets. I wanted to use 
> the PCA functions provided by MLlib, but they only work on a full dataset, 
> and I needed to do it on a grouped dataset (like a RelationalGroupedDataset). 
> So I built a little Aggregator that can do that, here's an example of how 
> it's called:
> {noformat}
> val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
> // For each grouping, compute a PCA matrix/vector
> val pcaModels = inputData
>   .groupBy(keys:_*)
>   .agg(pcaAggregation.as(pcaOutput)){noformat}
> I used the same algorithms under the hood as 
> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works 
> directly on Datasets without converting to RDD first.
> I've seen others who wanted this ability (for example on Stack Overflow) so 
> I'd like to contribute it if it would be a benefit to the larger community. 
> If there is interest, I will prepare the code for a pull request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25782) Add PCA Aggregator to support grouping

2018-10-19 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16657394#comment-16657394
 ] 

Erik Erlandson commented on SPARK-25782:


Thanks [~mttsndrs]!

I agree it makes sense to support full Dataset aggregation functionality via a 
UDAF.

> Add PCA Aggregator to support grouping
> --
>
> Key: SPARK-25782
> URL: https://issues.apache.org/jira/browse/SPARK-25782
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.3.2
>Reporter: Matt Saunders
>Priority: Minor
>
> I built an Aggregator that computes PCA on grouped datasets. I wanted to use 
> the PCA functions provided by MLlib, but they only work on a full dataset, 
> and I needed to do it on a grouped dataset (like a RelationalGroupedDataset). 
> So I built a little Aggregator that can do that, here's an example of how 
> it's called:
> {noformat}
> val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
> // For each grouping, compute a PCA matrix/vector
> val pcaModels = inputData
>   .groupBy(keys:_*)
>   .agg(pcaAggregation.as(pcaOutput)){noformat}
> I used the same algorithms under the hood as 
> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works 
> directly on Datasets without converting to RDD first.
> I've seen others who wanted this ability (for example on Stack Overflow) so 
> I'd like to contribute it if it would be a benefit to the larger community. 
> If there is interest, I will prepare the code for a pull request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25782) Add PCA Aggregator to support grouping

2018-10-19 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-25782:
---
Target Version/s: 3.0.0
 Component/s: ML
  Issue Type: New Feature  (was: Improvement)

> Add PCA Aggregator to support grouping
> --
>
> Key: SPARK-25782
> URL: https://issues.apache.org/jira/browse/SPARK-25782
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.3.2
>Reporter: Matt Saunders
>Priority: Minor
>
> I built an Aggregator that computes PCA on grouped datasets. I wanted to use 
> the PCA functions provided by MLlib, but they only work on a full dataset, 
> and I needed to do it on a grouped dataset (like a RelationalGroupedDataset). 
> So I built a little Aggregator that can do that, here's an example of how 
> it's called:
> {noformat}
> val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
> // For each grouping, compute a PCA matrix/vector
> val pcaModels = inputData
>   .groupBy(keys:_*)
>   .agg(pcaAggregation.as(pcaOutput)){noformat}
> I used the same algorithms under the hood as 
> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works 
> directly on Datasets without converting to RDD first.
> I've seen others who wanted this ability (for example on Stack Overflow) so 
> I'd like to contribute it if it would be a benefit to the larger community. 
> If there is interest, I will prepare the code for a pull request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25128) multiple simultaneous job submissions against k8s backend cause driver pods to hang

2018-09-06 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-25128:
---
Target Version/s: 3.0.0  (was: 2.4.0, 2.3.3)
Priority: Minor  (was: Major)

> multiple simultaneous job submissions against k8s backend cause driver pods 
> to hang
> ---
>
> Key: SPARK-25128
> URL: https://issues.apache.org/jira/browse/SPARK-25128
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>Priority: Minor
>  Labels: kubernetes
>
> User is reporting that multiple "simultaneous" (or rapidly in succession) job 
> submissions against the k8s back-end are causing driver pods to hang in 
> "Waiting: PodInitializing" state. They filed an associated question at 
> [stackoverflow|[https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes|https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes?noredirect=1#comment90640662_51843212]].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25128) multiple simultaneous job submissions against k8s backend cause driver pods to hang

2018-09-06 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605916#comment-16605916
 ] 

Erik Erlandson commented on SPARK-25128:


Retargeting to next release sounds good. There has been no traffic since filing 
and it shouldn't block the release.

> multiple simultaneous job submissions against k8s backend cause driver pods 
> to hang
> ---
>
> Key: SPARK-25128
> URL: https://issues.apache.org/jira/browse/SPARK-25128
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: kubernetes
>
> User is reporting that multiple "simultaneous" (or rapidly in succession) job 
> submissions against the k8s back-end are causing driver pods to hang in 
> "Waiting: PodInitializing" state. They filed an associated question at 
> [stackoverflow|[https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes|https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes?noredirect=1#comment90640662_51843212]].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-31 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599299#comment-16599299
 ] 

Erik Erlandson commented on SPARK-24434:


To amplify a little from my points above: I co-chair a SIG that is attended by 
some Apache Spark contributors, most frequently people involved around the 
kubernetes back-end. As chair, I do my best to provide input on the discussions 
we have there. However, the various community participants are their own 
independent entities; nobody in this community takes orders from me.

When everything is running smoothly, this kind of duplicated effort should 
never happen. Here things didn't go smoothly, and I hope to work it out as best 
we can.

[~skonto] I encourage you to post your dev on this feature, which allows 
everyone to discuss all the available options.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-31 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599283#comment-16599283
 ] 

Erik Erlandson commented on SPARK-24434:


Stavros, yes, I knew you were working on it, and also that there were no plans 
for 2.4.

As I said above, it is generally more efficient and respectful to coordinate 
with issue assignees. I did not request this second PR. On the other hand, 
multiple PRs for an issue doesn't violate any FOSS principles, it means there 
should be a community discussion about which PR ought to be pursued.

I'm not aware of any renewed push to get this into 2.4.  I don't see any 
discussion about it on dev@spark.

 

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-31 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599092#comment-16599092
 ] 

Erik Erlandson commented on SPARK-24434:


There are a few related, but separate, issues here.

I agree that it is most efficient, and considerate, to respect issue 
assignments and coordinate our distributed development around absences, etc.

To the best of my knowledge, the work Stavros did on 24434 was not made visible 
as a public WIP apache/spark branch. Making dev visible this way is one 
important way to minimize coordination problems.

Although this confusion is awkward, nothing in regard to 24434 has violated 
FOSS principles, or Spark governance. Onur's PR has been developed and reviewed 
on a public apache/spark branch. This Jira was filed, and has hosted discussion 
from all stakeholders.

The Kubernetes Big Data SIG is a separate community that overlaps with the 
Spark community. Our meetings are open to the public, and we publish recordings 
and meeting minutes. Although we discuss topics related to Spark on Kubernetes, 
we do not make Spark development decisions in that community. All of the work 
that members of the K8s Big Data SIG have contributed to Spark respects Apache 
governance and has been done using established Spark processes: SPIP, 
discussion on dev, Jira, and the PR workflow.

 

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25287) Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py

2018-08-30 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-25287.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22294
[https://github.com/apache/spark/pull/22294]

> Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py
> ---
>
> Key: SPARK-25287
> URL: https://issues.apache.org/jira/browse/SPARK-25287
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.1
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Minor
>  Labels: infrastructure
> Fix For: 2.4.0
>
>
> I never remember to set {{JIRA_USERNAME}} and {{JIRA_PASSWORD}}, and it isn't 
> checked, so I get to the end of the {{merge_spark_pr.py}} process and it 
> fails on the Jira state update. An up-front check for this would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25287) Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py

2018-08-30 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson reassigned SPARK-25287:
--

Assignee: Erik Erlandson

> Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py
> ---
>
> Key: SPARK-25287
> URL: https://issues.apache.org/jira/browse/SPARK-25287
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.1
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Minor
>  Labels: infrastructure
> Fix For: 2.4.0
>
>
> I never remember to set {{JIRA_USERNAME}} and {{JIRA_PASSWORD}}, and it isn't 
> checked, so I get to the end of the {{merge_spark_pr.py}} process and it 
> fails on the Jira state update. An up-front check for this would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25287) Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py

2018-08-30 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-25287:
--

 Summary: Check for JIRA_USERNAME and JIRA_PASSWORD up front in 
merge_spark_pr.py
 Key: SPARK-25287
 URL: https://issues.apache.org/jira/browse/SPARK-25287
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.3.1
Reporter: Erik Erlandson


I never remember to set {{JIRA_USERNAME}} and {{JIRA_PASSWORD}}, and it isn't 
checked, so I get to the end of the {{merge_spark_pr.py}} process and it fails 
on the Jira state update. An up-front check for this would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25275) require memberhip in wheel to run 'su' (in dockerfiles)

2018-08-30 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-25275.

   Resolution: Fixed
Fix Version/s: 2.4.0

> require memberhip in wheel to run 'su' (in dockerfiles)
> ---
>
> Key: SPARK-25275
> URL: https://issues.apache.org/jira/browse/SPARK-25275
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: docker, kubernetes
> Fix For: 2.4.0
>
>
> For improved security, configure that users must be in wheel group in order 
> to run su.
> see example:
> [https://github.com/ope]nshift-evangelists/terminal-base-image/blob/master/image/Dockerfile#L53



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25275) require memberhip in wheel to run 'su' (in dockerfiles)

2018-08-30 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597935#comment-16597935
 ] 

Erik Erlandson commented on SPARK-25275:


{{merge_spark_pr.py}} failed to close this, closing manually.

> require memberhip in wheel to run 'su' (in dockerfiles)
> ---
>
> Key: SPARK-25275
> URL: https://issues.apache.org/jira/browse/SPARK-25275
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: docker, kubernetes
>
> For improved security, configure that users must be in wheel group in order 
> to run su.
> see example:
> [https://github.com/ope]nshift-evangelists/terminal-base-image/blob/master/image/Dockerfile#L53



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25275) require memberhip in wheel to run 'su' (in dockerfiles)

2018-08-29 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-25275:
--

 Summary: require memberhip in wheel to run 'su' (in dockerfiles)
 Key: SPARK-25275
 URL: https://issues.apache.org/jira/browse/SPARK-25275
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.1, 2.3.0
Reporter: Erik Erlandson


For improved security, configure that users must be in wheel group in order to 
run su.

see example:

[https://github.com/ope]nshift-evangelists/terminal-base-image/blob/master/image/Dockerfile#L53



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2018-08-27 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594271#comment-16594271
 ] 

Erik Erlandson commented on SPARK-21097:


I'm wondering if this is going to be subsumed by the Shuffle Service redesign 
proposal.

cc [~mcheah]

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2018-08-22 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588934#comment-16588934
 ] 

Erik Erlandson commented on SPARK-7768:
---

We use `UserDefinedType`, for example here:
[https://github.com/isarn/isarn-sketches-spark/blob/develop/src/main/scala/org/apache/spark/isarnproject/sketches/udt/TDigestUDT.scala#L37]

My colleague [~willbenton] and I gave a talk at Spark+AI summit in June on 
[this topic|https://databricks.com/session/apache-spark-for-library-developers]

A comment about {{Encoder}}: they are strongly typed, which is quite nice to 
work with in Scala but if you are intending to expose via DataFrame and/or 
PySpark via py4j, they can't help you, and you need UDTs.

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25128) multiple simultaneous job submissions against k8s backend cause driver pods to hang

2018-08-15 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581537#comment-16581537
 ] 

Erik Erlandson commented on SPARK-25128:


[~mcheah], [~liyinan926], wdyt?

> multiple simultaneous job submissions against k8s backend cause driver pods 
> to hang
> ---
>
> Key: SPARK-25128
> URL: https://issues.apache.org/jira/browse/SPARK-25128
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: kubernetes
>
> User is reporting that multiple "simultaneous" (or rapidly in succession) job 
> submissions against the k8s back-end are causing driver pods to hang in 
> "Waiting: PodInitializing" state. They filed an associated question at 
> [stackoverflow|[https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes|https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes?noredirect=1#comment90640662_51843212]].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25128) multiple simultaneous job submissions against k8s backend cause driver pods to hang

2018-08-15 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-25128:
--

 Summary: multiple simultaneous job submissions against k8s backend 
cause driver pods to hang
 Key: SPARK-25128
 URL: https://issues.apache.org/jira/browse/SPARK-25128
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Erik Erlandson


User is reporting that multiple "simultaneous" (or rapidly in succession) job 
submissions against the k8s back-end are causing driver pods to hang in 
"Waiting: PodInitializing" state. They filed an associated question at 
[stackoverflow|[https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes|https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes?noredirect=1#comment90640662_51843212]].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24817) Implement BarrierTaskContext.barrier()

2018-08-02 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567358#comment-16567358
 ] 

Erik Erlandson commented on SPARK-24817:


I have been looking at the use cases for barrier-mode on the design doc. The 
primary story seems to be along the lines of using {{mapPartitions}} to:
 # write out any partitioned data (and sync)
 # execute some kind of ML logic (TF, etc) (possibly syncing on stages here?)
 # optionally move back into "normal" spark executions

My mental model has been that the value proposition for Hydrogen is primarily a 
convergence argument: it is easier to not have to leave a Spark workflow and 
execute something like TF using some other toolchain. But OTOH, given that the 
Spark programmer has to write out the partitioned data and then invoke ML 
tooling like TF regardless, does the increase to convenience pay for the cost 
in complexity for absorbing new clustering & scheduling models into Spark, 
along with other consequences, for example SPARK-24615, compared to the "null 
hypothesis" of writing partition data, then using ML-specific clustering 
toolchains (kubeflow, for example), and consuming the resulting products in 
Spark afterward.

> Implement BarrierTaskContext.barrier()
> --
>
> Key: SPARK-24817
> URL: https://issues.apache.org/jira/browse/SPARK-24817
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Implement BarrierTaskContext.barrier(), to support global sync between all 
> the tasks in a barrier stage. The global sync shall finish immediately once 
> all tasks in the same barrier stage reaches the same barrier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24817) Implement BarrierTaskContext.barrier()

2018-08-02 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567301#comment-16567301
 ] 

Erik Erlandson commented on SPARK-24817:


Thanks [~jiangxb] - I'd expect that design to work out-of-box on the k8s 
backend. 

ML-specific code seems like it will have needs that are harder to predict, by 
definition. If it can use IP addresses in the cluster space, it should work 
regardless. If it wants fqdn, then perhaps additional pod configurations will 
be required.

> Implement BarrierTaskContext.barrier()
> --
>
> Key: SPARK-24817
> URL: https://issues.apache.org/jira/browse/SPARK-24817
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Implement BarrierTaskContext.barrier(), to support global sync between all 
> the tasks in a barrier stage. The global sync shall finish immediately once 
> all tasks in the same barrier stage reaches the same barrier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24817) Implement BarrierTaskContext.barrier()

2018-08-01 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566159#comment-16566159
 ] 

Erik Erlandson commented on SPARK-24817:


I'm curious about what the {{barrier}} invocations inside {{mapPartitions}} 
closures imply about communications between executors, for example executors 
running on pods in a kube cluster. It is possible that whatever is allowing 
shuffle data to transfer between executors will also allow these  {{barrier}} 
coordinations to work, but we had to create a headless service for executors to 
register properly with the driver pod, and if every executor pod needs 
something like that for barrier to work, it will be an impact for kube backend 
support.

> Implement BarrierTaskContext.barrier()
> --
>
> Key: SPARK-24817
> URL: https://issues.apache.org/jira/browse/SPARK-24817
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Implement BarrierTaskContext.barrier(), to support global sync between all 
> the tasks in a barrier stage. The global sync shall finish immediately once 
> all tasks in the same barrier stage reaches the same barrier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24580) List scenarios to be handled by barrier execution mode properly

2018-08-01 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566154#comment-16566154
 ] 

Erik Erlandson commented on SPARK-24580:


This is blocking SPARK-24582 which is marked as 'resolved' but it appears to be 
inactive.

> List scenarios to be handled by barrier execution mode properly
> ---
>
> Key: SPARK-24580
> URL: https://issues.apache.org/jira/browse/SPARK-24580
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> List scenarios to be handled by barrier execution mode to help the design. We 
> will start with simple ones to complex.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564558#comment-16564558
 ] 

Erik Erlandson commented on SPARK-24615:


Am I understanding correctly that this can't assign executors to desired 
resources without resorting to Dynamic Allocation to tear down an Executor and 
reallocate it somewhere else?

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24793) Make spark-submit more useful with k8s

2018-07-12 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542102#comment-16542102
 ] 

Erik Erlandson edited comment on SPARK-24793 at 7/12/18 7:01 PM:
-

Also a good point that --kill and --status are existing invocation modes, so it 
is already part of the command scope. From that pov, I agree it makes sense to 
support them via the k8s backend


was (Author: eje):
Also a good point that {{--kill}} and {{--status}} are existing invocation 
modes, so it is already part of the command scope. From that pov, I agree it 
makes sense to support them via the k8s backend

> Make spark-submit more useful with k8s
> --
>
> Key: SPARK-24793
> URL: https://issues.apache.org/jira/browse/SPARK-24793
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Major
>
> Support controlling the lifecycle of Spark Application through spark-submit. 
> For example:
> {{ 
>   --kill app_name   If given, kills the driver specified.
>   --status app_name  If given, requests the status of the driver 
> specified.
> }}
> Potentially also --list to list all spark drivers running.
> Given that our submission client can actually launch jobs into many different 
> namespaces, we'll need an additional specification of the namespace through a 
> --namespace flag potentially.
> I think this is pretty useful to have instead of forcing a user to use 
> kubectl to manage the lifecycle of any k8s Spark Application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24793) Make spark-submit more useful with k8s

2018-07-12 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542102#comment-16542102
 ] 

Erik Erlandson commented on SPARK-24793:


Also a good point that {{--kill}} and {{--status}} are existing invocation 
modes, so it is already part of the command scope. From that pov, I agree it 
makes sense to support them via the k8s backend

> Make spark-submit more useful with k8s
> --
>
> Key: SPARK-24793
> URL: https://issues.apache.org/jira/browse/SPARK-24793
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Major
>
> Support controlling the lifecycle of Spark Application through spark-submit. 
> For example:
> {{ 
>   --kill app_name   If given, kills the driver specified.
>   --status app_name  If given, requests the status of the driver 
> specified.
> }}
> Potentially also --list to list all spark drivers running.
> Given that our submission client can actually launch jobs into many different 
> namespaces, we'll need an additional specification of the namespace through a 
> --namespace flag potentially.
> I think this is pretty useful to have instead of forcing a user to use 
> kubectl to manage the lifecycle of any k8s Spark Application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24793) Make spark-submit more useful with k8s

2018-07-12 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541930#comment-16541930
 ] 

Erik Erlandson commented on SPARK-24793:


Another possible angle (not mutually exclusive with above) is establishing 
spark-operator as a "standard" solution for supporting these kind of 
higher-level operations. "If you want to do higher-level CRUD on jobs, we 
recommend investigating spark-operator..."

> Make spark-submit more useful with k8s
> --
>
> Key: SPARK-24793
> URL: https://issues.apache.org/jira/browse/SPARK-24793
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Major
>
> Support controlling the lifecycle of Spark Application through spark-submit. 
> For example:
> {{ 
>   --kill app_name   If given, kills the driver specified.
>   --status app_name  If given, requests the status of the driver 
> specified.
> }}
> Potentially also --list to list all spark drivers running.
> Given that our submission client can actually launch jobs into many different 
> namespaces, we'll need an additional specification of the namespace through a 
> --namespace flag potentially.
> I think this is pretty useful to have instead of forcing a user to use 
> kubectl to manage the lifecycle of any k8s Spark Application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24793) Make spark-submit more useful with k8s

2018-07-12 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541923#comment-16541923
 ] 

Erik Erlandson commented on SPARK-24793:


I am concerned that this is outside the scope of {{spark-submit}}, especially 
since it is arguably a k8s-centric use case.

But it's definitely a useful set of functionality.  I'd propose strategic use 
of labels to make these kind of operations easier via {{kubectl}}. Possibly 
supported via a tutorial example in the docs?  "here's how to use labels to do 
common operations like "kill this app" and "list all the running driver pods", 
etc

> Make spark-submit more useful with k8s
> --
>
> Key: SPARK-24793
> URL: https://issues.apache.org/jira/browse/SPARK-24793
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Major
>
> Support controlling the lifecycle of Spark Application through spark-submit. 
> For example:
> {{ 
>   --kill app_name   If given, kills the driver specified.
>   --status app_name  If given, requests the status of the driver 
> specified.
> }}
> Potentially also --list to list all spark drivers running.
> Given that our submission client can actually launch jobs into many different 
> namespaces, we'll need an additional specification of the namespace through a 
> --namespace flag potentially.
> I think this is pretty useful to have instead of forcing a user to use 
> kubectl to manage the lifecycle of any k8s Spark Application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed

2018-06-19 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-24534.

   Resolution: Fixed
Fix Version/s: 2.4.0

> Add a way to bypass entrypoint.sh script if no spark cmd is passed
> --
>
> Key: SPARK-24534
> URL: https://issues.apache.org/jira/browse/SPARK-24534
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Minor
> Fix For: 2.4.0
>
>
> As an improvement in the entrypoint.sh script, I'd like to propose spark 
> entrypoint do a passthrough if driver/executor/init is not the command 
> passed. Currently it raises an error.
> To me more specific, I'm talking about these lines:
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]
> This allows the openshift-spark image to continue to function as a Spark 
> Standalone component, with custom configuration support etc. without 
> compromising the previous method to configure the cluster inside a kubernetes 
> environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed

2018-06-13 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511804#comment-16511804
 ] 

Erik Erlandson commented on SPARK-24534:


I think this has potential use for customization beyond the openshift 
downstream. It allows derived images to leverage the apache spark base images 
in contexts outside of directly running the driver and executor processes.

> Add a way to bypass entrypoint.sh script if no spark cmd is passed
> --
>
> Key: SPARK-24534
> URL: https://issues.apache.org/jira/browse/SPARK-24534
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Minor
>
> As an improvement in the entrypoint.sh script, I'd like to propose spark 
> entrypoint do a passthrough if driver/executor/init is not the command 
> passed. Currently it raises an error.
> To me more specific, I'm talking about these lines:
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]
> This allows the openshift-spark image to continue to function as a Spark 
> Standalone component, with custom configuration support etc. without 
> compromising the previous method to configure the cluster inside a kubernetes 
> environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates

2018-06-01 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498337#comment-16498337
 ] 

Erik Erlandson edited comment on SPARK-24434 at 6/1/18 5:53 PM:


My current take on UX around this feature is that there's not much precedent 
from the Spark world. Assuming I'm right about that it's more likely to be 
driven by what expectations Kubernetes users have. In my experience that is 
along the lines of "pointing at a yaml file," but maybe there's more variety of 
user workflows than I think.

JSON definitely seems more amenable for including with command-line arguments. 
I have been assuming that if users were specifying pod configurations that 
they'd be somewhat larger pod sub-structures and not easy to supply embedded on 
a command line. Are "small" pod modifications also likely?


was (Author: eje):
My current take on UX around this feature is that there's not much precedent 
from the Spark world. Assuming I'm right about that it's more likely to be 
driven by what expectations Kubernetes users have. In my experience that is 
along the lines of "pointing at a yaml file," but maybe there's more variety of 
user workflows than I think.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-06-01 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498337#comment-16498337
 ] 

Erik Erlandson commented on SPARK-24434:


My current take on UX around this feature is that there's not much precedent 
from the Spark world. Assuming I'm right about that it's more likely to be 
driven by what expectations Kubernetes users have. In my experience that is 
along the lines of "pointing at a yaml file," but maybe there's more variety of 
user workflows than I think.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-31 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496991#comment-16496991
 ] 

Erik Erlandson commented on SPARK-24434:


[~foxish] is there a technical (or ux) argument for json, versus yaml (or 
allowing both)?

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-30 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495642#comment-16495642
 ] 

Erik Erlandson commented on SPARK-24434:


[~skonto] given the number of ideas that have gotten tossed around for this 
over time, an 'alternatives considered' section for a design doc will 
definitely be valuable

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files

2018-05-30 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495619#comment-16495619
 ] 

Erik Erlandson commented on SPARK-24091:


If we support user-supplied yaml, that may become a source of config map 
specifications

> Internally used ConfigMap prevents use of user-specified ConfigMaps carrying 
> Spark configs files
> 
>
> Key: SPARK-24091
> URL: https://issues.apache.org/jira/browse/SPARK-24091
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> The recent PR [https://github.com/apache/spark/pull/20669] for removing the 
> init-container introduced a internally used ConfigMap carrying Spark 
> configuration properties in a file for the driver. This ConfigMap gets 
> mounted under {{$SPARK_HOME/conf}} and the environment variable 
> {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much 
> prevents users from mounting their own ConfigMaps that carry custom Spark 
> configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and 
> leaves users with only the option of building custom images. IMO, it is very 
> useful to support mounting user-specified ConfigMaps for custom Spark 
> configuration files. This worths further discussions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-30 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495615#comment-16495615
 ] 

Erik Erlandson commented on SPARK-24434:


Is the template-based solution being explicitly favored over other options, 
e.g. pod presets or webhooks, etc?

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24435) Support user-supplied YAML that can be merged with k8s pod descriptions

2018-05-30 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-24435.

Resolution: Duplicate

> Support user-supplied YAML that can be merged with k8s pod descriptions
> ---
>
> Key: SPARK-24435
> URL: https://issues.apache.org/jira/browse/SPARK-24435
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: features, kubernetes
> Fix For: 2.4.0
>
>
> Kubernetes supports a large variety of configurations to Pods. Currently only 
> some of these are configurable from Spark, and they all operate by being 
> plumbed from --conf arguments through to pod creation in the code.
> To avoid the anti-pattern of trying to expose an unbounded Pod feature set 
> through Spark configuration keywords, the community is interested in working 
> out a sane way of allowing users to supply "arbitrary" Pod YAML which can be 
> merged with the pod configurations created by the kube backend.
> Multiple solutions have been considerd, including Pod Pre-sets and loading 
> Pod template objects.  A requirement is that the policy for how user-supplied 
> YAML interacts with the configurations created by the kube back-end must be 
> easy to reason about, and also that whatever kubernetes features the solution 
> uses are supported on the kubernetes roadmap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24435) Support user-supplied YAML that can be merged with k8s pod descriptions

2018-05-30 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-24435:
--

 Summary: Support user-supplied YAML that can be merged with k8s 
pod descriptions
 Key: SPARK-24435
 URL: https://issues.apache.org/jira/browse/SPARK-24435
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Erik Erlandson
 Fix For: 2.4.0


Kubernetes supports a large variety of configurations to Pods. Currently only 
some of these are configurable from Spark, and they all operate by being 
plumbed from --conf arguments through to pod creation in the code.

To avoid the anti-pattern of trying to expose an unbounded Pod feature set 
through Spark configuration keywords, the community is interested in working 
out a sane way of allowing users to supply "arbitrary" Pod YAML which can be 
merged with the pod configurations created by the kube backend.

Multiple solutions have been considerd, including Pod Pre-sets and loading Pod 
template objects.  A requirement is that the policy for how user-supplied YAML 
interacts with the configurations created by the kube back-end must be easy to 
reason about, and also that whatever kubernetes features the solution uses are 
supported on the kubernetes roadmap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-05-16 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477812#comment-16477812
 ] 

Erik Erlandson commented on SPARK-24248:


Is the design above using re-sync as the fallback for the watcher losing 
connection, or periodic resync as a replacement for the watcher?  Are there any 
potential race-condition issues between a dequeing thread and the thread 
querying pod states?

> [K8S] Use the Kubernetes cluster as the backing store for the state of pods
> ---
>
> Key: SPARK-24248
> URL: https://issues.apache.org/jira/browse/SPARK-24248
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> We have a number of places in KubernetesClusterSchedulerBackend right now 
> that maintains the state of pods in memory. However, the Kubernetes API can 
> always give us the most up to date and correct view of what our executors are 
> doing. We should consider moving away from in-memory state as much as can in 
> favor of using the Kubernetes cluster as the source of truth for pod status. 
> Maintaining less state in memory makes it so that there's a lower chance that 
> we accidentally miss updating one of these data structures and breaking the 
> lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-02 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461324#comment-16461324
 ] 

Erik Erlandson commented on SPARK-24135:


> In the case of the executor failing to start at all, this wouldn't be caught 
> by Spark's task failure count logic because you're never going to end up 
> scheduling tasks on these executors that failed to start.

Aha, that argues for allowing a way to give up after repeated pod start 
failures.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-02 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461154#comment-16461154
 ] 

Erik Erlandson commented on SPARK-24135:


IIRC the dynamic allocation heuristic was to avoid scheduling new executors if 
there were executors still pending, to prevent a positive feedback loop from 
swamping kube with ever-increasing numbers of executor pod scheduling requests. 
How does that interact with the concept of killing a pending executor because 
its pod start is failing?

 

Restarting seems like it would eventually be limited by the job failure limit 
that Spark already has. If pod startup failures are deterministic the job 
failure count will hit this limit and job will be killed that way.  That isn't 
mutually exclusive to supporting some maximum number of pod startup attempts in 
the back-end, however.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-01 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459954#comment-16459954
 ] 

Erik Erlandson commented on SPARK-24135:


I think it makes sense to detect these failure states.  Even if they won't 
resolve by requesting replacement executors, reporting the specific failure 
mode in the error logs should aid in debugging. It could optionally be used as 
grounds for job failure, in the case of repeating executor failures.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23891) Debian based Dockerfile

2018-04-14 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438410#comment-16438410
 ] 

Erik Erlandson commented on SPARK-23891:


[~SercanKaraoglu] thanks for the information! You are correct; Spark also has a 
netty dep. Can you attach your customized docker file to this JIRA? That would 
be a very useful reference for our ongoing container image discussions.

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23891) Debian based Dockerfile

2018-04-12 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436196#comment-16436196
 ] 

Erik Erlandson commented on SPARK-23891:


I do think that these reports are very useful for collecting data on community 
use cases. Is this incompatability something fundamental to alpine that can 
only be fixed via debian, or is it possible to hack the alpine build to fix it?

 

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23891) Debian based Dockerfile

2018-04-12 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436185#comment-16436185
 ] 

Erik Erlandson commented on SPARK-23891:


The question of what OS base to use for "canonical" images or dockerfiles is an 
open one. The use of alpine was influenced by the relatively small image size 
that resulted. We could entertain arguments about why debian, centos, or some 
other OS, might be an advantage.

The current position of the Apache Spark project is that the dockerfiles 
shipped with the project are for reference, and as an aid to users building 
their own images for use with the kubernetes back-end.  IMO, the project should 
not get into the business of supporting _multiple_ dockerfiles at the present 
time. In the future, if/when the "container image api" stabilizes further, we 
might reconsider maintaining multiple dockerfiles.

I'm interested if others have different point of view; my take currently is 
that if users would like to construct similar dockerfiles using an alternative 
base OS, it would be great to publish that as a github project where interested 
community members could use it.

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23891) Debian based Dockerfile

2018-04-12 Thread Erik Erlandson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-23891:
---
  Priority: Minor  (was: Major)
Issue Type: New Feature  (was: Bug)

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23680) entrypoint.sh does not accept arbitrary UIDs, returning as an error

2018-03-16 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402266#comment-16402266
 ] 

Erik Erlandson commented on SPARK-23680:


commit workflow indicates to set the Assignee, however I cannot edit that field

> entrypoint.sh does not accept arbitrary UIDs, returning as an error
> ---
>
> Key: SPARK-23680
> URL: https://issues.apache.org/jira/browse/SPARK-23680
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: OpenShift
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Major
>  Labels: easyfix
>
> Openshift supports running pods using arbitrary UIDs 
> ([https://docs.openshift.com/container-platform/3.7/creating_images/guidelines.html#openshift-specific-guidelines)]
>   to improve security. Although entrypoint.sh was developed to cover this 
> feature, the script is returning an error[1].
> The issue is that the script uses getent to find the passwd entry of the 
> current UID, and if the entry is not found it creates an entry in 
> /etc/passwd. According to the getent man page:
> {code:java}
> EXIT STATUS
>    One of the following exit values can be returned by getent:
>   0 Command completed successfully.
>   1 Missing arguments, or database unknown.
>   2 One or more supplied key could not be found in the 
> database.
>   3 Enumeration not supported on this database.
> {code}
> And since the script begin with a "set -ex" command, which means it turns on 
> debug and breaks the script if the command pipelines returns an exit code 
> other than 0.--
> Having that said, this line below must be changed to remove the "-e" flag 
> from set command:
> https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L20
>  
>  
> [1]https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L25-L34



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23680) entrypoint.sh does not accept arbitrary UIDs, returning as an error

2018-03-16 Thread Erik Erlandson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-23680.

  Resolution: Fixed
Target Version/s: 2.3.1, 2.4.0

merged to master

> entrypoint.sh does not accept arbitrary UIDs, returning as an error
> ---
>
> Key: SPARK-23680
> URL: https://issues.apache.org/jira/browse/SPARK-23680
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: OpenShift
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Major
>  Labels: easyfix
>
> Openshift supports running pods using arbitrary UIDs 
> ([https://docs.openshift.com/container-platform/3.7/creating_images/guidelines.html#openshift-specific-guidelines)]
>   to improve security. Although entrypoint.sh was developed to cover this 
> feature, the script is returning an error[1].
> The issue is that the script uses getent to find the passwd entry of the 
> current UID, and if the entry is not found it creates an entry in 
> /etc/passwd. According to the getent man page:
> {code:java}
> EXIT STATUS
>    One of the following exit values can be returned by getent:
>   0 Command completed successfully.
>   1 Missing arguments, or database unknown.
>   2 One or more supplied key could not be found in the 
> database.
>   3 Enumeration not supported on this database.
> {code}
> And since the script begin with a "set -ex" command, which means it turns on 
> debug and breaks the script if the command pipelines returns an exit code 
> other than 0.--
> Having that said, this line below must be changed to remove the "-e" flag 
> from set command:
> https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L20
>  
>  
> [1]https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L25-L34



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23680) entrypoint.sh does not accept arbitrary UIDs, returning as an error

2018-03-14 Thread Erik Erlandson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-23680:
---
 Flags: Important
Labels: easyfix  (was: )

> entrypoint.sh does not accept arbitrary UIDs, returning as an error
> ---
>
> Key: SPARK-23680
> URL: https://issues.apache.org/jira/browse/SPARK-23680
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: OpenShift
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Major
>  Labels: easyfix
>
> Openshift supports running pods using arbitrary UIDs 
> ([https://docs.openshift.com/container-platform/3.7/creating_images/guidelines.html#openshift-specific-guidelines)]
>   to improve security. Although entrypoint.sh was developed to cover this 
> feature, the script is returning an error[1].
> The issue is that the script uses getent to find the passwd entry of the 
> current UID, and if the entry is not found it creates an entry in 
> /etc/passwd. According to the getent man page:
> {code:java}
> EXIT STATUS
>    One of the following exit values can be returned by getent:
>   0 Command completed successfully.
>   1 Missing arguments, or database unknown.
>   2 One or more supplied key could not be found in the 
> database.
>   3 Enumeration not supported on this database.
> {code}
> And since the script begin with a "set -ex" command, which means it turns on 
> debug and breaks the script if the command pipelines returns an exit code 
> other than 0.--
> Having that said, this line below must be changed to remove the "-e" flag 
> from set command:
> https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L20
>  
>  
> [1]https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L25-L34



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23680) entrypoint.sh does not accept arbitrary UIDs, returning as an error

2018-03-14 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398701#comment-16398701
 ] 

Erik Erlandson commented on SPARK-23680:


[~rmartine] thanks for catching this! It will impact platforms running w/ 
anonymous uid such as OpenShift.

> entrypoint.sh does not accept arbitrary UIDs, returning as an error
> ---
>
> Key: SPARK-23680
> URL: https://issues.apache.org/jira/browse/SPARK-23680
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: OpenShift
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Major
>  Labels: easyfix
>
> Openshift supports running pods using arbitrary UIDs 
> ([https://docs.openshift.com/container-platform/3.7/creating_images/guidelines.html#openshift-specific-guidelines)]
>   to improve security. Although entrypoint.sh was developed to cover this 
> feature, the script is returning an error[1].
> The issue is that the script uses getent to find the passwd entry of the 
> current UID, and if the entry is not found it creates an entry in 
> /etc/passwd. According to the getent man page:
> {code:java}
> EXIT STATUS
>    One of the following exit values can be returned by getent:
>   0 Command completed successfully.
>   1 Missing arguments, or database unknown.
>   2 One or more supplied key could not be found in the 
> database.
>   3 Enumeration not supported on this database.
> {code}
> And since the script begin with a "set -ex" command, which means it turns on 
> debug and breaks the script if the command pipelines returns an exit code 
> other than 0.--
> Having that said, this line below must be changed to remove the "-e" flag 
> from set command:
> https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L20
>  
>  
> [1]https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L25-L34



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23324) Announce new Kubernetes back-end for 2.3 release notes

2018-02-02 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351020#comment-16351020
 ] 

Erik Erlandson commented on SPARK-23324:


cc [~sameer], [~foxish]

> Announce new Kubernetes back-end for 2.3 release notes
> --
>
> Key: SPARK-23324
> URL: https://issues.apache.org/jira/browse/SPARK-23324
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Kubernetes
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: documentation, kubernetes, release_notes
>
> This is an issue to request that the new Kubernetes scheduler back-end gets 
> called out in the 2.3 release notes, as it is a prominent new feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23324) Announce new Kubernetes back-end for 2.3 release notes

2018-02-02 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-23324:
--

 Summary: Announce new Kubernetes back-end for 2.3 release notes
 Key: SPARK-23324
 URL: https://issues.apache.org/jira/browse/SPARK-23324
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Kubernetes
Affects Versions: 2.3.0
Reporter: Erik Erlandson


This is an issue to request that the new Kubernetes scheduler back-end gets 
called out in the 2.3 release notes, as it is a prominent new feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23137) spark.kubernetes.executor.podNamePrefix is ignored

2018-01-17 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329721#comment-16329721
 ] 

Erik Erlandson commented on SPARK-23137:


+1, a more general "app prefix" seems more useful

> spark.kubernetes.executor.podNamePrefix is ignored
> --
>
> Key: SPARK-23137
> URL: https://issues.apache.org/jira/browse/SPARK-23137
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> [~liyinan926] is fixing this as we speak. Should be a very minor change.
> It's also a non-critical option, so, if we decide that the safer thing is to 
> just remove it, we can do that as well. Will leave that decision to the 
> release czar and reviewers.
>  
> [~vanzin] [~felixcheung] [~sameerag]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22647) Docker files for image creation

2017-12-13 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289988#comment-16289988
 ] 

Erik Erlandson commented on SPARK-22647:


I'd like to propose migrating our images onto centos, which should also fix 
this particular issue.

> Docker files for image creation
> ---
>
> Key: SPARK-22647
> URL: https://issues.apache.org/jira/browse/SPARK-22647
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> This covers the dockerfiles that need to be shipped to enable the Kubernetes 
> backend for Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21277) Spark is invoking an incorrect serializer after UDAF completion

2017-07-05 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16075818#comment-16075818
 ] 

Erik Erlandson commented on SPARK-21277:


It would be ideal to document the requirement that all array data must be 
serialized via {{UnsafeArrayData}} for a UDT.  The obvious place would be on 
{{UserDefinedType}}, however now that it is no longer a public class there's no 
channel there for scaladoc.

> Spark is invoking an incorrect serializer after UDAF completion
> ---
>
> Key: SPARK-21277
> URL: https://issues.apache.org/jira/browse/SPARK-21277
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Erik Erlandson
>
> I'm writing a UDAF that also requires some custom UDT implementations.  The 
> UDAF (and UDT) logic appear to be executing properly up through the final 
> UDAF call to the {{evaluate}} method. However, after the evaluate method 
> completes, I am seeing the UDT {{deserialize}} method being called another 
> time, however this time it is being invoked on data that wasn't produced by 
> my corresponding {{serialize}} method, and it is crashing.  The following 
> REPL output shows the execution and completion of {{evaluate}}, and then 
> another call to {{deserialize}} that sees some kind of {{UnsafeArrayData}} 
> object that my serialization doesn't produce, and so the method fails:
> {code}entering evaluate
> a= 
> [[0.5,10,2,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1813f2c,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@b3587fc7],[0.5,10,4,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d3065487,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1a5ace9],[0.5,10,4,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d01fbbcf,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1a5ace9]]
> leaving evaluate
> a= org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@27d73513
> java.lang.RuntimeException: Error while decoding: 
> java.lang.UnsupportedOperationException: Not supported on UnsafeArrayData.
> createexternalrow(newInstance(class 
> org.apache.spark.isarnproject.sketches.udt.TDigestArrayUDT).deserialize, 
> StructField(tdigestmlvecudaf(features),TDigestArrayUDT,true))
> {code}
> To reproduce, check out the branch {{first-cut}} of {{isarn-sketches-spark}}:
> https://github.com/erikerlandson/isarn-sketches-spark/tree/first-cut
> Then invoke {{xsbt console}} to get a REPL with a spark session.  In the REPL 
> execute:
> {code}
> Welcome to Scala 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131).
> Type in expressions for evaluation. Or try :help.
> scala> val training = spark.createDataFrame(Seq((1.0, Vectors.dense(0.0, 1.1, 
> 0.1)),(0.0, Vectors.dense(2.0, 1.0, -1.0)),(0.0, Vectors.dense(2.0, 1.3, 
> 1.0)),(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF("label", "features")
> training: org.apache.spark.sql.DataFrame = [label: double, features: vector]
> scala> val featTD = 
> training.agg(TDigestMLVecUDAF(0.5,10)(training("features")))
> featTD: org.apache.spark.sql.DataFrame = [tdigestmlvecudaf(features): 
> tdigestarray]
> scala> featTD.first
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21277) Spark is invoking an incorrect serializer after UDAF completion

2017-07-01 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-21277:
--

 Summary: Spark is invoking an incorrect serializer after UDAF 
completion
 Key: SPARK-21277
 URL: https://issues.apache.org/jira/browse/SPARK-21277
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, SQL
Affects Versions: 2.1.0
Reporter: Erik Erlandson


I'm writing a UDAF that also requires some custom UDT implementations.  The 
UDAF (and UDT) logic appear to be executing properly up through the final UDAF 
call to the {{evaluate}} method. However, after the evaluate method completes, 
I am seeing the UDT {{deserialize}} method being called another time, however 
this time it is being invoked on data that wasn't produced by my corresponding 
{{serialize}} method, and it is crashing.  The following REPL output shows the 
execution and completion of {{evaluate}}, and then another call to 
{{deserialize}} that sees some kind of {{UnsafeArrayData}} object that my 
serialization doesn't produce, and so the method fails:

{code}entering evaluate
a= 
[[0.5,10,2,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1813f2c,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@b3587fc7],[0.5,10,4,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d3065487,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1a5ace9],[0.5,10,4,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d01fbbcf,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1a5ace9]]
leaving evaluate
a= org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@27d73513
java.lang.RuntimeException: Error while decoding: 
java.lang.UnsupportedOperationException: Not supported on UnsafeArrayData.
createexternalrow(newInstance(class 
org.apache.spark.isarnproject.sketches.udt.TDigestArrayUDT).deserialize, 
StructField(tdigestmlvecudaf(features),TDigestArrayUDT,true))
{code}

To reproduce, check out the branch {{first-cut}} of {{isarn-sketches-spark}}:
https://github.com/erikerlandson/isarn-sketches-spark/tree/first-cut

Then invoke {{xsbt console}} to get a REPL with a spark session.  In the REPL 
execute:
{code}
Welcome to Scala 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131).
Type in expressions for evaluation. Or try :help.

scala> val training = spark.createDataFrame(Seq((1.0, Vectors.dense(0.0, 1.1, 
0.1)),(0.0, Vectors.dense(2.0, 1.0, -1.0)),(0.0, Vectors.dense(2.0, 1.3, 
1.0)),(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF("label", "features")
training: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> val featTD = training.agg(TDigestMLVecUDAF(0.5,10)(training("features")))
featTD: org.apache.spark.sql.DataFrame = [tdigestmlvecudaf(features): 
tdigestarray]

scala> featTD.first
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2017-06-27 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065787#comment-16065787
 ] 

Erik Erlandson commented on SPARK-10915:


This would be great for exposing {{TDigest}} aggregation to py-spark datasets.  
(see https://github.com/isarn/isarn-sketches#t-digest)

Currently the newer {{Aggregator}} trait makes this easy to do for datasets in 
Scala.  Writing the alternative {{UserDefinedAggregateFunction}} is possible, 
although I'd have to code my own serializor for a TDigest UDT instead of just 
using {{Encoder.kryo}}.  But UDAF to python is a hack at best: (see 
https://stackoverflow.com/a/33257733/3669757)


> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21026) Document jenkins plug-ins assumed by the spark documentation build

2017-06-08 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-21026:
--

 Summary: Document jenkins plug-ins assumed by the spark 
documentation build
 Key: SPARK-21026
 URL: https://issues.apache.org/jira/browse/SPARK-21026
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 2.1.1
Reporter: Erik Erlandson


I haven't been able to find documentation on what plug-ins the spark doc build 
assumes for jenkins.  Is there a list somewhere, or a gemfile?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-12-08 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733439#comment-15733439
 ] 

Erik Erlandson commented on SPARK-18278:


As I understand it (and as I've built them) an "MVP" Apache Spark docker image 
consists of:

1. Some base OS image, presumably some variant of linux, with some package 
management, but starting from some minimalist install
2. A spark-compatible JRE
3. (if python support) whatever standard python installs are required to run 
py-spark
4. A Spark distro, likely installed from an official distro tarball

Hopefully I'm not over-simplifying, but IIUC the licensing around all of those 
is well understood and known to be FOSS compatible.  Other non-minimal, or 
non-standard images builds are definitely possible, but I'd consider those to 
be under the purview of 3rd parties in the community.

Publishing "official Apache Spark" images would imply some new 
responsibilities, including maintenance.  A possible roadmap might be to add 
"official" images as part of a subsequent phase, drawing on experience with 
phase 1.  A separate registry organization could in principle be used, for 
example: https://hub.docker.com/u/k8s4spark/

A consequence of not having such an official image is that integration testing 
would then be based, at least initially, on 3rd-party images.


> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-12-03 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718249#comment-15718249
 ] 

Erik Erlandson commented on SPARK-18278:


Not publishing images puts users in the position of not being able to run this 
out-of-the-box.  First they would have to either build images themselves, or 
find somebody else's 3rd-party images, etc.  It doesn't seem like it would make 
for good UX.

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-12-03 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718240#comment-15718240
 ] 

Erik Erlandson commented on SPARK-18278:


A possible scheme might be to publish the docker-files, but not actually build 
the images.   It seems more standard to actually publish images for the 
community.   Is there some reason for not wanting to do that?

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-07 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644919#comment-15644919
 ] 

Erik Erlandson commented on SPARK-18278:


Another comment on external plug-ins for scheduling: although I think it's a 
good idea to support it, it does introduce the maintenance of keeping external 
scheduling packages synced with the main Apache Spark project.  That represents 
another argument for first-class support for schedulers of sufficient 
importance to the community.

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-07 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644841#comment-15644841
 ] 

Erik Erlandson commented on SPARK-18278:


I agree with [~willbenton] that Kube is a sufficiently-popular container mgmt 
system that it warrants "first-class" sub-project status for Apache park.

I'm also interested in making modifications to the Spark scheduler support so 
that it is easier to plug-in new ones externally.  I believe the necessary 
modifications would not be very intrusive.  The system is already based on 
sub-classing abstract traits.  It would be mostly a matter of increasing their 
exposure.



> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-04 Thread Erik Erlandson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-18278:
---
External issue URL: https://github.com/kubernetes/kubernetes/issues/34377
 External issue ID: #34377

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Erik Erlandson
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-04 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637591#comment-15637591
 ] 

Erik Erlandson commented on SPARK-18278:


Current prototype:
https://github.com/foxish/spark/tree/k8s-support
https://github.com/foxish/spark/pull/1

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Erik Erlandson
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-04 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-18278:
--

 Summary: Support native submission of spark jobs to a kubernetes 
cluster
 Key: SPARK-18278
 URL: https://issues.apache.org/jira/browse/SPARK-18278
 Project: Spark
  Issue Type: Umbrella
  Components: Build, Deploy, Documentation, Scheduler, Spark Core
Affects Versions: 2.2.0
Reporter: Erik Erlandson


A new Apache Spark sub-project that enables native support for submitting Spark 
applications to a kubernetes cluster.   The submitted application runs in a 
driver executing on a kubernetes pod, and executors lifecycles are also managed 
as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >