from:"Erik Erlandson \(JIRA\)"

[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2017-06-27 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065787#comment-16065787
 ] 

Erik Erlandson commented on SPARK-10915:


This would be great for exposing {{TDigest}} aggregation to py-spark datasets.  
(see https://github.com/isarn/isarn-sketches#t-digest)

Currently the newer {{Aggregator}} trait makes this easy to do for datasets in 
Scala.  Writing the alternative {{UserDefinedAggregateFunction}} is possible, 
although I'd have to code my own serializor for a TDigest UDT instead of just 
using {{Encoder.kryo}}.  But UDAF to python is a hack at best: (see 
https://stackoverflow.com/a/33257733/3669757)


> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21277) Spark is invoking an incorrect serializer after UDAF completion

2017-07-01 Thread Erik Erlandson (JIRA)

Erik Erlandson created SPARK-21277:
--

 Summary: Spark is invoking an incorrect serializer after UDAF 
completion
 Key: SPARK-21277
 URL: https://issues.apache.org/jira/browse/SPARK-21277
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, SQL
Affects Versions: 2.1.0
Reporter: Erik Erlandson


I'm writing a UDAF that also requires some custom UDT implementations.  The 
UDAF (and UDT) logic appear to be executing properly up through the final UDAF 
call to the {{evaluate}} method. However, after the evaluate method completes, 
I am seeing the UDT {{deserialize}} method being called another time, however 
this time it is being invoked on data that wasn't produced by my corresponding 
{{serialize}} method, and it is crashing.  The following REPL output shows the 
execution and completion of {{evaluate}}, and then another call to 
{{deserialize}} that sees some kind of {{UnsafeArrayData}} object that my 
serialization doesn't produce, and so the method fails:

{code}entering evaluate
a= 
[[0.5,10,2,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1813f2c,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@b3587fc7],[0.5,10,4,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d3065487,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1a5ace9],[0.5,10,4,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d01fbbcf,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1a5ace9]]
leaving evaluate
a= org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@27d73513
java.lang.RuntimeException: Error while decoding: 
java.lang.UnsupportedOperationException: Not supported on UnsafeArrayData.
createexternalrow(newInstance(class 
org.apache.spark.isarnproject.sketches.udt.TDigestArrayUDT).deserialize, 
StructField(tdigestmlvecudaf(features),TDigestArrayUDT,true))
{code}

To reproduce, check out the branch {{first-cut}} of {{isarn-sketches-spark}}:
https://github.com/erikerlandson/isarn-sketches-spark/tree/first-cut

Then invoke {{xsbt console}} to get a REPL with a spark session.  In the REPL 
execute:
{code}
Welcome to Scala 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131).
Type in expressions for evaluation. Or try :help.

scala> val training = spark.createDataFrame(Seq((1.0, Vectors.dense(0.0, 1.1, 
0.1)),(0.0, Vectors.dense(2.0, 1.0, -1.0)),(0.0, Vectors.dense(2.0, 1.3, 
1.0)),(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF("label", "features")
training: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> val featTD = training.agg(TDigestMLVecUDAF(0.5,10)(training("features")))
featTD: org.apache.spark.sql.DataFrame = [tdigestmlvecudaf(features): 
tdigestarray]

scala> featTD.first
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21277) Spark is invoking an incorrect serializer after UDAF completion

2017-07-05 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075818#comment-16075818
 ] 

Erik Erlandson commented on SPARK-21277:


It would be ideal to document the requirement that all array data must be 
serialized via {{UnsafeArrayData}} for a UDT.  The obvious place would be on 
{{UserDefinedType}}, however now that it is no longer a public class there's no 
channel there for scaladoc.

> Spark is invoking an incorrect serializer after UDAF completion
> ---
>
> Key: SPARK-21277
> URL: https://issues.apache.org/jira/browse/SPARK-21277
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Erik Erlandson
>
> I'm writing a UDAF that also requires some custom UDT implementations.  The 
> UDAF (and UDT) logic appear to be executing properly up through the final 
> UDAF call to the {{evaluate}} method. However, after the evaluate method 
> completes, I am seeing the UDT {{deserialize}} method being called another 
> time, however this time it is being invoked on data that wasn't produced by 
> my corresponding {{serialize}} method, and it is crashing.  The following 
> REPL output shows the execution and completion of {{evaluate}}, and then 
> another call to {{deserialize}} that sees some kind of {{UnsafeArrayData}} 
> object that my serialization doesn't produce, and so the method fails:
> {code}entering evaluate
> a= 
> [[0.5,10,2,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1813f2c,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@b3587fc7],[0.5,10,4,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d3065487,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1a5ace9],[0.5,10,4,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d01fbbcf,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1a5ace9]]
> leaving evaluate
> a= org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@27d73513
> java.lang.RuntimeException: Error while decoding: 
> java.lang.UnsupportedOperationException: Not supported on UnsafeArrayData.
> createexternalrow(newInstance(class 
> org.apache.spark.isarnproject.sketches.udt.TDigestArrayUDT).deserialize, 
> StructField(tdigestmlvecudaf(features),TDigestArrayUDT,true))
> {code}
> To reproduce, check out the branch {{first-cut}} of {{isarn-sketches-spark}}:
> https://github.com/erikerlandson/isarn-sketches-spark/tree/first-cut
> Then invoke {{xsbt console}} to get a REPL with a spark session.  In the REPL 
> execute:
> {code}
> Welcome to Scala 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131).
> Type in expressions for evaluation. Or try :help.
> scala> val training = spark.createDataFrame(Seq((1.0, Vectors.dense(0.0, 1.1, 
> 0.1)),(0.0, Vectors.dense(2.0, 1.0, -1.0)),(0.0, Vectors.dense(2.0, 1.3, 
> 1.0)),(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF("label", "features")
> training: org.apache.spark.sql.DataFrame = [label: double, features: vector]
> scala> val featTD = 
> training.agg(TDigestMLVecUDAF(0.5,10)(training("features")))
> featTD: org.apache.spark.sql.DataFrame = [tdigestmlvecudaf(features): 
> tdigestarray]
> scala> featTD.first
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27296) User Defined Aggregating Functions (UDAFs) have a major efficiency problem

2019-03-27 Thread Erik Erlandson (JIRA)

Erik Erlandson created SPARK-27296:
--

 Summary: User Defined Aggregating Functions (UDAFs) have a major 
efficiency problem
 Key: SPARK-27296
 URL: https://issues.apache.org/jira/browse/SPARK-27296
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL, Structured Streaming
Affects Versions: 2.4.0, 2.3.3, 3.0.0
Reporter: Erik Erlandson


Spark's UDAFs appear to be serializing and de-serializing to/from the 
MutableAggregationBuffer for each row.  This gist shows a small reproducing 
UDAF and a spark shell session:

[https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]

The UDAF and its compantion UDT are designed to count the number of times that 
ser/de is invoked for the aggregator.  The spark shell session demonstrates 
that it is executing ser/de on every row of the data frame.

Note, Spark's pre-defined aggregators do not have this problem, as they are 
based on an internal aggregating trait that does the correct thing and only 
calls ser/de at points such as partition boundaries, presenting final results, 
etc.

This is a major problem for UDAFs, as it means that every UDAF is doing a 
massive amount of unnecessary work per row, including but not limited to Row 
object allocations. For a more realistic UDAF having its own non trivial 
internal structure it is obviously that much worse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27296) User Defined Aggregating Functions (UDAFs) have a major efficiency problem

2019-03-28 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804176#comment-16804176
 ] 

Erik Erlandson commented on SPARK-27296:


My initial proposal would be to alter the logic underneath
{code:java}
register(name: String, udaf: UserDefinedAggregateFunction){code}
so that the UDAF gets hooked to a TypedImperativeAggregate, and registered in 
the same way that objects like CountMinSketchAgg are.

> User Defined Aggregating Functions (UDAFs) have a major efficiency problem
> --
>
> Key: SPARK-27296
> URL: https://issues.apache.org/jira/browse/SPARK-27296
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: performance, usability
>
> Spark's UDAFs appear to be serializing and de-serializing to/from the 
> MutableAggregationBuffer for each row.  This gist shows a small reproducing 
> UDAF and a spark shell session:
> [https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]
> The UDAF and its compantion UDT are designed to count the number of times 
> that ser/de is invoked for the aggregator.  The spark shell session 
> demonstrates that it is executing ser/de on every row of the data frame.
> Note, Spark's pre-defined aggregators do not have this problem, as they are 
> based on an internal aggregating trait that does the correct thing and only 
> calls ser/de at points such as partition boundaries, presenting final 
> results, etc.
> This is a major problem for UDAFs, as it means that every UDAF is doing a 
> massive amount of unnecessary work per row, including but not limited to Row 
> object allocations. For a more realistic UDAF having its own non trivial 
> internal structure it is obviously that much worse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-04 Thread Erik Erlandson (JIRA)

Erik Erlandson created SPARK-18278:
--

 Summary: Support native submission of spark jobs to a kubernetes 
cluster
 Key: SPARK-18278
 URL: https://issues.apache.org/jira/browse/SPARK-18278
 Project: Spark
  Issue Type: Umbrella
  Components: Build, Deploy, Documentation, Scheduler, Spark Core
Affects Versions: 2.2.0
Reporter: Erik Erlandson


A new Apache Spark sub-project that enables native support for submitting Spark 
applications to a kubernetes cluster.   The submitted application runs in a 
driver executing on a kubernetes pod, and executors lifecycles are also managed 
as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-04 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637591#comment-15637591
 ] 

Erik Erlandson commented on SPARK-18278:


Current prototype:
https://github.com/foxish/spark/tree/k8s-support
https://github.com/foxish/spark/pull/1

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Erik Erlandson
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-04 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-18278:
---
External issue URL: https://github.com/kubernetes/kubernetes/issues/34377
 External issue ID: #34377

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Erik Erlandson
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-07 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15644841#comment-15644841
 ] 

Erik Erlandson commented on SPARK-18278:


I agree with [~willbenton] that Kube is a sufficiently-popular container mgmt 
system that it warrants "first-class" sub-project status for Apache park.

I'm also interested in making modifications to the Spark scheduler support so 
that it is easier to plug-in new ones externally.  I believe the necessary 
modifications would not be very intrusive.  The system is already based on 
sub-classing abstract traits.  It would be mostly a matter of increasing their 
exposure.



> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-07 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15644919#comment-15644919
 ] 

Erik Erlandson commented on SPARK-18278:


Another comment on external plug-ins for scheduling: although I think it's a 
good idea to support it, it does introduce the maintenance of keeping external 
scheduling packages synced with the main Apache Spark project.  That represents 
another argument for first-class support for schedulers of sufficient 
importance to the community.

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation

2021-03-18 Thread Erik Erlandson (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304397#comment-17304397
 ] 

Erik Erlandson commented on SPARK-24432:


[~dongjoon] should this be closed, now that spark 3.1 is available (per 
[above|https://issues.apache.org/jira/browse/SPARK-24432?focusedCommentId=17224905&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17224905])

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24793) Make spark-submit more useful with k8s

2018-07-12 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541923#comment-16541923
 ] 

Erik Erlandson commented on SPARK-24793:


I am concerned that this is outside the scope of {{spark-submit}}, especially 
since it is arguably a k8s-centric use case.

But it's definitely a useful set of functionality.  I'd propose strategic use 
of labels to make these kind of operations easier via {{kubectl}}. Possibly 
supported via a tutorial example in the docs?  "here's how to use labels to do 
common operations like "kill this app" and "list all the running driver pods", 
etc

> Make spark-submit more useful with k8s
> --
>
> Key: SPARK-24793
> URL: https://issues.apache.org/jira/browse/SPARK-24793
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Major
>
> Support controlling the lifecycle of Spark Application through spark-submit. 
> For example:
> {{ 
>   --kill app_name   If given, kills the driver specified.
>   --status app_name  If given, requests the status of the driver 
> specified.
> }}
> Potentially also --list to list all spark drivers running.
> Given that our submission client can actually launch jobs into many different 
> namespaces, we'll need an additional specification of the namespace through a 
> --namespace flag potentially.
> I think this is pretty useful to have instead of forcing a user to use 
> kubectl to manage the lifecycle of any k8s Spark Application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24793) Make spark-submit more useful with k8s

2018-07-12 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541930#comment-16541930
 ] 

Erik Erlandson commented on SPARK-24793:


Another possible angle (not mutually exclusive with above) is establishing 
spark-operator as a "standard" solution for supporting these kind of 
higher-level operations. "If you want to do higher-level CRUD on jobs, we 
recommend investigating spark-operator..."

> Make spark-submit more useful with k8s
> --
>
> Key: SPARK-24793
> URL: https://issues.apache.org/jira/browse/SPARK-24793
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Major
>
> Support controlling the lifecycle of Spark Application through spark-submit. 
> For example:
> {{ 
>   --kill app_name   If given, kills the driver specified.
>   --status app_name  If given, requests the status of the driver 
> specified.
> }}
> Potentially also --list to list all spark drivers running.
> Given that our submission client can actually launch jobs into many different 
> namespaces, we'll need an additional specification of the namespace through a 
> --namespace flag potentially.
> I think this is pretty useful to have instead of forcing a user to use 
> kubectl to manage the lifecycle of any k8s Spark Application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24793) Make spark-submit more useful with k8s

2018-07-12 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542102#comment-16542102
 ] 

Erik Erlandson commented on SPARK-24793:


Also a good point that {{--kill}} and {{--status}} are existing invocation 
modes, so it is already part of the command scope. From that pov, I agree it 
makes sense to support them via the k8s backend

> Make spark-submit more useful with k8s
> --
>
> Key: SPARK-24793
> URL: https://issues.apache.org/jira/browse/SPARK-24793
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Major
>
> Support controlling the lifecycle of Spark Application through spark-submit. 
> For example:
> {{ 
>   --kill app_name   If given, kills the driver specified.
>   --status app_name  If given, requests the status of the driver 
> specified.
> }}
> Potentially also --list to list all spark drivers running.
> Given that our submission client can actually launch jobs into many different 
> namespaces, we'll need an additional specification of the namespace through a 
> --namespace flag potentially.
> I think this is pretty useful to have instead of forcing a user to use 
> kubectl to manage the lifecycle of any k8s Spark Application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24793) Make spark-submit more useful with k8s

2018-07-12 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542102#comment-16542102
 ] 

Erik Erlandson edited comment on SPARK-24793 at 7/12/18 7:01 PM:
-

Also a good point that --kill and --status are existing invocation modes, so it 
is already part of the command scope. From that pov, I agree it makes sense to 
support them via the k8s backend


was (Author: eje):
Also a good point that {{--kill}} and {{--status}} are existing invocation 
modes, so it is already part of the command scope. From that pov, I agree it 
makes sense to support them via the k8s backend

> Make spark-submit more useful with k8s
> --
>
> Key: SPARK-24793
> URL: https://issues.apache.org/jira/browse/SPARK-24793
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Major
>
> Support controlling the lifecycle of Spark Application through spark-submit. 
> For example:
> {{ 
>   --kill app_name   If given, kills the driver specified.
>   --status app_name  If given, requests the status of the driver 
> specified.
> }}
> Potentially also --list to list all spark drivers running.
> Given that our submission client can actually launch jobs into many different 
> namespaces, we'll need an additional specification of the namespace through a 
> --namespace flag potentially.
> I think this is pretty useful to have instead of forcing a user to use 
> kubectl to manage the lifecycle of any k8s Spark Application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564558#comment-16564558
 ] 

Erik Erlandson commented on SPARK-24615:


Am I understanding correctly that this can't assign executors to desired 
resources without resorting to Dynamic Allocation to tear down an Executor and 
reallocate it somewhere else?

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24580) List scenarios to be handled by barrier execution mode properly

2018-08-01 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566154#comment-16566154
 ] 

Erik Erlandson commented on SPARK-24580:


This is blocking SPARK-24582 which is marked as 'resolved' but it appears to be 
inactive.

> List scenarios to be handled by barrier execution mode properly
> ---
>
> Key: SPARK-24580
> URL: https://issues.apache.org/jira/browse/SPARK-24580
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> List scenarios to be handled by barrier execution mode to help the design. We 
> will start with simple ones to complex.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24817) Implement BarrierTaskContext.barrier()

2018-08-01 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566159#comment-16566159
 ] 

Erik Erlandson commented on SPARK-24817:


I'm curious about what the {{barrier}} invocations inside {{mapPartitions}} 
closures imply about communications between executors, for example executors 
running on pods in a kube cluster. It is possible that whatever is allowing 
shuffle data to transfer between executors will also allow these  {{barrier}} 
coordinations to work, but we had to create a headless service for executors to 
register properly with the driver pod, and if every executor pod needs 
something like that for barrier to work, it will be an impact for kube backend 
support.

> Implement BarrierTaskContext.barrier()
> --
>
> Key: SPARK-24817
> URL: https://issues.apache.org/jira/browse/SPARK-24817
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Implement BarrierTaskContext.barrier(), to support global sync between all 
> the tasks in a barrier stage. The global sync shall finish immediately once 
> all tasks in the same barrier stage reaches the same barrier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24817) Implement BarrierTaskContext.barrier()

2018-08-02 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567301#comment-16567301
 ] 

Erik Erlandson commented on SPARK-24817:


Thanks [~jiangxb] - I'd expect that design to work out-of-box on the k8s 
backend. 

ML-specific code seems like it will have needs that are harder to predict, by 
definition. If it can use IP addresses in the cluster space, it should work 
regardless. If it wants fqdn, then perhaps additional pod configurations will 
be required.

> Implement BarrierTaskContext.barrier()
> --
>
> Key: SPARK-24817
> URL: https://issues.apache.org/jira/browse/SPARK-24817
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Implement BarrierTaskContext.barrier(), to support global sync between all 
> the tasks in a barrier stage. The global sync shall finish immediately once 
> all tasks in the same barrier stage reaches the same barrier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24817) Implement BarrierTaskContext.barrier()

2018-08-02 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567358#comment-16567358
 ] 

Erik Erlandson commented on SPARK-24817:


I have been looking at the use cases for barrier-mode on the design doc. The 
primary story seems to be along the lines of using {{mapPartitions}} to:
 # write out any partitioned data (and sync)
 # execute some kind of ML logic (TF, etc) (possibly syncing on stages here?)
 # optionally move back into "normal" spark executions

My mental model has been that the value proposition for Hydrogen is primarily a 
convergence argument: it is easier to not have to leave a Spark workflow and 
execute something like TF using some other toolchain. But OTOH, given that the 
Spark programmer has to write out the partitioned data and then invoke ML 
tooling like TF regardless, does the increase to convenience pay for the cost 
in complexity for absorbing new clustering & scheduling models into Spark, 
along with other consequences, for example SPARK-24615, compared to the "null 
hypothesis" of writing partition data, then using ML-specific clustering 
toolchains (kubeflow, for example), and consuming the resulting products in 
Spark afterward.

> Implement BarrierTaskContext.barrier()
> --
>
> Key: SPARK-24817
> URL: https://issues.apache.org/jira/browse/SPARK-24817
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Implement BarrierTaskContext.barrier(), to support global sync between all 
> the tasks in a barrier stage. The global sync shall finish immediately once 
> all tasks in the same barrier stage reaches the same barrier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25128) multiple simultaneous job submissions against k8s backend cause driver pods to hang

2018-08-15 Thread Erik Erlandson (JIRA)

Erik Erlandson created SPARK-25128:
--

 Summary: multiple simultaneous job submissions against k8s backend 
cause driver pods to hang
 Key: SPARK-25128
 URL: https://issues.apache.org/jira/browse/SPARK-25128
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Erik Erlandson


User is reporting that multiple "simultaneous" (or rapidly in succession) job 
submissions against the k8s back-end are causing driver pods to hang in 
"Waiting: PodInitializing" state. They filed an associated question at 
[stackoverflow|[https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes|https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes?noredirect=1#comment90640662_51843212]].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25128) multiple simultaneous job submissions against k8s backend cause driver pods to hang

2018-08-15 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581537#comment-16581537
 ] 

Erik Erlandson commented on SPARK-25128:


[~mcheah], [~liyinan926], wdyt?

> multiple simultaneous job submissions against k8s backend cause driver pods 
> to hang
> ---
>
> Key: SPARK-25128
> URL: https://issues.apache.org/jira/browse/SPARK-25128
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: kubernetes
>
> User is reporting that multiple "simultaneous" (or rapidly in succession) job 
> submissions against the k8s back-end are causing driver pods to hang in 
> "Waiting: PodInitializing" state. They filed an associated question at 
> [stackoverflow|[https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes|https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes?noredirect=1#comment90640662_51843212]].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2018-08-22 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588934#comment-16588934
 ] 

Erik Erlandson commented on SPARK-7768:
---

We use `UserDefinedType`, for example here:
[https://github.com/isarn/isarn-sketches-spark/blob/develop/src/main/scala/org/apache/spark/isarnproject/sketches/udt/TDigestUDT.scala#L37]

My colleague [~willbenton] and I gave a talk at Spark+AI summit in June on 
[this topic|https://databricks.com/session/apache-spark-for-library-developers]

A comment about {{Encoder}}: they are strongly typed, which is quite nice to 
work with in Scala but if you are intending to expose via DataFrame and/or 
PySpark via py4j, they can't help you, and you need UDTs.

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2018-08-27 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594271#comment-16594271
 ] 

Erik Erlandson commented on SPARK-21097:


I'm wondering if this is going to be subsumed by the Shuffle Service redesign 
proposal.

cc [~mcheah]

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25275) require memberhip in wheel to run 'su' (in dockerfiles)

2018-08-29 Thread Erik Erlandson (JIRA)

Erik Erlandson created SPARK-25275:
--

 Summary: require memberhip in wheel to run 'su' (in dockerfiles)
 Key: SPARK-25275
 URL: https://issues.apache.org/jira/browse/SPARK-25275
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.1, 2.3.0
Reporter: Erik Erlandson


For improved security, configure that users must be in wheel group in order to 
run su.

see example:

[https://github.com/ope]nshift-evangelists/terminal-base-image/blob/master/image/Dockerfile#L53



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25275) require memberhip in wheel to run 'su' (in dockerfiles)

2018-08-30 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597935#comment-16597935
 ] 

Erik Erlandson commented on SPARK-25275:


{{merge_spark_pr.py}} failed to close this, closing manually.

> require memberhip in wheel to run 'su' (in dockerfiles)
> ---
>
> Key: SPARK-25275
> URL: https://issues.apache.org/jira/browse/SPARK-25275
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: docker, kubernetes
>
> For improved security, configure that users must be in wheel group in order 
> to run su.
> see example:
> [https://github.com/ope]nshift-evangelists/terminal-base-image/blob/master/image/Dockerfile#L53



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25275) require memberhip in wheel to run 'su' (in dockerfiles)

2018-08-30 Thread Erik Erlandson (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-25275.

   Resolution: Fixed
Fix Version/s: 2.4.0

> require memberhip in wheel to run 'su' (in dockerfiles)
> ---
>
> Key: SPARK-25275
> URL: https://issues.apache.org/jira/browse/SPARK-25275
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: docker, kubernetes
> Fix For: 2.4.0
>
>
> For improved security, configure that users must be in wheel group in order 
> to run su.
> see example:
> [https://github.com/ope]nshift-evangelists/terminal-base-image/blob/master/image/Dockerfile#L53



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25287) Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py

2018-08-30 Thread Erik Erlandson (JIRA)

Erik Erlandson created SPARK-25287:
--

 Summary: Check for JIRA_USERNAME and JIRA_PASSWORD up front in 
merge_spark_pr.py
 Key: SPARK-25287
 URL: https://issues.apache.org/jira/browse/SPARK-25287
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.3.1
Reporter: Erik Erlandson


I never remember to set {{JIRA_USERNAME}} and {{JIRA_PASSWORD}}, and it isn't 
checked, so I get to the end of the {{merge_spark_pr.py}} process and it fails 
on the Jira state update. An up-front check for this would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25287) Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py

2018-08-30 Thread Erik Erlandson (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson reassigned SPARK-25287:
--

Assignee: Erik Erlandson

> Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py
> ---
>
> Key: SPARK-25287
> URL: https://issues.apache.org/jira/browse/SPARK-25287
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.1
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Minor
>  Labels: infrastructure
> Fix For: 2.4.0
>
>
> I never remember to set {{JIRA_USERNAME}} and {{JIRA_PASSWORD}}, and it isn't 
> checked, so I get to the end of the {{merge_spark_pr.py}} process and it 
> fails on the Jira state update. An up-front check for this would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25287) Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py

2018-08-30 Thread Erik Erlandson (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-25287.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22294
[https://github.com/apache/spark/pull/22294]

> Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py
> ---
>
> Key: SPARK-25287
> URL: https://issues.apache.org/jira/browse/SPARK-25287
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.1
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Minor
>  Labels: infrastructure
> Fix For: 2.4.0
>
>
> I never remember to set {{JIRA_USERNAME}} and {{JIRA_PASSWORD}}, and it isn't 
> checked, so I get to the end of the {{merge_spark_pr.py}} process and it 
> fails on the Jira state update. An up-front check for this would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-31 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16599092#comment-16599092
 ] 

Erik Erlandson commented on SPARK-24434:


There are a few related, but separate, issues here.

I agree that it is most efficient, and considerate, to respect issue 
assignments and coordinate our distributed development around absences, etc.

To the best of my knowledge, the work Stavros did on 24434 was not made visible 
as a public WIP apache/spark branch. Making dev visible this way is one 
important way to minimize coordination problems.

Although this confusion is awkward, nothing in regard to 24434 has violated 
FOSS principles, or Spark governance. Onur's PR has been developed and reviewed 
on a public apache/spark branch. This Jira was filed, and has hosted discussion 
from all stakeholders.

The Kubernetes Big Data SIG is a separate community that overlaps with the 
Spark community. Our meetings are open to the public, and we publish recordings 
and meeting minutes. Although we discuss topics related to Spark on Kubernetes, 
we do not make Spark development decisions in that community. All of the work 
that members of the K8s Big Data SIG have contributed to Spark respects Apache 
governance and has been done using established Spark processes: SPIP, 
discussion on dev, Jira, and the PR workflow.

 

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-31 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16599283#comment-16599283
 ] 

Erik Erlandson commented on SPARK-24434:


Stavros, yes, I knew you were working on it, and also that there were no plans 
for 2.4.

As I said above, it is generally more efficient and respectful to coordinate 
with issue assignees. I did not request this second PR. On the other hand, 
multiple PRs for an issue doesn't violate any FOSS principles, it means there 
should be a community discussion about which PR ought to be pursued.

I'm not aware of any renewed push to get this into 2.4.  I don't see any 
discussion about it on dev@spark.

 

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-31 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16599299#comment-16599299
 ] 

Erik Erlandson commented on SPARK-24434:


To amplify a little from my points above: I co-chair a SIG that is attended by 
some Apache Spark contributors, most frequently people involved around the 
kubernetes back-end. As chair, I do my best to provide input on the discussions 
we have there. However, the various community participants are their own 
independent entities; nobody in this community takes orders from me.

When everything is running smoothly, this kind of duplicated effort should 
never happen. Here things didn't go smoothly, and I hope to work it out as best 
we can.

[~skonto] I encourage you to post your dev on this feature, which allows 
everyone to discuss all the available options.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-01 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459954#comment-16459954
 ] 

Erik Erlandson commented on SPARK-24135:


I think it makes sense to detect these failure states.  Even if they won't 
resolve by requesting replacement executors, reporting the specific failure 
mode in the error logs should aid in debugging. It could optionally be used as 
grounds for job failure, in the case of repeating executor failures.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-02 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461154#comment-16461154
 ] 

Erik Erlandson commented on SPARK-24135:


IIRC the dynamic allocation heuristic was to avoid scheduling new executors if 
there were executors still pending, to prevent a positive feedback loop from 
swamping kube with ever-increasing numbers of executor pod scheduling requests. 
How does that interact with the concept of killing a pending executor because 
its pod start is failing?

 

Restarting seems like it would eventually be limited by the job failure limit 
that Spark already has. If pod startup failures are deterministic the job 
failure count will hit this limit and job will be killed that way.  That isn't 
mutually exclusive to supporting some maximum number of pod startup attempts in 
the back-end, however.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-02 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461324#comment-16461324
 ] 

Erik Erlandson commented on SPARK-24135:


> In the case of the executor failing to start at all, this wouldn't be caught 
> by Spark's task failure count logic because you're never going to end up 
> scheduling tasks on these executors that failed to start.

Aha, that argues for allowing a way to give up after repeated pod start 
failures.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-05-16 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477812#comment-16477812
 ] 

Erik Erlandson commented on SPARK-24248:


Is the design above using re-sync as the fallback for the watcher losing 
connection, or periodic resync as a replacement for the watcher?  Are there any 
potential race-condition issues between a dequeing thread and the thread 
querying pod states?

> [K8S] Use the Kubernetes cluster as the backing store for the state of pods
> ---
>
> Key: SPARK-24248
> URL: https://issues.apache.org/jira/browse/SPARK-24248
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> We have a number of places in KubernetesClusterSchedulerBackend right now 
> that maintains the state of pods in memory. However, the Kubernetes API can 
> always give us the most up to date and correct view of what our executors are 
> doing. We should consider moving away from in-memory state as much as can in 
> favor of using the Kubernetes cluster as the source of truth for pod status. 
> Maintaining less state in memory makes it so that there's a lower chance that 
> we accidentally miss updating one of these data structures and breaking the 
> lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24435) Support user-supplied YAML that can be merged with k8s pod descriptions

2018-05-30 Thread Erik Erlandson (JIRA)

Erik Erlandson created SPARK-24435:
--

 Summary: Support user-supplied YAML that can be merged with k8s 
pod descriptions
 Key: SPARK-24435
 URL: https://issues.apache.org/jira/browse/SPARK-24435
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Erik Erlandson
 Fix For: 2.4.0


Kubernetes supports a large variety of configurations to Pods. Currently only 
some of these are configurable from Spark, and they all operate by being 
plumbed from --conf arguments through to pod creation in the code.

To avoid the anti-pattern of trying to expose an unbounded Pod feature set 
through Spark configuration keywords, the community is interested in working 
out a sane way of allowing users to supply "arbitrary" Pod YAML which can be 
merged with the pod configurations created by the kube backend.

Multiple solutions have been considerd, including Pod Pre-sets and loading Pod 
template objects.  A requirement is that the policy for how user-supplied YAML 
interacts with the configurations created by the kube back-end must be easy to 
reason about, and also that whatever kubernetes features the solution uses are 
supported on the kubernetes roadmap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24435) Support user-supplied YAML that can be merged with k8s pod descriptions

2018-05-30 Thread Erik Erlandson (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-24435.

Resolution: Duplicate

> Support user-supplied YAML that can be merged with k8s pod descriptions
> ---
>
> Key: SPARK-24435
> URL: https://issues.apache.org/jira/browse/SPARK-24435
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: features, kubernetes
> Fix For: 2.4.0
>
>
> Kubernetes supports a large variety of configurations to Pods. Currently only 
> some of these are configurable from Spark, and they all operate by being 
> plumbed from --conf arguments through to pod creation in the code.
> To avoid the anti-pattern of trying to expose an unbounded Pod feature set 
> through Spark configuration keywords, the community is interested in working 
> out a sane way of allowing users to supply "arbitrary" Pod YAML which can be 
> merged with the pod configurations created by the kube backend.
> Multiple solutions have been considerd, including Pod Pre-sets and loading 
> Pod template objects.  A requirement is that the policy for how user-supplied 
> YAML interacts with the configurations created by the kube back-end must be 
> easy to reason about, and also that whatever kubernetes features the solution 
> uses are supported on the kubernetes roadmap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-30 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16495615#comment-16495615
 ] 

Erik Erlandson commented on SPARK-24434:


Is the template-based solution being explicitly favored over other options, 
e.g. pod presets or webhooks, etc?

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files

2018-05-30 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16495619#comment-16495619
 ] 

Erik Erlandson commented on SPARK-24091:


If we support user-supplied yaml, that may become a source of config map 
specifications

> Internally used ConfigMap prevents use of user-specified ConfigMaps carrying 
> Spark configs files
> 
>
> Key: SPARK-24091
> URL: https://issues.apache.org/jira/browse/SPARK-24091
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> The recent PR [https://github.com/apache/spark/pull/20669] for removing the 
> init-container introduced a internally used ConfigMap carrying Spark 
> configuration properties in a file for the driver. This ConfigMap gets 
> mounted under {{$SPARK_HOME/conf}} and the environment variable 
> {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much 
> prevents users from mounting their own ConfigMaps that carry custom Spark 
> configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and 
> leaves users with only the option of building custom images. IMO, it is very 
> useful to support mounting user-specified ConfigMaps for custom Spark 
> configuration files. This worths further discussions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-30 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16495642#comment-16495642
 ] 

Erik Erlandson commented on SPARK-24434:


[~skonto] given the number of ideas that have gotten tossed around for this 
over time, an 'alternatives considered' section for a design doc will 
definitely be valuable

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-31 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496991#comment-16496991
 ] 

Erik Erlandson commented on SPARK-24434:


[~foxish] is there a technical (or ux) argument for json, versus yaml (or 
allowing both)?

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-06-01 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498337#comment-16498337
 ] 

Erik Erlandson commented on SPARK-24434:


My current take on UX around this feature is that there's not much precedent 
from the Spark world. Assuming I'm right about that it's more likely to be 
driven by what expectations Kubernetes users have. In my experience that is 
along the lines of "pointing at a yaml file," but maybe there's more variety of 
user workflows than I think.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates

2018-06-01 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498337#comment-16498337
 ] 

Erik Erlandson edited comment on SPARK-24434 at 6/1/18 5:53 PM:


My current take on UX around this feature is that there's not much precedent 
from the Spark world. Assuming I'm right about that it's more likely to be 
driven by what expectations Kubernetes users have. In my experience that is 
along the lines of "pointing at a yaml file," but maybe there's more variety of 
user workflows than I think.

JSON definitely seems more amenable for including with command-line arguments. 
I have been assuming that if users were specifying pod configurations that 
they'd be somewhat larger pod sub-structures and not easy to supply embedded on 
a command line. Are "small" pod modifications also likely?


was (Author: eje):
My current take on UX around this feature is that there's not much precedent 
from the Spark world. Assuming I'm right about that it's more likely to be 
driven by what expectations Kubernetes users have. In my experience that is 
along the lines of "pointing at a yaml file," but maybe there's more variety of 
user workflows than I think.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed

2018-06-13 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511804#comment-16511804
 ] 

Erik Erlandson commented on SPARK-24534:


I think this has potential use for customization beyond the openshift 
downstream. It allows derived images to leverage the apache spark base images 
in contexts outside of directly running the driver and executor processes.

> Add a way to bypass entrypoint.sh script if no spark cmd is passed
> --
>
> Key: SPARK-24534
> URL: https://issues.apache.org/jira/browse/SPARK-24534
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Minor
>
> As an improvement in the entrypoint.sh script, I'd like to propose spark 
> entrypoint do a passthrough if driver/executor/init is not the command 
> passed. Currently it raises an error.
> To me more specific, I'm talking about these lines:
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]
> This allows the openshift-spark image to continue to function as a Spark 
> Standalone component, with custom configuration support etc. without 
> compromising the previous method to configure the cluster inside a kubernetes 
> environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed

2018-06-19 Thread Erik Erlandson (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-24534.

   Resolution: Fixed
Fix Version/s: 2.4.0

> Add a way to bypass entrypoint.sh script if no spark cmd is passed
> --
>
> Key: SPARK-24534
> URL: https://issues.apache.org/jira/browse/SPARK-24534
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Minor
> Fix For: 2.4.0
>
>
> As an improvement in the entrypoint.sh script, I'd like to propose spark 
> entrypoint do a passthrough if driver/executor/init is not the command 
> passed. Currently it raises an error.
> To me more specific, I'm talking about these lines:
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]
> This allows the openshift-spark image to continue to function as a Spark 
> Standalone component, with custom configuration support etc. without 
> compromising the previous method to configure the cluster inside a kubernetes 
> environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26973) Kubernetes version support strategy on test nodes / backend

2019-02-22 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775435#comment-16775435
 ] 

Erik Erlandson commented on SPARK-26973:


A couple other points:
 * Currently, k8s is evolving in a manner where breakage of existing 
functionality is low probability, and so testing against the earliest version 
we wish to support is probably optimal in a scenario where we are choosing one 
version to test against. (This heuristic might change in the future, for 
example if k8s goes to a 2.x series where backward compatibility may be broken)
 * The integration testing was designed to support running against external 
clusters (GCP, etc) - this might provide an approach to supporting testing 
against multiple k8s versions. However, it would come with additional op-ex 
costs and decreased control over the environment. I mention it mostly because 
it's a plausible path to outsourcing some of the combinatorics that 
[~shaneknapp] discussed above

> Kubernetes version support strategy on test nodes / backend
> ---
>
> Key: SPARK-26973
> URL: https://issues.apache.org/jira/browse/SPARK-26973
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes has a policy for supporting three minor releases and the current 
> ones are defined here: 
> [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]
> Moving from release 1.x to 1.(x+1) happens roughly every 100 
> days:[https://gravitational.com/blog/kubernetes-release-cycle.]
> This has an effect on dependencies upgrade at the Spark on K8s backend and 
> the version of Minikube required to be supported for testing. One other issue 
> is what the users actually want at the given time of a release. Some popular 
> vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap 
> for releases and may not catch up fast (what is our view on this).
> Follow the comments for a recent discussion on the topic: 
> [https://github.com/apache/spark/pull/23814.]
> Clearly we need a strategy for this.
> A couple of options for the current state of things:
> a) Support only the last two versions, but that leaves out a version that 
> still receives patches.
> b) Support only the latest, which makes testing easier, but leaves out other 
> currently maintained version.
> A good strategy will optimize at least the following:
> 1) percentage of users satisfied at release time.
> 2) how long it takes to support the latest K8s version
> 3) testing requirements eg. minikube versions used
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25128) multiple simultaneous job submissions against k8s backend cause driver pods to hang

2018-09-06 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605916#comment-16605916
 ] 

Erik Erlandson commented on SPARK-25128:


Retargeting to next release sounds good. There has been no traffic since filing 
and it shouldn't block the release.

> multiple simultaneous job submissions against k8s backend cause driver pods 
> to hang
> ---
>
> Key: SPARK-25128
> URL: https://issues.apache.org/jira/browse/SPARK-25128
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: kubernetes
>
> User is reporting that multiple "simultaneous" (or rapidly in succession) job 
> submissions against the k8s back-end are causing driver pods to hang in 
> "Waiting: PodInitializing" state. They filed an associated question at 
> [stackoverflow|[https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes|https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes?noredirect=1#comment90640662_51843212]].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25128) multiple simultaneous job submissions against k8s backend cause driver pods to hang

2018-09-06 Thread Erik Erlandson (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-25128:
---
Target Version/s: 3.0.0  (was: 2.4.0, 2.3.3)
Priority: Minor  (was: Major)

> multiple simultaneous job submissions against k8s backend cause driver pods 
> to hang
> ---
>
> Key: SPARK-25128
> URL: https://issues.apache.org/jira/browse/SPARK-25128
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>Priority: Minor
>  Labels: kubernetes
>
> User is reporting that multiple "simultaneous" (or rapidly in succession) job 
> submissions against the k8s back-end are causing driver pods to hang in 
> "Waiting: PodInitializing" state. They filed an associated question at 
> [stackoverflow|[https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes|https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes?noredirect=1#comment90640662_51843212]].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25782) Add PCA Aggregator to support grouping

2018-10-19 Thread Erik Erlandson (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-25782:
---
Target Version/s: 3.0.0
 Component/s: ML
  Issue Type: New Feature  (was: Improvement)

> Add PCA Aggregator to support grouping
> --
>
> Key: SPARK-25782
> URL: https://issues.apache.org/jira/browse/SPARK-25782
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.3.2
>Reporter: Matt Saunders
>Priority: Minor
>
> I built an Aggregator that computes PCA on grouped datasets. I wanted to use 
> the PCA functions provided by MLlib, but they only work on a full dataset, 
> and I needed to do it on a grouped dataset (like a RelationalGroupedDataset). 
> So I built a little Aggregator that can do that, here's an example of how 
> it's called:
> {noformat}
> val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
> // For each grouping, compute a PCA matrix/vector
> val pcaModels = inputData
>   .groupBy(keys:_*)
>   .agg(pcaAggregation.as(pcaOutput)){noformat}
> I used the same algorithms under the hood as 
> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works 
> directly on Datasets without converting to RDD first.
> I've seen others who wanted this ability (for example on Stack Overflow) so 
> I'd like to contribute it if it would be a benefit to the larger community. 
> If there is interest, I will prepare the code for a pull request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25782) Add PCA Aggregator to support grouping

2018-10-19 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657394#comment-16657394
 ] 

Erik Erlandson commented on SPARK-25782:


Thanks [~mttsndrs]!

I agree it makes sense to support full Dataset aggregation functionality via a 
UDAF.

> Add PCA Aggregator to support grouping
> --
>
> Key: SPARK-25782
> URL: https://issues.apache.org/jira/browse/SPARK-25782
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.3.2
>Reporter: Matt Saunders
>Priority: Minor
>
> I built an Aggregator that computes PCA on grouped datasets. I wanted to use 
> the PCA functions provided by MLlib, but they only work on a full dataset, 
> and I needed to do it on a grouped dataset (like a RelationalGroupedDataset). 
> So I built a little Aggregator that can do that, here's an example of how 
> it's called:
> {noformat}
> val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
> // For each grouping, compute a PCA matrix/vector
> val pcaModels = inputData
>   .groupBy(keys:_*)
>   .agg(pcaAggregation.as(pcaOutput)){noformat}
> I used the same algorithms under the hood as 
> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works 
> directly on Datasets without converting to RDD first.
> I've seen others who wanted this ability (for example on Stack Overflow) so 
> I'd like to contribute it if it would be a benefit to the larger community. 
> If there is interest, I will prepare the code for a pull request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25782) Add PCA Aggregator to support grouping

2018-10-19 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657398#comment-16657398
 ] 

Erik Erlandson commented on SPARK-25782:


An ML Estimator also arguably would be a good API to expose

> Add PCA Aggregator to support grouping
> --
>
> Key: SPARK-25782
> URL: https://issues.apache.org/jira/browse/SPARK-25782
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.3.2
>Reporter: Matt Saunders
>Priority: Minor
>
> I built an Aggregator that computes PCA on grouped datasets. I wanted to use 
> the PCA functions provided by MLlib, but they only work on a full dataset, 
> and I needed to do it on a grouped dataset (like a RelationalGroupedDataset). 
> So I built a little Aggregator that can do that, here's an example of how 
> it's called:
> {noformat}
> val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
> // For each grouping, compute a PCA matrix/vector
> val pcaModels = inputData
>   .groupBy(keys:_*)
>   .agg(pcaAggregation.as(pcaOutput)){noformat}
> I used the same algorithms under the hood as 
> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works 
> directly on Datasets without converting to RDD first.
> I've seen others who wanted this ability (for example on Stack Overflow) so 
> I'd like to contribute it if it would be a benefit to the larger community. 
> If there is interest, I will prepare the code for a pull request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25828) Bumping Version of kubernetes.client to latest version

2018-10-24 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662978#comment-16662978
 ] 

Erik Erlandson commented on SPARK-25828:


cc [~skonto]

> Bumping Version of kubernetes.client to latest version
> --
>
> Key: SPARK-25828
> URL: https://issues.apache.org/jira/browse/SPARK-25828
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Minor
>
> Upgrade the Kubernetes client version to at least 
> [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
> as we are falling behind on fabric8 updates. This will be an update to both 
> kubernetes/core and kubernetes/integration-tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25828) Bumping Version of kubernetes.client to latest version

2018-10-26 Thread Erik Erlandson (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-25828.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22820
[https://github.com/apache/spark/pull/22820]

> Bumping Version of kubernetes.client to latest version
> --
>
> Key: SPARK-25828
> URL: https://issues.apache.org/jira/browse/SPARK-25828
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Assignee: Ilan Filonenko
>Priority: Minor
> Fix For: 3.0.0
>
>
> Upgrade the Kubernetes client version to at least 
> [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
> as we are falling behind on fabric8 updates. This will be an update to both 
> kubernetes/core and kubernetes/integration-tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25828) Bumping Version of kubernetes.client to latest version

2018-10-26 Thread Erik Erlandson (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson reassigned SPARK-25828:
--

Assignee: Ilan Filonenko

> Bumping Version of kubernetes.client to latest version
> --
>
> Key: SPARK-25828
> URL: https://issues.apache.org/jira/browse/SPARK-25828
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Assignee: Ilan Filonenko
>Priority: Minor
> Fix For: 3.0.0
>
>
> Upgrade the Kubernetes client version to at least 
> [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
> as we are falling behind on fabric8 updates. This will be an update to both 
> kubernetes/core and kubernetes/integration-tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23680) entrypoint.sh does not accept arbitrary UIDs, returning as an error

2018-03-14 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398701#comment-16398701
 ] 

Erik Erlandson commented on SPARK-23680:


[~rmartine] thanks for catching this! It will impact platforms running w/ 
anonymous uid such as OpenShift.

> entrypoint.sh does not accept arbitrary UIDs, returning as an error
> ---
>
> Key: SPARK-23680
> URL: https://issues.apache.org/jira/browse/SPARK-23680
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: OpenShift
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Major
>  Labels: easyfix
>
> Openshift supports running pods using arbitrary UIDs 
> ([https://docs.openshift.com/container-platform/3.7/creating_images/guidelines.html#openshift-specific-guidelines)]
>   to improve security. Although entrypoint.sh was developed to cover this 
> feature, the script is returning an error[1].
> The issue is that the script uses getent to find the passwd entry of the 
> current UID, and if the entry is not found it creates an entry in 
> /etc/passwd. According to the getent man page:
> {code:java}
> EXIT STATUS
>    One of the following exit values can be returned by getent:
>   0 Command completed successfully.
>   1 Missing arguments, or database unknown.
>   2 One or more supplied key could not be found in the 
> database.
>   3 Enumeration not supported on this database.
> {code}
> And since the script begin with a "set -ex" command, which means it turns on 
> debug and breaks the script if the command pipelines returns an exit code 
> other than 0.--
> Having that said, this line below must be changed to remove the "-e" flag 
> from set command:
> https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L20
>  
>  
> [1]https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L25-L34



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23680) entrypoint.sh does not accept arbitrary UIDs, returning as an error

2018-03-14 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-23680:
---
 Flags: Important
Labels: easyfix  (was: )

> entrypoint.sh does not accept arbitrary UIDs, returning as an error
> ---
>
> Key: SPARK-23680
> URL: https://issues.apache.org/jira/browse/SPARK-23680
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: OpenShift
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Major
>  Labels: easyfix
>
> Openshift supports running pods using arbitrary UIDs 
> ([https://docs.openshift.com/container-platform/3.7/creating_images/guidelines.html#openshift-specific-guidelines)]
>   to improve security. Although entrypoint.sh was developed to cover this 
> feature, the script is returning an error[1].
> The issue is that the script uses getent to find the passwd entry of the 
> current UID, and if the entry is not found it creates an entry in 
> /etc/passwd. According to the getent man page:
> {code:java}
> EXIT STATUS
>    One of the following exit values can be returned by getent:
>   0 Command completed successfully.
>   1 Missing arguments, or database unknown.
>   2 One or more supplied key could not be found in the 
> database.
>   3 Enumeration not supported on this database.
> {code}
> And since the script begin with a "set -ex" command, which means it turns on 
> debug and breaks the script if the command pipelines returns an exit code 
> other than 0.--
> Having that said, this line below must be changed to remove the "-e" flag 
> from set command:
> https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L20
>  
>  
> [1]https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L25-L34



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23680) entrypoint.sh does not accept arbitrary UIDs, returning as an error

2018-03-16 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-23680.

  Resolution: Fixed
Target Version/s: 2.3.1, 2.4.0

merged to master

> entrypoint.sh does not accept arbitrary UIDs, returning as an error
> ---
>
> Key: SPARK-23680
> URL: https://issues.apache.org/jira/browse/SPARK-23680
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: OpenShift
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Major
>  Labels: easyfix
>
> Openshift supports running pods using arbitrary UIDs 
> ([https://docs.openshift.com/container-platform/3.7/creating_images/guidelines.html#openshift-specific-guidelines)]
>   to improve security. Although entrypoint.sh was developed to cover this 
> feature, the script is returning an error[1].
> The issue is that the script uses getent to find the passwd entry of the 
> current UID, and if the entry is not found it creates an entry in 
> /etc/passwd. According to the getent man page:
> {code:java}
> EXIT STATUS
>    One of the following exit values can be returned by getent:
>   0 Command completed successfully.
>   1 Missing arguments, or database unknown.
>   2 One or more supplied key could not be found in the 
> database.
>   3 Enumeration not supported on this database.
> {code}
> And since the script begin with a "set -ex" command, which means it turns on 
> debug and breaks the script if the command pipelines returns an exit code 
> other than 0.--
> Having that said, this line below must be changed to remove the "-e" flag 
> from set command:
> https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L20
>  
>  
> [1]https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L25-L34



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23680) entrypoint.sh does not accept arbitrary UIDs, returning as an error

2018-03-16 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16402266#comment-16402266
 ] 

Erik Erlandson commented on SPARK-23680:


commit workflow indicates to set the Assignee, however I cannot edit that field

> entrypoint.sh does not accept arbitrary UIDs, returning as an error
> ---
>
> Key: SPARK-23680
> URL: https://issues.apache.org/jira/browse/SPARK-23680
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: OpenShift
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Major
>  Labels: easyfix
>
> Openshift supports running pods using arbitrary UIDs 
> ([https://docs.openshift.com/container-platform/3.7/creating_images/guidelines.html#openshift-specific-guidelines)]
>   to improve security. Although entrypoint.sh was developed to cover this 
> feature, the script is returning an error[1].
> The issue is that the script uses getent to find the passwd entry of the 
> current UID, and if the entry is not found it creates an entry in 
> /etc/passwd. According to the getent man page:
> {code:java}
> EXIT STATUS
>    One of the following exit values can be returned by getent:
>   0 Command completed successfully.
>   1 Missing arguments, or database unknown.
>   2 One or more supplied key could not be found in the 
> database.
>   3 Enumeration not supported on this database.
> {code}
> And since the script begin with a "set -ex" command, which means it turns on 
> debug and breaks the script if the command pipelines returns an exit code 
> other than 0.--
> Having that said, this line below must be changed to remove the "-e" flag 
> from set command:
> https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L20
>  
>  
> [1]https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L25-L34



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23891) Debian based Dockerfile

2018-04-12 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-23891:
---
  Priority: Minor  (was: Major)
Issue Type: New Feature  (was: Bug)

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23891) Debian based Dockerfile

2018-04-12 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436185#comment-16436185
 ] 

Erik Erlandson commented on SPARK-23891:


The question of what OS base to use for "canonical" images or dockerfiles is an 
open one. The use of alpine was influenced by the relatively small image size 
that resulted. We could entertain arguments about why debian, centos, or some 
other OS, might be an advantage.

The current position of the Apache Spark project is that the dockerfiles 
shipped with the project are for reference, and as an aid to users building 
their own images for use with the kubernetes back-end.  IMO, the project should 
not get into the business of supporting _multiple_ dockerfiles at the present 
time. In the future, if/when the "container image api" stabilizes further, we 
might reconsider maintaining multiple dockerfiles.

I'm interested if others have different point of view; my take currently is 
that if users would like to construct similar dockerfiles using an alternative 
base OS, it would be great to publish that as a github project where interested 
community members could use it.

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23891) Debian based Dockerfile

2018-04-12 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436196#comment-16436196
 ] 

Erik Erlandson commented on SPARK-23891:


I do think that these reports are very useful for collecting data on community 
use cases. Is this incompatability something fundamental to alpine that can 
only be fixed via debian, or is it possible to hack the alpine build to fix it?

 

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23891) Debian based Dockerfile

2018-04-14 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438410#comment-16438410
 ] 

Erik Erlandson commented on SPARK-23891:


[~SercanKaraoglu] thanks for the information! You are correct; Spark also has a 
netty dep. Can you attach your customized docker file to this JIRA? That would 
be a very useful reference for our ongoing container image discussions.

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22647) Docker files for image creation

2017-12-13 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289988#comment-16289988
 ] 

Erik Erlandson commented on SPARK-22647:


I'd like to propose migrating our images onto centos, which should also fix 
this particular issue.

> Docker files for image creation
> ---
>
> Key: SPARK-22647
> URL: https://issues.apache.org/jira/browse/SPARK-22647
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> This covers the dockerfiles that need to be shipped to enable the Kubernetes 
> backend for Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23137) spark.kubernetes.executor.podNamePrefix is ignored

2018-01-17 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329721#comment-16329721
 ] 

Erik Erlandson commented on SPARK-23137:


+1, a more general "app prefix" seems more useful

> spark.kubernetes.executor.podNamePrefix is ignored
> --
>
> Key: SPARK-23137
> URL: https://issues.apache.org/jira/browse/SPARK-23137
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> [~liyinan926] is fixing this as we speak. Should be a very minor change.
> It's also a non-critical option, so, if we decide that the safer thing is to 
> just remove it, we can do that as well. Will leave that decision to the 
> release czar and reviewers.
>  
> [~vanzin] [~felixcheung] [~sameerag]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23324) Announce new Kubernetes back-end for 2.3 release notes

2018-02-02 Thread Erik Erlandson (JIRA)

Erik Erlandson created SPARK-23324:
--

 Summary: Announce new Kubernetes back-end for 2.3 release notes
 Key: SPARK-23324
 URL: https://issues.apache.org/jira/browse/SPARK-23324
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Kubernetes
Affects Versions: 2.3.0
Reporter: Erik Erlandson


This is an issue to request that the new Kubernetes scheduler back-end gets 
called out in the 2.3 release notes, as it is a prominent new feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23324) Announce new Kubernetes back-end for 2.3 release notes

2018-02-02 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351020#comment-16351020
 ] 

Erik Erlandson commented on SPARK-23324:


cc [~sameer], [~foxish]

> Announce new Kubernetes back-end for 2.3 release notes
> --
>
> Key: SPARK-23324
> URL: https://issues.apache.org/jira/browse/SPARK-23324
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Kubernetes
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: documentation, kubernetes, release_notes
>
> This is an issue to request that the new Kubernetes scheduler back-end gets 
> called out in the 2.3 release notes, as it is a prominent new feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24434) Support user-specified driver and executor pod templates

2018-11-26 Thread Erik Erlandson (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-24434:
---
Fix Version/s: 3.0.0

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
> Fix For: 3.0.0
>
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3250) More Efficient Sampling

2014-09-11 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14130809#comment-14130809
 ] 

Erik Erlandson commented on SPARK-3250:
---

I developed prototype iterator classes for "fast gap sampling" with and without 
replacement.  The code, testing rig and test results can be seen here:
https://gist.github.com/erikerlandson/05db1f15c8d623448ff6

I also wrote up some discussion of the algorithms here:

Faster Random Samples With Gap Sampling
http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/


> More Efficient Sampling
> ---
>
> Key: SPARK-3250
> URL: https://issues.apache.org/jira/browse/SPARK-3250
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: RJ Nowling
>
> Sampling, as currently implemented in Spark, is an O\(n\) operation.  A 
> number of stochastic algorithms achieve speed ups by exploiting O\(k\) 
> sampling, where k is the number of data points to sample.  Examples of such 
> algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient 
> Descent with mini batching.
> More efficient sampling may be achievable by packing partitions with an 
> ArrayBuffer or other data structure supporting random access.  Since many of 
> these stochastic algorithms perform repeated rounds of sampling, it may be 
> feasible to perform a transformation to change the backing data structure 
> followed by multiple rounds of sampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3250) More Efficient Sampling

2014-09-18 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139965#comment-14139965
 ] 

Erik Erlandson commented on SPARK-3250:
---

PR: https://github.com/apache/spark/pull/2455
[SPARK-3250] Implement Gap Sampling optimization for random sampling


> More Efficient Sampling
> ---
>
> Key: SPARK-3250
> URL: https://issues.apache.org/jira/browse/SPARK-3250
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: RJ Nowling
>
> Sampling, as currently implemented in Spark, is an O\(n\) operation.  A 
> number of stochastic algorithms achieve speed ups by exploiting O\(k\) 
> sampling, where k is the number of data points to sample.  Examples of such 
> algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient 
> Descent with mini batching.
> More efficient sampling may be achievable by packing partitions with an 
> ArrayBuffer or other data structure supporting random access.  Since many of 
> these stochastic algorithms perform repeated rounds of sampling, it may be 
> feasible to perform a transformation to change the backing data structure 
> followed by multiple rounds of sampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-2315) drop, dropRight and dropWhile which take RDD input and return RDD

2015-06-13 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson closed SPARK-2315.
-
Resolution: Won't Fix

There is now an implementation on the silex project:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.drop.DropRDDFunctions

> drop, dropRight and dropWhile which take RDD input and return RDD
> -
>
> Key: SPARK-2315
> URL: https://issues.apache.org/jira/browse/SPARK-2315
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>  Labels: features
>
> Last time I loaded in a text file, I found myself wanting to just skip the 
> first element as it was a header. I wrote candidate methods drop, 
> dropRight and dropWhile to satisfy this kind of need:
> val txt = sc.textFile("text_with_header.txt")
> val data = txt.drop(1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)

2014-06-09 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025895#comment-14025895
 ] 

Erik Erlandson commented on SPARK-1493:
---

RAT itself appears to preclude exclusion using a "/path/to/file.ext" regex 
because it traverses the directory tree and applies its exclusion filter only 
to individual file names.  The filter never sees an entire path 
"path/to/file.ext", only "path", "to", and "file.ext"

https://github.com/apache/rat/blob/incubator-site-import/rat/rat-core/src/main/java/org/apache/rat/DirectoryWalker.java#L127

Either RAT needs a new filtering feature that can see an entire path, or the 
report it generates has to be filtered post-hoc.


> Apache RAT excludes don't work with file path (instead of file name)
> 
>
> Key: SPARK-1493
> URL: https://issues.apache.org/jira/browse/SPARK-1493
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Patrick Wendell
>  Labels: starter
> Fix For: 1.1.0
>
>
> Right now the way we do RAT checks, it doesn't work if you try to exclude:
> /path/to/file.ext
> you have to just exclude
> file.ext



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)

2014-06-09 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025895#comment-14025895
 ] 

Erik Erlandson edited comment on SPARK-1493 at 6/9/14 11:13 PM:


RAT itself appears to preclude exclusion using a "/path/to/file.ext" regex 
because it traverses the directory tree and applies its exclusion filter only 
to individual file names.  The filter never sees an entire path 
"path/to/file.ext", only "path", "to", and "file.ext"

https://github.com/apache/rat/blob/incubator-site-import/rat/rat-core/src/main/java/org/apache/rat/DirectoryWalker.java#L127

Either RAT needs a new filtering feature that can see an entire path, or the 
report it generates has to be filtered post-hoc.

Filed an RFE against RAT:  RAT-161


was (Author: eje):
RAT itself appears to preclude exclusion using a "/path/to/file.ext" regex 
because it traverses the directory tree and applies its exclusion filter only 
to individual file names.  The filter never sees an entire path 
"path/to/file.ext", only "path", "to", and "file.ext"

https://github.com/apache/rat/blob/incubator-site-import/rat/rat-core/src/main/java/org/apache/rat/DirectoryWalker.java#L127

Either RAT needs a new filtering feature that can see an entire path, or the 
report it generates has to be filtered post-hoc.


> Apache RAT excludes don't work with file path (instead of file name)
> 
>
> Key: SPARK-1493
> URL: https://issues.apache.org/jira/browse/SPARK-1493
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Patrick Wendell
>  Labels: starter
> Fix For: 1.1.0
>
>
> Right now the way we do RAT checks, it doesn't work if you try to exclude:
> /path/to/file.ext
> you have to just exclude
> file.ext



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)

2014-06-20 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038921#comment-14038921
 ] 

Erik Erlandson commented on SPARK-1493:
---

I submitted a proposal patch for RAT-161, which allows one to request 
path-spanning patterns by including a leading '/'

If '--dir' argument is /path/to/repo, and contents of '-E' file includes:
/subpath/to/.*ext

then the pattern induced is:
/path/to/repo + /subpath/to/.*ex t --> /path/to/repo/subpath/to/.*ext


> Apache RAT excludes don't work with file path (instead of file name)
> 
>
> Key: SPARK-1493
> URL: https://issues.apache.org/jira/browse/SPARK-1493
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Patrick Wendell
>  Labels: starter
>
> Right now the way we do RAT checks, it doesn't work if you try to exclude:
> /path/to/file.ext
> you have to just exclude
> file.ext



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-06-20 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039256#comment-14039256
 ] 

Erik Erlandson commented on SPARK-1406:
---

How about a PMML pickler/unpickler, written as an extension to:
https://github.com/scala/pickling


> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1856) Standardize MLlib interfaces

2014-06-20 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039495#comment-14039495
 ] 

Erik Erlandson commented on SPARK-1856:
---

How does this jira relate (if at all) to the MLI project, part of whose purpose 
is (or was) to provide a unified and pluggable type system for machine learning 
models, their inputs, parameters and training?

http://www.cs.berkeley.edu/~ameet/mlbase_website/mlbase_website/publications.html


> Standardize MLlib interfaces
> 
>
> Key: SPARK-1856
> URL: https://issues.apache.org/jira/browse/SPARK-1856
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.1.0
>
>
> Instead of expanding MLlib based on the current class naming scheme 
> (ProblemWithAlgorithm),  we should standardize MLlib's interfaces that 
> clearly separate datasets, formulations, algorithms, parameter sets, and 
> models.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2262) Extreme Learning Machines (ELM) for MLLib

2014-06-24 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042731#comment-14042731
 ] 

Erik Erlandson commented on SPARK-2262:
---

I'd like to have this assigned to me

> Extreme Learning Machines (ELM) for MLLib
> -
>
> Key: SPARK-2262
> URL: https://issues.apache.org/jira/browse/SPARK-2262
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Erik Erlandson
>  Labels: features
>
> MLLib has a gap in the NN space.   There's some good reason for this, as 
> batching gradient updates in traditional backprop training is known to not 
> perform well.
> However, Extreme Learning Machines(ELM)  combine support for nonlinear 
> activation functions in a hidden layer with a batch-friendly linear training. 
>  There is also a body of ELM literature on various avenues for extension, 
> including multi-category classification, multiple hidden layers and adaptive 
> addition/deletion of hidden nodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2262) Extreme Learning Machines (ELM) for MLLib

2014-06-24 Thread Erik Erlandson (JIRA)

Erik Erlandson created SPARK-2262:
-

 Summary: Extreme Learning Machines (ELM) for MLLib
 Key: SPARK-2262
 URL: https://issues.apache.org/jira/browse/SPARK-2262
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Erik Erlandson


MLLib has a gap in the NN space.   There's some good reason for this, as 
batching gradient updates in traditional backprop training is known to not 
perform well.

However, Extreme Learning Machines(ELM)  combine support for nonlinear 
activation functions in a hidden layer with a batch-friendly linear training.  
There is also a body of ELM literature on various avenues for extension, 
including multi-category classification, multiple hidden layers and adaptive 
addition/deletion of hidden nodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2315) drop, dropRight and dropWhile which take RDD input and return RDD

2014-06-27 Thread Erik Erlandson (JIRA)

Erik Erlandson created SPARK-2315:
-

 Summary: drop, dropRight and dropWhile which take RDD input and 
return RDD
 Key: SPARK-2315
 URL: https://issues.apache.org/jira/browse/SPARK-2315
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Erik Erlandson


Last time I loaded in a text file, I found myself wanting to just skip the 
first element as it was a header. I wrote candidate methods drop, dropRight 
and dropWhile to satisfy this kind of need:

val txt = sc.textFile("text_with_header.txt")
val data = txt.drop(1)




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2315) drop, dropRight and dropWhile which take RDD input and return RDD

2014-06-27 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046725#comment-14046725
 ] 

Erik Erlandson commented on SPARK-2315:
---

PR:  https://github.com/apache/spark/pull/1254


> drop, dropRight and dropWhile which take RDD input and return RDD
> -
>
> Key: SPARK-2315
> URL: https://issues.apache.org/jira/browse/SPARK-2315
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Erik Erlandson
>  Labels: features
>
> Last time I loaded in a text file, I found myself wanting to just skip the 
> first element as it was a header. I wrote candidate methods drop, 
> dropRight and dropWhile to satisfy this kind of need:
> val txt = sc.textFile("text_with_header.txt")
> val data = txt.drop(1)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark

2014-07-06 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053119#comment-14053119
 ] 

Erik Erlandson commented on SPARK-2352:
---

Related:  SPARK-2262

> [MLLIB] Add Artificial Neural Network (ANN) to Spark
> 
>
> Key: SPARK-2352
> URL: https://issues.apache.org/jira/browse/SPARK-2352
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
> Environment: MLLIB code
>Reporter: Bert Greevenbosch
>
> It would be good if the Machine Learning Library contained Artificial Neural 
> Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib

2014-07-15 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062339#comment-14062339
 ] 

Erik Erlandson commented on SPARK-1486:
---

Does the dev on this issue effectively subsume SPARK-1457  and/or  SPARK-1856 ?


> Support multi-model training in MLlib
> -
>
> Key: SPARK-1486
> URL: https://issues.apache.org/jira/browse/SPARK-1486
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.1.0
>
>
> It is rare in practice to train just one model with a given set of 
> parameters. Usually, this is done by training multiple models with different 
> sets of parameters and then select the best based on their performance on the 
> validation set. MLlib should provide native support for multi-model 
> training/scoring. It requires decoupling of concepts like problem, 
> formulation, algorithm, parameter set, and model, which are missing in MLlib 
> now. MLI implements similar concepts, which we can borrow. There are 
> different approaches for multi-model training:
> 0) Keep one copy of the data, and train models one after another (or maybe in 
> parallel, depending on the scheduler).
> 1) Keep one copy of the data, and train multiple models at the same time 
> (similar to `runs` in KMeans).
> 2) Make multiple copies of the data (still stored distributively), and use 
> more cores to distribute the work.
> 3) Collect the data, make the entire dataset available on workers, and train 
> one or more models on each worker.
> Users should be able to choose which execution mode they want to use. Note 
> that 3) could cover many use cases in practice when the training data is not 
> huge, e.g., <1GB.
> This task will be divided into sub-tasks and this JIRA is created to discuss 
> the design and track the overall progress.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2488) Model SerDe in MLlib

2014-07-15 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062486#comment-14062486
 ] 

Erik Erlandson commented on SPARK-2488:
---

Related, possible duplicate:  SPARK-1406


> Model SerDe in MLlib
> 
>
> Key: SPARK-2488
> URL: https://issues.apache.org/jira/browse/SPARK-2488
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> Support model serialization/deserialization in MLlib. The first version could 
> be text-based.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2315) drop, dropRight and dropWhile which take RDD input and return RDD

2014-07-30 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14079300#comment-14079300
 ] 

Erik Erlandson commented on SPARK-2315:
---

Updated the PR with a proper lazy-transform implementation:
http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/


> drop, dropRight and dropWhile which take RDD input and return RDD
> -
>
> Key: SPARK-2315
> URL: https://issues.apache.org/jira/browse/SPARK-2315
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Erik Erlandson
>  Labels: features
>
> Last time I loaded in a text file, I found myself wanting to just skip the 
> first element as it was a header. I wrote candidate methods drop, 
> dropRight and dropWhile to satisfy this kind of need:
> val txt = sc.textFile("text_with_header.txt")
> val data = txt.drop(1)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

2014-07-30 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080250#comment-14080250
 ] 

Erik Erlandson commented on SPARK-1021:
---

I deferred the compute of the partition bounds this way, and seems to work 
properly in my testing and the unit tests:
https://github.com/erikerlandson/spark/compare/erikerlandson:rdd_drop_master...spark-1021


> sortByKey() launches a cluster job when it shouldn't
> 
>
> Key: SPARK-1021
> URL: https://issues.apache.org/jira/browse/SPARK-1021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Andrew Ash
>Assignee: Mark Hamstra
>  Labels: starter
>
> The sortByKey() method is listed as a transformation, not an action, in the 
> documentation.  But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the 
> rdd.count() call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
> fix this 
> (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
>   We'd need to make sure that rangeBounds() is never called before an action 
> is performed.  This could be tricky because it's called in the 
> RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
> number of partitions, the ids of the RDDs used to create the 
> RangePartitioner, and the sort ordering.  This still supports the case where 
> I range-partition one RDD and pass the same partitioner to a different RDD.  
> It breaks support for the case where two range partitioners created on 
> different RDDs happened to have the same rangeBounds(), but it seems unlikely 
> that this would really harm performance since it's probably unlikely that the 
> range partitioners are equal by chance.
> {quote}
> Can we please make this happen?  I'll send a PR on GitHub to start the 
> discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2911) provide rdd.parent[T](j) to obtain jth parent of rdd

2014-08-07 Thread Erik Erlandson (JIRA)

Erik Erlandson created SPARK-2911:
-

 Summary: provide rdd.parent[T](j) to obtain jth parent of rdd
 Key: SPARK-2911
 URL: https://issues.apache.org/jira/browse/SPARK-2911
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Erik Erlandson
Priority: Minor


For writing RDD subclasses that involve more than a single parent dependency, 
it would be convenient (and more readable) to say:

rdd.parent[T](j)

instead of:

rdd.dependencies(j).rdd.asInstanceOf[RDD[T]]





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2911) provide rdd.parent[T](j) to obtain jth parent of rdd

2014-08-08 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090734#comment-14090734
 ] 

Erik Erlandson commented on SPARK-2911:
---

OK, shall I do it as part of this jira or file a separate one?

> provide rdd.parent[T](j) to obtain jth parent of rdd
> 
>
> Key: SPARK-2911
> URL: https://issues.apache.org/jira/browse/SPARK-2911
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Erik Erlandson
>Priority: Minor
>  Labels: easyfix, easytest
>
> For writing RDD subclasses that involve more than a single parent dependency, 
> it would be convenient (and more readable) to say:
> rdd.parent[T](j)
> instead of:
> rdd.dependencies(j).rdd.asInstanceOf[RDD[T]]



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-2991) RDD transforms for scan and scanLeft

2014-08-12 Thread Erik Erlandson (JIRA)

Erik Erlandson created SPARK-2991:
-

 Summary: RDD transforms for scan and scanLeft 
 Key: SPARK-2991
 URL: https://issues.apache.org/jira/browse/SPARK-2991
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Erik Erlandson
Priority: Minor


Provide RDD transforms analogous to Scala scan(z)(f) (parallel prefix scan) and 
scanLeft(z)(f) (sequential prefix scan)

Discussion of a scanLeft implementation:
http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/

Discussion of scan:
http://erikerlandson.github.io/blog/2014/08/12/implementing-parallel-prefix-scan-as-a-spark-rdd-transform/




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-2992) The transforms formerly known as non-lazy

2014-08-12 Thread Erik Erlandson (JIRA)

Erik Erlandson created SPARK-2992:
-

 Summary: The transforms formerly known as non-lazy
 Key: SPARK-2992
 URL: https://issues.apache.org/jira/browse/SPARK-2992
 Project: Spark
  Issue Type: Umbrella
  Components: Spark Core
Reporter: Erik Erlandson
Priority: Minor


An umbrella for a grab-bag of tickets involving lazy implementations of 
transforms formerly thought to be non-lazy.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2991) RDD transforms for scan and scanLeft

2014-08-12 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-2991:
--

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-2992

> RDD transforms for scan and scanLeft 
> -
>
> Key: SPARK-2991
> URL: https://issues.apache.org/jira/browse/SPARK-2991
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Minor
>  Labels: features
>
> Provide RDD transforms analogous to Scala scan(z)(f) (parallel prefix scan) 
> and scanLeft(z)(f) (sequential prefix scan)
> Discussion of a scanLeft implementation:
> http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/
> Discussion of scan:
> http://erikerlandson.github.io/blog/2014/08/12/implementing-parallel-prefix-scan-as-a-spark-rdd-transform/



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

2014-08-12 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-1021:
--

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-2992

> sortByKey() launches a cluster job when it shouldn't
> 
>
> Key: SPARK-1021
> URL: https://issues.apache.org/jira/browse/SPARK-1021
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0
>Reporter: Andrew Ash
>Assignee: Mark Hamstra
>  Labels: starter
>
> The sortByKey() method is listed as a transformation, not an action, in the 
> documentation.  But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the 
> rdd.count() call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
> fix this 
> (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
>   We'd need to make sure that rangeBounds() is never called before an action 
> is performed.  This could be tricky because it's called in the 
> RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
> number of partitions, the ids of the RDDs used to create the 
> RangePartitioner, and the sort ordering.  This still supports the case where 
> I range-partition one RDD and pass the same partitioner to a different RDD.  
> It breaks support for the case where two range partitioners created on 
> different RDDs happened to have the same rangeBounds(), but it seems unlikely 
> that this would really harm performance since it's probably unlikely that the 
> range partitioners are equal by chance.
> {quote}
> Can we please make this happen?  I'll send a PR on GitHub to start the 
> discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2315) drop, dropRight and dropWhile which take RDD input and return RDD

2014-08-12 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-2315:
--

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-2992

> drop, dropRight and dropWhile which take RDD input and return RDD
> -
>
> Key: SPARK-2315
> URL: https://issues.apache.org/jira/browse/SPARK-2315
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>  Labels: features
>
> Last time I loaded in a text file, I found myself wanting to just skip the 
> first element as it was a header. I wrote candidate methods drop, 
> dropRight and dropWhile to satisfy this kind of need:
> val txt = sc.textFile("text_with_header.txt")
> val data = txt.drop(1)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3104) Jenkins failing to test some PRs when asked to

2014-08-18 Thread Erik Erlandson (JIRA)

Erik Erlandson created SPARK-3104:
-

 Summary: Jenkins failing to test some PRs when asked to
 Key: SPARK-3104
 URL: https://issues.apache.org/jira/browse/SPARK-3104
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Erik Erlandson


I've seen a few PRs where Jenkins does not appear to be testing when requested:

https://github.com/apache/spark/pull/1964
https://github.com/apache/spark/pull/1254
https://github.com/apache/spark/pull/1839

Maybe the Jenkins logs have a record of what's going wrong?




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs

2014-08-22 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107233#comment-14107233
 ] 

Erik Erlandson commented on SPARK-2360:
---

It appears that this is not a pure lazy transform, as it invokes 'first()` when 
inferring schema from headers.
I wrote up some ideas on this, pertaining to SPARK-2315, here:
http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/


> CSV import to SchemaRDDs
> 
>
> Key: SPARK-2360
> URL: https://issues.apache.org/jira/browse/SPARK-2360
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Hossein Falaki
>
> I think the first step it to design the interface that we want to present to 
> users.  Mostly this is defining options when importing.  Off the top of my 
> head:
> - What is the separator?
> - Provide column names or infer them from the first row.
> - how to handle multiple files with possibly different schemas
> - do we have a method to let users specify the datatypes of the columns or 
> are they just strings?
> - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3250) More Efficient Sampling

2014-08-28 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114789#comment-14114789
 ] 

Erik Erlandson commented on SPARK-3250:
---

I did some experiments with sampling that models the gaps between samples (so 
one can use iterator.drop between samples).  The results are here:

https://gist.github.com/erikerlandson/66b42d96500589f25553

There appears to be a crossover point in efficiency, around sampling 
probability p=0.3, where densities below 0.3 are best done using the new logic, 
and higher sampling densities are better done using traditional filter-based 
logic.

I need to run more tests, but the first results are promising.  At low sampling 
densities the improvement is large.

> More Efficient Sampling
> ---
>
> Key: SPARK-3250
> URL: https://issues.apache.org/jira/browse/SPARK-3250
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: RJ Nowling
>
> Sampling, as currently implemented in Spark, is an O\(n\) operation.  A 
> number of stochastic algorithms achieve speed ups by exploiting O\(k\) 
> sampling, where k is the number of data points to sample.  Examples of such 
> algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient 
> Descent with mini batching.
> More efficient sampling may be achievable by packing partitions with an 
> ArrayBuffer or other data structure supporting random access.  Since many of 
> these stochastic algorithms perform repeated rounds of sampling, it may be 
> feasible to perform a transformation to change the backing data structure 
> followed by multiple rounds of sampling.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2991) RDD transforms for scan and scanLeft

2015-08-26 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14715417#comment-14715417
 ] 

Erik Erlandson commented on SPARK-2991:
---

This RFE now has a [pull-req|https://github.com/willb/silex/pull/28] against 
the 3rd-party [silex|https://github.com/willb/silex] library:
https://github.com/willb/silex/pull/28


> RDD transforms for scan and scanLeft 
> -
>
> Key: SPARK-2991
> URL: https://issues.apache.org/jira/browse/SPARK-2991
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Minor
>  Labels: features
>
> Provide RDD transforms analogous to Scala scan(z)(f) (parallel prefix scan) 
> and scanLeft(z)(f) (sequential prefix scan)
> Discussion of a scanLeft implementation:
> http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/
> Discussion of scan:
> http://erikerlandson.github.io/blog/2014/08/12/implementing-parallel-prefix-scan-as-a-spark-rdd-transform/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30520) Eliminate deprecation warnings for UserDefinedAggregateFunction

2020-06-15 Thread Erik Erlandson (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136153#comment-17136153
 ] 

Erik Erlandson commented on SPARK-30520:


Starting in spark 3.0, any custom aggregator that would have been implemented 
using UserDefinedAggregateFunction should now be implemented using Aggregator. 
To use a custom Aggregator with a dynamically typed DataFrame (aka 
DataSet[Row]), register it using org.apache.spark.sql.functions.udaf

 

> Eliminate deprecation warnings for UserDefinedAggregateFunction
> ---
>
> Key: SPARK-30520
> URL: https://issues.apache.org/jira/browse/SPARK-30520
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> {code}
> /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala
> Warning:Warning:line (718)class UserDefinedAggregateFunction in package 
> expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now 
> be registered as a UDF via the functions.udaf(agg) method.
>   val udaf = 
> clazz.getConstructor().newInstance().asInstanceOf[UserDefinedAggregateFunction]
> Warning:Warning:line (719)method register in class UDFRegistration is 
> deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered 
> as a UDF via the functions.udaf(agg) method.
>   register(name, udaf)
> /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
> Warning:Warning:line (328)class UserDefinedAggregateFunction in package 
> expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now 
> be registered as a UDF via the functions.udaf(agg) method.
> udaf: UserDefinedAggregateFunction,
> Warning:Warning:line (326)class UserDefinedAggregateFunction in package 
> expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now 
> be registered as a UDF via the functions.udaf(agg) method.
> case class ScalaUDAF(
> /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFunctionsSuite.scala
> Warning:Warning:line (363)class UserDefinedAggregateFunction in package 
> expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now 
> be registered as a UDF via the functions.udaf(agg) method.
> val udaf = new UserDefinedAggregateFunction {
> /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/test/java/test/org/apache/spark/sql/MyDoubleSum.java
> Warning:Warning:line (25)java: 
> org.apache.spark.sql.expressions.UserDefinedAggregateFunction in 
> org.apache.spark.sql.expressions has been deprecated
> Warning:Warning:line (35)java: 
> org.apache.spark.sql.expressions.UserDefinedAggregateFunction in 
> org.apache.spark.sql.expressions has been deprecated
> /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/test/java/test/org/apache/spark/sql/MyDoubleAvg.java
> Warning:Warning:line (25)java: 
> org.apache.spark.sql.expressions.UserDefinedAggregateFunction in 
> org.apache.spark.sql.expressions has been deprecated
> Warning:Warning:line (36)java: 
> org.apache.spark.sql.expressions.UserDefinedAggregateFunction in 
> org.apache.spark.sql.expressions has been deprecated
> /Users/maxim/proj/eliminate-expr-info-warnings/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala
> Warning:Warning:line (36)class UserDefinedAggregateFunction in package 
> expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now 
> be registered as a UDF via the functions.udaf(agg) method.
> class ScalaAggregateFunction(schema: StructType) extends 
> UserDefinedAggregateFunction {
> Warning:Warning:line (73)class UserDefinedAggregateFunction in package 
> expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now 
> be registered as a UDF via the functions.udaf(agg) method.
> class ScalaAggregateFunctionWithoutInputSchema extends 
> UserDefinedAggregateFunction {
> Warning:Warning:line (100)class UserDefinedAggregateFunction in package 
> expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now 
> be registered as a UDF via the functions.udaf(agg) method.
> class LongProductSum extends UserDefinedAggregateFunction {
> Warning:Warning:line (189)method register in class UDFRegistration is 
> deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered 
> as a UDF via the functions.udaf(agg) method.
> spark.udf.register("mydoublesum", new MyDoubleSum)
> Warning:Warning:line (190)method register in class UDFRegistration is 
> deprecated (since 3.0.0): Aggregator[IN, BUF, OUT

[jira] [Created] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Erik Erlandson (Jira)

Erik Erlandson created SPARK-32159:
--

 Summary: New udaf(Aggregator) has an integration bug with 
UnresolvedMapObjects serialization
 Key: SPARK-32159
 URL: https://issues.apache.org/jira/browse/SPARK-32159
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Erik Erlandson


The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

{{
/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
case class UnresolvedMapObjects(
@transient function: Expression => Expression,
child: Expression,
customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
  override lazy val resolved = false

  override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse {
throw new UnsupportedOperationException("not resolved")
  }
}

}} 

The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'

{{
object MapObjects {
  def apply(
  function: Expression => Expression,
  inputData: Expression,
  elementType: DataType,
  elementNullable: Boolean = true,
  customCollectionCls: Option[Class[_]] = None): MapObjects = {
val loopVar = LambdaVariable("MapObject", elementType, elementNullable)
MapObjects(loopVar, function(loopVar), inputData, customCollectionCls)
  }
}
}}

I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Erik Erlandson (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-32159:
---
Description: 
The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

{{/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
case class UnresolvedMapObjects(
@transient function: Expression => Expression,
child: Expression,
customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
  override lazy val resolved = false

  override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse {
throw new UnsupportedOperationException("not resolved")
  }
}}}

The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'

{{object MapObjects {
  def apply(
  function: Expression => Expression,
  inputData: Expression,
  elementType: DataType,
  elementNullable: Boolean = true,
  customCollectionCls: Option[Class[_]] = None): MapObjects = {
val loopVar = LambdaVariable("MapObject", elementType, elementNullable)
MapObjects(loopVar, function(loopVar), inputData, customCollectionCls)
  }
}
}}

I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be

  was:
The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

{{
/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
case class UnresolvedMapObjects(
@transient function: Expression => Expression,
child: Expression,
customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
  override lazy val resolved = false

  override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse {
throw new UnsupportedOperationException("not resolved")
  }
}

}} 

The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'

{{
object MapObjects {
  def apply(
  function: Expression => Expression,
  inputData: Expression,
  elementType: DataType,
  elementNullable: Boolean = true,
  customCollectionCls: Option[Class[_]] = None): MapObjects = {
val loopVar = LambdaVariable("MapObject", elementType, elementNullable)
MapObjects(loopVar, function(loopVar), inputData, customCollectionCls)
  }
}
}}

I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be


> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Sp

1 2 >

1 - 100 of 143 matches

Mail list logo