[jira] [Resolved] (SPARK-3682) Add helpful warnings to the UI

2016-02-08 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-3682.
---
Resolution: Won't Fix

> Add helpful warnings to the UI
> --
>
> Key: SPARK-3682
> URL: https://issues.apache.org/jira/browse/SPARK-3682
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
> Attachments: SPARK-3682Design.pdf
>
>
> Spark has a zillion configuration options and a zillion different things that 
> can go wrong with a job.  Improvements like incremental and better metrics 
> and the proposed spark replay debugger provide more insight into what's going 
> on under the covers.  However, it's difficult for non-advanced users to 
> synthesize this information and understand where to direct their attention. 
> It would be helpful to have some sort of central location on the UI users 
> could go to that would provide indications about why an app/job is failing or 
> performing poorly.
> Some helpful messages that we could provide:
> * Warn that the tasks in a particular stage are spending a long time in GC.
> * Warn that spark.shuffle.memoryFraction does not fit inside the young 
> generation.
> * Warn that tasks in a particular stage are very short, and that the number 
> of partitions should probably be decreased.
> * Warn that tasks in a particular stage are spilling a lot, and that the 
> number of partitions should probably be increased.
> * Warn that a cached RDD that gets a lot of use does not fit in memory, and a 
> lot of time is being spent recomputing it.
> To start, probably two kinds of warnings would be most helpful.
> * Warnings at the app level that report on misconfigurations, issues with the 
> general health of executors.
> * Warnings at the job level that indicate why a job might be performing 
> slowly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun

2016-02-08 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-5490:
--
Assignee: (was: Sandy Ryza)

> KMeans costs can be incorrect if tasks need to be rerun
> ---
>
> Key: SPARK-5490
> URL: https://issues.apache.org/jira/browse/SPARK-5490
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>  Labels: clustering
>
> KMeans uses accumulators to compute the cost of a clustering at each 
> iteration.
> Each time a ShuffleMapTask completes, it increments the accumulators at the 
> driver.  If a task runs twice because of failures, the accumulators get 
> incremented twice.
> KMeans uses accumulators in ShuffleMapTasks.  This means that a task's cost 
> can end up being double-counted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7410) Add option to avoid broadcasting configuration with newAPIHadoopFile

2016-02-08 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-7410.
---
Resolution: Won't Fix

> Add option to avoid broadcasting configuration with newAPIHadoopFile
> 
>
> Key: SPARK-7410
> URL: https://issues.apache.org/jira/browse/SPARK-7410
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Sandy Ryza
>
> I'm working with a Spark application that creates thousands of HadoopRDDs and 
> unions them together.  Certain details of the way the data is stored require 
> this.
> Creating ten thousand of these RDDs takes about 10 minutes, even before any 
> of them is used in an action.  I dug into why this takes so long and it looks 
> like the overhead of broadcasting the Hadoop configuration is taking up most 
> of the time.  In this case, the broadcasting isn't helpful because each 
> HadoopRDD only corresponds to one or two tasks.  When I reverted the original 
> change that switched to broadcasting configurations, the time it took to 
> instantiate these RDDs improved 10x.
> It would be nice if there was a way to turn this broadcasting off.  Either 
> through a Spark configuration option, a Hadoop configuration option, or an 
> argument to hadoopFile / newAPIHadoopFile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

2015-11-23 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022634#comment-15022634
 ] 

Sandy Ryza commented on SPARK-:
---

[~nchammas] it's not clear that it makes sense to add a similar API for Python 
and R.  The main point of the Dataset API, as I understand it, is to extend 
DataFrames to take advantage of Java / Scala's static typing systems.  This 
means recovering compile-time type safety, integration with existing Java / 
Scale object frameworks, and Scala syntactic sugar like pattern matching.  
Python and R are dynamically typed so can't take advantage of these.

> Dataset API on top of Catalyst/DataFrame
> 
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However, 
> it will take a few more future releases to flush everything out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2015-10-29 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980717#comment-14980717
 ] 

Sandy Ryza commented on SPARK-2089:
---

My opinion is that we should be moving towards dynamic allocation as the norm, 
both for batch and long-running applications.  With dynamic allocation turned 
on, it's possible to attain close to the same behavior as static allocation if 
you set max executors and a really fast ramp-up time.

> With YARN, preferredNodeLocalityData isn't honored 
> ---
>
> Key: SPARK-2089
> URL: https://issues.apache.org/jira/browse/SPARK-2089
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Critical
>
> When running in YARN cluster mode, apps can pass preferred locality data when 
> constructing a Spark context that will dictate where to request executor 
> containers.
> This is currently broken because of a race condition.  The Spark-YARN code 
> runs the user class and waits for it to start up a SparkContext.  During its 
> initialization, the SparkContext will create a YarnClusterScheduler, which 
> notifies a monitor in the Spark-YARN code that .  The Spark-Yarn code then 
> immediately fetches the preferredNodeLocationData from the SparkContext and 
> uses it to start requesting containers.
> But in the SparkContext constructor that takes the preferredNodeLocationData, 
> setting preferredNodeLocationData comes after the rest of the initialization, 
> so, if the Spark-YARN code comes around quickly enough after being notified, 
> the data that's fetched is the empty unset version.  The occurred during all 
> of my runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-16 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961513#comment-14961513
 ] 

Sandy Ryza commented on SPARK-:
---

So ClassTags would work for case classes and Avro specific records, but 
wouldn't work for tuples (or anywhere else types get erased).  Blrgh.  I wonder 
if the former is enough?  Tuples are pretty useful though.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957144#comment-14957144
 ] 

Sandy Ryza commented on SPARK-:
---

Maybe you all have thought through this as well, but I had some more thoughts 
on the proposed API.

Fundamentally, it seems a weird to me that the user is responsible for having a 
matching Encoder around every time they want to map to a class of a particular 
type.  In 99% of cases, the Encoder used to encode any given type will be the 
same, and it seems more intuitive to me to specify this up front.

To be more concrete, suppose I want to use case classes in my app and have a 
function that can auto-generate an Encoder from a class object (though this 
might be a little bit time consuming because it needs to use reflection).  With 
the current proposal, any time I want to map my Dataset to a Dataset of some 
case class, I need to either have a line of code that generates an Encoder for 
that case class, or have an Encoder already lying around.  If I perform this 
operation within a method, I need to pass the Encoder down to the method and 
include it in the signature.

Ideally I would be able to register an EncoderSystem up front that caches 
Encoders and generates new Encoders whenever it sees a new class used.  This 
still of course requires the user to pass in type information when they call 
map, but it's easier for them to get this information than an actual encoder.  
If there's not some principled way to get this working implicitly with 
ClassTags, the user could just pass in classOf[MyCaseClass] as the second 
argument to map.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956341#comment-14956341
 ] 

Sandy Ryza commented on SPARK-:
---

Thanks for the explanation [~rxin] and [~marmbrus].  I understand the problem 
and don't have any great ideas for an alternative workable solution.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958022#comment-14958022
 ] 

Sandy Ryza commented on SPARK-:
---

bq. The problem with doing this using a registry (like kryo in RDDs today) is 
that then you aren't finding out the object type until you have an example 
object from realizing the computation.

My suggestion was that the user would still need to pass the class object, so 
this shouldn't be a problem, unless I'm misunderstanding.

Thanks to the pointer to the test suite.  So am I to understand correctly that 
with Scala implicits magic I can do the following without any additional 
boilerplate?

{code}
import 

case class MyClass1()
case class MyClass2()

val ds : Dataset[MyClass1] = ...
val ds2: Dataset[MyClass2] = ds.map(funcThatConvertsFromMyClass1ToMyClass2)
{code}

and in Java, imagining those case classes above were POJOs, we'd be able to 
support the following?

{code}
Dataset ds2 = ds1.map(funcThatConvertsFromMyClass1ToMyClass2, 
MyClass2.class);
{code}

If that's the case, then that resolves my concerns above.

Lastly, though, IIUC, it seems like for all the common cases, we could register 
an object with the SparkContext that converts from ClassTag to Encoder, and the 
RDD API would work.  Where does that break down?

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-13 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955286#comment-14955286
 ] 

Sandy Ryza commented on SPARK-:
---

To ask the obvious question: what are the reasons that the RDD API couldn't be 
adapted to these purposes?  If I understand correctly, a summarization of the 
differences is that Datasets:
1. Support encoders for conversion to schema'd / efficiently serializable data
2. Have a GroupedDataset concept
3. Execute on Catalyst instead of directly on top of the DAGScheduler

How difficult would it be to add encoders on top of RDDs, as well as a 
GroupedRDD?  Is there anything in the RDD API contract that says RDDs can't be 
executed on top of Catalyst?

Surely this creates some dependency hell as well given that SQL depends on 
core, but surely that's better than exposing an entirely new API that looks 
almost like the original one.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-13 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955320#comment-14955320
 ] 

Sandy Ryza commented on SPARK-:
---

[~rxin] where are the places where the API would need to break?

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-13 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955840#comment-14955840
 ] 

Sandy Ryza commented on SPARK-:
---

If I understand correctly, it seems like there are ways to work around each of 
these issues that, necessarily, make the API dirtier, but avoid the need for a 
whole new public API.

* groupBy: deprecate the old groupBy and add a groupWith or groupby method that 
returns a GroupedRDD.
* partitions: have -1 be a special value that means "determined by the planner"
* encoders: what are the main obstacles to addressing this with an EncodedRDD 
that extends RDD?

Regarding the issues Michael brought up:
I'd love to get rid of class tags from the public API as well as take out 
JavaRDD, but these seem more like "nice to have" than core to the proposal.  Am 
I misunderstanding?

All of these of course add ugliness, but I think it's really easy to 
underestimate the cost of introducing a new API.  Applications everywhere 
become legacy and need to be rewritten to take advantage of new features.  Code 
examples and training materials everywhere become invalidated.  Can we point to 
systems that have successfully made a transition like this at this point in 
their maturity?

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-22 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902982#comment-14902982
 ] 

Sandy Ryza commented on SPARK-10739:


That's the one I was referring to as well.  That's about executor failure, 
where this is about AM failure, so different issues.

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901901#comment-14901901
 ] 

Sandy Ryza commented on SPARK-10739:


I recall there was a JIRA similar to this that avoided killing the application 
when we reached a certain number of executor failures.  However, IIUC, this is 
about something different: deciding whether to have YARN restart the 
application when it fails.

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4534) With YARN, JavaSparkContext provide to add preferredNodeLocalityData to SparkContext

2015-09-10 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-4534.
---
Resolution: Won't Fix

As SPARK-2089 is closed as "Won't Fix", also closing this.

> With YARN, JavaSparkContext provide to add preferredNodeLocalityData to 
> SparkContext
> 
>
> Key: SPARK-4534
> URL: https://issues.apache.org/jira/browse/SPARK-4534
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Reporter: Lianhui Wang
>
> example:
> SparkConf sparkConf=new SparkConf();
> Configuration conf = new Configuration();
> List inputFormatInfoList =new ArrayList();
> inputFormatInfoList.add(new InputFormatInfo(conf, 
> org.apache.hadoop.mapred.TextInputFormat.class, path));
> JavaSparkContext ctx=new JavaSparkContext(sparkConf,inputFormatInfoList);
> we can use above code to let preferredNodeLocalityData be work in Java 
> program.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9782) Add support for YARN application tags running Spark on YARN

2015-08-18 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-9782:
--
Assignee: Dennis Huo

 Add support for YARN application tags running Spark on YARN
 ---

 Key: SPARK-9782
 URL: https://issues.apache.org/jira/browse/SPARK-9782
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.4.1
Reporter: Dennis Huo
Assignee: Dennis Huo

 https://issues.apache.org/jira/browse/YARN-1390 originally added the new 
 “Application Tags” feature to YARN to help track the sources of applications 
 among many possible YARN clients. 
 https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a 
 set of tags to be applied, and for comparison, 
 https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for 
 MapReduce to easily propagate tags through to YARN via Configuration settings.
 Since the ApplicationSubmissionContext.setApplicationTags method was only 
 added in Hadoop 2.4+, Spark support will invoke the method via reflection the 
 same way other such version-specific methods are called in elsewhere in the 
 YARN client. Since the usage of tags is generally not critical to the 
 functionality of older YARN setups, it should be safe to handle 
 NoSuchMethodException with just a logWarning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9782) Add support for YARN application tags running Spark on YARN

2015-08-18 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-9782.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

 Add support for YARN application tags running Spark on YARN
 ---

 Key: SPARK-9782
 URL: https://issues.apache.org/jira/browse/SPARK-9782
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.4.1
Reporter: Dennis Huo
Assignee: Dennis Huo
 Fix For: 1.6.0


 https://issues.apache.org/jira/browse/YARN-1390 originally added the new 
 “Application Tags” feature to YARN to help track the sources of applications 
 among many possible YARN clients. 
 https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a 
 set of tags to be applied, and for comparison, 
 https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for 
 MapReduce to easily propagate tags through to YARN via Configuration settings.
 Since the ApplicationSubmissionContext.setApplicationTags method was only 
 added in Hadoop 2.4+, Spark support will invoke the method via reflection the 
 same way other such version-specific methods are called in elsewhere in the 
 YARN client. Since the usage of tags is generally not critical to the 
 functionality of older YARN setups, it should be safe to handle 
 NoSuchMethodException with just a logWarning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8674) 2-sample, 2-sided Kolmogorov Smirnov Test

2015-08-17 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-8674:
--
Summary: 2-sample, 2-sided Kolmogorov Smirnov Test  (was: [WIP] 2-sample, 
2-sided Kolmogorov Smirnov Test Implementation)

 2-sample, 2-sided Kolmogorov Smirnov Test
 -

 Key: SPARK-8674
 URL: https://issues.apache.org/jira/browse/SPARK-8674
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Jose Cambronero
Assignee: Jose Cambronero
Priority: Minor

 We added functionality to calculate a 2-sample, 2-sided Kolmogorov Smirnov 
 test for 2 RDD[Double]. The calculation provides a test for the null 
 hypothesis that both samples come from the same probability distribution. The 
 implementation seeks to minimize the shuffles necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8674) [WIP] 2-sample, 2-sided Kolmogorov Smirnov Test Implementation

2015-08-17 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-8674:
--
Assignee: Jose Cambronero

 [WIP] 2-sample, 2-sided Kolmogorov Smirnov Test Implementation
 --

 Key: SPARK-8674
 URL: https://issues.apache.org/jira/browse/SPARK-8674
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Jose Cambronero
Assignee: Jose Cambronero
Priority: Minor

 We added functionality to calculate a 2-sample, 2-sided Kolmogorov Smirnov 
 test for 2 RDD[Double]. The calculation provides a test for the null 
 hypothesis that both samples come from the same probability distribution. The 
 implementation seeks to minimize the shuffles necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7707) User guide and example code for Statistics.kernelDensity

2015-08-16 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698634#comment-14698634
 ] 

Sandy Ryza commented on SPARK-7707:
---

[~mengxr] thoughts on which page this should land in?  mllib-statistics?

 User guide and example code for Statistics.kernelDensity
 

 Key: SPARK-7707
 URL: https://issues.apache.org/jira/browse/SPARK-7707
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7707) User guide and example code for KernelDensity

2015-08-16 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-7707:
--
Summary: User guide and example code for KernelDensity  (was: User guide 
and example code for Statistics.kernelDensity)

 User guide and example code for KernelDensity
 -

 Key: SPARK-7707
 URL: https://issues.apache.org/jira/browse/SPARK-7707
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7707) User guide and example code for KernelDensity

2015-08-16 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-7707:
-

Assignee: Sandy Ryza

 User guide and example code for KernelDensity
 -

 Key: SPARK-7707
 URL: https://issues.apache.org/jira/browse/SPARK-7707
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7707) User guide and example code for Statistics.kernelDensity

2015-08-12 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694413#comment-14694413
 ] 

Sandy Ryza commented on SPARK-7707:
---

Again, sorry for the long delay here.  I'm traveling for the rest of this week, 
but can get to this next week.  Is that soon enough?

 User guide and example code for Statistics.kernelDensity
 

 Key: SPARK-7707
 URL: https://issues.apache.org/jira/browse/SPARK-7707
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9808) Remove hash shuffle file consolidation

2015-08-11 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681601#comment-14681601
 ] 

Sandy Ryza commented on SPARK-9808:
---

Have we considered removing the hash-based shuffle entirely?  Especially 
without shuffle file consolidation, it's almost always a poor choice.

 Remove hash shuffle file consolidation
 --

 Key: SPARK-9808
 URL: https://issues.apache.org/jira/browse/SPARK-9808
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 I think that we should remove {{spark.shuffle.consolidateFiles}} and its 
 associated implementation for Spark 1.5.0. This feature isn't properly tested 
 and does not work with the external shuffle service. Since it's likely to be 
 buggy and since its motivation has been subsumed by sort-based shuffle, I 
 think it's a prime candidate for removal now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9808) Remove hash shuffle file consolidation

2015-08-11 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692589#comment-14692589
 ] 

Sandy Ryza commented on SPARK-9808:
---

I don't have strong opinions here, but as a data point: my impression was that 
nearly everyone using hash-based shuffle was using shuffle file consolidation.  
I believe we recommended it to most of our users when hash-based shuffle was 
the default.

 Remove hash shuffle file consolidation
 --

 Key: SPARK-9808
 URL: https://issues.apache.org/jira/browse/SPARK-9808
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 I think that we should remove {{spark.shuffle.consolidateFiles}} and its 
 associated implementation for Spark 1.5.0. This feature isn't properly tested 
 and does not work with the external shuffle service. Since it's likely to be 
 buggy and since its motivation has been subsumed by sort-based shuffle, I 
 think it's a prime candidate for removal now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-07-27 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-4352.
---
  Resolution: Fixed
   Fix Version/s: 1.5.0
Target Version/s: 1.5.0

 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Fix For: 1.5.0

 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1744) Document how to pass in preferredNodeLocationData

2015-07-23 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-1744.
---
Resolution: Won't Fix

 Document how to pass in preferredNodeLocationData
 -

 Key: SPARK-1744
 URL: https://issues.apache.org/jira/browse/SPARK-1744
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9092) Make --num-executors compatible with dynamic allocation

2015-07-21 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635942#comment-14635942
 ] 

Sandy Ryza commented on SPARK-9092:
---

I had a brief discussion with [~andrewor14] about this offline and wanted to 
move the discussion public.

--num-executors and dynamic allocation are fundamentally at odds with each 
other in the sense that neither makes sense in the context of the other.  This 
means that essentially one needs to override the other.

My position is that it makes more sense for --num-executors to override dynamic 
allocation than the other way around.  I.e. if --num-executors is set, behave 
as if dynamic allocation were disabled.  The advantages of this are:
* Cluster operators can turn on dynamic allocation as the default cluster 
setting without impacting existing applications.  The precedent set by existing 
big data processing frameworks (MR, Impala, Tez) is that users can depend on 
the framework to determine how many resources to acquire from the cluster 
manager, so I think it's reasonable that most clusters would want to move to 
dynamic allocation as the default.  In a Cloudera setting, we plan to 
eventually enable dynamic allocation as the factory default for all clusters, 
but we'd like to minimize the extent to which we change the behavior of 
existing apps.
* --num-executors is conceptually a more specific property, and specific 
properties tend to override more general ones.

 Make --num-executors compatible with dynamic allocation
 ---

 Key: SPARK-9092
 URL: https://issues.apache.org/jira/browse/SPARK-9092
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Niranjan Padmanabhan

 Currently when you enable dynamic allocation, you can't use --num-executors 
 or the property spark.executor.instances. If we are to enable dynamic 
 allocation by default, we should make these work so that existing workloads 
 don't fail



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1640) In yarn-client mode, pass preferred node locations to AM

2015-07-20 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-1640.
---
Resolution: Invalid

 In yarn-client mode, pass preferred node locations to AM
 

 Key: SPARK-1640
 URL: https://issues.apache.org/jira/browse/SPARK-1640
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 0.9.0
Reporter: Sandy Ryza

 In yarn-cluster mode, if the user passes preferred node location data to the 
 SparkContext, the AM requests containers based on that data.
 In yarn-client mode, it would be good to do this as well.  This required some 
 way of passing this data from the client process to the AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8623) Some queries in spark-sql lead to NullPointerException when using Yarn

2015-06-26 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603243#comment-14603243
 ] 

Sandy Ryza commented on SPARK-8623:
---

Am able to reproduce this locally.  Looking into the cause.

 Some queries in spark-sql lead to NullPointerException when using Yarn
 --

 Key: SPARK-8623
 URL: https://issues.apache.org/jira/browse/SPARK-8623
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Hadoop 2.6, Kerberos
Reporter: Bolke de Bruin

 The following query was executed using spark-sql --master yarn-client on 
 1.5.0-SNAPSHOT:
 select * from wcs.geolite_city limit 10;
 This lead to the following error:
 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException
   at org.apache.hadoop.conf.Configuration.init(Configuration.java:693)
   at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442)
   at org.apache.hadoop.mapreduce.Job.init(Job.java:131)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 This does not happen in every case, ie. some queries execute fine, and it is 
 unclear why.
 Using just spark-sql the query executes fine as well and thus the issue 
 seems to rely in the communication with Yarn. Also the query executes fine 
 (with yarn) in spark-shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8623) Some queries in spark-sql lead to NullPointerException when using Yarn

2015-06-26 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603694#comment-14603694
 ] 

Sandy Ryza commented on SPARK-8623:
---

Figured out the issue - my patch omitted registering a custom Kryo serializer 
for SerializableConfiguration, so it gets serialized and deserialized using 
Kryo, which means the writeObject and readObject methods are never called, 
which means the internal Configuration is never instantiated.  Uploading a 
patch that I've found to fix the issue.

 Some queries in spark-sql lead to NullPointerException when using Yarn
 --

 Key: SPARK-8623
 URL: https://issues.apache.org/jira/browse/SPARK-8623
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Hadoop 2.6, Kerberos
Reporter: Bolke de Bruin

 The following query was executed using spark-sql --master yarn-client on 
 1.5.0-SNAPSHOT:
 select * from wcs.geolite_city limit 10;
 This lead to the following error:
 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException
   at org.apache.hadoop.conf.Configuration.init(Configuration.java:693)
   at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442)
   at org.apache.hadoop.mapreduce.Job.init(Job.java:131)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 This does not happen in every case, ie. some queries execute fine, and it is 
 unclear why.
 Using just spark-sql the query executes fine as well and thus the issue 
 seems to rely in the communication with Yarn. Also the query executes fine 
 (with yarn) in spark-shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8623) Hadoop RDDs fail to properly serialize configuration

2015-06-26 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-8623:
--
Summary: Hadoop RDDs fail to properly serialize configuration  (was: Some 
queries in spark-sql lead to NullPointerException when using Yarn)

 Hadoop RDDs fail to properly serialize configuration
 

 Key: SPARK-8623
 URL: https://issues.apache.org/jira/browse/SPARK-8623
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
 Environment: Hadoop 2.6, Kerberos
Reporter: Bolke de Bruin
Assignee: Sandy Ryza

 The following query was executed using spark-sql --master yarn-client on 
 1.5.0-SNAPSHOT:
 select * from wcs.geolite_city limit 10;
 This lead to the following error:
 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException
   at org.apache.hadoop.conf.Configuration.init(Configuration.java:693)
   at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442)
   at org.apache.hadoop.mapreduce.Job.init(Job.java:131)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 This does not happen in every case, ie. some queries execute fine, and it is 
 unclear why.
 Using just spark-sql the query executes fine as well and thus the issue 
 seems to rely in the communication with Yarn. Also the query executes fine 
 (with yarn) in spark-shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8623) Some queries in spark-sql lead to NullPointerException when using Yarn

2015-06-26 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-8623:
-

Assignee: Sandy Ryza

 Some queries in spark-sql lead to NullPointerException when using Yarn
 --

 Key: SPARK-8623
 URL: https://issues.apache.org/jira/browse/SPARK-8623
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
 Environment: Hadoop 2.6, Kerberos
Reporter: Bolke de Bruin
Assignee: Sandy Ryza

 The following query was executed using spark-sql --master yarn-client on 
 1.5.0-SNAPSHOT:
 select * from wcs.geolite_city limit 10;
 This lead to the following error:
 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException
   at org.apache.hadoop.conf.Configuration.init(Configuration.java:693)
   at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442)
   at org.apache.hadoop.mapreduce.Job.init(Job.java:131)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 This does not happen in every case, ie. some queries execute fine, and it is 
 unclear why.
 Using just spark-sql the query executes fine as well and thus the issue 
 seems to rely in the communication with Yarn. Also the query executes fine 
 (with yarn) in spark-shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8623) Some queries in spark-sql lead to NullPointerException when using Yarn

2015-06-26 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-8623:
--
Component/s: (was: SQL)
 Spark Core

 Some queries in spark-sql lead to NullPointerException when using Yarn
 --

 Key: SPARK-8623
 URL: https://issues.apache.org/jira/browse/SPARK-8623
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
 Environment: Hadoop 2.6, Kerberos
Reporter: Bolke de Bruin

 The following query was executed using spark-sql --master yarn-client on 
 1.5.0-SNAPSHOT:
 select * from wcs.geolite_city limit 10;
 This lead to the following error:
 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException
   at org.apache.hadoop.conf.Configuration.init(Configuration.java:693)
   at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442)
   at org.apache.hadoop.mapreduce.Job.init(Job.java:131)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 This does not happen in every case, ie. some queries execute fine, and it is 
 unclear why.
 Using just spark-sql the query executes fine as well and thus the issue 
 seems to rely in the communication with Yarn. Also the query executes fine 
 (with yarn) in spark-shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8623) Some queries in spark-sql lead to NullPointerException when using Yarn

2015-06-25 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601593#comment-14601593
 ] 

Sandy Ryza commented on SPARK-8623:
---

Looking into it

 Some queries in spark-sql lead to NullPointerException when using Yarn
 --

 Key: SPARK-8623
 URL: https://issues.apache.org/jira/browse/SPARK-8623
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Hadoop 2.6, Kerberos
Reporter: Bolke de Bruin

 The following query was executed using spark-sql --master yarn-client on 
 1.5.0-SNAPSHOT:
 select * from wcs.geolite_city limit 10;
 This lead to the following error:
 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException
   at org.apache.hadoop.conf.Configuration.init(Configuration.java:693)
   at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442)
   at org.apache.hadoop.mapreduce.Job.init(Job.java:131)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 This does not happen in every case, ie. some queries execute fine, and it is 
 unclear why.
 Using just spark-sql the query executes fine as well and thus the issue 
 seems to rely in the communication with Yarn. Also the query executes fine 
 (with yarn) in spark-shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7173) Support YARN node label expressions for the application master

2015-06-07 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576565#comment-14576565
 ] 

Sandy Ryza commented on SPARK-7173:
---

This requires additional work on top of SPARK-6470

 Support YARN node label expressions for the application master
 --

 Key: SPARK-7173
 URL: https://issues.apache.org/jira/browse/SPARK-7173
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.1
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8099) In yarn-cluster mode, --executor-cores can't be setted into SparkConf

2015-06-05 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-8099:
--
Assignee: meiyoula

 In yarn-cluster mode, --executor-cores can't be setted into SparkConf
 ---

 Key: SPARK-8099
 URL: https://issues.apache.org/jira/browse/SPARK-8099
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: meiyoula
Assignee: meiyoula
 Fix For: 1.5.0


 While testing dynamic executor allocation function, I set the executor cores 
 with *--executor-cores 4* in spark-submit command. But in 
 *ExecutorAllocationManager*, the *private val tasksPerExecutor 
 =conf.getInt(spark.executor.cores, 1) / conf.getInt(spark.task.cpus, 1)* 
 is still to be 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8099) In yarn-cluster mode, --executor-cores can't be setted into SparkConf

2015-06-05 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-8099:
--
Assignee: (was: meiyoula)

 In yarn-cluster mode, --executor-cores can't be setted into SparkConf
 ---

 Key: SPARK-8099
 URL: https://issues.apache.org/jira/browse/SPARK-8099
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: meiyoula
 Fix For: 1.5.0


 While testing dynamic executor allocation function, I set the executor cores 
 with *--executor-cores 4* in spark-submit command. But in 
 *ExecutorAllocationManager*, the *private val tasksPerExecutor 
 =conf.getInt(spark.executor.cores, 1) / conf.getInt(spark.task.cpus, 1)* 
 is still to be 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8099) In yarn-cluster mode, --executor-cores can't be setted into SparkConf

2015-06-05 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-8099.
---
   Resolution: Fixed
Fix Version/s: 1.5.0
 Assignee: meiyoula

 In yarn-cluster mode, --executor-cores can't be setted into SparkConf
 ---

 Key: SPARK-8099
 URL: https://issues.apache.org/jira/browse/SPARK-8099
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: meiyoula
Assignee: meiyoula
 Fix For: 1.5.0


 While testing dynamic executor allocation function, I set the executor cores 
 with *--executor-cores 4* in spark-submit command. But in 
 *ExecutorAllocationManager*, the *private val tasksPerExecutor 
 =conf.getInt(spark.executor.cores, 1) / conf.getInt(spark.task.cpus, 1)* 
 is still to be 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7699) Dynamic allocation: initial executors may be canceled before first job

2015-06-05 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-7699.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

 Dynamic allocation: initial executors may be canceled before first job
 --

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: meiyoula
Assignee: Saisai Shao
 Fix For: 1.5.0


 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8135) Don't load defaults when reconstituting Hadoop Configurations

2015-06-05 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-8135:
--
Summary: Don't load defaults when reconstituting Hadoop Configurations  
(was: In SerializableWritable, don't load defaults when instantiating 
Configuration)

 Don't load defaults when reconstituting Hadoop Configurations
 -

 Key: SPARK-8135
 URL: https://issues.apache.org/jira/browse/SPARK-8135
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Calling new Configuration() is an expensive operation because it loads any 
 Hadoop configuration XMLs from disk.
 In SerializableWritable, we call new Configuration needlessly when 
 instantiating an ObjectWritable.  The ObjectWritable only needs the 
 Configuration for its class cache, not for any Hadoop properties that might 
 be in XML files, so it should be ok to call new Configuration with 
 loadDefaults = false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8135) In SerializableWritable, don't load defaults when instantiating Configuration

2015-06-05 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575397#comment-14575397
 ] 

Sandy Ryza commented on SPARK-8135:
---

CC [~joshrosen]

 In SerializableWritable, don't load defaults when instantiating Configuration
 -

 Key: SPARK-8135
 URL: https://issues.apache.org/jira/browse/SPARK-8135
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Calling new Configuration() is an expensive operation because it loads any 
 Hadoop configuration XMLs from disk.
 In SerializableWritable, we call new Configuration needlessly when 
 instantiating an ObjectWritable.  The ObjectWritable only needs the 
 Configuration for its class cache, not for any Hadoop properties that might 
 be in XML files, so it should be ok to call new Configuration with 
 loadDefaults = false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8135) In SerializableWritable, don't load defaults when instantiating Configuration

2015-06-05 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-8135:
-

 Summary: In SerializableWritable, don't load defaults when 
instantiating Configuration
 Key: SPARK-8135
 URL: https://issues.apache.org/jira/browse/SPARK-8135
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza
Assignee: Sandy Ryza


Calling new Configuration() is an expensive operation because it loads any 
Hadoop configuration XMLs from disk.

In SerializableWritable, we call new Configuration needlessly when 
instantiating an ObjectWritable.  The ObjectWritable only needs the 
Configuration for its class cache, not for any Hadoop properties that might be 
in XML files, so it should be ok to call new Configuration with loadDefaults = 
false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8135) In SerializableWritable, don't load defaults when instantiating Configuration

2015-06-05 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575410#comment-14575410
 ] 

Sandy Ryza commented on SPARK-8135:
---

Your question made me think about the fact that, whether or not we go through 
ObjectWritable, we're still going to end up calling the default Configuration 
constructor when SerializableWritable wraps a Configuration.  I think this is 
undesirable for the reasons outlined above.

What do you think about adding a SerializableConfiguration class that 
instantiates a Configuration with loadDefaults = false?

Regarding your question, I think we could get away with that.  Is there an 
advantage though?  I think it would probably be more code.

 In SerializableWritable, don't load defaults when instantiating Configuration
 -

 Key: SPARK-8135
 URL: https://issues.apache.org/jira/browse/SPARK-8135
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Calling new Configuration() is an expensive operation because it loads any 
 Hadoop configuration XMLs from disk.
 In SerializableWritable, we call new Configuration needlessly when 
 instantiating an ObjectWritable.  The ObjectWritable only needs the 
 Configuration for its class cache, not for any Hadoop properties that might 
 be in XML files, so it should be ok to call new Configuration with 
 loadDefaults = false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8135) Don't load defaults when reconstituting Hadoop Configurations

2015-06-05 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575463#comment-14575463
 ] 

Sandy Ryza commented on SPARK-8135:
---

Cool.  Updated the PR with SerializableConfiguration.

 Don't load defaults when reconstituting Hadoop Configurations
 -

 Key: SPARK-8135
 URL: https://issues.apache.org/jira/browse/SPARK-8135
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Calling new Configuration() is an expensive operation because it loads any 
 Hadoop configuration XMLs from disk.
 In SerializableWritable, we call new Configuration needlessly when 
 instantiating an ObjectWritable.  The ObjectWritable only needs the 
 Configuration for its class cache, not for any Hadoop properties that might 
 be in XML files, so it should be ok to call new Configuration with 
 loadDefaults = false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8136) AM link download test can be flaky

2015-06-05 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-8136:
--
Assignee: Hari Shreedharan

 AM link download test can be flaky
 --

 Key: SPARK-8136
 URL: https://issues.apache.org/jira/browse/SPARK-8136
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan

 Sometimes YARN does not replace the link (or replaces it too soon) causing 
 the YarnClusterSuite to fail. On a real cluster, the NM automatically 
 redirects once the app is complete. So we should make the test less strict 
 and have it only check the link's format rather than try to download the logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-04 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572245#comment-14572245
 ] 

Sandy Ryza commented on SPARK-4352:
---

I don't think it's abnormal.  Consider joining the results of a shuffle (no 
locality preferences) with a small table on HDFS (has locality preferences).  
Is there any particular reason we would expect to ramp down quickly in this 
situation?

 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-04 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572356#comment-14572356
 ] 

Sandy Ryza commented on SPARK-4352:
---

Right, but once we have placed 5 executors on those nodes, the other executors 
can go anywhere (and it's probably preferable that they do in order to spread 
out our resource consumption more evenly).

 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-04 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572338#comment-14572338
 ] 

Sandy Ryza commented on SPARK-4352:
---

In your example, what would be the advantage of requesting 15 executors on a, 
b, d if we could only make use of 5?

 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8062) NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics

2015-06-03 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570388#comment-14570388
 ] 

Sandy Ryza commented on SPARK-8062:
---

[~joshrosen] nothing sticks out to me past what you've found.  My other 
suspicion was that somehow the scheme from the path was coming in as null, but 
the implementation of Path#makeQualified seems like it should always set a 
scheme.  Given that we've never come across it, it seems possible that it's 
related to something maprfs is doing?  Defensive checks seem reasonable to me.

 NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics
 -

 Key: SPARK-8062
 URL: https://issues.apache.org/jira/browse/SPARK-8062
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: MapR 4.0.1, Hadoop 2.4.1, Yarn
Reporter: Josh Rosen

 I received the following error report from a user:
 While running a Spark Streaming job that reads from MapRfs and writes to 
 HBase using Spark 1.2.1, the job intermittently experiences a total job 
 failure due to the following errors:
 {code}
 15/05/28 10:35:50 ERROR executor.Executor: Exception in task 1.1 in stage 6.0 
 (TID 24) 
 java.lang.NullPointerException 
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178)
  
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178)
  
 at 
 scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
  
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
 at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 
 at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) 
 at scala.collection.AbstractTraversable.filter(Traversable.scala:105) 
 at 
 org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178)
  
 at 
 org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139)
  
 at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:116) 
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107) 
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) 
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) 
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) 
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) 
 at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) 
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) 
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) 
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) 
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) 
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
 at org.apache.spark.scheduler.Task.run(Task.scala:56) 
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) 
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  
 at java.lang.Thread.run(Thread.java:744) 
 15/05/28 10:35:50 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 25 
 15/05/28 10:35:50 INFO executor.Executor: Running task 2.1 in stage 6.0 (TID 
 25) 
 15/05/28 10:35:50 INFO rdd.NewHadoopRDD: Input split: hdfs:/[REDACTED] 
 15/05/28 10:35:50 ERROR executor.Executor: Exception in task 2.1 in stage 6.0 
 (TID 25) 
 java.lang.NullPointerException 
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178)
  
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178)
  
 at 
 scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
  
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
 {code}
 Diving into the code here:
 The NPE is occurring on this line of SparkHadoopUtil (in 1.2.1.): 
 https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L178
 Here's that block of code from 1.2.1 (it's the same in 1.2.2):
 {code}
   private def getFileSystemThreadStatistics(path: Path, conf: 

[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-02 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569623#comment-14569623
 ] 

Sandy Ryza commented on SPARK-4352:
---

In the case where the task number = executor number * cores, I think my 
earlier argument still stands.  Any executor requests beyond the ones needed to 
satisfy our preferences should be submitted with locality preferences.  This 
means we will be less likely to bunch up requests on particular nodes where 
executors are not needed.  Consider the extreme case where we want to request 
100 executors but only have a single task with locality preferences, for data 
on 3 nodes.  Going purely by the ratio approach, we would end up requesting all 
100 executors on those three nodes.

For the other cases, your approach makes sense to me.

 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-01 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567755#comment-14567755
 ] 

Sandy Ryza commented on SPARK-4352:
---

[~jerryshao] I wouldn't say that the goal is necessarily to get as close as 
possible to the ratio of requests (3 : 3 : 2 : 1 in the example).  My idea was 
to get as close as possible to sum(cores from all executor requests with that 
node on their preferred list) = number tasks that prefer that node.

Why?  Let's look at the situation where we're requesting 18 executors.  Let's 
say we request 6 executors with a preference for a, b, c, d like you 
suggested. YARN would be perfectly happy giving us 6 executors on node d.  But 
we only have 10 tasks (with executors that have 2 cores, this means 5 
executors) that need to run on node d.  So we'd really prefer that the 6th 
executor be scheduled on a, b, or c, because placing it on d confers no 
additional advantage.

For the situation where we're requesting 7 executors I have less of an argument 
for why my 5 : 2 is better than your 2 : 2 : 3.  Thinking about it more now, it 
seems like your approach could be closer to optimal because getting executors 
on a or b means more of our tasks get to run on local data.  So I would 
certainly be open to something that tries to preserve the ratio when the number 
of executors we're allowed to request is under the maximum number of tasks 
targeted for any particular node.

 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-01 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568580#comment-14568580
 ] 

Sandy Ryza commented on SPARK-4352:
---

[~jerryshao] I think you're right that in the case where the number of 
executors is smaller than the maximum number of tasks with preferences on any 
particular node, using the ratio makes more sense.


 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7707) User guide and example code for Statistics.kernelDensity

2015-05-29 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565638#comment-14565638
 ] 

Sandy Ryza commented on SPARK-7707:
---

Sorry for the delayed response here.  I will try to get to this next week.

 User guide and example code for Statistics.kernelDensity
 

 Key: SPARK-7707
 URL: https://issues.apache.org/jira/browse/SPARK-7707
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7896) IndexOutOfBoundsException in ChainedBuffer

2015-05-27 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561311#comment-14561311
 ] 

Sandy Ryza commented on SPARK-7896:
---

[~joshrosen] I'll take a look

 IndexOutOfBoundsException in ChainedBuffer
 --

 Key: SPARK-7896
 URL: https://issues.apache.org/jira/browse/SPARK-7896
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Arun Ahuja
Assignee: Sandy Ryza
Priority: Blocker

 I've run into this on two tasks that use the same dataset.
 The dataset is a collection of strings where the most common string appears 
 ~200M times and the next few appear ~50M times each.
 for this rdd: RDD[String], I can do rdd.map( x = (x, 1)).reduceByKey( _ + _) 
 to get the counts (how I got the number above), but I hit the error on 
 rdd.groupByKey().
 Also, I have a second RDD of strings rdd2: RDD[String] and I cannot do 
 rdd2.leftOuterJoin(rdd) without hitting this error
 {code}
 15/05/26 23:27:55 WARN scheduler.TaskSetManager: Lost task 3169.1 in stage 
 5.0 (TID 4843, demeter-csmaz10-19.demeter.hpc.mssm.edu): 
 java.lang.IndexOutOfBoundsException: 512
 at 
 scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
 at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
 at 
 org.apache.spark.util.collection.ChainedBuffer.write(ChainedBuffer.scala:110)
 at 
 org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:141)
 at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
 at 
 org.apache.spark.serializer.KryoSerializationStream.flush(KryoSerializer.scala:147)
 at 
 org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:78)
 at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219)
 at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7896) IndexOutOfBoundsException in ChainedBuffer

2015-05-27 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561347#comment-14561347
 ] 

Sandy Ryza commented on SPARK-7896:
---

This must be because we're overflowing the 2 GB limit of the ChainedBuffer.  
512 * 4 MB  = 2 GB.

I'll post a patch that uses long indices.  Would it be easy for you to try it 
out today [~arahuja]?

 IndexOutOfBoundsException in ChainedBuffer
 --

 Key: SPARK-7896
 URL: https://issues.apache.org/jira/browse/SPARK-7896
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Arun Ahuja
Assignee: Sandy Ryza
Priority: Blocker

 I've run into this on two tasks that use the same dataset.
 The dataset is a collection of strings where the most common string appears 
 ~200M times and the next few appear ~50M times each.
 for this rdd: RDD[String], I can do rdd.map( x = (x, 1)).reduceByKey( _ + _) 
 to get the counts (how I got the number above), but I hit the error on 
 rdd.groupByKey().
 Also, I have a second RDD of strings rdd2: RDD[String] and I cannot do 
 rdd2.leftOuterJoin(rdd) without hitting this error
 {code}
 15/05/26 23:27:55 WARN scheduler.TaskSetManager: Lost task 3169.1 in stage 
 5.0 (TID 4843, demeter-csmaz10-19.demeter.hpc.mssm.edu): 
 java.lang.IndexOutOfBoundsException: 512
 at 
 scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
 at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
 at 
 org.apache.spark.util.collection.ChainedBuffer.write(ChainedBuffer.scala:110)
 at 
 org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:141)
 at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
 at 
 org.apache.spark.serializer.KryoSerializationStream.flush(KryoSerializer.scala:147)
 at 
 org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:78)
 at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219)
 at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-05-27 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561656#comment-14561656
 ] 

Sandy Ryza commented on SPARK-4352:
---

I have a couple concerns about that approach.  The first is that we can end up 
requesting many more executors on a node than we need.  E.g. if only a single 
task is interested in running on node1, but many tasks are interested in 
running on node2, nothing will stop YARN from scheduling all requested 
executors on node1.  Conversely, we can end up requesting far fewer executors 
on a node than we need.  Once a single executor has been placed on a node, even 
if that executor only has a single core and there are many pending tasks 
interested in that node, we will stop trying to place executors on that node. 

I have an idea for a scheme that I think will better tailor requests to 
locality needs.  It's definitely more complex, but I don't think it's more 
complex than what frameworks like MapReduce do, and I think it's an answer to a 
fundamentally complex problem.

The idea is basically to submit executor requests such that, for each node, 
sum(cores from all executor requests with that node on their preferred list) = 
number tasks that prefer that node.

As an example, let's imagine our executors have 2 cores each, we have 4 nodes 
named a through d, and 30 pending tasks.  The first 20 of these pending 
tasks have locality preferences for nodes a, b, and c, and the other 10 prefer 
nodes a, b, and d.  Meaning that the number of tasks desired on each node are 
a: 30, b: 30, c: 20, and d: 10.

If the number of executors we want to request is = 15, then we would submit the 
following requests:
requests for 5 executors with nodes = a, b, c, d
requests for 5 executors with nodes = a, b, c
requests for 5 executors with nodes = a, b

If the number of executors we want to request is 7, then we would submit the 
following requests:
requests for 5 executors with nodes = a, b, c, d
requests for 2 executors with nodes = a, b, c

If the number of executors we want to request is 18, then we would submit the 
following requests:
requests for 5 executors with nodes = a, b, c, d
requests for 5 executors with nodes = a, b, c
requests for 5 executors with nodes = a, b
requests for 3 executors with no locality preferences

We might want to augment this to account for the fact that some nodes have 
executors already.

What do you think? 

 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled

2015-05-26 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560227#comment-14560227
 ] 

Sandy Ryza commented on SPARK-7699:
---

I think tying this the AM-RM heartbeat would just make things more confusing, 
especially now that the heartbeat interval is variable.  Whether we've had a 
fair chance to allocate resources also depends on internal YARN configurations, 
like the NM-RM heartbeat interval or whether continuous scheduling is enabled.  
I don't think there's any easy notion of fair chance that doesn't rely on a 
timeout.

Another option would be to avoid adjusting targetNumExecutors down before the 
first job is submitted.



 Number of executors can be reduced from initial before work is scheduled
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: meiyoula
Priority: Minor

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled

2015-05-26 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558748#comment-14558748
 ] 

Sandy Ryza commented on SPARK-7699:
---

We can't wait only on the initial allocation being made because YARN might not 
be able to fully satisfy it in any finite amount of time.

 Number of executors can be reduced from initial before work is scheduled
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: meiyoula
Priority: Minor

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled

2015-05-26 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558757#comment-14558757
 ] 

Sandy Ryza commented on SPARK-7699:
---

I think delaying releasing them is exactly the point of the property.  If we 
don't want to do that, what's it there for? 

 Number of executors can be reduced from initial before work is scheduled
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: meiyoula
Priority: Minor

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-05-26 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4352:
--
Assignee: Saisai Shao

 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect

2015-05-24 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557911#comment-14557911
 ] 

Sandy Ryza edited comment on SPARK-7699 at 5/25/15 1:26 AM:


[~sowen] I think the possible flaw in your argument is that it relies on 
initial load being defined in some reasonable way.

I.e. I think the worry is that the following can happen:
* initial = 3 and min = 1
* cluster is large and uncontended
* first line of user code is a job submission that can make use of at least 3
* because the executor allocation thread starts immediately, requested 
executors ramps down to 1 before the user code has a chance to submit the job

Which is to say: what guarantees do we provide about initialExecutors other 
than that it's the number of executors requests we have before some opaque 
internal thing happens to adjust it down?  One possible such guarantee we could 
provide is that we won't adjust down for some fixed number of seconds after the 
SparkContext starts.


was (Author: sandyr):
[~sowen] I think the possible flaw in your argument is that it relies on 
initial load being defined in some reasonable.

I.e. I think the worry is that the following can happen:
* initial = 3 and min = 1
* cluster is large and uncontended
* first line of user code is a job submission that can make use of at least 3
* because the executor allocation thread starts immediately, requested 
executors ramps down to 1 before the user code has a chance to submit the job

Which is to say: what guarantees do we provide about initialExecutors other 
than that it's the number of executors requests we have before some opaque 
internal thing happens to adjust it down?  One possible such guarantee we could 
provide is that we won't adjust down for some fixed number of seconds after the 
SparkContext starts.

 Config spark.dynamicAllocation.initialExecutors has no effect 
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect

2015-05-24 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557911#comment-14557911
 ] 

Sandy Ryza commented on SPARK-7699:
---

[~sowen] I think the possible flaw in your argument is that it relies on 
initial load being defined in some reasonable.

I.e. I think the worry is that the following can happen:
* initial = 3 and min = 1
* cluster is large and uncontended
* first line of user code is a job submission that can make use of at least 3
* because the executor allocation thread starts immediately, requested 
executors ramps down to 1 before the user code has a chance to submit the job

Which is to say: what guarantees do we provide about initialExecutors other 
than that it's the number of executors requests we have before some opaque 
internal thing happens to adjust it down?  One possible such guarantee we could 
provide is that we won't adjust down for some fixed number of seconds after the 
SparkContext starts.

 Config spark.dynamicAllocation.initialExecutors has no effect 
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect

2015-05-22 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555837#comment-14555837
 ] 

Sandy Ryza commented on SPARK-7699:
---

Sorry for the delay here.  The desired behavior is to never have outstanding 
requests for more than the number of executors we'd need to satisfy all current 
tasks (unless minExecutors is set).  I.e. we don't want to ramp down gradually.

So I think the concerns here are valid - this policy means that as soon as the 
dynamic allocation thread starts being active, it will cancel any container 
requests that were made as a result of initialExecutors.  If, however, these 
executor requests were actually fulfilled, dynamic allocation wouldn't throw 
away the executors.

This means that the relevant questions are: is there a window of time after 
we've requested the initial executors but before the dynamic allocation thread 
starts?  Is there something fundamental about this window of time that means it 
will probably still be there after future scheduling optimizations?  If not, do 
we want to make sure that ExecutorAllocationManager itself doesn't ramp down 
below initialExecutors until some criteria (probably time) is satisfied?  Or 
should we just scrap the property.


 Config spark.dynamicAllocation.initialExecutors has no effect 
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-05-20 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551898#comment-14551898
 ] 

Sandy Ryza commented on SPARK-4352:
---

I don't think we should kill executors in order to achieve locality.  But I 
think the question of how we translate task preferences into new executor 
requests is non-trivial, and the way that it was previously done with 
generateNodeToWeight isn't necessarily the best way (but could be).

I don't think there's an obvious solution, but it would be good to lay out the 
options and their pros and cons.

 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Critical

 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-05-19 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551677#comment-14551677
 ] 

Sandy Ryza commented on SPARK-4352:
---

Thanks for posting this Saisai.  Can you export and attach it as a PDF so that 
we've got an immutable copy for posterity?

This mostly looks good to me.  Regarding a couple of your open questions:
1. I think best would be to modify requestTotalExecutors (which is private) and 
avoid any public API additions for now.
2. Task locality preferences are computed already, right?  Can we put any 
computation beyond that in ExecutorAllocationManager so that it only happens 
when dynamic allocation is turned on?

I think another big question is how we translate task locality preferences into 
requests to YARN.  This impacts what the preferredNodeLocalityData should look 
like.  At any moment, we have a set of pending tasks, each with a set of node 
preferences and rack preferences, as well as a number of desired executors.  
How do we account for the fact that some nodes have more pending tasks than 
others?  What happens when we're working with executors with 5 cores, but none 
of the tasks share nodes that they want?  What happens when the number of 
pending tasks exceeds the total capacity of the desired number of executors?

One approach would be to request executors at every location that we have 
pending tasks, and then return executors once we've reached the number that we 
need.  Another would be to condense down our preferences into an optimal number 
of executor requests. 


 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Critical

 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7579) User guide update for OneHotEncoder

2015-05-13 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541517#comment-14541517
 ] 

Sandy Ryza commented on SPARK-7579:
---

Ah.  I was actually referring to the examples in the ml-guide.md doc.  Didn't 
notice the ml-features.md doc.  Seems like it should be pretty straightforward 
to add something there. 

 User guide update for OneHotEncoder
 ---

 Key: SPARK-7579
 URL: https://issues.apache.org/jira/browse/SPARK-7579
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley
Assignee: Sandy Ryza

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}
 Note: I created a new subsection for links to spark.ml-specific guides in 
 this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
 subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5888) Add OneHotEncoder as a Transformer

2015-05-13 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541524#comment-14541524
 ] 

Sandy Ryza commented on SPARK-5888:
---

[~mengxr] that makes sense to me, but does that address the issue I brought up 
above?

 Add OneHotEncoder as a Transformer
 --

 Key: SPARK-5888
 URL: https://issues.apache.org/jira/browse/SPARK-5888
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Sandy Ryza
 Fix For: 1.4.0


 `OneHotEncoder` takes a categorical column and output a vector column, which 
 stores the category info in binaries.
 {code}
 val ohe = new OneHotEncoder()
   .setInputCol(countryIndex)
   .setOutputCol(countries)
 {code}
 It should read the category info from the metadata and assign feature names 
 properly in the output column. We need to discuss the default naming scheme 
 and whether we should let it process multiple categorical columns at the same 
 time.
 One category (the most frequent one) should be removed from the output to 
 make the output columns linear independent. Or this could be an option tuned 
 on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5888) Add OneHotEncoder as a Transformer

2015-05-12 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541339#comment-14541339
 ] 

Sandy Ryza commented on SPARK-5888:
---

Hi [~hvanhovell], I agree that this should work.  [~mengxr], any thoughts on 
the best way to solve this?

 Add OneHotEncoder as a Transformer
 --

 Key: SPARK-5888
 URL: https://issues.apache.org/jira/browse/SPARK-5888
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Sandy Ryza
 Fix For: 1.4.0


 `OneHotEncoder` takes a categorical column and output a vector column, which 
 stores the category info in binaries.
 {code}
 val ohe = new OneHotEncoder()
   .setInputCol(countryIndex)
   .setOutputCol(countries)
 {code}
 It should read the category info from the metadata and assign feature names 
 properly in the output column. We need to discuss the default naming scheme 
 and whether we should let it process multiple categorical columns at the same 
 time.
 One category (the most frequent one) should be removed from the output to 
 make the output columns linear independent. Or this could be an option tuned 
 on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5888) Add OneHotEncoder as a Transformer

2015-05-12 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541367#comment-14541367
 ] 

Sandy Ryza commented on SPARK-5888:
---

Right, but while the values are unknown at first, they will become known at 
some point during the execution (after StringIndexer.fit completes).  So it 
seems like at some point it would be good to pass these values down.

Put another way, it seems bad to me that the user should see different behavior 
with regard to attribute values if they use OneHotEncoder in the Pipeline way, 
as described by Herman above, vs in the slightly more verbose way where 
StringIndexer.transform is explicitly called first: 
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala#L50.

I'm relatively unfamiliar with these APIs, so apologies if I'm not making sense.

 Add OneHotEncoder as a Transformer
 --

 Key: SPARK-5888
 URL: https://issues.apache.org/jira/browse/SPARK-5888
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Sandy Ryza
 Fix For: 1.4.0


 `OneHotEncoder` takes a categorical column and output a vector column, which 
 stores the category info in binaries.
 {code}
 val ohe = new OneHotEncoder()
   .setInputCol(countryIndex)
   .setOutputCol(countries)
 {code}
 It should read the category info from the metadata and assign feature names 
 properly in the output column. We need to discuss the default naming scheme 
 and whether we should let it process multiple categorical columns at the same 
 time.
 One category (the most frequent one) should be removed from the output to 
 make the output columns linear independent. Or this could be an option tuned 
 on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7579) User guide update for OneHotEncoder

2015-05-12 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541356#comment-14541356
 ] 

Sandy Ryza commented on SPARK-7579:
---

I can take this up.  Any thoughts on how it should be incorporated?  E.g. as an 
example a la Example: Model Selection via Cross-Validation?

 User guide update for OneHotEncoder
 ---

 Key: SPARK-7579
 URL: https://issues.apache.org/jira/browse/SPARK-7579
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley
Assignee: Sandy Ryza

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}
 Note: I created a new subsection for links to spark.ml-specific guides in 
 this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
 subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5888) Add OneHotEncoder as a Transformer

2015-05-12 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541351#comment-14541351
 ] 

Sandy Ryza commented on SPARK-5888:
---

The values of the nominal output attribute should be based based on those of 
the input attribute.  Are you saying that there is a way to propagate these at 
a later time?

 Add OneHotEncoder as a Transformer
 --

 Key: SPARK-5888
 URL: https://issues.apache.org/jira/browse/SPARK-5888
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Sandy Ryza
 Fix For: 1.4.0


 `OneHotEncoder` takes a categorical column and output a vector column, which 
 stores the category info in binaries.
 {code}
 val ohe = new OneHotEncoder()
   .setInputCol(countryIndex)
   .setOutputCol(countries)
 {code}
 It should read the category info from the metadata and assign feature names 
 properly in the output column. We need to discuss the default naming scheme 
 and whether we should let it process multiple categorical columns at the same 
 time.
 One category (the most frequent one) should be removed from the output to 
 make the output columns linear independent. Or this could be an option tuned 
 on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7515) Update documentation for PySpark on YARN with cluster mode

2015-05-11 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-7515.
---
  Resolution: Fixed
   Fix Version/s: 1.5.0
Target Version/s:   (was: 1.4.0)

 Update documentation for PySpark on YARN with cluster mode
 --

 Key: SPARK-7515
 URL: https://issues.apache.org/jira/browse/SPARK-7515
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 1.5.0


 Now PySpark on YARN with cluster mode is supported so let's update doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7515) Update documentation for PySpark on YARN with cluster mode

2015-05-11 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-7515:
--
Assignee: Kousuke Saruta

 Update documentation for PySpark on YARN with cluster mode
 --

 Key: SPARK-7515
 URL: https://issues.apache.org/jira/browse/SPARK-7515
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor

 Now PySpark on YARN with cluster mode is supported so let's update doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7410) Add option to avoid broadcasting configuration with newAPIHadoopFile

2015-05-11 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538784#comment-14538784
 ] 

Sandy Ryza commented on SPARK-7410:
---

Thanks for the pointer, [~joshrosen].  Looked over that JIRA and hard to 
understand what's going on at first glance.  Are you saying it's a performance 
issue or a correctness issue?  Or both?  I can look deeper if you don't 
remember.

Regarding performance, broadcasting the conf is probably faster in the majority 
of cases.   But in cases where the RDD has only a couple partitions, the 
reverse is true.  So what I'm advocating for is the ability to turn 
broadcasting off in the latter case.

 Add option to avoid broadcasting configuration with newAPIHadoopFile
 

 Key: SPARK-7410
 URL: https://issues.apache.org/jira/browse/SPARK-7410
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Sandy Ryza

 I'm working with a Spark application that creates thousands of HadoopRDDs and 
 unions them together.  Certain details of the way the data is stored require 
 this.
 Creating ten thousand of these RDDs takes about 10 minutes, even before any 
 of them is used in an action.  I dug into why this takes so long and it looks 
 like the overhead of broadcasting the Hadoop configuration is taking up most 
 of the time.  In this case, the broadcasting isn't helpful because each 
 HadoopRDD only corresponds to one or two tasks.  When I reverted the original 
 change that switched to broadcasting configurations, the time it took to 
 instantiate these RDDs improved 10x.
 It would be nice if there was a way to turn this broadcasting off.  Either 
 through a Spark configuration option, a Hadoop configuration option, or an 
 argument to hadoopFile / newAPIHadoopFile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7533) Decrease spacing between AM-RM heartbeats.

2015-05-11 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-7533:
-

 Summary: Decrease spacing between AM-RM heartbeats.
 Key: SPARK-7533
 URL: https://issues.apache.org/jira/browse/SPARK-7533
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.1
Reporter: Sandy Ryza


The current default of spark.yarn.scheduler.heartbeat.interval-ms is 5 seconds. 
 This is really long.  For reference, the MR equivalent is 1 second.

To avoid noise and unnecessary communication, we could have a fast rate for 
when we're waiting for executors and a slow rate for when we're just 
heartbeating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7410) Add option to avoid broadcasting configuration with newAPIHadoopFile

2015-05-06 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-7410:
-

 Summary: Add option to avoid broadcasting configuration with 
newAPIHadoopFile
 Key: SPARK-7410
 URL: https://issues.apache.org/jira/browse/SPARK-7410
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Sandy Ryza


I'm working with a Spark application that creates thousands of HadoopRDDs and 
unions them together.  Certain details of the way the data is stored require 
this.

Creating ten thousand of these RDDs takes about 10 minutes, even before any of 
them is used in an action.  I dug into why this takes so long and it looks like 
the overhead of broadcasting the Hadoop configuration is taking up most of the 
time.  In this case, the broadcasting isn't helpful because each HadoopRDD only 
corresponds to one or two tasks.  When I reverted the original change that 
switched to broadcasting configurations, the time it took to instantiate these 
RDDs improved 10x.

It would be nice if there was a way to turn this broadcasting off.  Either 
through a Spark configuration option, a Hadoop configuration option, or an 
argument to hadoopFile / newAPIHadoopFile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2015-05-01 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-4550.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

 In sort-based shuffle, store map outputs in serialized form
 ---

 Key: SPARK-4550
 URL: https://issues.apache.org/jira/browse/SPARK-4550
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical
 Fix For: 1.4.0

 Attachments: SPARK-4550-design-v1.pdf, kryo-flush-benchmark.scala


 One drawback with sort-based shuffle compared to hash-based shuffle is that 
 it ends up storing many more java objects in memory.  If Spark could store 
 map outputs in serialized form, it could
 * spill less often because the serialized form is more compact
 * reduce GC pressure
 This will only work when the serialized representations of objects are 
 independent from each other and occupy contiguous segments of memory.  E.g. 
 when Kryo reference tracking is left on, objects may contain pointers to 
 objects farther back in the stream, which means that the sort can't relocate 
 objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7311) Enable in-memory serialized map-side shuffle to work with SQL serializers

2015-05-01 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-7311:
-

 Summary: Enable in-memory serialized map-side shuffle to work with 
SQL serializers
 Key: SPARK-7311
 URL: https://issues.apache.org/jira/browse/SPARK-7311
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 1.4.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza


Right now it only works with KryoSerializer, but it would be useful to make it 
work with any Serializer that writes objects in a relocatable / self-contained 
fashion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-04-28 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518286#comment-14518286
 ] 

Sandy Ryza commented on SPARK-3655:
---

My opinion is that a secondary sort operator in core Spark would definitely be 
useful.

 Support sorting of values in addition to keys (i.e. secondary sort)
 ---

 Key: SPARK-3655
 URL: https://issues.apache.org/jira/browse/SPARK-3655
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: koert kuipers
Assignee: Koert Kuipers

 Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
 There are some use cases where getting a sorted iterator of values per key is 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7173) Support YARN node label expressions for the application master

2015-04-27 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-7173:
-

 Summary: Support YARN node label expressions for the application 
master
 Key: SPARK-7173
 URL: https://issues.apache.org/jira/browse/SPARK-7173
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.1
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6954) ExecutorAllocationManager can end up requesting a negative number of executors

2015-04-26 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-6954:
--
Summary: ExecutorAllocationManager can end up requesting a negative number 
of executors  (was: Dynamic allocation: numExecutorsPending in 
ExecutorAllocationManager should never become negative)

 ExecutorAllocationManager can end up requesting a negative number of executors
 --

 Key: SPARK-6954
 URL: https://issues.apache.org/jira/browse/SPARK-6954
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.1
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
  Labels: yarn
 Attachments: with_fix.png, without_fix.png


 I have a simple test case for dynamic allocation on YARN that fails with the 
 following stack trace-
 {code}
 15/04/16 00:52:14 ERROR Utils: Uncaught exception in thread 
 spark-dynamic-executor-allocation-0
 java.lang.IllegalArgumentException: Attempted to request a negative number of 
 executor(s) -21 from the cluster manager. Please specify a positive number!
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338)
   at 
 org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137)
   at 
 org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
   at 
 org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
   at 
 org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 My test is as follows-
 # Start spark-shell with a single executor.
 # Run a {{select count(\*)}} query. The number of executors rises as input 
 size is non-trivial.
 # After the job finishes, the number of  executors falls as most of them 
 become idle.
 # Rerun the same query again, and the request to add executors fails with the 
 above error. In fact, the job itself continues to run with whatever executors 
 it already has, but it never gets more executors unless the shell is closed 
 and restarted. 
 In fact, this error only happens when I configure {{executorIdleTimeout}} 
 very small. For eg, I can reproduce it with the following configs-
 {code}
 spark.dynamicAllocation.executorIdleTimeout 5
 spark.dynamicAllocation.schedulerBacklogTimeout 5
 {code}
 Although I can simply increase {{executorIdleTimeout}} to something like 60 
 secs to avoid the error, I think this is still a bug to be fixed.
 The root cause seems that {{numExecutorsPending}} accidentally becomes 
 negative if executors are killed too aggressively (i.e. 
 {{executorIdleTimeout}} is too small) because under that circumstance, the 
 new target # of executors can be smaller than the current # of executors. 
 When that happens, {{ExecutorAllocationManager}} ends up trying to add a 
 negative number of executors, which throws an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6891) ExecutorAllocationManager will request negative number executors

2015-04-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-6891.
---
Resolution: Duplicate

 ExecutorAllocationManager will request negative number executors
 

 Key: SPARK-6891
 URL: https://issues.apache.org/jira/browse/SPARK-6891
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
Priority: Critical
 Attachments: DynamicExecutorTest.scala


 Below is the exception:
15/04/14 10:10:18 ERROR Utils: Uncaught exception in thread 
 spark-dynamic-executor-allocation-0
 java.lang.IllegalArgumentException: Attempted to request a negative 
 number of executor(s) -1 from the cluster manager. Please specify a positive 
 number!
  at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:342)
  at 
 org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1170)
  at 
 org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
  at 
 org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
  at 
 org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
  at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
  at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
  at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
  at 
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1723)
  at 
 org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
  at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at 
 java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
  at 
 java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
  at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
  at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 Below is the configurations I  setted:
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors   0
spark.dynamicAllocation.initialExecutors3
spark.dynamicAllocation.maxExecutors7
spark.dynamicAllocation.executorIdleTimeout 30
spark.shuffle.service.enabled   true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6891) ExecutorAllocationManager will request negative number executors

2015-04-24 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510550#comment-14510550
 ] 

Sandy Ryza commented on SPARK-6891:
---

This looks like a duplicate of SPARK-6954.  While this JIRA was filed earlier, 
I'm going to close it because the other one already has a patch with heavy 
discussion.  Sorry that we didn't notice this earlier.

 ExecutorAllocationManager will request negative number executors
 

 Key: SPARK-6891
 URL: https://issues.apache.org/jira/browse/SPARK-6891
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
Priority: Critical
 Attachments: DynamicExecutorTest.scala


 Below is the exception:
15/04/14 10:10:18 ERROR Utils: Uncaught exception in thread 
 spark-dynamic-executor-allocation-0
 java.lang.IllegalArgumentException: Attempted to request a negative 
 number of executor(s) -1 from the cluster manager. Please specify a positive 
 number!
  at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:342)
  at 
 org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1170)
  at 
 org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
  at 
 org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
  at 
 org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
  at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
  at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
  at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
  at 
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1723)
  at 
 org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
  at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at 
 java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
  at 
 java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
  at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
  at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 Below is the configurations I  setted:
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors   0
spark.dynamicAllocation.initialExecutors3
spark.dynamicAllocation.maxExecutors7
spark.dynamicAllocation.executorIdleTimeout 30
spark.shuffle.service.enabled   true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6954) Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative

2015-04-15 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497465#comment-14497465
 ] 

Sandy Ryza commented on SPARK-6954:
---

Hi [~cheolsoo], are you running with a version of Spark that contains 
SPARK-6325? (1.3.0 does not).

 Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should 
 never become negative
 -

 Key: SPARK-6954
 URL: https://issues.apache.org/jira/browse/SPARK-6954
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
Reporter: Cheolsoo Park
Priority: Minor
  Labels: yarn

 I have a simple test case for dynamic allocation on YARN that fails with the 
 following stack trace-
 {code}
 15/04/16 00:52:14 ERROR Utils: Uncaught exception in thread 
 spark-dynamic-executor-allocation-0
 java.lang.IllegalArgumentException: Attempted to request a negative number of 
 executor(s) -21 from the cluster manager. Please specify a positive number!
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338)
   at 
 org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137)
   at 
 org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
   at 
 org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
   at 
 org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 My test is as follows-
 # Start spark-shell with a single executor.
 # Run a {{select count(\*)}} query. The number of executors rises as input 
 size is non-trivial.
 # After the job finishes, the number of  executors falls as most of them 
 become idle.
 # Rerun the same query again, and the request to add executors fails with the 
 above error. In fact, the job itself continues to run with whatever executors 
 it already has, but it never gets more executors unless the shell is closed 
 and restarted. 
 In fact, this error only happens when I configure {{executorIdleTimeout}} 
 very small. For eg, I can reproduce it with the following configs-
 {code}
 spark.dynamicAllocation.executorIdleTimeout 5
 spark.dynamicAllocation.schedulerBacklogTimeout 5
 {code}
 Although I can simply increase {{executorIdleTimeout}} to something like 60 
 secs to avoid the error, I think this is still a bug to be fixed.
 The root cause seems that {{numExecutorsPending}} accidentally becomes 
 negative if executors are killed too aggressively (i.e. 
 {{executorIdleTimeout}} is too small) because under that circumstance, the 
 new target # of executors can be smaller than the current # of executors. 
 When that happens, {{ExecutorAllocationManager}} ends up trying to add a 
 negative number of executors, which throws an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer

2015-04-13 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-5888:
-

Assignee: Sandy Ryza

 Add OneHotEncoder as a Transformer
 --

 Key: SPARK-5888
 URL: https://issues.apache.org/jira/browse/SPARK-5888
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Sandy Ryza

 `OneHotEncoder` takes a categorical column and output a vector column, which 
 stores the category info in binaries.
 {code}
 val ohe = new OneHotEncoder()
   .setInputCol(countryIndex)
   .setOutputCol(countries)
 {code}
 It should read the category info from the metadata and assign feature names 
 properly in the output column. We need to discuss the default naming scheme 
 and whether we should let it process multiple categorical columns at the same 
 time.
 One category (the most frequent one) should be removed from the output to 
 make the output columns linear independent. Or this could be an option tuned 
 on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.

2015-04-09 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487670#comment-14487670
 ] 

Sandy Ryza commented on SPARK-6735:
---

Hi [~twinkle], can you submit the PR against the main Spark project.

 Provide options to make maximum executor failure count ( which kills the 
 application ) relative to a window duration or disable it.
 ---

 Key: SPARK-6735
 URL: https://issues.apache.org/jira/browse/SPARK-6735
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit, YARN
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Twinkle Sachdeva

 Currently there is a setting (spark.yarn.max.executor.failures ) which tells 
 maximum number of executor failures, after which Application fails.
 For long running applications, user can require not to kill the application 
 at all or will require such setting relative to a window duration. This 
 improvement is ti provide such options to make maximum executor failure count 
 ( which kills the application ) relative to a window duration or disable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.

2015-04-09 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487670#comment-14487670
 ] 

Sandy Ryza edited comment on SPARK-6735 at 4/9/15 5:06 PM:
---

Hi [~twinkle], can you submit the PR against the main Spark project?


was (Author: sandyr):
Hi [~twinkle], can you submit the PR against the main Spark project.

 Provide options to make maximum executor failure count ( which kills the 
 application ) relative to a window duration or disable it.
 ---

 Key: SPARK-6735
 URL: https://issues.apache.org/jira/browse/SPARK-6735
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit, YARN
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Twinkle Sachdeva

 Currently there is a setting (spark.yarn.max.executor.failures ) which tells 
 maximum number of executor failures, after which Application fails.
 For long running applications, user can require not to kill the application 
 at all or will require such setting relative to a window duration. This 
 improvement is ti provide such options to make maximum executor failure count 
 ( which kills the application ) relative to a window duration or disable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6700) flaky test: run Python application in yarn-cluster mode

2015-04-03 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395135#comment-14395135
 ] 

Sandy Ryza edited comment on SPARK-6700 at 4/3/15 9:54 PM:
---

Does this fail often?  How often?  Might the issue be related to SPARK-6506?


was (Author: sandyr):
Does this fail often?

 flaky test: run Python application in yarn-cluster mode 
 

 Key: SPARK-6700
 URL: https://issues.apache.org/jira/browse/SPARK-6700
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Davies Liu
Assignee: Lianhui Wang
Priority: Critical
  Labels: test, yarn

 org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in 
 yarn-cluster mode
 Failing for the past 1 build (Since Failed#2025 )
 Took 12 sec.
 Error Message
 {code}
 Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
 Stacktrace
 sbt.ForkMain$ForkError: Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 

[jira] [Commented] (SPARK-6700) flaky test: run Python application in yarn-cluster mode

2015-04-03 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395135#comment-14395135
 ] 

Sandy Ryza commented on SPARK-6700:
---

Does this fail often?

 flaky test: run Python application in yarn-cluster mode 
 

 Key: SPARK-6700
 URL: https://issues.apache.org/jira/browse/SPARK-6700
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Davies Liu
Assignee: Lianhui Wang
Priority: Critical
  Labels: test, yarn

 org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in 
 yarn-cluster mode
 Failing for the past 1 build (Since Failed#2025 )
 Took 12 sec.
 Error Message
 {code}
 Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
 Stacktrace
 sbt.ForkMain$ForkError: Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$scalatest$BeforeAndAfterAll$$super$run(YarnClusterSuite.scala:44)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390153#comment-14390153
 ] 

Sandy Ryza commented on SPARK-6646:
---

This seems like a good opportunity to finally add a DataFrame 
registerTempTablet API.

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390196#comment-14390196
 ] 

Sandy Ryza commented on SPARK-6646:
---

[~srowen] I like the way you think.  I know a lot of good nodes out there 
looking for love or at least a casual shutdown hookup. 

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2015-03-26 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383280#comment-14383280
 ] 

Sandy Ryza commented on SPARK-4550:
---

Java serialization appears to write out the full class name the first time an 
object is written and then refer to it by an identifier afterwards:

{code}
scala val baos = new ByteArrayOutputStream()
scala val oos = new ObjectOutputStream(baos)
scala oos.writeObject(new java.util.Date())
scala oos.flush()

scala baos.toString
res8: String = ��??sr??java.util.Datehj�?KYtxpwLY6: x 
scala baos.toByteArray.length
res9: Int = 46

scala oos.writeObject(new java.util.Date())
scala oos.flush()

scala baos.toString
res14: String = ��??sr??java.util.Datehj�?KYtxpwLY6: xsq?~??wLY6�Dx 
scala baos.toByteArray.length
res13: Int = 63

scala oos.writeObject(new java.util.Date())
scala oos.flush()

scala baos.toString
res17: String = ��??sr??java.util.Datehj�?KYtxpwLY6: 
xsq?~??wLY6�Dxsq?~??wLY8?�x 
scala baos.toByteArray.length
res18: Int = 80
{code}

There might be some fancy way to listen for the class name being written out 
and relocate that segment to the front of the stream.  However, this seems 
fairly and involved and bug-prone; my opinion is that isn't not worth it given 
that Java ser is already a severely performance-impaired option.   Another 
option of course would be to write the class name in front of every record, but 
this would bloat the serialized representation considerably.

 In sort-based shuffle, store map outputs in serialized form
 ---

 Key: SPARK-4550
 URL: https://issues.apache.org/jira/browse/SPARK-4550
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical
 Attachments: SPARK-4550-design-v1.pdf, kryo-flush-benchmark.scala


 One drawback with sort-based shuffle compared to hash-based shuffle is that 
 it ends up storing many more java objects in memory.  If Spark could store 
 map outputs in serialized form, it could
 * spill less often because the serialized form is more compact
 * reduce GC pressure
 This will only work when the serialized representations of objects are 
 independent from each other and occupy contiguous segments of memory.  E.g. 
 when Kryo reference tracking is left on, objects may contain pointers to 
 objects farther back in the stream, which means that the sort can't relocate 
 objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)

2015-03-24 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378011#comment-14378011
 ] 

Sandy Ryza commented on SPARK-6479:
---

I believe he means wrapping Spark's call-outs to Tachyon in the new APIs 
proposed here and invoking Tachyon through them. 

 Create off-heap block storage API (internal)
 

 Key: SPARK-6479
 URL: https://issues.apache.org/jira/browse/SPARK-6479
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Reporter: Reynold Xin
 Attachments: SparkOffheapsupportbyHDFS.pdf


 Would be great to create APIs for off-heap block stores, rather than doing a 
 bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6470) Allow Spark apps to put YARN node labels in their requests

2015-03-23 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-6470:
-

Assignee: Sandy Ryza

 Allow Spark apps to put YARN node labels in their requests
 --

 Key: SPARK-6470
 URL: https://issues.apache.org/jira/browse/SPARK-6470
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6418) Add simple per-stage visualization to the UI

2015-03-20 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372144#comment-14372144
 ] 

Sandy Ryza commented on SPARK-6418:
---

I think this would be a great addition.  One note is that I think it might be 
easier to interpret if the results were sorted by start time.  If possible it 
would also be really helpful to somehow give an indication of which tasks ran 
on which nodes.  And to have task metrics show when you hover over or 
something. And to have filtering.  And pagination.  Ok I'm getting carried 
away, but would be great to see something like this in.

 Add simple per-stage visualization to the UI
 

 Key: SPARK-6418
 URL: https://issues.apache.org/jira/browse/SPARK-6418
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Pradyumn Shroff
 Attachments: Screen Shot 2015-03-18 at 6.13.04 PM.png


 Visualizing how tasks in a stage spend their time can be very helpful to 
 understanding performance.  Many folks have started using the visualization 
 tools here: https://github.com/kayousterhout/trace-analysis (see the README 
 at the bottom) to analyze their jobs after they've finished running, but it 
 would be great if this functionality were natively integrated into Spark's UI.
 I'd propose adding a relatively simple visualization to the stage detail 
 page, that's hidden by default but that users can view by clicking on a 
 drop-down menu.  The plan is to implement this using D3; a mock up of how 
 this would look (that uses D3) is attached.
 This is intended to be a much simpler and more limited version of SPARK-3468



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6418) Add simple per-stage visualization to the UI

2015-03-20 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372178#comment-14372178
 ] 

Sandy Ryza commented on SPARK-6418:
---

Yeah, all of that is wishful thinking, definitely not requirements for a first 
version.  Dealing with huge numbers of tasks, as you mentioned, seems to me 
like a requirement for a first cut, and something like just taking the first N 
seems sufficient to me. 

 Add simple per-stage visualization to the UI
 

 Key: SPARK-6418
 URL: https://issues.apache.org/jira/browse/SPARK-6418
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Pradyumn Shroff
 Attachments: Screen Shot 2015-03-18 at 6.13.04 PM.png


 Visualizing how tasks in a stage spend their time can be very helpful to 
 understanding performance.  Many folks have started using the visualization 
 tools here: https://github.com/kayousterhout/trace-analysis (see the README 
 at the bottom) to analyze their jobs after they've finished running, but it 
 would be great if this functionality were natively integrated into Spark's UI.
 I'd propose adding a relatively simple visualization to the stage detail 
 page, that's hidden by default but that users can view by clicking on a 
 drop-down menu.  The plan is to implement this using D3; a mock up of how 
 this would look (that uses D3) is attached.  One change we'll make for the 
 initial implementation, compared to the attached visualization, is tasks will 
 be sorted by start time.
 This is intended to be a much simpler and more limited version of SPARK-3468



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2015-03-20 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372232#comment-14372232
 ] 

Sandy Ryza commented on SPARK-4550:
---

I spoke briefly with Reynold about this offline, and he pointed out that, with 
the patch, we now flush the Kryo serialization stream after every object we 
write.  I put together a micro-benchmark to stress this that writes a bunch of 
small records to a Kryo serialization stream with and without flushing:

runs without flush: (count: 30, mean: 226.40, stdev: 3.929377, max: 
241.00, min: 222.00)
runs with flush: (count: 30, mean: 226.30, stdev: 2.084067, max: 
234.00, min: 224.00)

There doesn't appear to be a significant difference.  The benchmark code is 
attached.


 In sort-based shuffle, store map outputs in serialized form
 ---

 Key: SPARK-4550
 URL: https://issues.apache.org/jira/browse/SPARK-4550
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical
 Attachments: SPARK-4550-design-v1.pdf


 One drawback with sort-based shuffle compared to hash-based shuffle is that 
 it ends up storing many more java objects in memory.  If Spark could store 
 map outputs in serialized form, it could
 * spill less often because the serialized form is more compact
 * reduce GC pressure
 This will only work when the serialized representations of objects are 
 independent from each other and occupy contiguous segments of memory.  E.g. 
 when Kryo reference tracking is left on, objects may contain pointers to 
 objects farther back in the stream, which means that the sort can't relocate 
 objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2015-03-20 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4550:
--
Attachment: kryo-flush-benchmark.scala

 In sort-based shuffle, store map outputs in serialized form
 ---

 Key: SPARK-4550
 URL: https://issues.apache.org/jira/browse/SPARK-4550
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical
 Attachments: SPARK-4550-design-v1.pdf, kryo-flush-benchmark.scala


 One drawback with sort-based shuffle compared to hash-based shuffle is that 
 it ends up storing many more java objects in memory.  If Spark could store 
 map outputs in serialized form, it could
 * spill less often because the serialized form is more compact
 * reduce GC pressure
 This will only work when the serialized representations of objects are 
 independent from each other and occupy contiguous segments of memory.  E.g. 
 when Kryo reference tracking is left on, objects may contain pointers to 
 objects farther back in the stream, which means that the sort can't relocate 
 objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6393) Extra RPC to the AM during killExecutor invocation

2015-03-17 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-6393:
-

 Summary: Extra RPC to the AM during killExecutor invocation
 Key: SPARK-6393
 URL: https://issues.apache.org/jira/browse/SPARK-6393
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.3.1
Reporter: Sandy Ryza


This was introduced by SPARK-6325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   >