[jira] [Resolved] (SPARK-3682) Add helpful warnings to the UI
[ https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-3682. --- Resolution: Won't Fix > Add helpful warnings to the UI > -- > > Key: SPARK-3682 > URL: https://issues.apache.org/jira/browse/SPARK-3682 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 1.1.0 >Reporter: Sandy Ryza > Attachments: SPARK-3682Design.pdf > > > Spark has a zillion configuration options and a zillion different things that > can go wrong with a job. Improvements like incremental and better metrics > and the proposed spark replay debugger provide more insight into what's going > on under the covers. However, it's difficult for non-advanced users to > synthesize this information and understand where to direct their attention. > It would be helpful to have some sort of central location on the UI users > could go to that would provide indications about why an app/job is failing or > performing poorly. > Some helpful messages that we could provide: > * Warn that the tasks in a particular stage are spending a long time in GC. > * Warn that spark.shuffle.memoryFraction does not fit inside the young > generation. > * Warn that tasks in a particular stage are very short, and that the number > of partitions should probably be decreased. > * Warn that tasks in a particular stage are spilling a lot, and that the > number of partitions should probably be increased. > * Warn that a cached RDD that gets a lot of use does not fit in memory, and a > lot of time is being spent recomputing it. > To start, probably two kinds of warnings would be most helpful. > * Warnings at the app level that report on misconfigurations, issues with the > general health of executors. > * Warnings at the job level that indicate why a job might be performing > slowly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun
[ https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-5490: -- Assignee: (was: Sandy Ryza) > KMeans costs can be incorrect if tasks need to be rerun > --- > > Key: SPARK-5490 > URL: https://issues.apache.org/jira/browse/SPARK-5490 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sandy Ryza > Labels: clustering > > KMeans uses accumulators to compute the cost of a clustering at each > iteration. > Each time a ShuffleMapTask completes, it increments the accumulators at the > driver. If a task runs twice because of failures, the accumulators get > incremented twice. > KMeans uses accumulators in ShuffleMapTasks. This means that a task's cost > can end up being double-counted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7410) Add option to avoid broadcasting configuration with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-7410. --- Resolution: Won't Fix > Add option to avoid broadcasting configuration with newAPIHadoopFile > > > Key: SPARK-7410 > URL: https://issues.apache.org/jira/browse/SPARK-7410 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Sandy Ryza > > I'm working with a Spark application that creates thousands of HadoopRDDs and > unions them together. Certain details of the way the data is stored require > this. > Creating ten thousand of these RDDs takes about 10 minutes, even before any > of them is used in an action. I dug into why this takes so long and it looks > like the overhead of broadcasting the Hadoop configuration is taking up most > of the time. In this case, the broadcasting isn't helpful because each > HadoopRDD only corresponds to one or two tasks. When I reverted the original > change that switched to broadcasting configurations, the time it took to > instantiate these RDDs improved 10x. > It would be nice if there was a way to turn this broadcasting off. Either > through a Spark configuration option, a Hadoop configuration option, or an > argument to hadoopFile / newAPIHadoopFile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022634#comment-15022634 ] Sandy Ryza commented on SPARK-: --- [~nchammas] it's not clear that it makes sense to add a similar API for Python and R. The main point of the Dataset API, as I understand it, is to extend DataFrames to take advantage of Java / Scala's static typing systems. This means recovering compile-time type safety, integration with existing Java / Scale object frameworks, and Scala syntactic sugar like pattern matching. Python and R are dynamically typed so can't take advantage of these. > Dataset API on top of Catalyst/DataFrame > > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] > The initial version of the Dataset API has been merged in Spark 1.6. However, > it will take a few more future releases to flush everything out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored
[ https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980717#comment-14980717 ] Sandy Ryza commented on SPARK-2089: --- My opinion is that we should be moving towards dynamic allocation as the norm, both for batch and long-running applications. With dynamic allocation turned on, it's possible to attain close to the same behavior as static allocation if you set max executors and a really fast ramp-up time. > With YARN, preferredNodeLocalityData isn't honored > --- > > Key: SPARK-2089 > URL: https://issues.apache.org/jira/browse/SPARK-2089 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza >Priority: Critical > > When running in YARN cluster mode, apps can pass preferred locality data when > constructing a Spark context that will dictate where to request executor > containers. > This is currently broken because of a race condition. The Spark-YARN code > runs the user class and waits for it to start up a SparkContext. During its > initialization, the SparkContext will create a YarnClusterScheduler, which > notifies a monitor in the Spark-YARN code that . The Spark-Yarn code then > immediately fetches the preferredNodeLocationData from the SparkContext and > uses it to start requesting containers. > But in the SparkContext constructor that takes the preferredNodeLocationData, > setting preferredNodeLocationData comes after the rest of the initialization, > so, if the Spark-YARN code comes around quickly enough after being notified, > the data that's fetched is the empty unset version. The occurred during all > of my runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961513#comment-14961513 ] Sandy Ryza commented on SPARK-: --- So ClassTags would work for case classes and Avro specific records, but wouldn't work for tuples (or anywhere else types get erased). Blrgh. I wonder if the former is enough? Tuples are pretty useful though. > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957144#comment-14957144 ] Sandy Ryza commented on SPARK-: --- Maybe you all have thought through this as well, but I had some more thoughts on the proposed API. Fundamentally, it seems a weird to me that the user is responsible for having a matching Encoder around every time they want to map to a class of a particular type. In 99% of cases, the Encoder used to encode any given type will be the same, and it seems more intuitive to me to specify this up front. To be more concrete, suppose I want to use case classes in my app and have a function that can auto-generate an Encoder from a class object (though this might be a little bit time consuming because it needs to use reflection). With the current proposal, any time I want to map my Dataset to a Dataset of some case class, I need to either have a line of code that generates an Encoder for that case class, or have an Encoder already lying around. If I perform this operation within a method, I need to pass the Encoder down to the method and include it in the signature. Ideally I would be able to register an EncoderSystem up front that caches Encoders and generates new Encoders whenever it sees a new class used. This still of course requires the user to pass in type information when they call map, but it's easier for them to get this information than an actual encoder. If there's not some principled way to get this working implicitly with ClassTags, the user could just pass in classOf[MyCaseClass] as the second argument to map. > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956341#comment-14956341 ] Sandy Ryza commented on SPARK-: --- Thanks for the explanation [~rxin] and [~marmbrus]. I understand the problem and don't have any great ideas for an alternative workable solution. > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958022#comment-14958022 ] Sandy Ryza commented on SPARK-: --- bq. The problem with doing this using a registry (like kryo in RDDs today) is that then you aren't finding out the object type until you have an example object from realizing the computation. My suggestion was that the user would still need to pass the class object, so this shouldn't be a problem, unless I'm misunderstanding. Thanks to the pointer to the test suite. So am I to understand correctly that with Scala implicits magic I can do the following without any additional boilerplate? {code} import case class MyClass1() case class MyClass2() val ds : Dataset[MyClass1] = ... val ds2: Dataset[MyClass2] = ds.map(funcThatConvertsFromMyClass1ToMyClass2) {code} and in Java, imagining those case classes above were POJOs, we'd be able to support the following? {code} Dataset ds2 = ds1.map(funcThatConvertsFromMyClass1ToMyClass2, MyClass2.class); {code} If that's the case, then that resolves my concerns above. Lastly, though, IIUC, it seems like for all the common cases, we could register an object with the SparkContext that converts from ClassTag to Encoder, and the RDD API would work. Where does that break down? > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955286#comment-14955286 ] Sandy Ryza commented on SPARK-: --- To ask the obvious question: what are the reasons that the RDD API couldn't be adapted to these purposes? If I understand correctly, a summarization of the differences is that Datasets: 1. Support encoders for conversion to schema'd / efficiently serializable data 2. Have a GroupedDataset concept 3. Execute on Catalyst instead of directly on top of the DAGScheduler How difficult would it be to add encoders on top of RDDs, as well as a GroupedRDD? Is there anything in the RDD API contract that says RDDs can't be executed on top of Catalyst? Surely this creates some dependency hell as well given that SQL depends on core, but surely that's better than exposing an entirely new API that looks almost like the original one. > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955320#comment-14955320 ] Sandy Ryza commented on SPARK-: --- [~rxin] where are the places where the API would need to break? > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955840#comment-14955840 ] Sandy Ryza commented on SPARK-: --- If I understand correctly, it seems like there are ways to work around each of these issues that, necessarily, make the API dirtier, but avoid the need for a whole new public API. * groupBy: deprecate the old groupBy and add a groupWith or groupby method that returns a GroupedRDD. * partitions: have -1 be a special value that means "determined by the planner" * encoders: what are the main obstacles to addressing this with an EncodedRDD that extends RDD? Regarding the issues Michael brought up: I'd love to get rid of class tags from the public API as well as take out JavaRDD, but these seem more like "nice to have" than core to the proposal. Am I misunderstanding? All of these of course add ugliness, but I think it's really easy to underestimate the cost of introducing a new API. Applications everywhere become legacy and need to be rewritten to take advantage of new features. Code examples and training materials everywhere become invalidated. Can we point to systems that have successfully made a transition like this at this point in their maturity? > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902982#comment-14902982 ] Sandy Ryza commented on SPARK-10739: That's the one I was referring to as well. That's about executor failure, where this is about AM failure, so different issues. > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901901#comment-14901901 ] Sandy Ryza commented on SPARK-10739: I recall there was a JIRA similar to this that avoided killing the application when we reached a certain number of executor failures. However, IIUC, this is about something different: deciding whether to have YARN restart the application when it fails. > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4534) With YARN, JavaSparkContext provide to add preferredNodeLocalityData to SparkContext
[ https://issues.apache.org/jira/browse/SPARK-4534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-4534. --- Resolution: Won't Fix As SPARK-2089 is closed as "Won't Fix", also closing this. > With YARN, JavaSparkContext provide to add preferredNodeLocalityData to > SparkContext > > > Key: SPARK-4534 > URL: https://issues.apache.org/jira/browse/SPARK-4534 > Project: Spark > Issue Type: Improvement > Components: Java API >Reporter: Lianhui Wang > > example: > SparkConf sparkConf=new SparkConf(); > Configuration conf = new Configuration(); > List inputFormatInfoList =new ArrayList(); > inputFormatInfoList.add(new InputFormatInfo(conf, > org.apache.hadoop.mapred.TextInputFormat.class, path)); > JavaSparkContext ctx=new JavaSparkContext(sparkConf,inputFormatInfoList); > we can use above code to let preferredNodeLocalityData be work in Java > program. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9782) Add support for YARN application tags running Spark on YARN
[ https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-9782: -- Assignee: Dennis Huo Add support for YARN application tags running Spark on YARN --- Key: SPARK-9782 URL: https://issues.apache.org/jira/browse/SPARK-9782 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.4.1 Reporter: Dennis Huo Assignee: Dennis Huo https://issues.apache.org/jira/browse/YARN-1390 originally added the new “Application Tags” feature to YARN to help track the sources of applications among many possible YARN clients. https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a set of tags to be applied, and for comparison, https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for MapReduce to easily propagate tags through to YARN via Configuration settings. Since the ApplicationSubmissionContext.setApplicationTags method was only added in Hadoop 2.4+, Spark support will invoke the method via reflection the same way other such version-specific methods are called in elsewhere in the YARN client. Since the usage of tags is generally not critical to the functionality of older YARN setups, it should be safe to handle NoSuchMethodException with just a logWarning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9782) Add support for YARN application tags running Spark on YARN
[ https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-9782. --- Resolution: Fixed Fix Version/s: 1.6.0 Add support for YARN application tags running Spark on YARN --- Key: SPARK-9782 URL: https://issues.apache.org/jira/browse/SPARK-9782 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.4.1 Reporter: Dennis Huo Assignee: Dennis Huo Fix For: 1.6.0 https://issues.apache.org/jira/browse/YARN-1390 originally added the new “Application Tags” feature to YARN to help track the sources of applications among many possible YARN clients. https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a set of tags to be applied, and for comparison, https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for MapReduce to easily propagate tags through to YARN via Configuration settings. Since the ApplicationSubmissionContext.setApplicationTags method was only added in Hadoop 2.4+, Spark support will invoke the method via reflection the same way other such version-specific methods are called in elsewhere in the YARN client. Since the usage of tags is generally not critical to the functionality of older YARN setups, it should be safe to handle NoSuchMethodException with just a logWarning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8674) 2-sample, 2-sided Kolmogorov Smirnov Test
[ https://issues.apache.org/jira/browse/SPARK-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-8674: -- Summary: 2-sample, 2-sided Kolmogorov Smirnov Test (was: [WIP] 2-sample, 2-sided Kolmogorov Smirnov Test Implementation) 2-sample, 2-sided Kolmogorov Smirnov Test - Key: SPARK-8674 URL: https://issues.apache.org/jira/browse/SPARK-8674 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jose Cambronero Assignee: Jose Cambronero Priority: Minor We added functionality to calculate a 2-sample, 2-sided Kolmogorov Smirnov test for 2 RDD[Double]. The calculation provides a test for the null hypothesis that both samples come from the same probability distribution. The implementation seeks to minimize the shuffles necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8674) [WIP] 2-sample, 2-sided Kolmogorov Smirnov Test Implementation
[ https://issues.apache.org/jira/browse/SPARK-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-8674: -- Assignee: Jose Cambronero [WIP] 2-sample, 2-sided Kolmogorov Smirnov Test Implementation -- Key: SPARK-8674 URL: https://issues.apache.org/jira/browse/SPARK-8674 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jose Cambronero Assignee: Jose Cambronero Priority: Minor We added functionality to calculate a 2-sample, 2-sided Kolmogorov Smirnov test for 2 RDD[Double]. The calculation provides a test for the null hypothesis that both samples come from the same probability distribution. The implementation seeks to minimize the shuffles necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7707) User guide and example code for Statistics.kernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698634#comment-14698634 ] Sandy Ryza commented on SPARK-7707: --- [~mengxr] thoughts on which page this should land in? mllib-statistics? User guide and example code for Statistics.kernelDensity Key: SPARK-7707 URL: https://issues.apache.org/jira/browse/SPARK-7707 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7707) User guide and example code for KernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-7707: -- Summary: User guide and example code for KernelDensity (was: User guide and example code for Statistics.kernelDensity) User guide and example code for KernelDensity - Key: SPARK-7707 URL: https://issues.apache.org/jira/browse/SPARK-7707 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7707) User guide and example code for KernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-7707: - Assignee: Sandy Ryza User guide and example code for KernelDensity - Key: SPARK-7707 URL: https://issues.apache.org/jira/browse/SPARK-7707 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7707) User guide and example code for Statistics.kernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694413#comment-14694413 ] Sandy Ryza commented on SPARK-7707: --- Again, sorry for the long delay here. I'm traveling for the rest of this week, but can get to this next week. Is that soon enough? User guide and example code for Statistics.kernelDensity Key: SPARK-7707 URL: https://issues.apache.org/jira/browse/SPARK-7707 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9808) Remove hash shuffle file consolidation
[ https://issues.apache.org/jira/browse/SPARK-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681601#comment-14681601 ] Sandy Ryza commented on SPARK-9808: --- Have we considered removing the hash-based shuffle entirely? Especially without shuffle file consolidation, it's almost always a poor choice. Remove hash shuffle file consolidation -- Key: SPARK-9808 URL: https://issues.apache.org/jira/browse/SPARK-9808 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Josh Rosen Assignee: Josh Rosen I think that we should remove {{spark.shuffle.consolidateFiles}} and its associated implementation for Spark 1.5.0. This feature isn't properly tested and does not work with the external shuffle service. Since it's likely to be buggy and since its motivation has been subsumed by sort-based shuffle, I think it's a prime candidate for removal now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9808) Remove hash shuffle file consolidation
[ https://issues.apache.org/jira/browse/SPARK-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692589#comment-14692589 ] Sandy Ryza commented on SPARK-9808: --- I don't have strong opinions here, but as a data point: my impression was that nearly everyone using hash-based shuffle was using shuffle file consolidation. I believe we recommended it to most of our users when hash-based shuffle was the default. Remove hash shuffle file consolidation -- Key: SPARK-9808 URL: https://issues.apache.org/jira/browse/SPARK-9808 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Josh Rosen Assignee: Josh Rosen I think that we should remove {{spark.shuffle.consolidateFiles}} and its associated implementation for Spark 1.5.0. This feature isn't properly tested and does not work with the external shuffle service. Since it's likely to be buggy and since its motivation has been subsumed by sort-based shuffle, I think it's a prime candidate for removal now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-4352. --- Resolution: Fixed Fix Version/s: 1.5.0 Target Version/s: 1.5.0 Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Fix For: 1.5.0 Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1744) Document how to pass in preferredNodeLocationData
[ https://issues.apache.org/jira/browse/SPARK-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-1744. --- Resolution: Won't Fix Document how to pass in preferredNodeLocationData - Key: SPARK-1744 URL: https://issues.apache.org/jira/browse/SPARK-1744 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9092) Make --num-executors compatible with dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635942#comment-14635942 ] Sandy Ryza commented on SPARK-9092: --- I had a brief discussion with [~andrewor14] about this offline and wanted to move the discussion public. --num-executors and dynamic allocation are fundamentally at odds with each other in the sense that neither makes sense in the context of the other. This means that essentially one needs to override the other. My position is that it makes more sense for --num-executors to override dynamic allocation than the other way around. I.e. if --num-executors is set, behave as if dynamic allocation were disabled. The advantages of this are: * Cluster operators can turn on dynamic allocation as the default cluster setting without impacting existing applications. The precedent set by existing big data processing frameworks (MR, Impala, Tez) is that users can depend on the framework to determine how many resources to acquire from the cluster manager, so I think it's reasonable that most clusters would want to move to dynamic allocation as the default. In a Cloudera setting, we plan to eventually enable dynamic allocation as the factory default for all clusters, but we'd like to minimize the extent to which we change the behavior of existing apps. * --num-executors is conceptually a more specific property, and specific properties tend to override more general ones. Make --num-executors compatible with dynamic allocation --- Key: SPARK-9092 URL: https://issues.apache.org/jira/browse/SPARK-9092 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Niranjan Padmanabhan Currently when you enable dynamic allocation, you can't use --num-executors or the property spark.executor.instances. If we are to enable dynamic allocation by default, we should make these work so that existing workloads don't fail -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1640) In yarn-client mode, pass preferred node locations to AM
[ https://issues.apache.org/jira/browse/SPARK-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-1640. --- Resolution: Invalid In yarn-client mode, pass preferred node locations to AM Key: SPARK-1640 URL: https://issues.apache.org/jira/browse/SPARK-1640 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 0.9.0 Reporter: Sandy Ryza In yarn-cluster mode, if the user passes preferred node location data to the SparkContext, the AM requests containers based on that data. In yarn-client mode, it would be good to do this as well. This required some way of passing this data from the client process to the AM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8623) Some queries in spark-sql lead to NullPointerException when using Yarn
[ https://issues.apache.org/jira/browse/SPARK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603243#comment-14603243 ] Sandy Ryza commented on SPARK-8623: --- Am able to reproduce this locally. Looking into the cause. Some queries in spark-sql lead to NullPointerException when using Yarn -- Key: SPARK-8623 URL: https://issues.apache.org/jira/browse/SPARK-8623 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: Hadoop 2.6, Kerberos Reporter: Bolke de Bruin The following query was executed using spark-sql --master yarn-client on 1.5.0-SNAPSHOT: select * from wcs.geolite_city limit 10; This lead to the following error: 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.init(Configuration.java:693) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442) at org.apache.hadoop.mapreduce.Job.init(Job.java:131) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) This does not happen in every case, ie. some queries execute fine, and it is unclear why. Using just spark-sql the query executes fine as well and thus the issue seems to rely in the communication with Yarn. Also the query executes fine (with yarn) in spark-shell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8623) Some queries in spark-sql lead to NullPointerException when using Yarn
[ https://issues.apache.org/jira/browse/SPARK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603694#comment-14603694 ] Sandy Ryza commented on SPARK-8623: --- Figured out the issue - my patch omitted registering a custom Kryo serializer for SerializableConfiguration, so it gets serialized and deserialized using Kryo, which means the writeObject and readObject methods are never called, which means the internal Configuration is never instantiated. Uploading a patch that I've found to fix the issue. Some queries in spark-sql lead to NullPointerException when using Yarn -- Key: SPARK-8623 URL: https://issues.apache.org/jira/browse/SPARK-8623 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: Hadoop 2.6, Kerberos Reporter: Bolke de Bruin The following query was executed using spark-sql --master yarn-client on 1.5.0-SNAPSHOT: select * from wcs.geolite_city limit 10; This lead to the following error: 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.init(Configuration.java:693) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442) at org.apache.hadoop.mapreduce.Job.init(Job.java:131) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) This does not happen in every case, ie. some queries execute fine, and it is unclear why. Using just spark-sql the query executes fine as well and thus the issue seems to rely in the communication with Yarn. Also the query executes fine (with yarn) in spark-shell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8623) Hadoop RDDs fail to properly serialize configuration
[ https://issues.apache.org/jira/browse/SPARK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-8623: -- Summary: Hadoop RDDs fail to properly serialize configuration (was: Some queries in spark-sql lead to NullPointerException when using Yarn) Hadoop RDDs fail to properly serialize configuration Key: SPARK-8623 URL: https://issues.apache.org/jira/browse/SPARK-8623 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.0 Environment: Hadoop 2.6, Kerberos Reporter: Bolke de Bruin Assignee: Sandy Ryza The following query was executed using spark-sql --master yarn-client on 1.5.0-SNAPSHOT: select * from wcs.geolite_city limit 10; This lead to the following error: 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.init(Configuration.java:693) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442) at org.apache.hadoop.mapreduce.Job.init(Job.java:131) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) This does not happen in every case, ie. some queries execute fine, and it is unclear why. Using just spark-sql the query executes fine as well and thus the issue seems to rely in the communication with Yarn. Also the query executes fine (with yarn) in spark-shell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8623) Some queries in spark-sql lead to NullPointerException when using Yarn
[ https://issues.apache.org/jira/browse/SPARK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-8623: - Assignee: Sandy Ryza Some queries in spark-sql lead to NullPointerException when using Yarn -- Key: SPARK-8623 URL: https://issues.apache.org/jira/browse/SPARK-8623 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.0 Environment: Hadoop 2.6, Kerberos Reporter: Bolke de Bruin Assignee: Sandy Ryza The following query was executed using spark-sql --master yarn-client on 1.5.0-SNAPSHOT: select * from wcs.geolite_city limit 10; This lead to the following error: 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.init(Configuration.java:693) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442) at org.apache.hadoop.mapreduce.Job.init(Job.java:131) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) This does not happen in every case, ie. some queries execute fine, and it is unclear why. Using just spark-sql the query executes fine as well and thus the issue seems to rely in the communication with Yarn. Also the query executes fine (with yarn) in spark-shell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8623) Some queries in spark-sql lead to NullPointerException when using Yarn
[ https://issues.apache.org/jira/browse/SPARK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-8623: -- Component/s: (was: SQL) Spark Core Some queries in spark-sql lead to NullPointerException when using Yarn -- Key: SPARK-8623 URL: https://issues.apache.org/jira/browse/SPARK-8623 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.0 Environment: Hadoop 2.6, Kerberos Reporter: Bolke de Bruin The following query was executed using spark-sql --master yarn-client on 1.5.0-SNAPSHOT: select * from wcs.geolite_city limit 10; This lead to the following error: 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.init(Configuration.java:693) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442) at org.apache.hadoop.mapreduce.Job.init(Job.java:131) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) This does not happen in every case, ie. some queries execute fine, and it is unclear why. Using just spark-sql the query executes fine as well and thus the issue seems to rely in the communication with Yarn. Also the query executes fine (with yarn) in spark-shell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8623) Some queries in spark-sql lead to NullPointerException when using Yarn
[ https://issues.apache.org/jira/browse/SPARK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601593#comment-14601593 ] Sandy Ryza commented on SPARK-8623: --- Looking into it Some queries in spark-sql lead to NullPointerException when using Yarn -- Key: SPARK-8623 URL: https://issues.apache.org/jira/browse/SPARK-8623 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: Hadoop 2.6, Kerberos Reporter: Bolke de Bruin The following query was executed using spark-sql --master yarn-client on 1.5.0-SNAPSHOT: select * from wcs.geolite_city limit 10; This lead to the following error: 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.init(Configuration.java:693) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442) at org.apache.hadoop.mapreduce.Job.init(Job.java:131) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) This does not happen in every case, ie. some queries execute fine, and it is unclear why. Using just spark-sql the query executes fine as well and thus the issue seems to rely in the communication with Yarn. Also the query executes fine (with yarn) in spark-shell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7173) Support YARN node label expressions for the application master
[ https://issues.apache.org/jira/browse/SPARK-7173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576565#comment-14576565 ] Sandy Ryza commented on SPARK-7173: --- This requires additional work on top of SPARK-6470 Support YARN node label expressions for the application master -- Key: SPARK-7173 URL: https://issues.apache.org/jira/browse/SPARK-7173 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.1 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8099) In yarn-cluster mode, --executor-cores can't be setted into SparkConf
[ https://issues.apache.org/jira/browse/SPARK-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-8099: -- Assignee: meiyoula In yarn-cluster mode, --executor-cores can't be setted into SparkConf --- Key: SPARK-8099 URL: https://issues.apache.org/jira/browse/SPARK-8099 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: meiyoula Assignee: meiyoula Fix For: 1.5.0 While testing dynamic executor allocation function, I set the executor cores with *--executor-cores 4* in spark-submit command. But in *ExecutorAllocationManager*, the *private val tasksPerExecutor =conf.getInt(spark.executor.cores, 1) / conf.getInt(spark.task.cpus, 1)* is still to be 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8099) In yarn-cluster mode, --executor-cores can't be setted into SparkConf
[ https://issues.apache.org/jira/browse/SPARK-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-8099: -- Assignee: (was: meiyoula) In yarn-cluster mode, --executor-cores can't be setted into SparkConf --- Key: SPARK-8099 URL: https://issues.apache.org/jira/browse/SPARK-8099 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: meiyoula Fix For: 1.5.0 While testing dynamic executor allocation function, I set the executor cores with *--executor-cores 4* in spark-submit command. But in *ExecutorAllocationManager*, the *private val tasksPerExecutor =conf.getInt(spark.executor.cores, 1) / conf.getInt(spark.task.cpus, 1)* is still to be 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8099) In yarn-cluster mode, --executor-cores can't be setted into SparkConf
[ https://issues.apache.org/jira/browse/SPARK-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-8099. --- Resolution: Fixed Fix Version/s: 1.5.0 Assignee: meiyoula In yarn-cluster mode, --executor-cores can't be setted into SparkConf --- Key: SPARK-8099 URL: https://issues.apache.org/jira/browse/SPARK-8099 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: meiyoula Assignee: meiyoula Fix For: 1.5.0 While testing dynamic executor allocation function, I set the executor cores with *--executor-cores 4* in spark-submit command. But in *ExecutorAllocationManager*, the *private val tasksPerExecutor =conf.getInt(spark.executor.cores, 1) / conf.getInt(spark.task.cpus, 1)* is still to be 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7699) Dynamic allocation: initial executors may be canceled before first job
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-7699. --- Resolution: Fixed Fix Version/s: 1.5.0 Dynamic allocation: initial executors may be canceled before first job -- Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: meiyoula Assignee: Saisai Shao Fix For: 1.5.0 spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8135) Don't load defaults when reconstituting Hadoop Configurations
[ https://issues.apache.org/jira/browse/SPARK-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-8135: -- Summary: Don't load defaults when reconstituting Hadoop Configurations (was: In SerializableWritable, don't load defaults when instantiating Configuration) Don't load defaults when reconstituting Hadoop Configurations - Key: SPARK-8135 URL: https://issues.apache.org/jira/browse/SPARK-8135 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza Assignee: Sandy Ryza Calling new Configuration() is an expensive operation because it loads any Hadoop configuration XMLs from disk. In SerializableWritable, we call new Configuration needlessly when instantiating an ObjectWritable. The ObjectWritable only needs the Configuration for its class cache, not for any Hadoop properties that might be in XML files, so it should be ok to call new Configuration with loadDefaults = false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8135) In SerializableWritable, don't load defaults when instantiating Configuration
[ https://issues.apache.org/jira/browse/SPARK-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575397#comment-14575397 ] Sandy Ryza commented on SPARK-8135: --- CC [~joshrosen] In SerializableWritable, don't load defaults when instantiating Configuration - Key: SPARK-8135 URL: https://issues.apache.org/jira/browse/SPARK-8135 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza Assignee: Sandy Ryza Calling new Configuration() is an expensive operation because it loads any Hadoop configuration XMLs from disk. In SerializableWritable, we call new Configuration needlessly when instantiating an ObjectWritable. The ObjectWritable only needs the Configuration for its class cache, not for any Hadoop properties that might be in XML files, so it should be ok to call new Configuration with loadDefaults = false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8135) In SerializableWritable, don't load defaults when instantiating Configuration
Sandy Ryza created SPARK-8135: - Summary: In SerializableWritable, don't load defaults when instantiating Configuration Key: SPARK-8135 URL: https://issues.apache.org/jira/browse/SPARK-8135 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza Assignee: Sandy Ryza Calling new Configuration() is an expensive operation because it loads any Hadoop configuration XMLs from disk. In SerializableWritable, we call new Configuration needlessly when instantiating an ObjectWritable. The ObjectWritable only needs the Configuration for its class cache, not for any Hadoop properties that might be in XML files, so it should be ok to call new Configuration with loadDefaults = false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8135) In SerializableWritable, don't load defaults when instantiating Configuration
[ https://issues.apache.org/jira/browse/SPARK-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575410#comment-14575410 ] Sandy Ryza commented on SPARK-8135: --- Your question made me think about the fact that, whether or not we go through ObjectWritable, we're still going to end up calling the default Configuration constructor when SerializableWritable wraps a Configuration. I think this is undesirable for the reasons outlined above. What do you think about adding a SerializableConfiguration class that instantiates a Configuration with loadDefaults = false? Regarding your question, I think we could get away with that. Is there an advantage though? I think it would probably be more code. In SerializableWritable, don't load defaults when instantiating Configuration - Key: SPARK-8135 URL: https://issues.apache.org/jira/browse/SPARK-8135 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza Assignee: Sandy Ryza Calling new Configuration() is an expensive operation because it loads any Hadoop configuration XMLs from disk. In SerializableWritable, we call new Configuration needlessly when instantiating an ObjectWritable. The ObjectWritable only needs the Configuration for its class cache, not for any Hadoop properties that might be in XML files, so it should be ok to call new Configuration with loadDefaults = false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8135) Don't load defaults when reconstituting Hadoop Configurations
[ https://issues.apache.org/jira/browse/SPARK-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575463#comment-14575463 ] Sandy Ryza commented on SPARK-8135: --- Cool. Updated the PR with SerializableConfiguration. Don't load defaults when reconstituting Hadoop Configurations - Key: SPARK-8135 URL: https://issues.apache.org/jira/browse/SPARK-8135 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza Assignee: Sandy Ryza Calling new Configuration() is an expensive operation because it loads any Hadoop configuration XMLs from disk. In SerializableWritable, we call new Configuration needlessly when instantiating an ObjectWritable. The ObjectWritable only needs the Configuration for its class cache, not for any Hadoop properties that might be in XML files, so it should be ok to call new Configuration with loadDefaults = false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8136) AM link download test can be flaky
[ https://issues.apache.org/jira/browse/SPARK-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-8136: -- Assignee: Hari Shreedharan AM link download test can be flaky -- Key: SPARK-8136 URL: https://issues.apache.org/jira/browse/SPARK-8136 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan Assignee: Hari Shreedharan Sometimes YARN does not replace the link (or replaces it too soon) causing the YarnClusterSuite to fail. On a real cluster, the NM automatically redirects once the app is complete. So we should make the test less strict and have it only check the link's format rather than try to download the logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572245#comment-14572245 ] Sandy Ryza commented on SPARK-4352: --- I don't think it's abnormal. Consider joining the results of a shuffle (no locality preferences) with a small table on HDFS (has locality preferences). Is there any particular reason we would expect to ramp down quickly in this situation? Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572356#comment-14572356 ] Sandy Ryza commented on SPARK-4352: --- Right, but once we have placed 5 executors on those nodes, the other executors can go anywhere (and it's probably preferable that they do in order to spread out our resource consumption more evenly). Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572338#comment-14572338 ] Sandy Ryza commented on SPARK-4352: --- In your example, what would be the advantage of requesting 15 executors on a, b, d if we could only make use of 5? Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8062) NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics
[ https://issues.apache.org/jira/browse/SPARK-8062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570388#comment-14570388 ] Sandy Ryza commented on SPARK-8062: --- [~joshrosen] nothing sticks out to me past what you've found. My other suspicion was that somehow the scheme from the path was coming in as null, but the implementation of Path#makeQualified seems like it should always set a scheme. Given that we've never come across it, it seems possible that it's related to something maprfs is doing? Defensive checks seem reasonable to me. NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics - Key: SPARK-8062 URL: https://issues.apache.org/jira/browse/SPARK-8062 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: MapR 4.0.1, Hadoop 2.4.1, Yarn Reporter: Josh Rosen I received the following error report from a user: While running a Spark Streaming job that reads from MapRfs and writes to HBase using Spark 1.2.1, the job intermittently experiences a total job failure due to the following errors: {code} 15/05/28 10:35:50 ERROR executor.Executor: Exception in task 1.1 in stage 6.0 (TID 24) java.lang.NullPointerException at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) at scala.collection.AbstractTraversable.filter(Traversable.scala:105) at org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:116) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 15/05/28 10:35:50 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 25 15/05/28 10:35:50 INFO executor.Executor: Running task 2.1 in stage 6.0 (TID 25) 15/05/28 10:35:50 INFO rdd.NewHadoopRDD: Input split: hdfs:/[REDACTED] 15/05/28 10:35:50 ERROR executor.Executor: Exception in task 2.1 in stage 6.0 (TID 25) java.lang.NullPointerException at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) {code} Diving into the code here: The NPE is occurring on this line of SparkHadoopUtil (in 1.2.1.): https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L178 Here's that block of code from 1.2.1 (it's the same in 1.2.2): {code} private def getFileSystemThreadStatistics(path: Path, conf:
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569623#comment-14569623 ] Sandy Ryza commented on SPARK-4352: --- In the case where the task number = executor number * cores, I think my earlier argument still stands. Any executor requests beyond the ones needed to satisfy our preferences should be submitted with locality preferences. This means we will be less likely to bunch up requests on particular nodes where executors are not needed. Consider the extreme case where we want to request 100 executors but only have a single task with locality preferences, for data on 3 nodes. Going purely by the ratio approach, we would end up requesting all 100 executors on those three nodes. For the other cases, your approach makes sense to me. Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567755#comment-14567755 ] Sandy Ryza commented on SPARK-4352: --- [~jerryshao] I wouldn't say that the goal is necessarily to get as close as possible to the ratio of requests (3 : 3 : 2 : 1 in the example). My idea was to get as close as possible to sum(cores from all executor requests with that node on their preferred list) = number tasks that prefer that node. Why? Let's look at the situation where we're requesting 18 executors. Let's say we request 6 executors with a preference for a, b, c, d like you suggested. YARN would be perfectly happy giving us 6 executors on node d. But we only have 10 tasks (with executors that have 2 cores, this means 5 executors) that need to run on node d. So we'd really prefer that the 6th executor be scheduled on a, b, or c, because placing it on d confers no additional advantage. For the situation where we're requesting 7 executors I have less of an argument for why my 5 : 2 is better than your 2 : 2 : 3. Thinking about it more now, it seems like your approach could be closer to optimal because getting executors on a or b means more of our tasks get to run on local data. So I would certainly be open to something that tries to preserve the ratio when the number of executors we're allowed to request is under the maximum number of tasks targeted for any particular node. Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568580#comment-14568580 ] Sandy Ryza commented on SPARK-4352: --- [~jerryshao] I think you're right that in the case where the number of executors is smaller than the maximum number of tasks with preferences on any particular node, using the ratio makes more sense. Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7707) User guide and example code for Statistics.kernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565638#comment-14565638 ] Sandy Ryza commented on SPARK-7707: --- Sorry for the delayed response here. I will try to get to this next week. User guide and example code for Statistics.kernelDensity Key: SPARK-7707 URL: https://issues.apache.org/jira/browse/SPARK-7707 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7896) IndexOutOfBoundsException in ChainedBuffer
[ https://issues.apache.org/jira/browse/SPARK-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561311#comment-14561311 ] Sandy Ryza commented on SPARK-7896: --- [~joshrosen] I'll take a look IndexOutOfBoundsException in ChainedBuffer -- Key: SPARK-7896 URL: https://issues.apache.org/jira/browse/SPARK-7896 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Arun Ahuja Assignee: Sandy Ryza Priority: Blocker I've run into this on two tasks that use the same dataset. The dataset is a collection of strings where the most common string appears ~200M times and the next few appear ~50M times each. for this rdd: RDD[String], I can do rdd.map( x = (x, 1)).reduceByKey( _ + _) to get the counts (how I got the number above), but I hit the error on rdd.groupByKey(). Also, I have a second RDD of strings rdd2: RDD[String] and I cannot do rdd2.leftOuterJoin(rdd) without hitting this error {code} 15/05/26 23:27:55 WARN scheduler.TaskSetManager: Lost task 3169.1 in stage 5.0 (TID 4843, demeter-csmaz10-19.demeter.hpc.mssm.edu): java.lang.IndexOutOfBoundsException: 512 at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) at org.apache.spark.util.collection.ChainedBuffer.write(ChainedBuffer.scala:110) at org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:141) at com.esotericsoftware.kryo.io.Output.flush(Output.java:155) at org.apache.spark.serializer.KryoSerializationStream.flush(KryoSerializer.scala:147) at org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:78) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7896) IndexOutOfBoundsException in ChainedBuffer
[ https://issues.apache.org/jira/browse/SPARK-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561347#comment-14561347 ] Sandy Ryza commented on SPARK-7896: --- This must be because we're overflowing the 2 GB limit of the ChainedBuffer. 512 * 4 MB = 2 GB. I'll post a patch that uses long indices. Would it be easy for you to try it out today [~arahuja]? IndexOutOfBoundsException in ChainedBuffer -- Key: SPARK-7896 URL: https://issues.apache.org/jira/browse/SPARK-7896 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Arun Ahuja Assignee: Sandy Ryza Priority: Blocker I've run into this on two tasks that use the same dataset. The dataset is a collection of strings where the most common string appears ~200M times and the next few appear ~50M times each. for this rdd: RDD[String], I can do rdd.map( x = (x, 1)).reduceByKey( _ + _) to get the counts (how I got the number above), but I hit the error on rdd.groupByKey(). Also, I have a second RDD of strings rdd2: RDD[String] and I cannot do rdd2.leftOuterJoin(rdd) without hitting this error {code} 15/05/26 23:27:55 WARN scheduler.TaskSetManager: Lost task 3169.1 in stage 5.0 (TID 4843, demeter-csmaz10-19.demeter.hpc.mssm.edu): java.lang.IndexOutOfBoundsException: 512 at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) at org.apache.spark.util.collection.ChainedBuffer.write(ChainedBuffer.scala:110) at org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:141) at com.esotericsoftware.kryo.io.Output.flush(Output.java:155) at org.apache.spark.serializer.KryoSerializationStream.flush(KryoSerializer.scala:147) at org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:78) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561656#comment-14561656 ] Sandy Ryza commented on SPARK-4352: --- I have a couple concerns about that approach. The first is that we can end up requesting many more executors on a node than we need. E.g. if only a single task is interested in running on node1, but many tasks are interested in running on node2, nothing will stop YARN from scheduling all requested executors on node1. Conversely, we can end up requesting far fewer executors on a node than we need. Once a single executor has been placed on a node, even if that executor only has a single core and there are many pending tasks interested in that node, we will stop trying to place executors on that node. I have an idea for a scheme that I think will better tailor requests to locality needs. It's definitely more complex, but I don't think it's more complex than what frameworks like MapReduce do, and I think it's an answer to a fundamentally complex problem. The idea is basically to submit executor requests such that, for each node, sum(cores from all executor requests with that node on their preferred list) = number tasks that prefer that node. As an example, let's imagine our executors have 2 cores each, we have 4 nodes named a through d, and 30 pending tasks. The first 20 of these pending tasks have locality preferences for nodes a, b, and c, and the other 10 prefer nodes a, b, and d. Meaning that the number of tasks desired on each node are a: 30, b: 30, c: 20, and d: 10. If the number of executors we want to request is = 15, then we would submit the following requests: requests for 5 executors with nodes = a, b, c, d requests for 5 executors with nodes = a, b, c requests for 5 executors with nodes = a, b If the number of executors we want to request is 7, then we would submit the following requests: requests for 5 executors with nodes = a, b, c, d requests for 2 executors with nodes = a, b, c If the number of executors we want to request is 18, then we would submit the following requests: requests for 5 executors with nodes = a, b, c, d requests for 5 executors with nodes = a, b, c requests for 5 executors with nodes = a, b requests for 3 executors with no locality preferences We might want to augment this to account for the fact that some nodes have executors already. What do you think? Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560227#comment-14560227 ] Sandy Ryza commented on SPARK-7699: --- I think tying this the AM-RM heartbeat would just make things more confusing, especially now that the heartbeat interval is variable. Whether we've had a fair chance to allocate resources also depends on internal YARN configurations, like the NM-RM heartbeat interval or whether continuous scheduling is enabled. I don't think there's any easy notion of fair chance that doesn't rely on a timeout. Another option would be to avoid adjusting targetNumExecutors down before the first job is submitted. Number of executors can be reduced from initial before work is scheduled Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: meiyoula Priority: Minor spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558748#comment-14558748 ] Sandy Ryza commented on SPARK-7699: --- We can't wait only on the initial allocation being made because YARN might not be able to fully satisfy it in any finite amount of time. Number of executors can be reduced from initial before work is scheduled Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: meiyoula Priority: Minor spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558757#comment-14558757 ] Sandy Ryza commented on SPARK-7699: --- I think delaying releasing them is exactly the point of the property. If we don't want to do that, what's it there for? Number of executors can be reduced from initial before work is scheduled Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: meiyoula Priority: Minor spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-4352: -- Assignee: Saisai Shao Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557911#comment-14557911 ] Sandy Ryza edited comment on SPARK-7699 at 5/25/15 1:26 AM: [~sowen] I think the possible flaw in your argument is that it relies on initial load being defined in some reasonable way. I.e. I think the worry is that the following can happen: * initial = 3 and min = 1 * cluster is large and uncontended * first line of user code is a job submission that can make use of at least 3 * because the executor allocation thread starts immediately, requested executors ramps down to 1 before the user code has a chance to submit the job Which is to say: what guarantees do we provide about initialExecutors other than that it's the number of executors requests we have before some opaque internal thing happens to adjust it down? One possible such guarantee we could provide is that we won't adjust down for some fixed number of seconds after the SparkContext starts. was (Author: sandyr): [~sowen] I think the possible flaw in your argument is that it relies on initial load being defined in some reasonable. I.e. I think the worry is that the following can happen: * initial = 3 and min = 1 * cluster is large and uncontended * first line of user code is a job submission that can make use of at least 3 * because the executor allocation thread starts immediately, requested executors ramps down to 1 before the user code has a chance to submit the job Which is to say: what guarantees do we provide about initialExecutors other than that it's the number of executors requests we have before some opaque internal thing happens to adjust it down? One possible such guarantee we could provide is that we won't adjust down for some fixed number of seconds after the SparkContext starts. Config spark.dynamicAllocation.initialExecutors has no effect Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557911#comment-14557911 ] Sandy Ryza commented on SPARK-7699: --- [~sowen] I think the possible flaw in your argument is that it relies on initial load being defined in some reasonable. I.e. I think the worry is that the following can happen: * initial = 3 and min = 1 * cluster is large and uncontended * first line of user code is a job submission that can make use of at least 3 * because the executor allocation thread starts immediately, requested executors ramps down to 1 before the user code has a chance to submit the job Which is to say: what guarantees do we provide about initialExecutors other than that it's the number of executors requests we have before some opaque internal thing happens to adjust it down? One possible such guarantee we could provide is that we won't adjust down for some fixed number of seconds after the SparkContext starts. Config spark.dynamicAllocation.initialExecutors has no effect Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555837#comment-14555837 ] Sandy Ryza commented on SPARK-7699: --- Sorry for the delay here. The desired behavior is to never have outstanding requests for more than the number of executors we'd need to satisfy all current tasks (unless minExecutors is set). I.e. we don't want to ramp down gradually. So I think the concerns here are valid - this policy means that as soon as the dynamic allocation thread starts being active, it will cancel any container requests that were made as a result of initialExecutors. If, however, these executor requests were actually fulfilled, dynamic allocation wouldn't throw away the executors. This means that the relevant questions are: is there a window of time after we've requested the initial executors but before the dynamic allocation thread starts? Is there something fundamental about this window of time that means it will probably still be there after future scheduling optimizations? If not, do we want to make sure that ExecutorAllocationManager itself doesn't ramp down below initialExecutors until some criteria (probably time) is satisfied? Or should we just scrap the property. Config spark.dynamicAllocation.initialExecutors has no effect Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551898#comment-14551898 ] Sandy Ryza commented on SPARK-4352: --- I don't think we should kill executors in order to achieve locality. But I think the question of how we translate task preferences into new executor requests is non-trivial, and the way that it was previously done with generateNodeToWeight isn't necessarily the best way (but could be). I don't think there's an obvious solution, but it would be good to lay out the options and their pros and cons. Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Critical Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551677#comment-14551677 ] Sandy Ryza commented on SPARK-4352: --- Thanks for posting this Saisai. Can you export and attach it as a PDF so that we've got an immutable copy for posterity? This mostly looks good to me. Regarding a couple of your open questions: 1. I think best would be to modify requestTotalExecutors (which is private) and avoid any public API additions for now. 2. Task locality preferences are computed already, right? Can we put any computation beyond that in ExecutorAllocationManager so that it only happens when dynamic allocation is turned on? I think another big question is how we translate task locality preferences into requests to YARN. This impacts what the preferredNodeLocalityData should look like. At any moment, we have a set of pending tasks, each with a set of node preferences and rack preferences, as well as a number of desired executors. How do we account for the fact that some nodes have more pending tasks than others? What happens when we're working with executors with 5 cores, but none of the tasks share nodes that they want? What happens when the number of pending tasks exceeds the total capacity of the desired number of executors? One approach would be to request executors at every location that we have pending tasks, and then return executors once we've reached the number that we need. Another would be to condense down our preferences into an optimal number of executor requests. Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Critical Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7579) User guide update for OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541517#comment-14541517 ] Sandy Ryza commented on SPARK-7579: --- Ah. I was actually referring to the examples in the ml-guide.md doc. Didn't notice the ml-features.md doc. Seems like it should be pretty straightforward to add something there. User guide update for OneHotEncoder --- Key: SPARK-7579 URL: https://issues.apache.org/jira/browse/SPARK-7579 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Assignee: Sandy Ryza Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} Note: I created a new subsection for links to spark.ml-specific guides in this JIRA's PR: [SPARK-7557]. This transformer can go within the new subsection. I'll try to get that PR merged ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5888) Add OneHotEncoder as a Transformer
[ https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541524#comment-14541524 ] Sandy Ryza commented on SPARK-5888: --- [~mengxr] that makes sense to me, but does that address the issue I brought up above? Add OneHotEncoder as a Transformer -- Key: SPARK-5888 URL: https://issues.apache.org/jira/browse/SPARK-5888 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Sandy Ryza Fix For: 1.4.0 `OneHotEncoder` takes a categorical column and output a vector column, which stores the category info in binaries. {code} val ohe = new OneHotEncoder() .setInputCol(countryIndex) .setOutputCol(countries) {code} It should read the category info from the metadata and assign feature names properly in the output column. We need to discuss the default naming scheme and whether we should let it process multiple categorical columns at the same time. One category (the most frequent one) should be removed from the output to make the output columns linear independent. Or this could be an option tuned on by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5888) Add OneHotEncoder as a Transformer
[ https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541339#comment-14541339 ] Sandy Ryza commented on SPARK-5888: --- Hi [~hvanhovell], I agree that this should work. [~mengxr], any thoughts on the best way to solve this? Add OneHotEncoder as a Transformer -- Key: SPARK-5888 URL: https://issues.apache.org/jira/browse/SPARK-5888 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Sandy Ryza Fix For: 1.4.0 `OneHotEncoder` takes a categorical column and output a vector column, which stores the category info in binaries. {code} val ohe = new OneHotEncoder() .setInputCol(countryIndex) .setOutputCol(countries) {code} It should read the category info from the metadata and assign feature names properly in the output column. We need to discuss the default naming scheme and whether we should let it process multiple categorical columns at the same time. One category (the most frequent one) should be removed from the output to make the output columns linear independent. Or this could be an option tuned on by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5888) Add OneHotEncoder as a Transformer
[ https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541367#comment-14541367 ] Sandy Ryza commented on SPARK-5888: --- Right, but while the values are unknown at first, they will become known at some point during the execution (after StringIndexer.fit completes). So it seems like at some point it would be good to pass these values down. Put another way, it seems bad to me that the user should see different behavior with regard to attribute values if they use OneHotEncoder in the Pipeline way, as described by Herman above, vs in the slightly more verbose way where StringIndexer.transform is explicitly called first: https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala#L50. I'm relatively unfamiliar with these APIs, so apologies if I'm not making sense. Add OneHotEncoder as a Transformer -- Key: SPARK-5888 URL: https://issues.apache.org/jira/browse/SPARK-5888 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Sandy Ryza Fix For: 1.4.0 `OneHotEncoder` takes a categorical column and output a vector column, which stores the category info in binaries. {code} val ohe = new OneHotEncoder() .setInputCol(countryIndex) .setOutputCol(countries) {code} It should read the category info from the metadata and assign feature names properly in the output column. We need to discuss the default naming scheme and whether we should let it process multiple categorical columns at the same time. One category (the most frequent one) should be removed from the output to make the output columns linear independent. Or this could be an option tuned on by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7579) User guide update for OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541356#comment-14541356 ] Sandy Ryza commented on SPARK-7579: --- I can take this up. Any thoughts on how it should be incorporated? E.g. as an example a la Example: Model Selection via Cross-Validation? User guide update for OneHotEncoder --- Key: SPARK-7579 URL: https://issues.apache.org/jira/browse/SPARK-7579 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Assignee: Sandy Ryza Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} Note: I created a new subsection for links to spark.ml-specific guides in this JIRA's PR: [SPARK-7557]. This transformer can go within the new subsection. I'll try to get that PR merged ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5888) Add OneHotEncoder as a Transformer
[ https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541351#comment-14541351 ] Sandy Ryza commented on SPARK-5888: --- The values of the nominal output attribute should be based based on those of the input attribute. Are you saying that there is a way to propagate these at a later time? Add OneHotEncoder as a Transformer -- Key: SPARK-5888 URL: https://issues.apache.org/jira/browse/SPARK-5888 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Sandy Ryza Fix For: 1.4.0 `OneHotEncoder` takes a categorical column and output a vector column, which stores the category info in binaries. {code} val ohe = new OneHotEncoder() .setInputCol(countryIndex) .setOutputCol(countries) {code} It should read the category info from the metadata and assign feature names properly in the output column. We need to discuss the default naming scheme and whether we should let it process multiple categorical columns at the same time. One category (the most frequent one) should be removed from the output to make the output columns linear independent. Or this could be an option tuned on by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7515) Update documentation for PySpark on YARN with cluster mode
[ https://issues.apache.org/jira/browse/SPARK-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-7515. --- Resolution: Fixed Fix Version/s: 1.5.0 Target Version/s: (was: 1.4.0) Update documentation for PySpark on YARN with cluster mode -- Key: SPARK-7515 URL: https://issues.apache.org/jira/browse/SPARK-7515 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Minor Fix For: 1.5.0 Now PySpark on YARN with cluster mode is supported so let's update doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7515) Update documentation for PySpark on YARN with cluster mode
[ https://issues.apache.org/jira/browse/SPARK-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-7515: -- Assignee: Kousuke Saruta Update documentation for PySpark on YARN with cluster mode -- Key: SPARK-7515 URL: https://issues.apache.org/jira/browse/SPARK-7515 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Minor Now PySpark on YARN with cluster mode is supported so let's update doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7410) Add option to avoid broadcasting configuration with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538784#comment-14538784 ] Sandy Ryza commented on SPARK-7410: --- Thanks for the pointer, [~joshrosen]. Looked over that JIRA and hard to understand what's going on at first glance. Are you saying it's a performance issue or a correctness issue? Or both? I can look deeper if you don't remember. Regarding performance, broadcasting the conf is probably faster in the majority of cases. But in cases where the RDD has only a couple partitions, the reverse is true. So what I'm advocating for is the ability to turn broadcasting off in the latter case. Add option to avoid broadcasting configuration with newAPIHadoopFile Key: SPARK-7410 URL: https://issues.apache.org/jira/browse/SPARK-7410 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: Sandy Ryza I'm working with a Spark application that creates thousands of HadoopRDDs and unions them together. Certain details of the way the data is stored require this. Creating ten thousand of these RDDs takes about 10 minutes, even before any of them is used in an action. I dug into why this takes so long and it looks like the overhead of broadcasting the Hadoop configuration is taking up most of the time. In this case, the broadcasting isn't helpful because each HadoopRDD only corresponds to one or two tasks. When I reverted the original change that switched to broadcasting configurations, the time it took to instantiate these RDDs improved 10x. It would be nice if there was a way to turn this broadcasting off. Either through a Spark configuration option, a Hadoop configuration option, or an argument to hadoopFile / newAPIHadoopFile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7533) Decrease spacing between AM-RM heartbeats.
Sandy Ryza created SPARK-7533: - Summary: Decrease spacing between AM-RM heartbeats. Key: SPARK-7533 URL: https://issues.apache.org/jira/browse/SPARK-7533 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.1 Reporter: Sandy Ryza The current default of spark.yarn.scheduler.heartbeat.interval-ms is 5 seconds. This is really long. For reference, the MR equivalent is 1 second. To avoid noise and unnecessary communication, we could have a fast rate for when we're waiting for executors and a slow rate for when we're just heartbeating. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7410) Add option to avoid broadcasting configuration with newAPIHadoopFile
Sandy Ryza created SPARK-7410: - Summary: Add option to avoid broadcasting configuration with newAPIHadoopFile Key: SPARK-7410 URL: https://issues.apache.org/jira/browse/SPARK-7410 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: Sandy Ryza I'm working with a Spark application that creates thousands of HadoopRDDs and unions them together. Certain details of the way the data is stored require this. Creating ten thousand of these RDDs takes about 10 minutes, even before any of them is used in an action. I dug into why this takes so long and it looks like the overhead of broadcasting the Hadoop configuration is taking up most of the time. In this case, the broadcasting isn't helpful because each HadoopRDD only corresponds to one or two tasks. When I reverted the original change that switched to broadcasting configurations, the time it took to instantiate these RDDs improved 10x. It would be nice if there was a way to turn this broadcasting off. Either through a Spark configuration option, a Hadoop configuration option, or an argument to hadoopFile / newAPIHadoopFile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form
[ https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-4550. --- Resolution: Fixed Fix Version/s: 1.4.0 In sort-based shuffle, store map outputs in serialized form --- Key: SPARK-4550 URL: https://issues.apache.org/jira/browse/SPARK-4550 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical Fix For: 1.4.0 Attachments: SPARK-4550-design-v1.pdf, kryo-flush-benchmark.scala One drawback with sort-based shuffle compared to hash-based shuffle is that it ends up storing many more java objects in memory. If Spark could store map outputs in serialized form, it could * spill less often because the serialized form is more compact * reduce GC pressure This will only work when the serialized representations of objects are independent from each other and occupy contiguous segments of memory. E.g. when Kryo reference tracking is left on, objects may contain pointers to objects farther back in the stream, which means that the sort can't relocate objects without corrupting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7311) Enable in-memory serialized map-side shuffle to work with SQL serializers
Sandy Ryza created SPARK-7311: - Summary: Enable in-memory serialized map-side shuffle to work with SQL serializers Key: SPARK-7311 URL: https://issues.apache.org/jira/browse/SPARK-7311 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 1.4.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Right now it only works with KryoSerializer, but it would be useful to make it work with any Serializer that writes objects in a relocatable / self-contained fashion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518286#comment-14518286 ] Sandy Ryza commented on SPARK-3655: --- My opinion is that a secondary sort operator in core Spark would definitely be useful. Support sorting of values in addition to keys (i.e. secondary sort) --- Key: SPARK-3655 URL: https://issues.apache.org/jira/browse/SPARK-3655 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: koert kuipers Assignee: Koert Kuipers Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7173) Support YARN node label expressions for the application master
Sandy Ryza created SPARK-7173: - Summary: Support YARN node label expressions for the application master Key: SPARK-7173 URL: https://issues.apache.org/jira/browse/SPARK-7173 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.1 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6954) ExecutorAllocationManager can end up requesting a negative number of executors
[ https://issues.apache.org/jira/browse/SPARK-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-6954: -- Summary: ExecutorAllocationManager can end up requesting a negative number of executors (was: Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative) ExecutorAllocationManager can end up requesting a negative number of executors -- Key: SPARK-6954 URL: https://issues.apache.org/jira/browse/SPARK-6954 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.1 Reporter: Cheolsoo Park Assignee: Cheolsoo Park Labels: yarn Attachments: with_fix.png, without_fix.png I have a simple test case for dynamic allocation on YARN that fails with the following stack trace- {code} 15/04/16 00:52:14 ERROR Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -21 from the cluster manager. Please specify a positive number! at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338) at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294) at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} My test is as follows- # Start spark-shell with a single executor. # Run a {{select count(\*)}} query. The number of executors rises as input size is non-trivial. # After the job finishes, the number of executors falls as most of them become idle. # Rerun the same query again, and the request to add executors fails with the above error. In fact, the job itself continues to run with whatever executors it already has, but it never gets more executors unless the shell is closed and restarted. In fact, this error only happens when I configure {{executorIdleTimeout}} very small. For eg, I can reproduce it with the following configs- {code} spark.dynamicAllocation.executorIdleTimeout 5 spark.dynamicAllocation.schedulerBacklogTimeout 5 {code} Although I can simply increase {{executorIdleTimeout}} to something like 60 secs to avoid the error, I think this is still a bug to be fixed. The root cause seems that {{numExecutorsPending}} accidentally becomes negative if executors are killed too aggressively (i.e. {{executorIdleTimeout}} is too small) because under that circumstance, the new target # of executors can be smaller than the current # of executors. When that happens, {{ExecutorAllocationManager}} ends up trying to add a negative number of executors, which throws an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6891) ExecutorAllocationManager will request negative number executors
[ https://issues.apache.org/jira/browse/SPARK-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-6891. --- Resolution: Duplicate ExecutorAllocationManager will request negative number executors Key: SPARK-6891 URL: https://issues.apache.org/jira/browse/SPARK-6891 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Priority: Critical Attachments: DynamicExecutorTest.scala Below is the exception: 15/04/14 10:10:18 ERROR Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -1 from the cluster manager. Please specify a positive number! at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:342) at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1170) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294) at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1723) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Below is the configurations I setted: spark.dynamicAllocation.enabled true spark.dynamicAllocation.minExecutors 0 spark.dynamicAllocation.initialExecutors3 spark.dynamicAllocation.maxExecutors7 spark.dynamicAllocation.executorIdleTimeout 30 spark.shuffle.service.enabled true -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6891) ExecutorAllocationManager will request negative number executors
[ https://issues.apache.org/jira/browse/SPARK-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510550#comment-14510550 ] Sandy Ryza commented on SPARK-6891: --- This looks like a duplicate of SPARK-6954. While this JIRA was filed earlier, I'm going to close it because the other one already has a patch with heavy discussion. Sorry that we didn't notice this earlier. ExecutorAllocationManager will request negative number executors Key: SPARK-6891 URL: https://issues.apache.org/jira/browse/SPARK-6891 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Priority: Critical Attachments: DynamicExecutorTest.scala Below is the exception: 15/04/14 10:10:18 ERROR Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -1 from the cluster manager. Please specify a positive number! at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:342) at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1170) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294) at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1723) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Below is the configurations I setted: spark.dynamicAllocation.enabled true spark.dynamicAllocation.minExecutors 0 spark.dynamicAllocation.initialExecutors3 spark.dynamicAllocation.maxExecutors7 spark.dynamicAllocation.executorIdleTimeout 30 spark.shuffle.service.enabled true -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6954) Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative
[ https://issues.apache.org/jira/browse/SPARK-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497465#comment-14497465 ] Sandy Ryza commented on SPARK-6954: --- Hi [~cheolsoo], are you running with a version of Spark that contains SPARK-6325? (1.3.0 does not). Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative - Key: SPARK-6954 URL: https://issues.apache.org/jira/browse/SPARK-6954 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Cheolsoo Park Priority: Minor Labels: yarn I have a simple test case for dynamic allocation on YARN that fails with the following stack trace- {code} 15/04/16 00:52:14 ERROR Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -21 from the cluster manager. Please specify a positive number! at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338) at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294) at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} My test is as follows- # Start spark-shell with a single executor. # Run a {{select count(\*)}} query. The number of executors rises as input size is non-trivial. # After the job finishes, the number of executors falls as most of them become idle. # Rerun the same query again, and the request to add executors fails with the above error. In fact, the job itself continues to run with whatever executors it already has, but it never gets more executors unless the shell is closed and restarted. In fact, this error only happens when I configure {{executorIdleTimeout}} very small. For eg, I can reproduce it with the following configs- {code} spark.dynamicAllocation.executorIdleTimeout 5 spark.dynamicAllocation.schedulerBacklogTimeout 5 {code} Although I can simply increase {{executorIdleTimeout}} to something like 60 secs to avoid the error, I think this is still a bug to be fixed. The root cause seems that {{numExecutorsPending}} accidentally becomes negative if executors are killed too aggressively (i.e. {{executorIdleTimeout}} is too small) because under that circumstance, the new target # of executors can be smaller than the current # of executors. When that happens, {{ExecutorAllocationManager}} ends up trying to add a negative number of executors, which throws an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer
[ https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-5888: - Assignee: Sandy Ryza Add OneHotEncoder as a Transformer -- Key: SPARK-5888 URL: https://issues.apache.org/jira/browse/SPARK-5888 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Sandy Ryza `OneHotEncoder` takes a categorical column and output a vector column, which stores the category info in binaries. {code} val ohe = new OneHotEncoder() .setInputCol(countryIndex) .setOutputCol(countries) {code} It should read the category info from the metadata and assign feature names properly in the output column. We need to discuss the default naming scheme and whether we should let it process multiple categorical columns at the same time. One category (the most frequent one) should be removed from the output to make the output columns linear independent. Or this could be an option tuned on by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.
[ https://issues.apache.org/jira/browse/SPARK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487670#comment-14487670 ] Sandy Ryza commented on SPARK-6735: --- Hi [~twinkle], can you submit the PR against the main Spark project. Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it. --- Key: SPARK-6735 URL: https://issues.apache.org/jira/browse/SPARK-6735 Project: Spark Issue Type: Improvement Components: Spark Submit, YARN Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Twinkle Sachdeva Currently there is a setting (spark.yarn.max.executor.failures ) which tells maximum number of executor failures, after which Application fails. For long running applications, user can require not to kill the application at all or will require such setting relative to a window duration. This improvement is ti provide such options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.
[ https://issues.apache.org/jira/browse/SPARK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487670#comment-14487670 ] Sandy Ryza edited comment on SPARK-6735 at 4/9/15 5:06 PM: --- Hi [~twinkle], can you submit the PR against the main Spark project? was (Author: sandyr): Hi [~twinkle], can you submit the PR against the main Spark project. Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it. --- Key: SPARK-6735 URL: https://issues.apache.org/jira/browse/SPARK-6735 Project: Spark Issue Type: Improvement Components: Spark Submit, YARN Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Twinkle Sachdeva Currently there is a setting (spark.yarn.max.executor.failures ) which tells maximum number of executor failures, after which Application fails. For long running applications, user can require not to kill the application at all or will require such setting relative to a window duration. This improvement is ti provide such options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6700) flaky test: run Python application in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395135#comment-14395135 ] Sandy Ryza edited comment on SPARK-6700 at 4/3/15 9:54 PM: --- Does this fail often? How often? Might the issue be related to SPARK-6506? was (Author: sandyr): Does this fail often? flaky test: run Python application in yarn-cluster mode Key: SPARK-6700 URL: https://issues.apache.org/jira/browse/SPARK-6700 Project: Spark Issue Type: Bug Components: Tests Reporter: Davies Liu Assignee: Lianhui Wang Priority: Critical Labels: test, yarn org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode Failing for the past 1 build (Since Failed#2025 ) Took 12 sec. Error Message {code} Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 Stacktrace sbt.ForkMain$ForkError: Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at
[jira] [Commented] (SPARK-6700) flaky test: run Python application in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395135#comment-14395135 ] Sandy Ryza commented on SPARK-6700: --- Does this fail often? flaky test: run Python application in yarn-cluster mode Key: SPARK-6700 URL: https://issues.apache.org/jira/browse/SPARK-6700 Project: Spark Issue Type: Bug Components: Tests Reporter: Davies Liu Assignee: Lianhui Wang Priority: Critical Labels: test, yarn org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode Failing for the past 1 build (Since Failed#2025 ) Took 12 sec. Error Message {code} Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 Stacktrace sbt.ForkMain$ForkError: Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$scalatest$BeforeAndAfterAll$$super$run(YarnClusterSuite.scala:44) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390153#comment-14390153 ] Sandy Ryza commented on SPARK-6646: --- This seems like a good opportunity to finally add a DataFrame registerTempTablet API. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390196#comment-14390196 ] Sandy Ryza commented on SPARK-6646: --- [~srowen] I like the way you think. I know a lot of good nodes out there looking for love or at least a casual shutdown hookup. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form
[ https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383280#comment-14383280 ] Sandy Ryza commented on SPARK-4550: --- Java serialization appears to write out the full class name the first time an object is written and then refer to it by an identifier afterwards: {code} scala val baos = new ByteArrayOutputStream() scala val oos = new ObjectOutputStream(baos) scala oos.writeObject(new java.util.Date()) scala oos.flush() scala baos.toString res8: String = ��??sr??java.util.Datehj�?KYtxpwLY6: x scala baos.toByteArray.length res9: Int = 46 scala oos.writeObject(new java.util.Date()) scala oos.flush() scala baos.toString res14: String = ��??sr??java.util.Datehj�?KYtxpwLY6: xsq?~??wLY6�Dx scala baos.toByteArray.length res13: Int = 63 scala oos.writeObject(new java.util.Date()) scala oos.flush() scala baos.toString res17: String = ��??sr??java.util.Datehj�?KYtxpwLY6: xsq?~??wLY6�Dxsq?~??wLY8?�x scala baos.toByteArray.length res18: Int = 80 {code} There might be some fancy way to listen for the class name being written out and relocate that segment to the front of the stream. However, this seems fairly and involved and bug-prone; my opinion is that isn't not worth it given that Java ser is already a severely performance-impaired option. Another option of course would be to write the class name in front of every record, but this would bloat the serialized representation considerably. In sort-based shuffle, store map outputs in serialized form --- Key: SPARK-4550 URL: https://issues.apache.org/jira/browse/SPARK-4550 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical Attachments: SPARK-4550-design-v1.pdf, kryo-flush-benchmark.scala One drawback with sort-based shuffle compared to hash-based shuffle is that it ends up storing many more java objects in memory. If Spark could store map outputs in serialized form, it could * spill less often because the serialized form is more compact * reduce GC pressure This will only work when the serialized representations of objects are independent from each other and occupy contiguous segments of memory. E.g. when Kryo reference tracking is left on, objects may contain pointers to objects farther back in the stream, which means that the sort can't relocate objects without corrupting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)
[ https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378011#comment-14378011 ] Sandy Ryza commented on SPARK-6479: --- I believe he means wrapping Spark's call-outs to Tachyon in the new APIs proposed here and invoking Tachyon through them. Create off-heap block storage API (internal) Key: SPARK-6479 URL: https://issues.apache.org/jira/browse/SPARK-6479 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Reporter: Reynold Xin Attachments: SparkOffheapsupportbyHDFS.pdf Would be great to create APIs for off-heap block stores, rather than doing a bunch of if statements everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6470) Allow Spark apps to put YARN node labels in their requests
[ https://issues.apache.org/jira/browse/SPARK-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-6470: - Assignee: Sandy Ryza Allow Spark apps to put YARN node labels in their requests -- Key: SPARK-6470 URL: https://issues.apache.org/jira/browse/SPARK-6470 Project: Spark Issue Type: Improvement Components: YARN Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6418) Add simple per-stage visualization to the UI
[ https://issues.apache.org/jira/browse/SPARK-6418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372144#comment-14372144 ] Sandy Ryza commented on SPARK-6418: --- I think this would be a great addition. One note is that I think it might be easier to interpret if the results were sorted by start time. If possible it would also be really helpful to somehow give an indication of which tasks ran on which nodes. And to have task metrics show when you hover over or something. And to have filtering. And pagination. Ok I'm getting carried away, but would be great to see something like this in. Add simple per-stage visualization to the UI Key: SPARK-6418 URL: https://issues.apache.org/jira/browse/SPARK-6418 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Pradyumn Shroff Attachments: Screen Shot 2015-03-18 at 6.13.04 PM.png Visualizing how tasks in a stage spend their time can be very helpful to understanding performance. Many folks have started using the visualization tools here: https://github.com/kayousterhout/trace-analysis (see the README at the bottom) to analyze their jobs after they've finished running, but it would be great if this functionality were natively integrated into Spark's UI. I'd propose adding a relatively simple visualization to the stage detail page, that's hidden by default but that users can view by clicking on a drop-down menu. The plan is to implement this using D3; a mock up of how this would look (that uses D3) is attached. This is intended to be a much simpler and more limited version of SPARK-3468 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6418) Add simple per-stage visualization to the UI
[ https://issues.apache.org/jira/browse/SPARK-6418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372178#comment-14372178 ] Sandy Ryza commented on SPARK-6418: --- Yeah, all of that is wishful thinking, definitely not requirements for a first version. Dealing with huge numbers of tasks, as you mentioned, seems to me like a requirement for a first cut, and something like just taking the first N seems sufficient to me. Add simple per-stage visualization to the UI Key: SPARK-6418 URL: https://issues.apache.org/jira/browse/SPARK-6418 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Pradyumn Shroff Attachments: Screen Shot 2015-03-18 at 6.13.04 PM.png Visualizing how tasks in a stage spend their time can be very helpful to understanding performance. Many folks have started using the visualization tools here: https://github.com/kayousterhout/trace-analysis (see the README at the bottom) to analyze their jobs after they've finished running, but it would be great if this functionality were natively integrated into Spark's UI. I'd propose adding a relatively simple visualization to the stage detail page, that's hidden by default but that users can view by clicking on a drop-down menu. The plan is to implement this using D3; a mock up of how this would look (that uses D3) is attached. One change we'll make for the initial implementation, compared to the attached visualization, is tasks will be sorted by start time. This is intended to be a much simpler and more limited version of SPARK-3468 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form
[ https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372232#comment-14372232 ] Sandy Ryza commented on SPARK-4550: --- I spoke briefly with Reynold about this offline, and he pointed out that, with the patch, we now flush the Kryo serialization stream after every object we write. I put together a micro-benchmark to stress this that writes a bunch of small records to a Kryo serialization stream with and without flushing: runs without flush: (count: 30, mean: 226.40, stdev: 3.929377, max: 241.00, min: 222.00) runs with flush: (count: 30, mean: 226.30, stdev: 2.084067, max: 234.00, min: 224.00) There doesn't appear to be a significant difference. The benchmark code is attached. In sort-based shuffle, store map outputs in serialized form --- Key: SPARK-4550 URL: https://issues.apache.org/jira/browse/SPARK-4550 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical Attachments: SPARK-4550-design-v1.pdf One drawback with sort-based shuffle compared to hash-based shuffle is that it ends up storing many more java objects in memory. If Spark could store map outputs in serialized form, it could * spill less often because the serialized form is more compact * reduce GC pressure This will only work when the serialized representations of objects are independent from each other and occupy contiguous segments of memory. E.g. when Kryo reference tracking is left on, objects may contain pointers to objects farther back in the stream, which means that the sort can't relocate objects without corrupting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form
[ https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-4550: -- Attachment: kryo-flush-benchmark.scala In sort-based shuffle, store map outputs in serialized form --- Key: SPARK-4550 URL: https://issues.apache.org/jira/browse/SPARK-4550 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical Attachments: SPARK-4550-design-v1.pdf, kryo-flush-benchmark.scala One drawback with sort-based shuffle compared to hash-based shuffle is that it ends up storing many more java objects in memory. If Spark could store map outputs in serialized form, it could * spill less often because the serialized form is more compact * reduce GC pressure This will only work when the serialized representations of objects are independent from each other and occupy contiguous segments of memory. E.g. when Kryo reference tracking is left on, objects may contain pointers to objects farther back in the stream, which means that the sort can't relocate objects without corrupting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6393) Extra RPC to the AM during killExecutor invocation
Sandy Ryza created SPARK-6393: - Summary: Extra RPC to the AM during killExecutor invocation Key: SPARK-6393 URL: https://issues.apache.org/jira/browse/SPARK-6393 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.3.1 Reporter: Sandy Ryza This was introduced by SPARK-6325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org