[jira] [Comment Edited] (SPARK-16917) Spark streaming kafka version compatibility.
[ https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418384#comment-15418384 ] Alexey Zotov edited comment on SPARK-16917 at 8/12/16 5:24 AM: --- [~sowen] [~c...@koeninger.org] It really seems to be confusing: 1. Security is supported in new consumer API which is implemented starting from Kafka v0.9. _spark-streaming-kafka-0-8_2.11_ does not support new consumer API. Based on that it does not look compatible with secured Kafka v0.9. 2. _spark-streaming-kafka-0-10_2.11_ works with brokers 0.10 or higher. Based on the above reasoning looks like it is impossible to use Spark Streaming for secured Kafka v0.9. Am I correct? If yes, then it would be great to mention it somewhere in documentation. Thanks! was (Author: azotcsit): [~sowen] [~c...@koeninger.org] It really seems to be confusing: 1. Security is supported in new consumer API which is implemented starting from Kafka v0.9. _spark-streaming-kafka-0-8_2.11_ does not support new consumer API. Based on that it does not look compatible with secured Kafka v0.9. 2. _spark-streaming-kafka-0-10_2.11_ works with brokers 0.10 or higher. Based on the above reasoning looks like it is impossible to use Spark Streaming for secured Kafka v0.9. Please, let me know what I have missed in the above reasoning. Thanks! > Spark streaming kafka version compatibility. > - > > Key: SPARK-16917 > URL: https://issues.apache.org/jira/browse/SPARK-16917 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Sudev >Priority: Trivial > Labels: documentation > > It would be nice to have Kafka version compatibility information in the > official documentation. > It's very confusing now. > * If you look at this JIRA[1], it seems like Kafka is supported in Spark > 2.0.0. > * The documentation lists artifact for (Kafka 0.8) > spark-streaming-kafka-0-8_2.11 > Is Kafka 0.9 supported by Spark 2.0.0 ? > Since I'm confused here even after an hours effort googling on the same, I > think someone should help add the compatibility matrix. > [1] https://issues.apache.org/jira/browse/SPARK-12177 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16917) Spark streaming kafka version compatibility.
[ https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418384#comment-15418384 ] Alexey Zotov commented on SPARK-16917: -- [~sowen] [~c...@koeninger.org] It really seems to be confusing: 1. Security is supported in new consumer API which is implemented starting from Kafka v0.9. _spark-streaming-kafka-0-8_2.11_ does not support new consumer API. Based on that it does not look compatible with secured Kafka v0.9. 2. _spark-streaming-kafka-0-10_2.11_ works with brokers 0.10 or higher. Based on the above reasoning looks like it is impossible to use Spark Streaming for secured Kafka v0.9. Please, let me know what I have missed in the above reasoning. Thanks! > Spark streaming kafka version compatibility. > - > > Key: SPARK-16917 > URL: https://issues.apache.org/jira/browse/SPARK-16917 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Sudev >Priority: Trivial > Labels: documentation > > It would be nice to have Kafka version compatibility information in the > official documentation. > It's very confusing now. > * If you look at this JIRA[1], it seems like Kafka is supported in Spark > 2.0.0. > * The documentation lists artifact for (Kafka 0.8) > spark-streaming-kafka-0-8_2.11 > Is Kafka 0.9 supported by Spark 2.0.0 ? > Since I'm confused here even after an hours effort googling on the same, I > think someone should help add the compatibility matrix. > [1] https://issues.apache.org/jira/browse/SPARK-12177 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
[ https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418364#comment-15418364 ] Dongjoon Hyun commented on SPARK-16975: --- Hi, [~rxin]. Could you review this PR? > Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2 > -- > > Key: SPARK-16975 > URL: https://issues.apache.org/jira/browse/SPARK-16975 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Ubuntu Linux 14.04 >Reporter: immerrr again > Labels: parquet > > Spark-2.0.0 seems to have some problems reading a parquet dataset generated > by 1.6.2. > {code} > In [80]: spark.read.parquet('/path/to/data') > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data. It must be specified manually;' > {code} > The dataset is ~150G and partitioned by _locality_code column. None of the > partitions are empty. I have narrowed the failing dataset to the first 32 > partitions of the data: > {code} > In [82]: spark.read.parquet(*subdirs[:32]) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be > specified manually;' > {code} > Interestingly, it works OK if you remove any of the partitions from the list: > {code} > In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + > subdirs[i+1:32])) > {code} > Another strange thing is that the schemas for the first and the last 31 > partitions of the subset are identical: > {code} > In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == > spark.read.parquet(*subdirs[1:32]).schema.fields > Out[84]: True > {code} > Which got me interested and I tried this: > {code} > In [87]: spark.read.parquet(*([subdirs[0]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be > specified manually;' > In [88]: spark.read.parquet(*([subdirs[15]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be > specified manually;' > In [89]: spark.read.parquet(*([subdirs[31]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be > specified manually;' > {code} > If I read the first partition, save it in 2.0 and try to read in the same > manner, everything is fine: > {code} > In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test') > 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32)) > {code} > I have originally posted it to user mailing list, but with the last > discoveries this clearly seems like a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17019) Expose off-heap memory usage in various places
[ https://issues.apache.org/jira/browse/SPARK-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17019: Assignee: Apache Spark > Expose off-heap memory usage in various places > -- > > Key: SPARK-17019 > URL: https://issues.apache.org/jira/browse/SPARK-17019 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > With SPARK-13992, Spark supports persisting data into off-heap memory, but > the usage of off-heap is not exposed currently, it is not so convenient for > user to monitor and profile, so here propose to expose off-heap memory as > well as on-heap memory usage in various places: > 1. Spark UI's executor page will display both on-heap and off-heap memory > usage. > 2. REST request returns both on-heap and off-heap memory. > 3. Also these two memory usage can be obtained programmatically from > SparkListener. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17019) Expose off-heap memory usage in various places
[ https://issues.apache.org/jira/browse/SPARK-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17019: Assignee: (was: Apache Spark) > Expose off-heap memory usage in various places > -- > > Key: SPARK-17019 > URL: https://issues.apache.org/jira/browse/SPARK-17019 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Saisai Shao >Priority: Minor > > With SPARK-13992, Spark supports persisting data into off-heap memory, but > the usage of off-heap is not exposed currently, it is not so convenient for > user to monitor and profile, so here propose to expose off-heap memory as > well as on-heap memory usage in various places: > 1. Spark UI's executor page will display both on-heap and off-heap memory > usage. > 2. REST request returns both on-heap and off-heap memory. > 3. Also these two memory usage can be obtained programmatically from > SparkListener. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17019) Expose off-heap memory usage in various places
[ https://issues.apache.org/jira/browse/SPARK-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418305#comment-15418305 ] Apache Spark commented on SPARK-17019: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/14617 > Expose off-heap memory usage in various places > -- > > Key: SPARK-17019 > URL: https://issues.apache.org/jira/browse/SPARK-17019 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Saisai Shao >Priority: Minor > > With SPARK-13992, Spark supports persisting data into off-heap memory, but > the usage of off-heap is not exposed currently, it is not so convenient for > user to monitor and profile, so here propose to expose off-heap memory as > well as on-heap memory usage in various places: > 1. Spark UI's executor page will display both on-heap and off-heap memory > usage. > 2. REST request returns both on-heap and off-heap memory. > 3. Also these two memory usage can be obtained programmatically from > SparkListener. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16434) Avoid record-per type dispatch in JSON when reading
[ https://issues.apache.org/jira/browse/SPARK-16434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-16434: Assignee: Hyukjin Kwon > Avoid record-per type dispatch in JSON when reading > --- > > Key: SPARK-16434 > URL: https://issues.apache.org/jira/browse/SPARK-16434 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > Currently, {{JacksonParser.parse}} is doing type dispatch for each row to > read appropriate values. > It might not have to be done like this because the schema of {{DataFrame}} is > already there. > So, appropriate converters can be created first according to the schema, and > then apply them to each row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16434) Avoid record-per type dispatch in JSON when reading
[ https://issues.apache.org/jira/browse/SPARK-16434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-16434. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14102 [https://github.com/apache/spark/pull/14102] > Avoid record-per type dispatch in JSON when reading > --- > > Key: SPARK-16434 > URL: https://issues.apache.org/jira/browse/SPARK-16434 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > Fix For: 2.1.0 > > > Currently, {{JacksonParser.parse}} is doing type dispatch for each row to > read appropriate values. > It might not have to be done like this because the schema of {{DataFrame}} is > already there. > So, appropriate converters can be created first according to the schema, and > then apply them to each row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13081) Allow set pythonExec of driver and executor through configuration
[ https://issues.apache.org/jira/browse/SPARK-13081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-13081. Resolution: Fixed Assignee: Jeff Zhang Fix Version/s: 2.1.0 > Allow set pythonExec of driver and executor through configuration > - > > Key: SPARK-13081 > URL: https://issues.apache.org/jira/browse/SPARK-13081 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Submit >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Minor > Fix For: 2.1.0 > > > Currently user has to export environment variable PYSPARK_DRIVER_PYTHON and > PYSPAR_PYTHON to set the pythonExec of driver and executor. It is fine with > interactive mode using bin/pyspark, but it is not so convenient if user want > to use pyspark in batch mode by using bin/spark-submit. It would be better to > allow user set pythonExec through "--conf" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals
[ https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418288#comment-15418288 ] Apache Spark commented on SPARK-16955: -- User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/14616 > Using ordinals in ORDER BY causes an analysis error when the query has a > GROUP BY clause using ordinals > --- > > Key: SPARK-16955 > URL: https://issues.apache.org/jira/browse/SPARK-16955 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yin Huai > > The following queries work > {code} > select a from (select 1 as a) tmp order by 1 > select a, count(*) from (select 1 as a) tmp group by 1 > select a, count(*) from (select 1 as a) tmp group by 1 order by a > {code} > However, the following query does not > {code} > select a, count(*) from (select 1 as a) tmp group by 1 order by 1 > {code} > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > Group by position: '1' exceeds the size of the select list '0'. on unresolved > object, tree: > Aggregate [1] > +- SubqueryAlias tmp >+- Project [1 AS a#82] > +- OneRowRelation$ > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181) >
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418225#comment-15418225 ] Guoqiang Li commented on SPARK-6235: I'm doing this work and I'll put the patch in this month. > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17029) Dataset toJSON goes through RDD form instead of transforming dataset itself
[ https://issues.apache.org/jira/browse/SPARK-17029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418149#comment-15418149 ] Andrew Ash commented on SPARK-17029: Note RDD form usage from https://issues.apache.org/jira/browse/SPARK-10705 > Dataset toJSON goes through RDD form instead of transforming dataset itself > --- > > Key: SPARK-17029 > URL: https://issues.apache.org/jira/browse/SPARK-17029 > Project: Spark > Issue Type: Bug >Reporter: Robert Kruszewski > > No longer necessary and can be optimized with datasets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17029) Dataset toJSON goes through RDD form instead of transforming dataset itself
[ https://issues.apache.org/jira/browse/SPARK-17029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17029: Assignee: (was: Apache Spark) > Dataset toJSON goes through RDD form instead of transforming dataset itself > --- > > Key: SPARK-17029 > URL: https://issues.apache.org/jira/browse/SPARK-17029 > Project: Spark > Issue Type: Bug >Reporter: Robert Kruszewski > > No longer necessary and can be optimized with datasets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend
[ https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418143#comment-15418143 ] Miao Wang commented on SPARK-16578: --- OK. I will check with Junyang. > Configurable hostname for RBackend > -- > > Key: SPARK-16578 > URL: https://issues.apache.org/jira/browse/SPARK-16578 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > One of the requirements that comes up with SparkR being a standalone package > is that users can now install just the R package on the client side and > connect to a remote machine which runs the RBackend class. > We should check if we can support this mode of execution and what are the > pros / cons of it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17029) Dataset toJSON goes through RDD form instead of transforming dataset itself
[ https://issues.apache.org/jira/browse/SPARK-17029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418144#comment-15418144 ] Apache Spark commented on SPARK-17029: -- User 'robert3005' has created a pull request for this issue: https://github.com/apache/spark/pull/14615 > Dataset toJSON goes through RDD form instead of transforming dataset itself > --- > > Key: SPARK-17029 > URL: https://issues.apache.org/jira/browse/SPARK-17029 > Project: Spark > Issue Type: Bug >Reporter: Robert Kruszewski > > No longer necessary and can be optimized with datasets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17029) Dataset toJSON goes through RDD form instead of transforming dataset itself
[ https://issues.apache.org/jira/browse/SPARK-17029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17029: Assignee: Apache Spark > Dataset toJSON goes through RDD form instead of transforming dataset itself > --- > > Key: SPARK-17029 > URL: https://issues.apache.org/jira/browse/SPARK-17029 > Project: Spark > Issue Type: Bug >Reporter: Robert Kruszewski >Assignee: Apache Spark > > No longer necessary and can be optimized with datasets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-17028) Backport SI-9734 for Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-17028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu closed SPARK-17028. Resolution: Won't Fix > Backport SI-9734 for Scala 2.10 > --- > > Key: SPARK-17028 > URL: https://issues.apache.org/jira/browse/SPARK-17028 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: Scala 2.10 >Reporter: Shixiong Zhu > > SI-9734 will be included in Scala 2.11.9. However, we still need to backport > it to Spark Scala 2.10 Shell manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17027) PolynomialExpansion.choose is prone to integer overflow
[ https://issues.apache.org/jira/browse/SPARK-17027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17027: Assignee: (was: Apache Spark) > PolynomialExpansion.choose is prone to integer overflow > > > Key: SPARK-17027 > URL: https://issues.apache.org/jira/browse/SPARK-17027 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > Current implementation computes power of k directly and because of that it is > susceptible to integer overflow on relatively small input (4 features, degree > equal 10). It would be better to use recursive formula instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17027) PolynomialExpansion.choose is prone to integer overflow
[ https://issues.apache.org/jira/browse/SPARK-17027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17027: Assignee: Apache Spark > PolynomialExpansion.choose is prone to integer overflow > > > Key: SPARK-17027 > URL: https://issues.apache.org/jira/browse/SPARK-17027 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Minor > > Current implementation computes power of k directly and because of that it is > susceptible to integer overflow on relatively small input (4 features, degree > equal 10). It would be better to use recursive formula instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418111#comment-15418111 ] Apache Spark commented on SPARK-16883: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/14613 > SQL decimal type is not properly cast to number when collecting SparkDataFrame > -- > > Key: SPARK-16883 > URL: https://issues.apache.org/jira/browse/SPARK-16883 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > To reproduce run following code. As you can see "y" is a list of values. > {code} > registerTempTable(createDataFrame(iris), "iris") > str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y > from iris limit 5"))) > 'data.frame': 5 obs. of 2 variables: > $ x: num 1 1 1 1 1 > $ y:List of 5 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16883: Assignee: (was: Apache Spark) > SQL decimal type is not properly cast to number when collecting SparkDataFrame > -- > > Key: SPARK-16883 > URL: https://issues.apache.org/jira/browse/SPARK-16883 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > To reproduce run following code. As you can see "y" is a list of values. > {code} > registerTempTable(createDataFrame(iris), "iris") > str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y > from iris limit 5"))) > 'data.frame': 5 obs. of 2 variables: > $ x: num 1 1 1 1 1 > $ y:List of 5 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16883: Assignee: Apache Spark > SQL decimal type is not properly cast to number when collecting SparkDataFrame > -- > > Key: SPARK-16883 > URL: https://issues.apache.org/jira/browse/SPARK-16883 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki >Assignee: Apache Spark > > To reproduce run following code. As you can see "y" is a list of values. > {code} > registerTempTable(createDataFrame(iris), "iris") > str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y > from iris limit 5"))) > 'data.frame': 5 obs. of 2 variables: > $ x: num 1 1 1 1 1 > $ y:List of 5 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17027) PolynomialExpansion.choose is prone to integer overflow
[ https://issues.apache.org/jira/browse/SPARK-17027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418112#comment-15418112 ] Apache Spark commented on SPARK-17027: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/14614 > PolynomialExpansion.choose is prone to integer overflow > > > Key: SPARK-17027 > URL: https://issues.apache.org/jira/browse/SPARK-17027 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > Current implementation computes power of k directly and because of that it is > susceptible to integer overflow on relatively small input (4 features, degree > equal 10). It would be better to use recursive formula instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17026) warning msg in MulticlassMetricsSuite
[ https://issues.apache.org/jira/browse/SPARK-17026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren resolved SPARK-17026. - Resolution: Not A Problem > warning msg in MulticlassMetricsSuite > - > > Key: SPARK-17026 > URL: https://issues.apache.org/jira/browse/SPARK-17026 > Project: Spark > Issue Type: Improvement >Reporter: Xin Ren >Priority: Trivial > > Got warning when building: > {code} > [warn] > /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:74: > value precision in class MulticlassMetrics is deprecated: Use accuracy. > [warn]assert(math.abs(metrics.accuracy - metrics.precision) < delta) > [warn]^ > [warn] > /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:75: > value recall in class MulticlassMetrics is deprecated: Use accuracy. > [warn]assert(math.abs(metrics.accuracy - metrics.recall) < delta) > [warn]^ > [warn] > /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:76: > value fMeasure in class MulticlassMetrics is deprecated: Use accuracy. > [warn]assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta) > [warn]^ > {code} > And `precision` and `recall` and `fMeasure` are all referencing to `accuracy`: > {code} > assert(math.abs(metrics.accuracy - metrics.precision) < delta) > assert(math.abs(metrics.accuracy - metrics.recall) < delta) > assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta) > {code} > {code} > /** >* Returns precision >*/ > @Since("1.1.0") > @deprecated("Use accuracy.", "2.0.0") > lazy val precision: Double = accuracy > /** >* Returns recall >* (equals to precision for multiclass classifier >* because sum of all false positives is equal to sum >* of all false negatives) >*/ > @Since("1.1.0") > @deprecated("Use accuracy.", "2.0.0") > lazy val recall: Double = accuracy > /** >* Returns f-measure >* (equals to precision and recall because precision equals recall) >*/ > @Since("1.1.0") > @deprecated("Use accuracy.", "2.0.0") > lazy val fMeasure: Double = accuracy > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16803) SaveAsTable does not work when source DataFrame is built on a Hive Table
[ https://issues.apache.org/jira/browse/SPARK-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418096#comment-15418096 ] Apache Spark commented on SPARK-16803: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14612 > SaveAsTable does not work when source DataFrame is built on a Hive Table > > > Key: SPARK-16803 > URL: https://issues.apache.org/jira/browse/SPARK-16803 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > {noformat} > scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as > key, 'abc' as value") > res2: org.apache.spark.sql.DataFrame = [] > scala> val df = sql("select key, value as value from sample.sample") > df: org.apache.spark.sql.DataFrame = [key: int, value: string] > scala> df.write.mode("append").saveAsTable("sample.sample") > scala> sql("select * from sample.sample").show() > +---+-+ > |key|value| > +---+-+ > | 1| abc| > | 1| abc| > +---+-+ > {noformat} > In Spark 1.6, it works, but Spark 2.0 does not work. The error message from > Spark 2.0 is > {noformat} > scala> df.write.mode("append").saveAsTable("sample.sample") > org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation > sample, sample > is not supported.; > {noformat} > So far, we do not plan to support it in Spark 2.0. Spark 1.6 works because it > internally uses {{insertInto}}. But, if we change it back it will break the > semantic of {{saveAsTable}} (this method uses by-name resolution instead of > using by-position resolution used by {{insertInto}}). > Instead, users should use {{insertInto}} API. We should correct the error > messages. Users can understand how to bypass it before we support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17027) PolynomialExpansion.choose is prone to integer overflow
[ https://issues.apache.org/jira/browse/SPARK-17027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418071#comment-15418071 ] Maciej Szymkiewicz edited comment on SPARK-17027 at 8/11/16 10:38 PM: -- Yes, this is exactly the problem. {code} choose(14, 10) // res0: Int = -182 {code} was (Author: zero323): Yes, this exactly the problem. {code} choose(14, 10) // res0: Int = -182 {code} > PolynomialExpansion.choose is prone to integer overflow > > > Key: SPARK-17027 > URL: https://issues.apache.org/jira/browse/SPARK-17027 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > Current implementation computes power of k directly and because of that it is > susceptible to integer overflow on relatively small input (4 features, degree > equal 10). It would be better to use recursive formula instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17028) Backport SI-9734 for Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-17028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17028: Assignee: Apache Spark > Backport SI-9734 for Scala 2.10 > --- > > Key: SPARK-17028 > URL: https://issues.apache.org/jira/browse/SPARK-17028 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: Scala 2.10 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > SI-9734 will be included in Scala 2.11.9. However, we still need to backport > it to Spark Scala 2.10 Shell manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17028) Backport SI-9734 for Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-17028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418068#comment-15418068 ] Apache Spark commented on SPARK-17028: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/14611 > Backport SI-9734 for Scala 2.10 > --- > > Key: SPARK-17028 > URL: https://issues.apache.org/jira/browse/SPARK-17028 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: Scala 2.10 >Reporter: Shixiong Zhu > > SI-9734 will be included in Scala 2.11.9. However, we still need to backport > it to Spark Scala 2.10 Shell manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17014) arithmetic.sql
[ https://issues.apache.org/jira/browse/SPARK-17014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17014. --- Resolution: Invalid Believe this was opened in error as a duplicate > arithmetic.sql > -- > > Key: SPARK-17014 > URL: https://issues.apache.org/jira/browse/SPARK-17014 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Peter Lee > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17028) Backport SI-9734 for Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-17028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17028: Assignee: (was: Apache Spark) > Backport SI-9734 for Scala 2.10 > --- > > Key: SPARK-17028 > URL: https://issues.apache.org/jira/browse/SPARK-17028 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: Scala 2.10 >Reporter: Shixiong Zhu > > SI-9734 will be included in Scala 2.11.9. However, we still need to backport > it to Spark Scala 2.10 Shell manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17027) PolynomialExpansion.choose is prone to integer overflow
[ https://issues.apache.org/jira/browse/SPARK-17027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418065#comment-15418065 ] Sean Owen commented on SPARK-17027: --- Is the problem in the naive calculation of n choose k? {code} private def choose(n: Int, k: Int): Int = { Range(n, n - k, -1).product / Range(k, 1, -1).product } {code} Let's just call http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/util/CombinatoricsUtils.html#binomialCoefficient(int,%20int) > PolynomialExpansion.choose is prone to integer overflow > > > Key: SPARK-17027 > URL: https://issues.apache.org/jira/browse/SPARK-17027 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > Current implementation computes power of k directly and because of that it is > susceptible to integer overflow on relatively small input (4 features, degree > equal 10). It would be better to use recursive formula instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17028) Backport SI-9734 for Scala 2.10
Shixiong Zhu created SPARK-17028: Summary: Backport SI-9734 for Scala 2.10 Key: SPARK-17028 URL: https://issues.apache.org/jira/browse/SPARK-17028 Project: Spark Issue Type: Bug Components: Spark Shell Environment: Scala 2.10 Reporter: Shixiong Zhu SI-9734 will be included in Scala 2.11.9. However, we still need to backport it to Spark Scala 2.10 Shell manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17027) PolynomialExpansion.choose is prone to integer overflow
Maciej Szymkiewicz created SPARK-17027: -- Summary: PolynomialExpansion.choose is prone to integer overflow Key: SPARK-17027 URL: https://issues.apache.org/jira/browse/SPARK-17027 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.0.0, 1.6.0 Reporter: Maciej Szymkiewicz Priority: Minor Current implementation computes power of k directly and because of that it is susceptible to integer overflow on relatively small input (4 features, degree equal 10). It would be better to use recursive formula instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17022) Potential deadlock in driver handling message
[ https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-17022. Resolution: Fixed Assignee: Tao Wang Fix Version/s: 2.1.0 2.0.1 > Potential deadlock in driver handling message > - > > Key: SPARK-17022 > URL: https://issues.apache.org/jira/browse/SPARK-17022 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0 >Reporter: Tao Wang >Assignee: Tao Wang >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > Suggest t1 < t2 < t3 > At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one > of three functions: CoarseGrainedSchedulerBackend.killExecutors, > CoarseGrainedSchedulerBackend.requestTotalExecutors or > CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the > lock `CoarseGrainedSchedulerBackend`. > Then YarnSchedulerBackend.doRequestTotalExecutors will send a > RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply. > At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the > message is received by the endpoint. > At t3, the RequestExexutor message sent at t1 is received by the endpoint. > Then the endpoint would first handle RemoveExecutor then the RequestExecutor > message. > When handling RemoveExecutor, it would send the same message to > `driverEndpoint` and wait for reply. > In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to > handle that message, while the lock has been occupied in t1. > So it would cause a deadlock. > We have found the issue in our deployment, it would block the driver to make > it handle no messages until the two message all went timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16868) Executor will be both dead and alive when this executor reregister itself to driver.
[ https://issues.apache.org/jira/browse/SPARK-16868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-16868. Resolution: Fixed Assignee: carlmartin Fix Version/s: 2.1.0 > Executor will be both dead and alive when this executor reregister itself to > driver. > > > Key: SPARK-16868 > URL: https://issues.apache.org/jira/browse/SPARK-16868 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: carlmartin >Assignee: carlmartin >Priority: Minor > Fix For: 2.1.0 > > Attachments: 2016-8-3 15-41-47.jpg, 2016-8-3 15-51-13.jpg > > > In a rare condition, Executor will register its block manager twice. > !https://issues.apache.org/jira/secure/attachment/12821794/2016-8-3%2015-41-47.jpg! > When unregister it from BlockManagerMaster, driver mark it as "DEAD" in > executors WebUI. > But when the heartbeat reregister the block manager again, this executor will > also have another status "Active". > !https://issues.apache.org/jira/secure/attachment/12821795/2016-8-3%2015-51-13.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13602) o.a.s.deploy.worker.DriverRunner may leak the driver processes
[ https://issues.apache.org/jira/browse/SPARK-13602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-13602. Resolution: Fixed Assignee: Bryan Cutler Fix Version/s: 2.1.0 > o.a.s.deploy.worker.DriverRunner may leak the driver processes > -- > > Key: SPARK-13602 > URL: https://issues.apache.org/jira/browse/SPARK-13602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Bryan Cutler > Fix For: 2.1.0 > > > If Worker calls "System.exit", DriverRunner will not kill the driver > processes. We should add a shutdown hook in DriverRunner like > o.a.s.deploy.worker.ExecutorRunner -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17026) warning msg in MulticlassMetricsSuite
[ https://issues.apache.org/jira/browse/SPARK-17026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17026: Assignee: Apache Spark > warning msg in MulticlassMetricsSuite > - > > Key: SPARK-17026 > URL: https://issues.apache.org/jira/browse/SPARK-17026 > Project: Spark > Issue Type: Improvement >Reporter: Xin Ren >Assignee: Apache Spark >Priority: Trivial > > Got warning when building: > {code} > [warn] > /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:74: > value precision in class MulticlassMetrics is deprecated: Use accuracy. > [warn]assert(math.abs(metrics.accuracy - metrics.precision) < delta) > [warn]^ > [warn] > /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:75: > value recall in class MulticlassMetrics is deprecated: Use accuracy. > [warn]assert(math.abs(metrics.accuracy - metrics.recall) < delta) > [warn]^ > [warn] > /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:76: > value fMeasure in class MulticlassMetrics is deprecated: Use accuracy. > [warn]assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta) > [warn]^ > {code} > And `precision` and `recall` and `fMeasure` are all referencing to `accuracy`: > {code} > assert(math.abs(metrics.accuracy - metrics.precision) < delta) > assert(math.abs(metrics.accuracy - metrics.recall) < delta) > assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta) > {code} > {code} > /** >* Returns precision >*/ > @Since("1.1.0") > @deprecated("Use accuracy.", "2.0.0") > lazy val precision: Double = accuracy > /** >* Returns recall >* (equals to precision for multiclass classifier >* because sum of all false positives is equal to sum >* of all false negatives) >*/ > @Since("1.1.0") > @deprecated("Use accuracy.", "2.0.0") > lazy val recall: Double = accuracy > /** >* Returns f-measure >* (equals to precision and recall because precision equals recall) >*/ > @Since("1.1.0") > @deprecated("Use accuracy.", "2.0.0") > lazy val fMeasure: Double = accuracy > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17026) warning msg in MulticlassMetricsSuite
[ https://issues.apache.org/jira/browse/SPARK-17026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417995#comment-15417995 ] Apache Spark commented on SPARK-17026: -- User 'keypointt' has created a pull request for this issue: https://github.com/apache/spark/pull/14610 > warning msg in MulticlassMetricsSuite > - > > Key: SPARK-17026 > URL: https://issues.apache.org/jira/browse/SPARK-17026 > Project: Spark > Issue Type: Improvement >Reporter: Xin Ren >Priority: Trivial > > Got warning when building: > {code} > [warn] > /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:74: > value precision in class MulticlassMetrics is deprecated: Use accuracy. > [warn]assert(math.abs(metrics.accuracy - metrics.precision) < delta) > [warn]^ > [warn] > /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:75: > value recall in class MulticlassMetrics is deprecated: Use accuracy. > [warn]assert(math.abs(metrics.accuracy - metrics.recall) < delta) > [warn]^ > [warn] > /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:76: > value fMeasure in class MulticlassMetrics is deprecated: Use accuracy. > [warn]assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta) > [warn]^ > {code} > And `precision` and `recall` and `fMeasure` are all referencing to `accuracy`: > {code} > assert(math.abs(metrics.accuracy - metrics.precision) < delta) > assert(math.abs(metrics.accuracy - metrics.recall) < delta) > assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta) > {code} > {code} > /** >* Returns precision >*/ > @Since("1.1.0") > @deprecated("Use accuracy.", "2.0.0") > lazy val precision: Double = accuracy > /** >* Returns recall >* (equals to precision for multiclass classifier >* because sum of all false positives is equal to sum >* of all false negatives) >*/ > @Since("1.1.0") > @deprecated("Use accuracy.", "2.0.0") > lazy val recall: Double = accuracy > /** >* Returns f-measure >* (equals to precision and recall because precision equals recall) >*/ > @Since("1.1.0") > @deprecated("Use accuracy.", "2.0.0") > lazy val fMeasure: Double = accuracy > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17026) warning msg in MulticlassMetricsSuite
[ https://issues.apache.org/jira/browse/SPARK-17026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17026: Assignee: (was: Apache Spark) > warning msg in MulticlassMetricsSuite > - > > Key: SPARK-17026 > URL: https://issues.apache.org/jira/browse/SPARK-17026 > Project: Spark > Issue Type: Improvement >Reporter: Xin Ren >Priority: Trivial > > Got warning when building: > {code} > [warn] > /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:74: > value precision in class MulticlassMetrics is deprecated: Use accuracy. > [warn]assert(math.abs(metrics.accuracy - metrics.precision) < delta) > [warn]^ > [warn] > /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:75: > value recall in class MulticlassMetrics is deprecated: Use accuracy. > [warn]assert(math.abs(metrics.accuracy - metrics.recall) < delta) > [warn]^ > [warn] > /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:76: > value fMeasure in class MulticlassMetrics is deprecated: Use accuracy. > [warn]assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta) > [warn]^ > {code} > And `precision` and `recall` and `fMeasure` are all referencing to `accuracy`: > {code} > assert(math.abs(metrics.accuracy - metrics.precision) < delta) > assert(math.abs(metrics.accuracy - metrics.recall) < delta) > assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta) > {code} > {code} > /** >* Returns precision >*/ > @Since("1.1.0") > @deprecated("Use accuracy.", "2.0.0") > lazy val precision: Double = accuracy > /** >* Returns recall >* (equals to precision for multiclass classifier >* because sum of all false positives is equal to sum >* of all false negatives) >*/ > @Since("1.1.0") > @deprecated("Use accuracy.", "2.0.0") > lazy val recall: Double = accuracy > /** >* Returns f-measure >* (equals to precision and recall because precision equals recall) >*/ > @Since("1.1.0") > @deprecated("Use accuracy.", "2.0.0") > lazy val fMeasure: Double = accuracy > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17026) warning msg in MulticlassMetricsSuite
Xin Ren created SPARK-17026: --- Summary: warning msg in MulticlassMetricsSuite Key: SPARK-17026 URL: https://issues.apache.org/jira/browse/SPARK-17026 Project: Spark Issue Type: Improvement Reporter: Xin Ren Priority: Trivial Got warning when building: {code} [warn] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:74: value precision in class MulticlassMetrics is deprecated: Use accuracy. [warn]assert(math.abs(metrics.accuracy - metrics.precision) < delta) [warn]^ [warn] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:75: value recall in class MulticlassMetrics is deprecated: Use accuracy. [warn]assert(math.abs(metrics.accuracy - metrics.recall) < delta) [warn]^ [warn] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:76: value fMeasure in class MulticlassMetrics is deprecated: Use accuracy. [warn]assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta) [warn]^ {code} And `precision` and `recall` and `fMeasure` are all referencing to `accuracy`: {code} assert(math.abs(metrics.accuracy - metrics.precision) < delta) assert(math.abs(metrics.accuracy - metrics.recall) < delta) assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta) {code} {code} /** * Returns precision */ @Since("1.1.0") @deprecated("Use accuracy.", "2.0.0") lazy val precision: Double = accuracy /** * Returns recall * (equals to precision for multiclass classifier * because sum of all false positives is equal to sum * of all false negatives) */ @Since("1.1.0") @deprecated("Use accuracy.", "2.0.0") lazy val recall: Double = accuracy /** * Returns f-measure * (equals to precision and recall because precision equals recall) */ @Since("1.1.0") @deprecated("Use accuracy.", "2.0.0") lazy val fMeasure: Double = accuracy {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17013) negative numeric literal parsing
[ https://issues.apache.org/jira/browse/SPARK-17013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417941#comment-15417941 ] Apache Spark commented on SPARK-17013: -- User 'petermaxlee' has created a pull request for this issue: https://github.com/apache/spark/pull/14608 > negative numeric literal parsing > > > Key: SPARK-17013 > URL: https://issues.apache.org/jira/browse/SPARK-17013 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Peter Lee > > As found in https://github.com/apache/spark/pull/14592/files#r74367410, Spark > 2.0 parses negative numeric literals as the unary minus of positive literals. > This introduces problems for the edge cases such as -9223372036854775809 > being parsed as decimal instead of bigint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3577) Add task metric to report spill time
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417929#comment-15417929 ] Kay Ousterhout commented on SPARK-3577: --- I believe spill time will currently be displayed as part of the task runtime, but not as part of scheduler delay. The scheduler delay is calculated by looking at the difference between two values: (1) The time that the task was running on the executor (2) The time from when the scheduler sent information about the task to the executor (so the executor could run the task) until the scheduler received a message that the task completed. Scheduler delay is (2) - (1). Usually when it's high, it's because of queueing delays in the scheduler that are either delaying the task getting sent to the executor (e.g., because the scheduler has a long queue of other tasks that need to be launched, or because tasks are large so take a while to send over the network) or that are delaying the task completion message getting back to the scheduler (which can happen when the rate of task launch is high -- greater than 1K or so task launches / second). > Add task metric to report spill time > > > Key: SPARK-3577 > URL: https://issues.apache.org/jira/browse/SPARK-3577 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into > {{ExternalSorter}}. The write time recorded in those metrics is never used. > We should probably add task metrics to report this spill time, since for > shuffles, this would have previously been reported as part of shuffle write > time (with the original hash-based sorter). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16905) Support SQL DDL: MSCK REPAIR TABLE
[ https://issues.apache.org/jira/browse/SPARK-16905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417928#comment-15417928 ] Apache Spark commented on SPARK-16905: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/14607 > Support SQL DDL: MSCK REPAIR TABLE > -- > > Key: SPARK-16905 > URL: https://issues.apache.org/jira/browse/SPARK-16905 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.1, 2.1.0 > > > MSCK REPAIR TABLE could be used to recover the partitions in metastore based > on partitions in file system. > Another syntax is: > ALTER TABLE table RECOVER PARTITIONS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17018) literals.sql for testing literal parsing
[ https://issues.apache.org/jira/browse/SPARK-17018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17018. - Resolution: Fixed Assignee: Peter Lee Fix Version/s: 2.1.0 2.0.1 > literals.sql for testing literal parsing > > > Key: SPARK-17018 > URL: https://issues.apache.org/jira/browse/SPARK-17018 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Peter Lee >Assignee: Peter Lee > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3577) Add task metric to report spill time
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417916#comment-15417916 ] Tzach Zohar commented on SPARK-3577: Does this mean that currently, spill time will be displayed as part of the *Scheduler Delay*? Scheduler Delay is calculated pretty much as "everything that isn't specifically measured" (see [StagePage.getSchedulerDelay|https://github.com/apache/spark/blob/v2.0.0/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala#L770]), so I'm wondering if indeed it might include spill time if it's not included anywhere else. If so - this might explain long Scheduler Delay values which would be hard to make sense of otherwise (which I think is what I'm seeing...). Thanks > Add task metric to report spill time > > > Key: SPARK-3577 > URL: https://issues.apache.org/jira/browse/SPARK-3577 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into > {{ExternalSorter}}. The write time recorded in those metrics is never used. > We should probably add task metrics to report this spill time, since for > shuffles, this would have previously been reported as part of shuffle write > time (with the original hash-based sorter). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16784) Configurable log4j settings
[ https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417889#comment-15417889 ] Sean Owen commented on SPARK-16784: --- Oh, I really meant {{log4j.configuration}} to specify your own config. > Configurable log4j settings > --- > > Key: SPARK-16784 > URL: https://issues.apache.org/jira/browse/SPARK-16784 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Michael Gummelt > > I often want to change the logging configuration on a single spark job. This > is easy in client mode. I just modify log4j.properties. It's difficult in > cluster mode, because I need to modify the log4j.properties in the > distribution in which the driver runs. I'd like a way of setting this > dynamically, such as a java system property. Some brief searching showed > that log4j doesn't seem to accept such a property, but I'd like to open up > this idea for further comment. Maybe we can find a solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16993) model.transform without label column in random forest regression
[ https://issues.apache.org/jira/browse/SPARK-16993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417892#comment-15417892 ] Sean Owen commented on SPARK-16993: --- You would need to show some code or more about the error. > model.transform without label column in random forest regression > > > Key: SPARK-16993 > URL: https://issues.apache.org/jira/browse/SPARK-16993 > Project: Spark > Issue Type: Question > Components: Java API, ML >Reporter: Dulaj Rajitha > > I need to use a separate data set to prediction (Not as show in example's > training data split). > But those data do not have the label column. (Since these data are the data > that needs to be predict the label). > but model.transform is informing label column is missing. > org.apache.spark.sql.AnalysisException: cannot resolve 'label' given input > columns: [id,features,prediction] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16784) Configurable log4j settings
[ https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417857#comment-15417857 ] Michael Gummelt edited comment on SPARK-16784 at 8/11/16 8:11 PM: -- {{log4j.debug=true}} only results in log4j printing its internal debugging messages (e.g. config file location, appenders, etc.). It doesn't turn on debug logging for the application. was (Author: mgummelt): {{log4j.debug=true}} only results in log4j printing its debugging messages. It doesn't turn on debug logging for the application. > Configurable log4j settings > --- > > Key: SPARK-16784 > URL: https://issues.apache.org/jira/browse/SPARK-16784 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Michael Gummelt > > I often want to change the logging configuration on a single spark job. This > is easy in client mode. I just modify log4j.properties. It's difficult in > cluster mode, because I need to modify the log4j.properties in the > distribution in which the driver runs. I'd like a way of setting this > dynamically, such as a java system property. Some brief searching showed > that log4j doesn't seem to accept such a property, but I'd like to open up > this idea for further comment. Maybe we can find a solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16784) Configurable log4j settings
[ https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417856#comment-15417856 ] Michael Gummelt commented on SPARK-16784: - `log4j.debug=true` only results in log4j printing its debugging messages. It doesn't turn on debug logging for the application. > Configurable log4j settings > --- > > Key: SPARK-16784 > URL: https://issues.apache.org/jira/browse/SPARK-16784 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Michael Gummelt > > I often want to change the logging configuration on a single spark job. This > is easy in client mode. I just modify log4j.properties. It's difficult in > cluster mode, because I need to modify the log4j.properties in the > distribution in which the driver runs. I'd like a way of setting this > dynamically, such as a java system property. Some brief searching showed > that log4j doesn't seem to accept such a property, but I'd like to open up > this idea for further comment. Maybe we can find a solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-16784) Configurable log4j settings
[ https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt reopened SPARK-16784: - `log4j.debug=true` only results in log4j printing its debugging messages. It doesn't turn on debug logging for the application. > Configurable log4j settings > --- > > Key: SPARK-16784 > URL: https://issues.apache.org/jira/browse/SPARK-16784 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Michael Gummelt > > I often want to change the logging configuration on a single spark job. This > is easy in client mode. I just modify log4j.properties. It's difficult in > cluster mode, because I need to modify the log4j.properties in the > distribution in which the driver runs. I'd like a way of setting this > dynamically, such as a java system property. Some brief searching showed > that log4j doesn't seem to accept such a property, but I'd like to open up > this idea for further comment. Maybe we can find a solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16784) Configurable log4j settings
[ https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417857#comment-15417857 ] Michael Gummelt edited comment on SPARK-16784 at 8/11/16 8:10 PM: -- {{log4j.debug=true}} only results in log4j printing its debugging messages. It doesn't turn on debug logging for the application. was (Author: mgummelt): `log4j.debug=true` only results in log4j printing its debugging messages. It doesn't turn on debug logging for the application. > Configurable log4j settings > --- > > Key: SPARK-16784 > URL: https://issues.apache.org/jira/browse/SPARK-16784 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Michael Gummelt > > I often want to change the logging configuration on a single spark job. This > is easy in client mode. I just modify log4j.properties. It's difficult in > cluster mode, because I need to modify the log4j.properties in the > distribution in which the driver runs. I'd like a way of setting this > dynamically, such as a java system property. Some brief searching showed > that log4j doesn't seem to accept such a property, but I'd like to open up > this idea for further comment. Maybe we can find a solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16784) Configurable log4j settings
[ https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated SPARK-16784: Comment: was deleted (was: `log4j.debug=true` only results in log4j printing its debugging messages. It doesn't turn on debug logging for the application.) > Configurable log4j settings > --- > > Key: SPARK-16784 > URL: https://issues.apache.org/jira/browse/SPARK-16784 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Michael Gummelt > > I often want to change the logging configuration on a single spark job. This > is easy in client mode. I just modify log4j.properties. It's difficult in > cluster mode, because I need to modify the log4j.properties in the > distribution in which the driver runs. I'd like a way of setting this > dynamically, such as a java system property. Some brief searching showed > that log4j doesn't seem to accept such a property, but I'd like to open up > this idea for further comment. Maybe we can find a solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16993) model.transform without label column in random forest regression
[ https://issues.apache.org/jira/browse/SPARK-16993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417847#comment-15417847 ] Dulaj Rajitha commented on SPARK-16993: --- But the thing is if add dummy column as as the label column, the process goes fine. I could not continue without add dummy the label column for the data set that needs the prediction. > model.transform without label column in random forest regression > > > Key: SPARK-16993 > URL: https://issues.apache.org/jira/browse/SPARK-16993 > Project: Spark > Issue Type: Question > Components: Java API, ML >Reporter: Dulaj Rajitha > > I need to use a separate data set to prediction (Not as show in example's > training data split). > But those data do not have the label column. (Since these data are the data > that needs to be predict the label). > but model.transform is informing label column is missing. > org.apache.spark.sql.AnalysisException: cannot resolve 'label' given input > columns: [id,features,prediction] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.
[ https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Iaroslav Zeigerman resolved SPARK-17024. Resolution: Duplicate > Weird behaviour of the DataFrame when a column name contains dots. > -- > > Key: SPARK-17024 > URL: https://issues.apache.org/jira/browse/SPARK-17024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Iaroslav Zeigerman > > When a column name contains dots and one of the segment in a name is the same > as other column's name, Spark treats this column as a nested structure, > although the actual type of column is String/Int/etc. Example: > {code} > val df = sqlContext.createDataFrame(Seq( > ("user1", "task1"), > ("user2", "task2") > )).toDF("user", "user.task") > {code} > Two columns "user" and "user.task". Both of them are string, and the schema > resolution seems to be correct: > {noformat} > root > |-- user: string (nullable = true) > |-- user.task: string (nullable = true) > {noformat} > But when I'm trying to query this DataFrame like i.e.: > {code} > df.select(df("user"), df("user.task")) > {code} > Spark throws an exception "Can't extract value from user#2;" > It happens during the resolution of the LogicalPlan while processing the > "user.task" column. > Here is the full stacktrace: > {noformat} > Can't extract value from user#2; > org.apache.spark.sql.AnalysisException: Can't extract value from user#2; > at > org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) > at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696) > {noformat} > Is this actually an expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.
[ https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417829#comment-15417829 ] Sean Owen commented on SPARK-17024: --- There are many issues that sound like this, like https://issues.apache.org/jira/browse/SPARK-15230 Can you try 2.0? I think this is a duplicate of several, so also please search JIRA. > Weird behaviour of the DataFrame when a column name contains dots. > -- > > Key: SPARK-17024 > URL: https://issues.apache.org/jira/browse/SPARK-17024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Iaroslav Zeigerman > > When a column name contains dots and one of the segment in a name is the same > as other column's name, Spark treats this column as a nested structure, > although the actual type of column is String/Int/etc. Example: > {code} > val df = sqlContext.createDataFrame(Seq( > ("user1", "task1"), > ("user2", "task2") > )).toDF("user", "user.task") > {code} > Two columns "user" and "user.task". Both of them are string, and the schema > resolution seems to be correct: > {noformat} > root > |-- user: string (nullable = true) > |-- user.task: string (nullable = true) > {noformat} > But when I'm trying to query this DataFrame like i.e.: > {code} > df.select(df("user"), df("user.task")) > {code} > Spark throws an exception "Can't extract value from user#2;" > It happens during the resolution of the LogicalPlan while processing the > "user.task" column. > Here is the full stacktrace: > {noformat} > Can't extract value from user#2; > org.apache.spark.sql.AnalysisException: Can't extract value from user#2; > at > org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) > at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696) > {noformat} > Is this actually an expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788 ] Nicholas Chammas edited comment on SPARK-17025 at 8/11/16 7:33 PM: --- cc [~josephkb], [~mengxr] I guess a first step would be to add a {{_to_java}} method to the base {{Transformer}} class that simply raises {{NotImplementedError}}. Ultimately though, is there a way to have the base class handle this work automatically, or do custom transformers need to each implement their own {{_to_java}} method? was (Author: nchammas): cc [~josephkb], [~mengxr] I guess a first step be to add a {{_to_java}} method to the base Transformer class that simply raises {{NotImplementedError}}. Is there a way to have the base class handle this work automatically, or do custom transformers need to each implement their own {{_to_java}} method? > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.
[ https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417807#comment-15417807 ] Iaroslav Zeigerman commented on SPARK-17024: If I query this way (with backquotes for "user.task"): {code} df.select(df("user"), df("`user.task`")) {code} leaving the rest of code unchanged it works fine. > Weird behaviour of the DataFrame when a column name contains dots. > -- > > Key: SPARK-17024 > URL: https://issues.apache.org/jira/browse/SPARK-17024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Iaroslav Zeigerman > > When a column name contains dots and one of the segment in a name is the same > as other column's name, Spark treats this column as a nested structure, > although the actual type of column is String/Int/etc. Example: > {code} > val df = sqlContext.createDataFrame(Seq( > ("user1", "task1"), > ("user2", "task2") > )).toDF("user", "user.task") > {code} > Two columns "user" and "user.task". Both of them are string, and the schema > resolution seems to be correct: > {noformat} > root > |-- user: string (nullable = true) > |-- user.task: string (nullable = true) > {noformat} > But when I'm trying to query this DataFrame like i.e.: > {code} > df.select(df("user"), df("user.task")) > {code} > Spark throws an exception "Can't extract value from user#2;" > It happens during the resolution of the LogicalPlan while processing the > "user.task" column. > Here is the full stacktrace: > {noformat} > Can't extract value from user#2; > org.apache.spark.sql.AnalysisException: Can't extract value from user#2; > at > org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) > at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696) > {noformat} > Is this actually an expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788 ] Nicholas Chammas edited comment on SPARK-17025 at 8/11/16 7:27 PM: --- cc [~josephkb], [~mengxr] I guess a first step be to add a {{_to_java}} method to the base Transformer class that simply raises {{NotImplementedError}}. Is there a way to have the base class handle this work automatically, or do custom transformers need to each implement their own {{_to_java}} method? was (Author: nchammas): cc [~josephkb] [~mengxr] > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788 ] Nicholas Chammas commented on SPARK-17025: -- cc [~josephkb] [~mengxr] > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.
[ https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417789#comment-15417789 ] Iaroslav Zeigerman commented on SPARK-17024: If this behaviour is expected, is there a way to disable it? > Weird behaviour of the DataFrame when a column name contains dots. > -- > > Key: SPARK-17024 > URL: https://issues.apache.org/jira/browse/SPARK-17024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Iaroslav Zeigerman > > When a column name contains dots and one of the segment in a name is the same > as other column's name, Spark treats this column as a nested structure, > although the actual type of column is String/Int/etc. Example: > {code} > val df = sqlContext.createDataFrame(Seq( > ("user1", "task1"), > ("user2", "task2") > )).toDF("user", "user.task") > {code} > Two columns "user" and "user.task". Both of them are string, and the schema > resolution seems to be correct: > {noformat} > root > |-- user: string (nullable = true) > |-- user.task: string (nullable = true) > {noformat} > But when I'm trying to query this DataFrame like i.e.: > {code} > df.select(df("user"), df("user.task")) > {code} > Spark throws an exception "Can't extract value from user#2;" > It happens during the resolution of the LogicalPlan while processing the > "user.task" column. > Here is the full stacktrace: > {noformat} > Can't extract value from user#2; > org.apache.spark.sql.AnalysisException: Can't extract value from user#2; > at > org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) > at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696) > {noformat} > Is this actually an expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
Nicholas Chammas created SPARK-17025: Summary: Cannot persist PySpark ML Pipeline model that includes custom Transformer Key: SPARK-17025 URL: https://issues.apache.org/jira/browse/SPARK-17025 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 2.0.0 Reporter: Nicholas Chammas Priority: Minor Following the example in [this Databricks blog post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] under "Python tuning", I'm trying to save an ML Pipeline model. This pipeline, however, includes a custom transformer. When I try to save the model, the operation fails because the custom transformer doesn't have a {{_to_java}} attribute. {code} Traceback (most recent call last): File ".../file.py", line 56, in model.bestModel.save('model') File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 222, in save File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 217, in write File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", line 93, in __init__ File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 254, in _to_java AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' {code} Looking at the source code for [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], I see that not even the base Transformer class has such an attribute. I'm assuming this is missing functionality that is intended to be patched up (i.e. [like this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). I'm not sure if there is an existing JIRA for this (my searches didn't turn up clear results). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.
[ https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Iaroslav Zeigerman updated SPARK-17024: --- Summary: Weird behaviour of the DataFrame when a column name contains dots. (was: Weird behaviour of the DataFrame when the column name contains dots.) > Weird behaviour of the DataFrame when a column name contains dots. > -- > > Key: SPARK-17024 > URL: https://issues.apache.org/jira/browse/SPARK-17024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Iaroslav Zeigerman > > When a column name contains dots and one of the segment in a name is the same > as other column's name, Spark treats this column as a nested structure, > although the actual type of column is String/Int/etc. Example: > {code} > val df = sqlContext.createDataFrame(Seq( > ("user1", "task1"), > ("user2", "task2") > )).toDF("user", "user.task") > {code} > Two columns "user" and "user.task". Both of them are string, and the schema > resolution seems to be correct: > {noformat} > root > |-- user: string (nullable = true) > |-- user.task: string (nullable = true) > {noformat} > But when I'm trying to query this DataFrame like i.e.: > {code} > df.select(df("user"), df("user.task")) > {code} > Spark throws an exception "Can't extract value from user#2;" > It happens during the resolution of the LogicalPlan and while processing the > "user.task" column. > Here is the full stacktrace: > {noformat} > Can't extract value from user#2; > org.apache.spark.sql.AnalysisException: Can't extract value from user#2; > at > org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) > at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696) > {noformat} > Is this actually an expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17024) Weird behaviour of the DataFrame when the column name contains dots.
Iaroslav Zeigerman created SPARK-17024: -- Summary: Weird behaviour of the DataFrame when the column name contains dots. Key: SPARK-17024 URL: https://issues.apache.org/jira/browse/SPARK-17024 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Iaroslav Zeigerman When a column name contains dots and one of the segment in a name is the same as other column's name, Spark treats this column as a nested structure, although the actual type of column is String/Int/etc. Example: {code} val df = sqlContext.createDataFrame(Seq( ("user1", "task1"), ("user2", "task2") )).toDF("user", "user.task") {code} Two columns "user" and "user.task". Both of them are string, and the schema resolution seems to be correct: {noformat} root |-- user: string (nullable = true) |-- user.task: string (nullable = true) {noformat} But when I'm trying to query this DataFrame like i.e.: {code} df.select(df("user"), df("user.task")) {code} Spark throws an exception "Can't extract value from user#2;" It happens during the resolution of the LogicalPlan and while processing the "user.task" column. Here is the full stacktrace: {noformat} Can't extract value from user#2; org.apache.spark.sql.AnalysisException: Can't extract value from user#2; at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696) {noformat} Is this actually an expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.
[ https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Iaroslav Zeigerman updated SPARK-17024: --- Description: When a column name contains dots and one of the segment in a name is the same as other column's name, Spark treats this column as a nested structure, although the actual type of column is String/Int/etc. Example: {code} val df = sqlContext.createDataFrame(Seq( ("user1", "task1"), ("user2", "task2") )).toDF("user", "user.task") {code} Two columns "user" and "user.task". Both of them are string, and the schema resolution seems to be correct: {noformat} root |-- user: string (nullable = true) |-- user.task: string (nullable = true) {noformat} But when I'm trying to query this DataFrame like i.e.: {code} df.select(df("user"), df("user.task")) {code} Spark throws an exception "Can't extract value from user#2;" It happens during the resolution of the LogicalPlan while processing the "user.task" column. Here is the full stacktrace: {noformat} Can't extract value from user#2; org.apache.spark.sql.AnalysisException: Can't extract value from user#2; at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696) {noformat} Is this actually an expected behaviour? was: When a column name contains dots and one of the segment in a name is the same as other column's name, Spark treats this column as a nested structure, although the actual type of column is String/Int/etc. Example: {code} val df = sqlContext.createDataFrame(Seq( ("user1", "task1"), ("user2", "task2") )).toDF("user", "user.task") {code} Two columns "user" and "user.task". Both of them are string, and the schema resolution seems to be correct: {noformat} root |-- user: string (nullable = true) |-- user.task: string (nullable = true) {noformat} But when I'm trying to query this DataFrame like i.e.: {code} df.select(df("user"), df("user.task")) {code} Spark throws an exception "Can't extract value from user#2;" It happens during the resolution of the LogicalPlan and while processing the "user.task" column. Here is the full stacktrace: {noformat} Can't extract value from user#2; org.apache.spark.sql.AnalysisException: Can't extract value from user#2; at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696) {noformat} Is this actually an expected behaviour? > Weird behaviour of the DataFrame when a column name contains dots. > -- > > Key: SPARK-17024 > URL: https://issues.apache.org/jira/browse/SPARK-17024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Iaroslav Zeigerman > > When a column name contains dots and one of the segment in a name is the same > as other column's name, Spark treats this column as a nested structure, > although the actual type of column is String/Int/etc. Example: > {code} > val df = sqlContext.createDataFrame(Seq( > ("user1", "task1"), > ("user2", "task2") > )).toDF("user", "user.task") > {code} > Two columns "user" and "user.task". Both of them are string, and the
[jira] [Resolved] (SPARK-17021) simplify the constructor parameters of QuantileSummaries
[ https://issues.apache.org/jira/browse/SPARK-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-17021. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14603 [https://github.com/apache/spark/pull/14603] > simplify the constructor parameters of QuantileSummaries > > > Key: SPARK-17021 > URL: https://issues.apache.org/jira/browse/SPARK-17021 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17015) group-by-ordinal and order-by-ordinal test cases
[ https://issues.apache.org/jira/browse/SPARK-17015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17015: Fix Version/s: 2.0.1 > group-by-ordinal and order-by-ordinal test cases > > > Key: SPARK-17015 > URL: https://issues.apache.org/jira/browse/SPARK-17015 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Peter Lee >Assignee: Peter Lee > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17016) group-by/order-by ordinal should throw AnalysisException instead of UnresolvedException
[ https://issues.apache.org/jira/browse/SPARK-17016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17016: Fix Version/s: 2.0.1 > group-by/order-by ordinal should throw AnalysisException instead of > UnresolvedException > --- > > Key: SPARK-17016 > URL: https://issues.apache.org/jira/browse/SPARK-17016 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Peter Lee >Assignee: Peter Lee > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17023) Update Kafka connetor to use Kafka 0.10.0.1
[ https://issues.apache.org/jira/browse/SPARK-17023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17023: Assignee: (was: Apache Spark) > Update Kafka connetor to use Kafka 0.10.0.1 > --- > > Key: SPARK-17023 > URL: https://issues.apache.org/jira/browse/SPARK-17023 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Luciano Resende >Priority: Minor > > Update Kafka connector to use latest version of Kafka dependencies (0.10.0.1) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17023) Update Kafka connetor to use Kafka 0.10.0.1
[ https://issues.apache.org/jira/browse/SPARK-17023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17023: Assignee: Apache Spark > Update Kafka connetor to use Kafka 0.10.0.1 > --- > > Key: SPARK-17023 > URL: https://issues.apache.org/jira/browse/SPARK-17023 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Luciano Resende >Assignee: Apache Spark >Priority: Minor > > Update Kafka connector to use latest version of Kafka dependencies (0.10.0.1) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17023) Update Kafka connetor to use Kafka 0.10.0.1
[ https://issues.apache.org/jira/browse/SPARK-17023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417622#comment-15417622 ] Apache Spark commented on SPARK-17023: -- User 'lresende' has created a pull request for this issue: https://github.com/apache/spark/pull/14606 > Update Kafka connetor to use Kafka 0.10.0.1 > --- > > Key: SPARK-17023 > URL: https://issues.apache.org/jira/browse/SPARK-17023 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Luciano Resende >Priority: Minor > > Update Kafka connector to use latest version of Kafka dependencies (0.10.0.1) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17023) Update Kafka connetor to use Kafka 0.10.0.1
Luciano Resende created SPARK-17023: --- Summary: Update Kafka connetor to use Kafka 0.10.0.1 Key: SPARK-17023 URL: https://issues.apache.org/jira/browse/SPARK-17023 Project: Spark Issue Type: Improvement Components: Build Reporter: Luciano Resende Priority: Minor Update Kafka connector to use latest version of Kafka dependencies (0.10.0.1) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16577) Add check-cran script to Jenkins
[ https://issues.apache.org/jira/browse/SPARK-16577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417595#comment-15417595 ] Shivaram Venkataraman commented on SPARK-16577: --- Good point - Let me check this with [~shaneknapp] who maintains our Jenkins cluster. We can also try --no-manual and see if that removes the PDF checking > Add check-cran script to Jenkins > > > Key: SPARK-16577 > URL: https://issues.apache.org/jira/browse/SPARK-16577 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > After we have fixed the warnings from the CRAN checks we should add this as a > part of the Jenkins build. > This depends on SPARK-16507 and SPARK-16508 being resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417592#comment-15417592 ] Shivaram Venkataraman commented on SPARK-16519: --- Yeah I think the simplest thing to do is to just append RDD to the method names - i.e. unpersistRDD, collectRDD etc. Lets do this first while we continue to discuss things in SPARK-16611 > Handle SparkR RDD generics that create warnings in R CMD check > -- > > Key: SPARK-16519 > URL: https://issues.apache.org/jira/browse/SPARK-16519 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > One of the warnings we get from R CMD check is that RDD implementations of > some of the generics are not documented. These generics are shared between > RDD, DataFrames in SparkR. The list includes > {quote} > WARNING > Undocumented S4 methods: > generic 'cache' and siglist 'RDD' > generic 'collect' and siglist 'RDD' > generic 'count' and siglist 'RDD' > generic 'distinct' and siglist 'RDD' > generic 'first' and siglist 'RDD' > generic 'join' and siglist 'RDD,RDD' > generic 'length' and siglist 'RDD' > generic 'partitionBy' and siglist 'RDD' > generic 'persist' and siglist 'RDD,character' > generic 'repartition' and siglist 'RDD' > generic 'show' and siglist 'RDD' > generic 'take' and siglist 'RDD,numeric' > generic 'unpersist' and siglist 'RDD' > {quote} > As described in > https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks > like a limitation of R where exporting a generic from a package also exports > all the implementations of that generic. > One way to get around this is to remove the RDD API or rename the methods in > Spark 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17022) Potential deadlock in driver handling message
[ https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17022: Assignee: Apache Spark > Potential deadlock in driver handling message > - > > Key: SPARK-17022 > URL: https://issues.apache.org/jira/browse/SPARK-17022 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0 >Reporter: Tao Wang >Assignee: Apache Spark >Priority: Critical > > Suggest t1 < t2 < t3 > At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one > of three functions: CoarseGrainedSchedulerBackend.killExecutors, > CoarseGrainedSchedulerBackend.requestTotalExecutors or > CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the > lock `CoarseGrainedSchedulerBackend`. > Then YarnSchedulerBackend.doRequestTotalExecutors will send a > RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply. > At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the > message is received by the endpoint. > At t3, the RequestExexutor message sent at t1 is received by the endpoint. > Then the endpoint would first handle RemoveExecutor then the RequestExecutor > message. > When handling RemoveExecutor, it would send the same message to > `driverEndpoint` and wait for reply. > In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to > handle that message, while the lock has been occupied in t1. > So it would cause a deadlock. > We have found the issue in our deployment, it would block the driver to make > it handle no messages until the two message all went timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17022) Potential deadlock in driver handling message
[ https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417582#comment-15417582 ] Apache Spark commented on SPARK-17022: -- User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/14605 > Potential deadlock in driver handling message > - > > Key: SPARK-17022 > URL: https://issues.apache.org/jira/browse/SPARK-17022 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0 >Reporter: Tao Wang >Priority: Critical > > Suggest t1 < t2 < t3 > At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one > of three functions: CoarseGrainedSchedulerBackend.killExecutors, > CoarseGrainedSchedulerBackend.requestTotalExecutors or > CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the > lock `CoarseGrainedSchedulerBackend`. > Then YarnSchedulerBackend.doRequestTotalExecutors will send a > RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply. > At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the > message is received by the endpoint. > At t3, the RequestExexutor message sent at t1 is received by the endpoint. > Then the endpoint would first handle RemoveExecutor then the RequestExecutor > message. > When handling RemoveExecutor, it would send the same message to > `driverEndpoint` and wait for reply. > In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to > handle that message, while the lock has been occupied in t1. > So it would cause a deadlock. > We have found the issue in our deployment, it would block the driver to make > it handle no messages until the two message all went timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17022) Potential deadlock in driver handling message
[ https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17022: Assignee: (was: Apache Spark) > Potential deadlock in driver handling message > - > > Key: SPARK-17022 > URL: https://issues.apache.org/jira/browse/SPARK-17022 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0 >Reporter: Tao Wang >Priority: Critical > > Suggest t1 < t2 < t3 > At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one > of three functions: CoarseGrainedSchedulerBackend.killExecutors, > CoarseGrainedSchedulerBackend.requestTotalExecutors or > CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the > lock `CoarseGrainedSchedulerBackend`. > Then YarnSchedulerBackend.doRequestTotalExecutors will send a > RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply. > At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the > message is received by the endpoint. > At t3, the RequestExexutor message sent at t1 is received by the endpoint. > Then the endpoint would first handle RemoveExecutor then the RequestExecutor > message. > When handling RemoveExecutor, it would send the same message to > `driverEndpoint` and wait for reply. > In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to > handle that message, while the lock has been occupied in t1. > So it would cause a deadlock. > We have found the issue in our deployment, it would block the driver to make > it handle no messages until the two message all went timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16958) Reuse subqueries within single query
[ https://issues.apache.org/jira/browse/SPARK-16958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-16958. Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14548 [https://github.com/apache/spark/pull/14548] > Reuse subqueries within single query > > > Key: SPARK-16958 > URL: https://issues.apache.org/jira/browse/SPARK-16958 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.1.0 > > > There could be same subquery within a single query, we could reuse the result > without running it multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417541#comment-15417541 ] Felix Cheung edited comment on SPARK-16519 at 8/11/16 4:40 PM: --- since we are undecided on what to export for RDD, should we just go ahead and rename in these internal methods so their names won't match the generics we are exporting? That should eliminate the warnings. If that makes sense, I could start this shortly. was (Author: felixcheung): since we are undecided on what to export for RDD, should we just go ahead and rename in these internal methods so we won't match the generics we are exporting? That should eliminate the warnings. If that makes sense, I could start this shortly. > Handle SparkR RDD generics that create warnings in R CMD check > -- > > Key: SPARK-16519 > URL: https://issues.apache.org/jira/browse/SPARK-16519 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > One of the warnings we get from R CMD check is that RDD implementations of > some of the generics are not documented. These generics are shared between > RDD, DataFrames in SparkR. The list includes > {quote} > WARNING > Undocumented S4 methods: > generic 'cache' and siglist 'RDD' > generic 'collect' and siglist 'RDD' > generic 'count' and siglist 'RDD' > generic 'distinct' and siglist 'RDD' > generic 'first' and siglist 'RDD' > generic 'join' and siglist 'RDD,RDD' > generic 'length' and siglist 'RDD' > generic 'partitionBy' and siglist 'RDD' > generic 'persist' and siglist 'RDD,character' > generic 'repartition' and siglist 'RDD' > generic 'show' and siglist 'RDD' > generic 'take' and siglist 'RDD,numeric' > generic 'unpersist' and siglist 'RDD' > {quote} > As described in > https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks > like a limitation of R where exporting a generic from a package also exports > all the implementations of that generic. > One way to get around this is to remove the RDD API or rename the methods in > Spark 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417541#comment-15417541 ] Felix Cheung commented on SPARK-16519: -- since we are undecided on what to export for RDD, should we just go ahead and rename in these internal methods so we won't match the generics we are exporting? That should eliminate the warnings. If that makes sense, I could start this shortly. > Handle SparkR RDD generics that create warnings in R CMD check > -- > > Key: SPARK-16519 > URL: https://issues.apache.org/jira/browse/SPARK-16519 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > One of the warnings we get from R CMD check is that RDD implementations of > some of the generics are not documented. These generics are shared between > RDD, DataFrames in SparkR. The list includes > {quote} > WARNING > Undocumented S4 methods: > generic 'cache' and siglist 'RDD' > generic 'collect' and siglist 'RDD' > generic 'count' and siglist 'RDD' > generic 'distinct' and siglist 'RDD' > generic 'first' and siglist 'RDD' > generic 'join' and siglist 'RDD,RDD' > generic 'length' and siglist 'RDD' > generic 'partitionBy' and siglist 'RDD' > generic 'persist' and siglist 'RDD,character' > generic 'repartition' and siglist 'RDD' > generic 'show' and siglist 'RDD' > generic 'take' and siglist 'RDD,numeric' > generic 'unpersist' and siglist 'RDD' > {quote} > As described in > https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks > like a limitation of R where exporting a generic from a package also exports > all the implementations of that generic. > One way to get around this is to remove the RDD API or rename the methods in > Spark 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16577) Add check-cran script to Jenkins
[ https://issues.apache.org/jira/browse/SPARK-16577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417536#comment-15417536 ] Felix Cheung edited comment on SPARK-16577 at 8/11/16 4:36 PM: --- I found that to run the cran check on PDF it requires these texlive texinfo texlive-fonts-extra on Ubuntu - are these something we need to pre-install on Jenkins runs, before adding this to Jenkins? was (Author: felixcheung): I found that to run the cran check on PDF it requires these texlive texinfo texlive-fonts-extra on Ubuntu - are these something we need to pre-install on Jenkins runs? > Add check-cran script to Jenkins > > > Key: SPARK-16577 > URL: https://issues.apache.org/jira/browse/SPARK-16577 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > After we have fixed the warnings from the CRAN checks we should add this as a > part of the Jenkins build. > This depends on SPARK-16507 and SPARK-16508 being resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16577) Add check-cran script to Jenkins
[ https://issues.apache.org/jira/browse/SPARK-16577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417536#comment-15417536 ] Felix Cheung commented on SPARK-16577: -- I found that to run the cran check on PDF it requires these texlive texinfo texlive-fonts-extra on Ubuntu - are these something we need to pre-install on Jenkins runs? > Add check-cran script to Jenkins > > > Key: SPARK-16577 > URL: https://issues.apache.org/jira/browse/SPARK-16577 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > After we have fixed the warnings from the CRAN checks we should add this as a > part of the Jenkins build. > This depends on SPARK-16507 and SPARK-16508 being resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16831) CrossValidator reports incorrect avgMetrics
[ https://issues.apache.org/jira/browse/SPARK-16831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16831: -- Fix Version/s: (was: 1.6.3) > CrossValidator reports incorrect avgMetrics > --- > > Key: SPARK-16831 > URL: https://issues.apache.org/jira/browse/SPARK-16831 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Max Moroz >Assignee: Max Moroz > Fix For: 2.0.1, 2.1.0 > > > The avgMetrics are summed up across all folds instead of being averaged. This > is an easy fix in CrossValidator._fit() function: > {code}metrics[j]+=metric{code} should be > {code}metrics[j]+=metric/nFolds{code}. > {code} > dataset = spark.createDataFrame( > [(Vectors.dense([0.0]), 0.0), >(Vectors.dense([0.4]), 1.0), >(Vectors.dense([0.5]), 0.0), >(Vectors.dense([0.6]), 1.0), >(Vectors.dense([1.0]), 1.0)] * 1000, > ["features", "label"]).cache() > paramGrid = pyspark.ml.tuning.ParamGridBuilder().build() > tvs = > pyspark.ml.tuning.TrainValidationSplit(estimator=pyspark.ml.regression.LinearRegression(), > >estimatorParamMaps=paramGrid, > > evaluator=pyspark.ml.evaluation.RegressionEvaluator(), >trainRatio=0.8) > model = tvs.fit(train) > print(model.validationMetrics) > for folds in (3, 5, 10): > cv = > pyspark.ml.tuning.CrossValidator(estimator=pyspark.ml.regression.LinearRegression(), > > estimatorParamMaps=paramGrid, > > evaluator=pyspark.ml.evaluation.RegressionEvaluator(), > numFolds=folds > ) > cvModel = cv.fit(dataset) > print(folds, cvModel.avgMetrics) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17022) Potential deadlock in driver handling message
Tao Wang created SPARK-17022: Summary: Potential deadlock in driver handling message Key: SPARK-17022 URL: https://issues.apache.org/jira/browse/SPARK-17022 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.0.0, 1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0 Reporter: Tao Wang Priority: Critical Suggest t1 < t2 < t3 At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one of three functions: CoarseGrainedSchedulerBackend.killExecutors, CoarseGrainedSchedulerBackend.requestTotalExecutors or CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the lock `CoarseGrainedSchedulerBackend`. Then YarnSchedulerBackend.doRequestTotalExecutors will send a RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply. At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the message is received by the endpoint. At t3, the RequestExexutor message sent at t1 is received by the endpoint. Then the endpoint would first handle RemoveExecutor then the RequestExecutor message. When handling RemoveExecutor, it would send the same message to `driverEndpoint` and wait for reply. In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to handle that message, while the lock has been occupied in t1. So it would cause a deadlock. We have found the issue in our deployment, it would block the driver to make it handle no messages until the two message all went timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
[ https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417301#comment-15417301 ] Roi Reshef edited comment on SPARK-17020 at 8/11/16 2:09 PM: - Nevertheless, any attempt to repartition the resulting RDD also end with having (almost) all of its partitions stay on the same node. I made it transform into a ShuffledRDD via PairRDDFunctions, set a HashPartitioner with 140 partitions using *.partitionBy*, and yet, I got the same data-distribution as in the screenshot I attached. So I guess there's something very wrong with referring to a *DataFrame.rdd* without materializing it beforehand. What and why is beyond my understanding, currently. was (Author: roireshef): Nevertheless, any attempt to repartition the resulting RDD also end with having (almost) all of its partitions stay on the same node. I made it transform into a ShuffledRDD via PairRDDFunctions, set a HashPartitioner with 140 partitions *.partitionBy*, and yet, I got the same data-distribution as in the screenshot I attached. So I guess there's something very wrong with referring to a *DataFrame.rdd* without materializing it beforehand. What and why is beyond my understanding, currently. > Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data > -- > > Key: SPARK-17020 > URL: https://issues.apache.org/jira/browse/SPARK-17020 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Roi Reshef > Attachments: dataframe_cache.PNG, rdd_cache.PNG > > > Calling DataFrame's lazy val .rdd results with a new RDD with a poor > distribution of partitions across the cluster. Moreover, any attempt to > repartition this RDD further will fail. > Attached are a screenshot of the original DataFrame on cache and the > resulting RDD on cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
[ https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417301#comment-15417301 ] Roi Reshef edited comment on SPARK-17020 at 8/11/16 2:09 PM: - Nevertheless, any attempt to repartition the resulting RDD also end with having (almost) all of its partitions stay on the same node. I made it transform into a ShuffledRDD via PairRDDFunctions, set a HashPartitioner with 140 partitions *.partitionBy*, and yet, I got the same data-distribution as in the screenshot I attached. So I guess there's something very wrong with referring to a *DataFrame.rdd* without materializing it beforehand. What and why is beyond my understanding, currently. was (Author: roireshef): Nevertheless, any attempt to repartition the resulting RDD also end with having (almost) all of its partitions stay on the same node. I made it transform into a ShuffledRDD via PairRDDFunctions, set a HashPartitioner with 140 partitions, and yet, I got the same data-distribution as in the screenshot I attached. So I guess there's something very wrong with referring to a *DataFrame.rdd* without materializing it beforehand. What and why is beyond my understanding, currently. > Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data > -- > > Key: SPARK-17020 > URL: https://issues.apache.org/jira/browse/SPARK-17020 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Roi Reshef > Attachments: dataframe_cache.PNG, rdd_cache.PNG > > > Calling DataFrame's lazy val .rdd results with a new RDD with a poor > distribution of partitions across the cluster. Moreover, any attempt to > repartition this RDD further will fail. > Attached are a screenshot of the original DataFrame on cache and the > resulting RDD on cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
[ https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417301#comment-15417301 ] Roi Reshef commented on SPARK-17020: Nevertheless, any attempt to repartition the resulting RDD also end with having (almost) all of its partitions stay on the same node. I made it transform into a ShuffledRDD via PairRDDFunctions, set a HashPartitioner with 140 partitions, and yet, I got the same data-distribution as in the screenshot I attached. So I guess there's something very wrong with referring to a *DataFrame.rdd* without materializing it beforehand. What and why is beyond my understanding, currently. > Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data > -- > > Key: SPARK-17020 > URL: https://issues.apache.org/jira/browse/SPARK-17020 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Roi Reshef > Attachments: dataframe_cache.PNG, rdd_cache.PNG > > > Calling DataFrame's lazy val .rdd results with a new RDD with a poor > distribution of partitions across the cluster. Moreover, any attempt to > repartition this RDD further will fail. > Attached are a screenshot of the original DataFrame on cache and the > resulting RDD on cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
[ https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417291#comment-15417291 ] Sean Owen commented on SPARK-17020: --- I see, I was asking because you show the results of caching a DataFrame above. My guess is that in one case, the DataFrame is computed using the expected number of partitions, and somehow when you go straight through to the RDD, it ends up executing one task for one partition, thus putting the result in one big block. As to why, I don't know. You could confirm/deny by looking at the partition count for the DataFrame and RDD in these cases. > Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data > -- > > Key: SPARK-17020 > URL: https://issues.apache.org/jira/browse/SPARK-17020 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Roi Reshef > Attachments: dataframe_cache.PNG, rdd_cache.PNG > > > Calling DataFrame's lazy val .rdd results with a new RDD with a poor > distribution of partitions across the cluster. Moreover, any attempt to > repartition this RDD further will fail. > Attached are a screenshot of the original DataFrame on cache and the > resulting RDD on cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
[ https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417288#comment-15417288 ] Roi Reshef commented on SPARK-17020: The problem occurs only when calling **.rdd** on an *not-previously-cached* DataFrame. **data** is a DataFrame, so in the last code you have it cached whereas in the one before it wasn't, but rather only the RDD that was extracted from it. > Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data > -- > > Key: SPARK-17020 > URL: https://issues.apache.org/jira/browse/SPARK-17020 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Roi Reshef > Attachments: dataframe_cache.PNG, rdd_cache.PNG > > > Calling DataFrame's lazy val .rdd results with a new RDD with a poor > distribution of partitions across the cluster. Moreover, any attempt to > repartition this RDD further will fail. > Attached are a screenshot of the original DataFrame on cache and the > resulting RDD on cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
[ https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417268#comment-15417268 ] Sean Owen commented on SPARK-17020: --- Yeah, after it's cached and the partitions are established, I'd certainly expect it to do the sensible thing and use that locality, and that you'd find the locality of the RDD's partitions is the same and well-distributed. What's the code path where you cache the DataFrame? I only see the RDD cached here. > Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data > -- > > Key: SPARK-17020 > URL: https://issues.apache.org/jira/browse/SPARK-17020 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Roi Reshef > Attachments: dataframe_cache.PNG, rdd_cache.PNG > > > Calling DataFrame's lazy val .rdd results with a new RDD with a poor > distribution of partitions across the cluster. Moreover, any attempt to > repartition this RDD further will fail. > Attached are a screenshot of the original DataFrame on cache and the > resulting RDD on cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
[ https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417254#comment-15417254 ] Roi Reshef commented on SPARK-17020: Also note that I have just called: *data.cache().count()* val rdd = data.rdd.setName("rdd").cache() rdd.count and the rdd was distributed far better (similar to "data" DataFrame) I'm not sure it solves the issue with the rdd that ignores repartitioning methods further down the road. I'll have to check that > Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data > -- > > Key: SPARK-17020 > URL: https://issues.apache.org/jira/browse/SPARK-17020 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Roi Reshef > Attachments: dataframe_cache.PNG, rdd_cache.PNG > > > Calling DataFrame's lazy val .rdd results with a new RDD with a poor > distribution of partitions across the cluster. Moreover, any attempt to > repartition this RDD further will fail. > Attached are a screenshot of the original DataFrame on cache and the > resulting RDD on cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
[ https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417250#comment-15417250 ] Roi Reshef commented on SPARK-17020: val ab = SomeReader.read(...) //some reader function that uses spark-csv with inferSchema=true filter(!isnull($"name")). alias("revab") val meta = SomeReader.read(...) //same but different schema and data val udaf = ... //some UserDefinedAggregateFunction val features = ab.groupBy(...).agg(udaf(...)) val data = features. join(meta, $"meta.id" === $"features.id"). select(...) //only relevant fields val rdd = data.rdd.setName("rdd").cache() rdd.count > Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data > -- > > Key: SPARK-17020 > URL: https://issues.apache.org/jira/browse/SPARK-17020 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Roi Reshef > Attachments: dataframe_cache.PNG, rdd_cache.PNG > > > Calling DataFrame's lazy val .rdd results with a new RDD with a poor > distribution of partitions across the cluster. Moreover, any attempt to > repartition this RDD further will fail. > Attached are a screenshot of the original DataFrame on cache and the > resulting RDD on cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
[ https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417231#comment-15417231 ] Sean Owen commented on SPARK-17020: --- I think that's probably material, yes, as is the operations that created the DataFrame. Do you have any minimal reproduction? > Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data > -- > > Key: SPARK-17020 > URL: https://issues.apache.org/jira/browse/SPARK-17020 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Roi Reshef >Priority: Critical > Attachments: dataframe_cache.PNG, rdd_cache.PNG > > > Calling DataFrame's lazy val .rdd results with a new RDD with a poor > distribution of partitions across the cluster. Moreover, any attempt to > repartition this RDD further will fail. > Attached are a screenshot of the original DataFrame on cache and the > resulting RDD on cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
[ https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417240#comment-15417240 ] Dongjoon Hyun commented on SPARK-16975: --- Great! Thank you for confirming. > Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2 > -- > > Key: SPARK-16975 > URL: https://issues.apache.org/jira/browse/SPARK-16975 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Ubuntu Linux 14.04 >Reporter: immerrr again > Labels: parquet > > Spark-2.0.0 seems to have some problems reading a parquet dataset generated > by 1.6.2. > {code} > In [80]: spark.read.parquet('/path/to/data') > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data. It must be specified manually;' > {code} > The dataset is ~150G and partitioned by _locality_code column. None of the > partitions are empty. I have narrowed the failing dataset to the first 32 > partitions of the data: > {code} > In [82]: spark.read.parquet(*subdirs[:32]) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be > specified manually;' > {code} > Interestingly, it works OK if you remove any of the partitions from the list: > {code} > In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + > subdirs[i+1:32])) > {code} > Another strange thing is that the schemas for the first and the last 31 > partitions of the subset are identical: > {code} > In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == > spark.read.parquet(*subdirs[1:32]).schema.fields > Out[84]: True > {code} > Which got me interested and I tried this: > {code} > In [87]: spark.read.parquet(*([subdirs[0]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be > specified manually;' > In [88]: spark.read.parquet(*([subdirs[15]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be > specified manually;' > In [89]: spark.read.parquet(*([subdirs[31]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be > specified manually;' > {code} > If I read the first partition, save it in 2.0 and try to read in the same > manner, everything is fine: > {code} > In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test') > 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32)) > {code} > I have originally posted it to user mailing list, but with the last > discoveries this clearly seems like a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
[ https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17020: -- Priority: Major (was: Critical) > Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data > -- > > Key: SPARK-17020 > URL: https://issues.apache.org/jira/browse/SPARK-17020 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Roi Reshef > Attachments: dataframe_cache.PNG, rdd_cache.PNG > > > Calling DataFrame's lazy val .rdd results with a new RDD with a poor > distribution of partitions across the cluster. Moreover, any attempt to > repartition this RDD further will fail. > Attached are a screenshot of the original DataFrame on cache and the > resulting RDD on cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17021) simplify the constructor parameters of QuantileSummaries
[ https://issues.apache.org/jira/browse/SPARK-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417230#comment-15417230 ] Apache Spark commented on SPARK-17021: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/14603 > simplify the constructor parameters of QuantileSummaries > > > Key: SPARK-17021 > URL: https://issues.apache.org/jira/browse/SPARK-17021 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17021) simplify the constructor parameters of QuantileSummaries
[ https://issues.apache.org/jira/browse/SPARK-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17021: Assignee: Wenchen Fan (was: Apache Spark) > simplify the constructor parameters of QuantileSummaries > > > Key: SPARK-17021 > URL: https://issues.apache.org/jira/browse/SPARK-17021 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17021) simplify the constructor parameters of QuantileSummaries
[ https://issues.apache.org/jira/browse/SPARK-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17021: Assignee: Apache Spark (was: Wenchen Fan) > simplify the constructor parameters of QuantileSummaries > > > Key: SPARK-17021 > URL: https://issues.apache.org/jira/browse/SPARK-17021 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
[ https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417218#comment-15417218 ] Roi Reshef commented on SPARK-17020: [~srowen] Should there be any effect on this if I cached and materialized the DF befor I call on .rdd? > Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data > -- > > Key: SPARK-17020 > URL: https://issues.apache.org/jira/browse/SPARK-17020 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Roi Reshef >Priority: Critical > Attachments: dataframe_cache.PNG, rdd_cache.PNG > > > Calling DataFrame's lazy val .rdd results with a new RDD with a poor > distribution of partitions across the cluster. Moreover, any attempt to > repartition this RDD further will fail. > Attached are a screenshot of the original DataFrame on cache and the > resulting RDD on cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17021) simplify the constructor parameters of QuantileSummaries
Wenchen Fan created SPARK-17021: --- Summary: simplify the constructor parameters of QuantileSummaries Key: SPARK-17021 URL: https://issues.apache.org/jira/browse/SPARK-17021 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
[ https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417204#comment-15417204 ] Roi Reshef edited comment on SPARK-17020 at 8/11/16 1:13 PM: - [~srowen] I have 2 DataFrames that are generated from spark-csv reader. Then I pass them through several transformations, and join them together. After that I call either .rdd or .flatMap to get an RDD out of the joint DataFrame. Throughout the whole process I've monitored the distribution of the DataFrames. It is good until the point where ".rdd" is called was (Author: roireshef): [~srowen] I have 2 DataFrames that are generated from spark-csv reader. Then I pass them through several transformations, and join them together. After that I call either .rdd or .flatMap to get an RDD out of the joint DataFrame. Throughout all the process I've monitored the distribution of the DataFrames. It is good until the point where ".rdd" is called > Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data > -- > > Key: SPARK-17020 > URL: https://issues.apache.org/jira/browse/SPARK-17020 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Roi Reshef >Priority: Critical > Attachments: dataframe_cache.PNG, rdd_cache.PNG > > > Calling DataFrame's lazy val .rdd results with a new RDD with a poor > distribution of partitions across the cluster. Moreover, any attempt to > repartition this RDD further will fail. > Attached are a screenshot of the original DataFrame on cache and the > resulting RDD on cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org