[jira] [Comment Edited] (SPARK-16917) Spark streaming kafka version compatibility.

2016-08-11 Thread Alexey Zotov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418384#comment-15418384
 ] 

Alexey Zotov edited comment on SPARK-16917 at 8/12/16 5:24 AM:
---

[~sowen]
[~c...@koeninger.org]

It really seems to be confusing:
1. Security is supported in new consumer API which is implemented starting from 
Kafka v0.9. _spark-streaming-kafka-0-8_2.11_ does not support new consumer API. 
Based on that it does not look compatible with secured Kafka v0.9. 

2. _spark-streaming-kafka-0-10_2.11_ works with brokers 0.10 or higher.

Based on the above reasoning looks like it is impossible to use Spark Streaming 
for secured Kafka v0.9. Am I correct? If yes, then it would be great to mention 
it somewhere in documentation.

Thanks!



was (Author: azotcsit):
[~sowen]
[~c...@koeninger.org]

It really seems to be confusing:
1. Security is supported in new consumer API which is implemented starting from 
Kafka v0.9. _spark-streaming-kafka-0-8_2.11_ does not support new consumer API. 
Based on that it does not look compatible with secured Kafka v0.9. 

2. _spark-streaming-kafka-0-10_2.11_ works with brokers 0.10 or higher.

Based on the above reasoning looks like it is impossible to use Spark Streaming 
for secured Kafka v0.9. Please, let me know what I have missed in the above 
reasoning.

Thanks!


> Spark streaming kafka version compatibility. 
> -
>
> Key: SPARK-16917
> URL: https://issues.apache.org/jira/browse/SPARK-16917
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Sudev
>Priority: Trivial
>  Labels: documentation
>
> It would be nice to have Kafka version compatibility information in the 
> official documentation. 
> It's very confusing now. 
> * If you look at this JIRA[1], it seems like Kafka is supported in Spark 
> 2.0.0.
> * The documentation lists artifact for (Kafka 0.8)  
> spark-streaming-kafka-0-8_2.11
> Is Kafka 0.9 supported by Spark 2.0.0 ?
> Since I'm confused here even after an hours effort googling on the same, I 
> think someone should help add the compatibility matrix.
> [1] https://issues.apache.org/jira/browse/SPARK-12177



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16917) Spark streaming kafka version compatibility.

2016-08-11 Thread Alexey Zotov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418384#comment-15418384
 ] 

Alexey Zotov commented on SPARK-16917:
--

[~sowen]
[~c...@koeninger.org]

It really seems to be confusing:
1. Security is supported in new consumer API which is implemented starting from 
Kafka v0.9. _spark-streaming-kafka-0-8_2.11_ does not support new consumer API. 
Based on that it does not look compatible with secured Kafka v0.9. 

2. _spark-streaming-kafka-0-10_2.11_ works with brokers 0.10 or higher.

Based on the above reasoning looks like it is impossible to use Spark Streaming 
for secured Kafka v0.9. Please, let me know what I have missed in the above 
reasoning.

Thanks!


> Spark streaming kafka version compatibility. 
> -
>
> Key: SPARK-16917
> URL: https://issues.apache.org/jira/browse/SPARK-16917
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Sudev
>Priority: Trivial
>  Labels: documentation
>
> It would be nice to have Kafka version compatibility information in the 
> official documentation. 
> It's very confusing now. 
> * If you look at this JIRA[1], it seems like Kafka is supported in Spark 
> 2.0.0.
> * The documentation lists artifact for (Kafka 0.8)  
> spark-streaming-kafka-0-8_2.11
> Is Kafka 0.9 supported by Spark 2.0.0 ?
> Since I'm confused here even after an hours effort googling on the same, I 
> think someone should help add the compatibility matrix.
> [1] https://issues.apache.org/jira/browse/SPARK-12177



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-11 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418364#comment-15418364
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

Hi, [~rxin].
Could you review this PR?

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17019) Expose off-heap memory usage in various places

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17019:


Assignee: Apache Spark

> Expose off-heap memory usage in various places
> --
>
> Key: SPARK-17019
> URL: https://issues.apache.org/jira/browse/SPARK-17019
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> With SPARK-13992, Spark supports persisting data into off-heap memory, but 
> the usage of off-heap is not exposed currently, it is not so convenient for 
> user to monitor and profile, so here propose to expose off-heap memory as 
> well as on-heap memory usage in various places:
> 1. Spark UI's executor page will display both on-heap and off-heap memory 
> usage.
> 2. REST request returns both on-heap and off-heap memory.
> 3. Also these two memory usage can be obtained programmatically from 
> SparkListener.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17019) Expose off-heap memory usage in various places

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17019:


Assignee: (was: Apache Spark)

> Expose off-heap memory usage in various places
> --
>
> Key: SPARK-17019
> URL: https://issues.apache.org/jira/browse/SPARK-17019
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Priority: Minor
>
> With SPARK-13992, Spark supports persisting data into off-heap memory, but 
> the usage of off-heap is not exposed currently, it is not so convenient for 
> user to monitor and profile, so here propose to expose off-heap memory as 
> well as on-heap memory usage in various places:
> 1. Spark UI's executor page will display both on-heap and off-heap memory 
> usage.
> 2. REST request returns both on-heap and off-heap memory.
> 3. Also these two memory usage can be obtained programmatically from 
> SparkListener.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17019) Expose off-heap memory usage in various places

2016-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418305#comment-15418305
 ] 

Apache Spark commented on SPARK-17019:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/14617

> Expose off-heap memory usage in various places
> --
>
> Key: SPARK-17019
> URL: https://issues.apache.org/jira/browse/SPARK-17019
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Priority: Minor
>
> With SPARK-13992, Spark supports persisting data into off-heap memory, but 
> the usage of off-heap is not exposed currently, it is not so convenient for 
> user to monitor and profile, so here propose to expose off-heap memory as 
> well as on-heap memory usage in various places:
> 1. Spark UI's executor page will display both on-heap and off-heap memory 
> usage.
> 2. REST request returns both on-heap and off-heap memory.
> 3. Also these two memory usage can be obtained programmatically from 
> SparkListener.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16434) Avoid record-per type dispatch in JSON when reading

2016-08-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16434:

Assignee: Hyukjin Kwon

> Avoid record-per type dispatch in JSON when reading
> ---
>
> Key: SPARK-16434
> URL: https://issues.apache.org/jira/browse/SPARK-16434
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, {{JacksonParser.parse}} is doing type dispatch for each row to 
> read appropriate values.
> It might not have to be done like this because the schema of {{DataFrame}} is 
> already there. 
> So, appropriate converters can be created first according to the schema, and 
> then apply them to each row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16434) Avoid record-per type dispatch in JSON when reading

2016-08-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16434.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14102
[https://github.com/apache/spark/pull/14102]

> Avoid record-per type dispatch in JSON when reading
> ---
>
> Key: SPARK-16434
> URL: https://issues.apache.org/jira/browse/SPARK-16434
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, {{JacksonParser.parse}} is doing type dispatch for each row to 
> read appropriate values.
> It might not have to be done like this because the schema of {{DataFrame}} is 
> already there. 
> So, appropriate converters can be created first according to the schema, and 
> then apply them to each row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13081) Allow set pythonExec of driver and executor through configuration

2016-08-11 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-13081.

   Resolution: Fixed
 Assignee: Jeff Zhang
Fix Version/s: 2.1.0

> Allow set pythonExec of driver and executor through configuration
> -
>
> Key: SPARK-13081
> URL: https://issues.apache.org/jira/browse/SPARK-13081
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Submit
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently user has to export environment variable PYSPARK_DRIVER_PYTHON and 
> PYSPAR_PYTHON to set the pythonExec of driver and executor. It is fine with 
> interactive mode using bin/pyspark, but it is not so convenient if user want 
> to use pyspark in batch mode by using bin/spark-submit. It would be better to 
> allow user set pythonExec through "--conf"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals

2016-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418288#comment-15418288
 ] 

Apache Spark commented on SPARK-16955:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14616

> Using ordinals in ORDER BY causes an analysis error when the query has a 
> GROUP BY clause using ordinals
> ---
>
> Key: SPARK-16955
> URL: https://issues.apache.org/jira/browse/SPARK-16955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> The following queries work
> {code}
> select a from (select 1 as a) tmp order by 1
> select a, count(*) from (select 1 as a) tmp group by 1
> select a, count(*) from (select 1 as a) tmp group by 1 order by a
> {code}
> However, the following query does not
> {code}
> select a, count(*) from (select 1 as a) tmp group by 1 order by 1
> {code}
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> Group by position: '1' exceeds the size of the select list '0'. on unresolved 
> object, tree:
> Aggregate [1]
> +- SubqueryAlias tmp
>+- Project [1 AS a#82]
>   +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181)
>   

[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-08-11 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418225#comment-15418225
 ] 

Guoqiang Li commented on SPARK-6235:


 I'm doing this work and I'll put the patch in this month.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17029) Dataset toJSON goes through RDD form instead of transforming dataset itself

2016-08-11 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418149#comment-15418149
 ] 

Andrew Ash commented on SPARK-17029:


Note RDD form usage from https://issues.apache.org/jira/browse/SPARK-10705

> Dataset toJSON goes through RDD form instead of transforming dataset itself
> ---
>
> Key: SPARK-17029
> URL: https://issues.apache.org/jira/browse/SPARK-17029
> Project: Spark
>  Issue Type: Bug
>Reporter: Robert Kruszewski
>
> No longer necessary and can be optimized with datasets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17029) Dataset toJSON goes through RDD form instead of transforming dataset itself

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17029:


Assignee: (was: Apache Spark)

> Dataset toJSON goes through RDD form instead of transforming dataset itself
> ---
>
> Key: SPARK-17029
> URL: https://issues.apache.org/jira/browse/SPARK-17029
> Project: Spark
>  Issue Type: Bug
>Reporter: Robert Kruszewski
>
> No longer necessary and can be optimized with datasets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend

2016-08-11 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418143#comment-15418143
 ] 

Miao Wang commented on SPARK-16578:
---

OK. I will check with Junyang. 

> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17029) Dataset toJSON goes through RDD form instead of transforming dataset itself

2016-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418144#comment-15418144
 ] 

Apache Spark commented on SPARK-17029:
--

User 'robert3005' has created a pull request for this issue:
https://github.com/apache/spark/pull/14615

> Dataset toJSON goes through RDD form instead of transforming dataset itself
> ---
>
> Key: SPARK-17029
> URL: https://issues.apache.org/jira/browse/SPARK-17029
> Project: Spark
>  Issue Type: Bug
>Reporter: Robert Kruszewski
>
> No longer necessary and can be optimized with datasets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17029) Dataset toJSON goes through RDD form instead of transforming dataset itself

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17029:


Assignee: Apache Spark

> Dataset toJSON goes through RDD form instead of transforming dataset itself
> ---
>
> Key: SPARK-17029
> URL: https://issues.apache.org/jira/browse/SPARK-17029
> Project: Spark
>  Issue Type: Bug
>Reporter: Robert Kruszewski
>Assignee: Apache Spark
>
> No longer necessary and can be optimized with datasets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17028) Backport SI-9734 for Scala 2.10

2016-08-11 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu closed SPARK-17028.

Resolution: Won't Fix

> Backport SI-9734 for Scala 2.10
> ---
>
> Key: SPARK-17028
> URL: https://issues.apache.org/jira/browse/SPARK-17028
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: Scala 2.10
>Reporter: Shixiong Zhu
>
> SI-9734 will be included in Scala 2.11.9. However, we still need to backport 
> it to Spark Scala 2.10 Shell manually. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17027) PolynomialExpansion.choose is prone to integer overflow

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17027:


Assignee: (was: Apache Spark)

> PolynomialExpansion.choose is prone to integer overflow 
> 
>
> Key: SPARK-17027
> URL: https://issues.apache.org/jira/browse/SPARK-17027
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Current implementation computes power of k directly and because of that it is 
> susceptible to integer overflow on relatively small input (4 features, degree 
> equal 10).  It would be better to use recursive formula instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17027) PolynomialExpansion.choose is prone to integer overflow

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17027:


Assignee: Apache Spark

> PolynomialExpansion.choose is prone to integer overflow 
> 
>
> Key: SPARK-17027
> URL: https://issues.apache.org/jira/browse/SPARK-17027
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> Current implementation computes power of k directly and because of that it is 
> susceptible to integer overflow on relatively small input (4 features, degree 
> equal 10).  It would be better to use recursive formula instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame

2016-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418111#comment-15418111
 ] 

Apache Spark commented on SPARK-16883:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/14613

> SQL decimal type is not properly cast to number when collecting SparkDataFrame
> --
>
> Key: SPARK-16883
> URL: https://issues.apache.org/jira/browse/SPARK-16883
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> To reproduce run following code. As you can see "y" is a list of values.
> {code}
> registerTempTable(createDataFrame(iris), "iris")
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  
> from iris limit 5")))
> 'data.frame': 5 obs. of  2 variables:
>  $ x: num  1 1 1 1 1
>  $ y:List of 5
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16883:


Assignee: (was: Apache Spark)

> SQL decimal type is not properly cast to number when collecting SparkDataFrame
> --
>
> Key: SPARK-16883
> URL: https://issues.apache.org/jira/browse/SPARK-16883
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> To reproduce run following code. As you can see "y" is a list of values.
> {code}
> registerTempTable(createDataFrame(iris), "iris")
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  
> from iris limit 5")))
> 'data.frame': 5 obs. of  2 variables:
>  $ x: num  1 1 1 1 1
>  $ y:List of 5
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16883:


Assignee: Apache Spark

> SQL decimal type is not properly cast to number when collecting SparkDataFrame
> --
>
> Key: SPARK-16883
> URL: https://issues.apache.org/jira/browse/SPARK-16883
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Assignee: Apache Spark
>
> To reproduce run following code. As you can see "y" is a list of values.
> {code}
> registerTempTable(createDataFrame(iris), "iris")
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  
> from iris limit 5")))
> 'data.frame': 5 obs. of  2 variables:
>  $ x: num  1 1 1 1 1
>  $ y:List of 5
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17027) PolynomialExpansion.choose is prone to integer overflow

2016-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418112#comment-15418112
 ] 

Apache Spark commented on SPARK-17027:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/14614

> PolynomialExpansion.choose is prone to integer overflow 
> 
>
> Key: SPARK-17027
> URL: https://issues.apache.org/jira/browse/SPARK-17027
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Current implementation computes power of k directly and because of that it is 
> susceptible to integer overflow on relatively small input (4 features, degree 
> equal 10).  It would be better to use recursive formula instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17026) warning msg in MulticlassMetricsSuite

2016-08-11 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren resolved SPARK-17026.
-
Resolution: Not A Problem

> warning msg in MulticlassMetricsSuite
> -
>
> Key: SPARK-17026
> URL: https://issues.apache.org/jira/browse/SPARK-17026
> Project: Spark
>  Issue Type: Improvement
>Reporter: Xin Ren
>Priority: Trivial
>
> Got warning when building:
> {code}
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:74:
>  value precision in class MulticlassMetrics is deprecated: Use accuracy.
> [warn]assert(math.abs(metrics.accuracy - metrics.precision) < delta)
> [warn]^
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:75:
>  value recall in class MulticlassMetrics is deprecated: Use accuracy.
> [warn]assert(math.abs(metrics.accuracy - metrics.recall) < delta)
> [warn]^
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:76:
>  value fMeasure in class MulticlassMetrics is deprecated: Use accuracy.
> [warn]assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta)
> [warn]^
> {code}
> And `precision` and `recall` and `fMeasure` are all referencing to `accuracy`:
> {code}
> assert(math.abs(metrics.accuracy - metrics.precision) < delta)
> assert(math.abs(metrics.accuracy - metrics.recall) < delta)
> assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta)
> {code}
> {code}
>   /**
>* Returns precision
>*/
>   @Since("1.1.0")
>   @deprecated("Use accuracy.", "2.0.0")
>   lazy val precision: Double = accuracy
>   /**
>* Returns recall
>* (equals to precision for multiclass classifier
>* because sum of all false positives is equal to sum
>* of all false negatives)
>*/
>   @Since("1.1.0")
>   @deprecated("Use accuracy.", "2.0.0")
>   lazy val recall: Double = accuracy
>   /**
>* Returns f-measure
>* (equals to precision and recall because precision equals recall)
>*/
>   @Since("1.1.0")
>   @deprecated("Use accuracy.", "2.0.0")
>   lazy val fMeasure: Double = accuracy
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16803) SaveAsTable does not work when source DataFrame is built on a Hive Table

2016-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418096#comment-15418096
 ] 

Apache Spark commented on SPARK-16803:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14612

> SaveAsTable does not work when source DataFrame is built on a Hive Table
> 
>
> Key: SPARK-16803
> URL: https://issues.apache.org/jira/browse/SPARK-16803
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> {noformat}
> scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as 
> key, 'abc' as value")
> res2: org.apache.spark.sql.DataFrame = []
> scala> val df = sql("select key, value as value from sample.sample")
> df: org.apache.spark.sql.DataFrame = [key: int, value: string]
> scala> df.write.mode("append").saveAsTable("sample.sample")
> scala> sql("select * from sample.sample").show()
> +---+-+
> |key|value|
> +---+-+
> |  1|  abc|
> |  1|  abc|
> +---+-+
> {noformat}
> In Spark 1.6, it works, but Spark 2.0 does not work. The error message from 
> Spark 2.0 is
> {noformat}
> scala> df.write.mode("append").saveAsTable("sample.sample")
> org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation 
> sample, sample
>  is not supported.;
> {noformat}
> So far, we do not plan to support it in Spark 2.0. Spark 1.6 works because it 
> internally uses {{insertInto}}. But, if we change it back it will break the 
> semantic of {{saveAsTable}} (this method uses by-name resolution instead of 
> using by-position resolution used by {{insertInto}}).
> Instead, users should use {{insertInto}} API. We should correct the error 
> messages. Users can understand how to bypass it before we support it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17027) PolynomialExpansion.choose is prone to integer overflow

2016-08-11 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418071#comment-15418071
 ] 

Maciej Szymkiewicz edited comment on SPARK-17027 at 8/11/16 10:38 PM:
--

Yes, this is exactly the problem. 


{code}
choose(14, 10)
// res0: Int = -182
{code}


was (Author: zero323):
Yes, this exactly the problem. 


{code}
choose(14, 10)
// res0: Int = -182
{code}

> PolynomialExpansion.choose is prone to integer overflow 
> 
>
> Key: SPARK-17027
> URL: https://issues.apache.org/jira/browse/SPARK-17027
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Current implementation computes power of k directly and because of that it is 
> susceptible to integer overflow on relatively small input (4 features, degree 
> equal 10).  It would be better to use recursive formula instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17028) Backport SI-9734 for Scala 2.10

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17028:


Assignee: Apache Spark

> Backport SI-9734 for Scala 2.10
> ---
>
> Key: SPARK-17028
> URL: https://issues.apache.org/jira/browse/SPARK-17028
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: Scala 2.10
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> SI-9734 will be included in Scala 2.11.9. However, we still need to backport 
> it to Spark Scala 2.10 Shell manually. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17028) Backport SI-9734 for Scala 2.10

2016-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418068#comment-15418068
 ] 

Apache Spark commented on SPARK-17028:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/14611

> Backport SI-9734 for Scala 2.10
> ---
>
> Key: SPARK-17028
> URL: https://issues.apache.org/jira/browse/SPARK-17028
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: Scala 2.10
>Reporter: Shixiong Zhu
>
> SI-9734 will be included in Scala 2.11.9. However, we still need to backport 
> it to Spark Scala 2.10 Shell manually. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17014) arithmetic.sql

2016-08-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17014.
---
Resolution: Invalid

Believe this was opened in error as a duplicate

> arithmetic.sql
> --
>
> Key: SPARK-17014
> URL: https://issues.apache.org/jira/browse/SPARK-17014
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17028) Backport SI-9734 for Scala 2.10

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17028:


Assignee: (was: Apache Spark)

> Backport SI-9734 for Scala 2.10
> ---
>
> Key: SPARK-17028
> URL: https://issues.apache.org/jira/browse/SPARK-17028
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: Scala 2.10
>Reporter: Shixiong Zhu
>
> SI-9734 will be included in Scala 2.11.9. However, we still need to backport 
> it to Spark Scala 2.10 Shell manually. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17027) PolynomialExpansion.choose is prone to integer overflow

2016-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418065#comment-15418065
 ] 

Sean Owen commented on SPARK-17027:
---

Is the problem in the naive calculation of n choose k?

{code}
  private def choose(n: Int, k: Int): Int = {
Range(n, n - k, -1).product / Range(k, 1, -1).product
  }
{code}

Let's just call 
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/util/CombinatoricsUtils.html#binomialCoefficient(int,%20int)

> PolynomialExpansion.choose is prone to integer overflow 
> 
>
> Key: SPARK-17027
> URL: https://issues.apache.org/jira/browse/SPARK-17027
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Current implementation computes power of k directly and because of that it is 
> susceptible to integer overflow on relatively small input (4 features, degree 
> equal 10).  It would be better to use recursive formula instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17028) Backport SI-9734 for Scala 2.10

2016-08-11 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-17028:


 Summary: Backport SI-9734 for Scala 2.10
 Key: SPARK-17028
 URL: https://issues.apache.org/jira/browse/SPARK-17028
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
 Environment: Scala 2.10
Reporter: Shixiong Zhu


SI-9734 will be included in Scala 2.11.9. However, we still need to backport it 
to Spark Scala 2.10 Shell manually. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17027) PolynomialExpansion.choose is prone to integer overflow

2016-08-11 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-17027:
--

 Summary: PolynomialExpansion.choose is prone to integer overflow 
 Key: SPARK-17027
 URL: https://issues.apache.org/jira/browse/SPARK-17027
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.0, 1.6.0
Reporter: Maciej Szymkiewicz
Priority: Minor


Current implementation computes power of k directly and because of that it is 
susceptible to integer overflow on relatively small input (4 features, degree 
equal 10).  It would be better to use recursive formula instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17022) Potential deadlock in driver handling message

2016-08-11 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17022.

   Resolution: Fixed
 Assignee: Tao Wang
Fix Version/s: 2.1.0
   2.0.1

> Potential deadlock in driver handling message
> -
>
> Key: SPARK-17022
> URL: https://issues.apache.org/jira/browse/SPARK-17022
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0
>Reporter: Tao Wang
>Assignee: Tao Wang
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Suggest t1 < t2 < t3 
> At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one 
> of three functions: CoarseGrainedSchedulerBackend.killExecutors, 
> CoarseGrainedSchedulerBackend.requestTotalExecutors or 
> CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the 
> lock `CoarseGrainedSchedulerBackend`.
> Then YarnSchedulerBackend.doRequestTotalExecutors will send a 
> RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply.
> At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the 
> message is received by the endpoint.
> At t3, the RequestExexutor message sent at t1 is received by the endpoint.
> Then the endpoint would first handle RemoveExecutor then the RequestExecutor 
> message.
> When handling RemoveExecutor, it would send the same message to 
> `driverEndpoint` and wait for reply.
> In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to 
> handle that message, while the lock has been occupied in t1.
> So it would cause a deadlock.
> We have found the issue in our deployment, it would block the driver to make 
> it handle no messages until the two message all went timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16868) Executor will be both dead and alive when this executor reregister itself to driver.

2016-08-11 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16868.

   Resolution: Fixed
 Assignee: carlmartin
Fix Version/s: 2.1.0

> Executor will be both dead and alive when this executor reregister itself to 
> driver.
> 
>
> Key: SPARK-16868
> URL: https://issues.apache.org/jira/browse/SPARK-16868
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: carlmartin
>Assignee: carlmartin
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: 2016-8-3 15-41-47.jpg, 2016-8-3 15-51-13.jpg
>
>
> In a rare condition, Executor will register its block manager twice.
> !https://issues.apache.org/jira/secure/attachment/12821794/2016-8-3%2015-41-47.jpg!
> When unregister it from BlockManagerMaster, driver mark it as "DEAD" in 
> executors WebUI.
> But when the heartbeat reregister the block manager again, this executor will 
> also have another status "Active".
> !https://issues.apache.org/jira/secure/attachment/12821795/2016-8-3%2015-51-13.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13602) o.a.s.deploy.worker.DriverRunner may leak the driver processes

2016-08-11 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-13602.

   Resolution: Fixed
 Assignee: Bryan Cutler
Fix Version/s: 2.1.0

> o.a.s.deploy.worker.DriverRunner may leak the driver processes
> --
>
> Key: SPARK-13602
> URL: https://issues.apache.org/jira/browse/SPARK-13602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Bryan Cutler
> Fix For: 2.1.0
>
>
> If Worker calls "System.exit", DriverRunner will not kill the driver 
> processes. We should add a shutdown hook in DriverRunner like 
> o.a.s.deploy.worker.ExecutorRunner 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17026) warning msg in MulticlassMetricsSuite

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17026:


Assignee: Apache Spark

> warning msg in MulticlassMetricsSuite
> -
>
> Key: SPARK-17026
> URL: https://issues.apache.org/jira/browse/SPARK-17026
> Project: Spark
>  Issue Type: Improvement
>Reporter: Xin Ren
>Assignee: Apache Spark
>Priority: Trivial
>
> Got warning when building:
> {code}
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:74:
>  value precision in class MulticlassMetrics is deprecated: Use accuracy.
> [warn]assert(math.abs(metrics.accuracy - metrics.precision) < delta)
> [warn]^
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:75:
>  value recall in class MulticlassMetrics is deprecated: Use accuracy.
> [warn]assert(math.abs(metrics.accuracy - metrics.recall) < delta)
> [warn]^
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:76:
>  value fMeasure in class MulticlassMetrics is deprecated: Use accuracy.
> [warn]assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta)
> [warn]^
> {code}
> And `precision` and `recall` and `fMeasure` are all referencing to `accuracy`:
> {code}
> assert(math.abs(metrics.accuracy - metrics.precision) < delta)
> assert(math.abs(metrics.accuracy - metrics.recall) < delta)
> assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta)
> {code}
> {code}
>   /**
>* Returns precision
>*/
>   @Since("1.1.0")
>   @deprecated("Use accuracy.", "2.0.0")
>   lazy val precision: Double = accuracy
>   /**
>* Returns recall
>* (equals to precision for multiclass classifier
>* because sum of all false positives is equal to sum
>* of all false negatives)
>*/
>   @Since("1.1.0")
>   @deprecated("Use accuracy.", "2.0.0")
>   lazy val recall: Double = accuracy
>   /**
>* Returns f-measure
>* (equals to precision and recall because precision equals recall)
>*/
>   @Since("1.1.0")
>   @deprecated("Use accuracy.", "2.0.0")
>   lazy val fMeasure: Double = accuracy
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17026) warning msg in MulticlassMetricsSuite

2016-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417995#comment-15417995
 ] 

Apache Spark commented on SPARK-17026:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/14610

> warning msg in MulticlassMetricsSuite
> -
>
> Key: SPARK-17026
> URL: https://issues.apache.org/jira/browse/SPARK-17026
> Project: Spark
>  Issue Type: Improvement
>Reporter: Xin Ren
>Priority: Trivial
>
> Got warning when building:
> {code}
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:74:
>  value precision in class MulticlassMetrics is deprecated: Use accuracy.
> [warn]assert(math.abs(metrics.accuracy - metrics.precision) < delta)
> [warn]^
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:75:
>  value recall in class MulticlassMetrics is deprecated: Use accuracy.
> [warn]assert(math.abs(metrics.accuracy - metrics.recall) < delta)
> [warn]^
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:76:
>  value fMeasure in class MulticlassMetrics is deprecated: Use accuracy.
> [warn]assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta)
> [warn]^
> {code}
> And `precision` and `recall` and `fMeasure` are all referencing to `accuracy`:
> {code}
> assert(math.abs(metrics.accuracy - metrics.precision) < delta)
> assert(math.abs(metrics.accuracy - metrics.recall) < delta)
> assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta)
> {code}
> {code}
>   /**
>* Returns precision
>*/
>   @Since("1.1.0")
>   @deprecated("Use accuracy.", "2.0.0")
>   lazy val precision: Double = accuracy
>   /**
>* Returns recall
>* (equals to precision for multiclass classifier
>* because sum of all false positives is equal to sum
>* of all false negatives)
>*/
>   @Since("1.1.0")
>   @deprecated("Use accuracy.", "2.0.0")
>   lazy val recall: Double = accuracy
>   /**
>* Returns f-measure
>* (equals to precision and recall because precision equals recall)
>*/
>   @Since("1.1.0")
>   @deprecated("Use accuracy.", "2.0.0")
>   lazy val fMeasure: Double = accuracy
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17026) warning msg in MulticlassMetricsSuite

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17026:


Assignee: (was: Apache Spark)

> warning msg in MulticlassMetricsSuite
> -
>
> Key: SPARK-17026
> URL: https://issues.apache.org/jira/browse/SPARK-17026
> Project: Spark
>  Issue Type: Improvement
>Reporter: Xin Ren
>Priority: Trivial
>
> Got warning when building:
> {code}
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:74:
>  value precision in class MulticlassMetrics is deprecated: Use accuracy.
> [warn]assert(math.abs(metrics.accuracy - metrics.precision) < delta)
> [warn]^
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:75:
>  value recall in class MulticlassMetrics is deprecated: Use accuracy.
> [warn]assert(math.abs(metrics.accuracy - metrics.recall) < delta)
> [warn]^
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:76:
>  value fMeasure in class MulticlassMetrics is deprecated: Use accuracy.
> [warn]assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta)
> [warn]^
> {code}
> And `precision` and `recall` and `fMeasure` are all referencing to `accuracy`:
> {code}
> assert(math.abs(metrics.accuracy - metrics.precision) < delta)
> assert(math.abs(metrics.accuracy - metrics.recall) < delta)
> assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta)
> {code}
> {code}
>   /**
>* Returns precision
>*/
>   @Since("1.1.0")
>   @deprecated("Use accuracy.", "2.0.0")
>   lazy val precision: Double = accuracy
>   /**
>* Returns recall
>* (equals to precision for multiclass classifier
>* because sum of all false positives is equal to sum
>* of all false negatives)
>*/
>   @Since("1.1.0")
>   @deprecated("Use accuracy.", "2.0.0")
>   lazy val recall: Double = accuracy
>   /**
>* Returns f-measure
>* (equals to precision and recall because precision equals recall)
>*/
>   @Since("1.1.0")
>   @deprecated("Use accuracy.", "2.0.0")
>   lazy val fMeasure: Double = accuracy
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17026) warning msg in MulticlassMetricsSuite

2016-08-11 Thread Xin Ren (JIRA)
Xin Ren created SPARK-17026:
---

 Summary: warning msg in MulticlassMetricsSuite
 Key: SPARK-17026
 URL: https://issues.apache.org/jira/browse/SPARK-17026
 Project: Spark
  Issue Type: Improvement
Reporter: Xin Ren
Priority: Trivial


Got warning when building:

{code}
[warn] 
/home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:74:
 value precision in class MulticlassMetrics is deprecated: Use accuracy.
[warn]assert(math.abs(metrics.accuracy - metrics.precision) < delta)
[warn]^
[warn] 
/home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:75:
 value recall in class MulticlassMetrics is deprecated: Use accuracy.
[warn]assert(math.abs(metrics.accuracy - metrics.recall) < delta)
[warn]^
[warn] 
/home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala:76:
 value fMeasure in class MulticlassMetrics is deprecated: Use accuracy.
[warn]assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta)
[warn]^
{code}

And `precision` and `recall` and `fMeasure` are all referencing to `accuracy`:
{code}
assert(math.abs(metrics.accuracy - metrics.precision) < delta)
assert(math.abs(metrics.accuracy - metrics.recall) < delta)
assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta)
{code}

{code}
  /**
   * Returns precision
   */
  @Since("1.1.0")
  @deprecated("Use accuracy.", "2.0.0")
  lazy val precision: Double = accuracy


  /**
   * Returns recall
   * (equals to precision for multiclass classifier
   * because sum of all false positives is equal to sum
   * of all false negatives)
   */
  @Since("1.1.0")
  @deprecated("Use accuracy.", "2.0.0")
  lazy val recall: Double = accuracy


  /**
   * Returns f-measure
   * (equals to precision and recall because precision equals recall)
   */
  @Since("1.1.0")
  @deprecated("Use accuracy.", "2.0.0")
  lazy val fMeasure: Double = accuracy

{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17013) negative numeric literal parsing

2016-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417941#comment-15417941
 ] 

Apache Spark commented on SPARK-17013:
--

User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/14608

> negative numeric literal parsing
> 
>
> Key: SPARK-17013
> URL: https://issues.apache.org/jira/browse/SPARK-17013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Peter Lee
>
> As found in https://github.com/apache/spark/pull/14592/files#r74367410, Spark 
> 2.0 parses negative numeric literals as the unary minus of positive literals. 
> This introduces problems for the edge cases such as -9223372036854775809 
> being parsed as decimal instead of bigint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3577) Add task metric to report spill time

2016-08-11 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417929#comment-15417929
 ] 

Kay Ousterhout commented on SPARK-3577:
---

I believe spill time will currently be displayed as part of the task runtime, 
but not as part of scheduler delay.

The scheduler delay is calculated by looking at the difference between two 
values:

(1) The time that the task was running on the executor
(2) The time from when the scheduler sent information about the task to the 
executor (so the executor could run the task) until the scheduler received a 
message that the task completed.

Scheduler delay is (2) - (1).  Usually when it's high, it's because of queueing 
delays in the scheduler that are either delaying the task getting sent to the 
executor (e.g., because the scheduler has a long queue of other tasks that need 
to be launched, or because tasks are large so take a while to send over the 
network) or that are delaying the task completion message getting back to the 
scheduler (which can happen when the rate of task launch is high -- greater 
than 1K or so task launches / second).

> Add task metric to report spill time
> 
>
> Key: SPARK-3577
> URL: https://issues.apache.org/jira/browse/SPARK-3577
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
> The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into 
> {{ExternalSorter}}.  The write time recorded in those metrics is never used.  
> We should probably add task metrics to report this spill time, since for 
> shuffles, this would have previously been reported as part of shuffle write 
> time (with the original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16905) Support SQL DDL: MSCK REPAIR TABLE

2016-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417928#comment-15417928
 ] 

Apache Spark commented on SPARK-16905:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/14607

> Support SQL DDL: MSCK REPAIR TABLE
> --
>
> Key: SPARK-16905
> URL: https://issues.apache.org/jira/browse/SPARK-16905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.1, 2.1.0
>
>
> MSCK REPAIR TABLE could be used to recover the partitions in metastore based 
> on partitions in file system.
> Another syntax is:
> ALTER TABLE table RECOVER PARTITIONS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17018) literals.sql for testing literal parsing

2016-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17018.
-
   Resolution: Fixed
 Assignee: Peter Lee
Fix Version/s: 2.1.0
   2.0.1

> literals.sql for testing literal parsing
> 
>
> Key: SPARK-17018
> URL: https://issues.apache.org/jira/browse/SPARK-17018
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3577) Add task metric to report spill time

2016-08-11 Thread Tzach Zohar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417916#comment-15417916
 ] 

Tzach Zohar commented on SPARK-3577:


Does this mean that currently, spill time will be displayed as part of the 
*Scheduler Delay*? 
Scheduler Delay is calculated pretty much as "everything that isn't 
specifically measured" (see 
[StagePage.getSchedulerDelay|https://github.com/apache/spark/blob/v2.0.0/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala#L770]),
 so I'm wondering if indeed it might include  spill time if it's not included 
anywhere else. 

If so - this might explain long Scheduler Delay values which would be hard to 
make sense of otherwise (which I think is what I'm seeing...).

Thanks

> Add task metric to report spill time
> 
>
> Key: SPARK-3577
> URL: https://issues.apache.org/jira/browse/SPARK-3577
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
> The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into 
> {{ExternalSorter}}.  The write time recorded in those metrics is never used.  
> We should probably add task metrics to report this spill time, since for 
> shuffles, this would have previously been reported as part of shuffle write 
> time (with the original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16784) Configurable log4j settings

2016-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417889#comment-15417889
 ] 

Sean Owen commented on SPARK-16784:
---

Oh, I really meant {{log4j.configuration}} to specify your own config.

> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16993) model.transform without label column in random forest regression

2016-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417892#comment-15417892
 ] 

Sean Owen commented on SPARK-16993:
---

You would need to show some code or more about the error.

> model.transform without label column in random forest regression
> 
>
> Key: SPARK-16993
> URL: https://issues.apache.org/jira/browse/SPARK-16993
> Project: Spark
>  Issue Type: Question
>  Components: Java API, ML
>Reporter: Dulaj Rajitha
>
> I need to use a separate data set to prediction (Not as show in example's 
> training data split).
> But those data do not have the label column. (Since these data are the data 
> that needs to be predict the label).
> but model.transform is informing label column is missing.
> org.apache.spark.sql.AnalysisException: cannot resolve 'label' given input 
> columns: [id,features,prediction]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16784) Configurable log4j settings

2016-08-11 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417857#comment-15417857
 ] 

Michael Gummelt edited comment on SPARK-16784 at 8/11/16 8:11 PM:
--

{{log4j.debug=true}} only results in log4j printing its internal debugging 
messages (e.g. config file location, appenders, etc.).  It doesn't turn on 
debug logging for the application.


was (Author: mgummelt):
{{log4j.debug=true}} only results in log4j printing its debugging messages.  It 
doesn't turn on debug logging for the application.

> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16784) Configurable log4j settings

2016-08-11 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417856#comment-15417856
 ] 

Michael Gummelt commented on SPARK-16784:
-

`log4j.debug=true` only results in log4j printing its debugging messages.  It 
doesn't turn on debug logging for the application.

> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-16784) Configurable log4j settings

2016-08-11 Thread Michael Gummelt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt reopened SPARK-16784:
-

`log4j.debug=true` only results in log4j printing its debugging messages.  It 
doesn't turn on debug logging for the application.

> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16784) Configurable log4j settings

2016-08-11 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417857#comment-15417857
 ] 

Michael Gummelt edited comment on SPARK-16784 at 8/11/16 8:10 PM:
--

{{log4j.debug=true}} only results in log4j printing its debugging messages.  It 
doesn't turn on debug logging for the application.


was (Author: mgummelt):
`log4j.debug=true` only results in log4j printing its debugging messages.  It 
doesn't turn on debug logging for the application.

> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16784) Configurable log4j settings

2016-08-11 Thread Michael Gummelt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated SPARK-16784:

Comment: was deleted

(was: `log4j.debug=true` only results in log4j printing its debugging messages. 
 It doesn't turn on debug logging for the application.)

> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16993) model.transform without label column in random forest regression

2016-08-11 Thread Dulaj Rajitha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417847#comment-15417847
 ] 

Dulaj Rajitha commented on SPARK-16993:
---

But the thing is if add dummy column as as the label column, the process goes 
fine.
I could not continue without add dummy the label column for the data set that 
needs the prediction.

> model.transform without label column in random forest regression
> 
>
> Key: SPARK-16993
> URL: https://issues.apache.org/jira/browse/SPARK-16993
> Project: Spark
>  Issue Type: Question
>  Components: Java API, ML
>Reporter: Dulaj Rajitha
>
> I need to use a separate data set to prediction (Not as show in example's 
> training data split).
> But those data do not have the label column. (Since these data are the data 
> that needs to be predict the label).
> but model.transform is informing label column is missing.
> org.apache.spark.sql.AnalysisException: cannot resolve 'label' given input 
> columns: [id,features,prediction]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.

2016-08-11 Thread Iaroslav Zeigerman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iaroslav Zeigerman resolved SPARK-17024.

Resolution: Duplicate

> Weird behaviour of the DataFrame when a column name contains dots.
> --
>
> Key: SPARK-17024
> URL: https://issues.apache.org/jira/browse/SPARK-17024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Iaroslav Zeigerman
>
> When a column name contains dots and one of the segment in a name is the same 
> as other column's name, Spark treats this column as a nested structure, 
> although the actual type of column is String/Int/etc. Example:
> {code}
>   val df = sqlContext.createDataFrame(Seq(
> ("user1", "task1"),
> ("user2", "task2")
>   )).toDF("user", "user.task")
> {code}
> Two columns "user" and "user.task". Both of them are string, and the schema 
> resolution seems to be correct:
> {noformat}
> root
>  |-- user: string (nullable = true)
>  |-- user.task: string (nullable = true)
> {noformat}
> But when I'm trying to query this DataFrame like i.e.:
> {code}
>   df.select(df("user"), df("user.task"))
> {code}
> Spark throws an exception "Can't extract value from user#2;" 
> It happens during the resolution of the LogicalPlan while processing the  
> "user.task" column.
> Here is the full stacktrace:
> {noformat}
> Can't extract value from user#2;
> org.apache.spark.sql.AnalysisException: Can't extract value from user#2;
>   at 
> org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696)
> {noformat}
> Is this actually an expected behaviour? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.

2016-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417829#comment-15417829
 ] 

Sean Owen commented on SPARK-17024:
---

There are many issues that sound like this, like 
https://issues.apache.org/jira/browse/SPARK-15230
Can you try 2.0? I think this is a duplicate of several, so also please search 
JIRA.

> Weird behaviour of the DataFrame when a column name contains dots.
> --
>
> Key: SPARK-17024
> URL: https://issues.apache.org/jira/browse/SPARK-17024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Iaroslav Zeigerman
>
> When a column name contains dots and one of the segment in a name is the same 
> as other column's name, Spark treats this column as a nested structure, 
> although the actual type of column is String/Int/etc. Example:
> {code}
>   val df = sqlContext.createDataFrame(Seq(
> ("user1", "task1"),
> ("user2", "task2")
>   )).toDF("user", "user.task")
> {code}
> Two columns "user" and "user.task". Both of them are string, and the schema 
> resolution seems to be correct:
> {noformat}
> root
>  |-- user: string (nullable = true)
>  |-- user.task: string (nullable = true)
> {noformat}
> But when I'm trying to query this DataFrame like i.e.:
> {code}
>   df.select(df("user"), df("user.task"))
> {code}
> Spark throws an exception "Can't extract value from user#2;" 
> It happens during the resolution of the LogicalPlan while processing the  
> "user.task" column.
> Here is the full stacktrace:
> {noformat}
> Can't extract value from user#2;
> org.apache.spark.sql.AnalysisException: Can't extract value from user#2;
>   at 
> org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696)
> {noformat}
> Is this actually an expected behaviour? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-11 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788
 ] 

Nicholas Chammas edited comment on SPARK-17025 at 8/11/16 7:33 PM:
---

cc [~josephkb], [~mengxr]

I guess a first step would be to add a {{_to_java}} method to the base 
{{Transformer}} class that simply raises {{NotImplementedError}}.

Ultimately though, is there a way to have the base class handle this work 
automatically, or do custom transformers need to each implement their own 
{{_to_java}} method?


was (Author: nchammas):
cc [~josephkb], [~mengxr]

I guess a first step be to add a {{_to_java}} method to the base Transformer 
class that simply raises {{NotImplementedError}}.

Is there a way to have the base class handle this work automatically, or do 
custom transformers need to each implement their own {{_to_java}} method?

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.

2016-08-11 Thread Iaroslav Zeigerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417807#comment-15417807
 ] 

Iaroslav Zeigerman commented on SPARK-17024:


If I query this way (with backquotes for "user.task"):
{code}
df.select(df("user"), df("`user.task`"))
{code}
leaving the rest of code unchanged it works fine.

> Weird behaviour of the DataFrame when a column name contains dots.
> --
>
> Key: SPARK-17024
> URL: https://issues.apache.org/jira/browse/SPARK-17024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Iaroslav Zeigerman
>
> When a column name contains dots and one of the segment in a name is the same 
> as other column's name, Spark treats this column as a nested structure, 
> although the actual type of column is String/Int/etc. Example:
> {code}
>   val df = sqlContext.createDataFrame(Seq(
> ("user1", "task1"),
> ("user2", "task2")
>   )).toDF("user", "user.task")
> {code}
> Two columns "user" and "user.task". Both of them are string, and the schema 
> resolution seems to be correct:
> {noformat}
> root
>  |-- user: string (nullable = true)
>  |-- user.task: string (nullable = true)
> {noformat}
> But when I'm trying to query this DataFrame like i.e.:
> {code}
>   df.select(df("user"), df("user.task"))
> {code}
> Spark throws an exception "Can't extract value from user#2;" 
> It happens during the resolution of the LogicalPlan while processing the  
> "user.task" column.
> Here is the full stacktrace:
> {noformat}
> Can't extract value from user#2;
> org.apache.spark.sql.AnalysisException: Can't extract value from user#2;
>   at 
> org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696)
> {noformat}
> Is this actually an expected behaviour? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-11 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788
 ] 

Nicholas Chammas edited comment on SPARK-17025 at 8/11/16 7:27 PM:
---

cc [~josephkb], [~mengxr]

I guess a first step be to add a {{_to_java}} method to the base Transformer 
class that simply raises {{NotImplementedError}}.

Is there a way to have the base class handle this work automatically, or do 
custom transformers need to each implement their own {{_to_java}} method?


was (Author: nchammas):
cc [~josephkb] [~mengxr]

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-11 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788
 ] 

Nicholas Chammas commented on SPARK-17025:
--

cc [~josephkb] [~mengxr]

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.

2016-08-11 Thread Iaroslav Zeigerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417789#comment-15417789
 ] 

Iaroslav Zeigerman commented on SPARK-17024:


If this behaviour is expected, is there a way to disable it?

> Weird behaviour of the DataFrame when a column name contains dots.
> --
>
> Key: SPARK-17024
> URL: https://issues.apache.org/jira/browse/SPARK-17024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Iaroslav Zeigerman
>
> When a column name contains dots and one of the segment in a name is the same 
> as other column's name, Spark treats this column as a nested structure, 
> although the actual type of column is String/Int/etc. Example:
> {code}
>   val df = sqlContext.createDataFrame(Seq(
> ("user1", "task1"),
> ("user2", "task2")
>   )).toDF("user", "user.task")
> {code}
> Two columns "user" and "user.task". Both of them are string, and the schema 
> resolution seems to be correct:
> {noformat}
> root
>  |-- user: string (nullable = true)
>  |-- user.task: string (nullable = true)
> {noformat}
> But when I'm trying to query this DataFrame like i.e.:
> {code}
>   df.select(df("user"), df("user.task"))
> {code}
> Spark throws an exception "Can't extract value from user#2;" 
> It happens during the resolution of the LogicalPlan while processing the  
> "user.task" column.
> Here is the full stacktrace:
> {noformat}
> Can't extract value from user#2;
> org.apache.spark.sql.AnalysisException: Can't extract value from user#2;
>   at 
> org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696)
> {noformat}
> Is this actually an expected behaviour? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-11 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-17025:


 Summary: Cannot persist PySpark ML Pipeline model that includes 
custom Transformer
 Key: SPARK-17025
 URL: https://issues.apache.org/jira/browse/SPARK-17025
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.0.0
Reporter: Nicholas Chammas
Priority: Minor


Following the example in [this Databricks blog 
post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
 under "Python tuning", I'm trying to save an ML Pipeline model.

This pipeline, however, includes a custom transformer. When I try to save the 
model, the operation fails because the custom transformer doesn't have a 
{{_to_java}} attribute.

{code}
Traceback (most recent call last):
  File ".../file.py", line 56, in 
model.bestModel.save('model')
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
 line 222, in save
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
 line 217, in write
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
 line 93, in __init__
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
 line 254, in _to_java
AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
{code}

Looking at the source code for 
[ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
 I see that not even the base Transformer class has such an attribute.

I'm assuming this is missing functionality that is intended to be patched up 
(i.e. [like 
this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).

I'm not sure if there is an existing JIRA for this (my searches didn't turn up 
clear results).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.

2016-08-11 Thread Iaroslav Zeigerman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iaroslav Zeigerman updated SPARK-17024:
---
Summary: Weird behaviour of the DataFrame when a column name contains dots. 
 (was: Weird behaviour of the DataFrame when the column name contains dots.)

> Weird behaviour of the DataFrame when a column name contains dots.
> --
>
> Key: SPARK-17024
> URL: https://issues.apache.org/jira/browse/SPARK-17024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Iaroslav Zeigerman
>
> When a column name contains dots and one of the segment in a name is the same 
> as other column's name, Spark treats this column as a nested structure, 
> although the actual type of column is String/Int/etc. Example:
> {code}
>   val df = sqlContext.createDataFrame(Seq(
> ("user1", "task1"),
> ("user2", "task2")
>   )).toDF("user", "user.task")
> {code}
> Two columns "user" and "user.task". Both of them are string, and the schema 
> resolution seems to be correct:
> {noformat}
> root
>  |-- user: string (nullable = true)
>  |-- user.task: string (nullable = true)
> {noformat}
> But when I'm trying to query this DataFrame like i.e.:
> {code}
>   df.select(df("user"), df("user.task"))
> {code}
> Spark throws an exception "Can't extract value from user#2;" 
> It happens during the resolution of the LogicalPlan and while processing the  
> "user.task" column.
> Here is the full stacktrace:
> {noformat}
> Can't extract value from user#2;
> org.apache.spark.sql.AnalysisException: Can't extract value from user#2;
>   at 
> org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696)
> {noformat}
> Is this actually an expected behaviour? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17024) Weird behaviour of the DataFrame when the column name contains dots.

2016-08-11 Thread Iaroslav Zeigerman (JIRA)
Iaroslav Zeigerman created SPARK-17024:
--

 Summary: Weird behaviour of the DataFrame when the column name 
contains dots.
 Key: SPARK-17024
 URL: https://issues.apache.org/jira/browse/SPARK-17024
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Iaroslav Zeigerman


When a column name contains dots and one of the segment in a name is the same 
as other column's name, Spark treats this column as a nested structure, 
although the actual type of column is String/Int/etc. Example:

{code}
  val df = sqlContext.createDataFrame(Seq(
("user1", "task1"),
("user2", "task2")
  )).toDF("user", "user.task")
{code}

Two columns "user" and "user.task". Both of them are string, and the schema 
resolution seems to be correct:

{noformat}
root
 |-- user: string (nullable = true)
 |-- user.task: string (nullable = true)
{noformat}

But when I'm trying to query this DataFrame like i.e.:
{code}
  df.select(df("user"), df("user.task"))
{code}

Spark throws an exception "Can't extract value from user#2;" 
It happens during the resolution of the LogicalPlan and while processing the  
"user.task" column.

Here is the full stacktrace:

{noformat}
Can't extract value from user#2;
org.apache.spark.sql.AnalysisException: Can't extract value from user#2;
at 
org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696)
{noformat}

Is this actually an expected behaviour? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.

2016-08-11 Thread Iaroslav Zeigerman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iaroslav Zeigerman updated SPARK-17024:
---
Description: 
When a column name contains dots and one of the segment in a name is the same 
as other column's name, Spark treats this column as a nested structure, 
although the actual type of column is String/Int/etc. Example:

{code}
  val df = sqlContext.createDataFrame(Seq(
("user1", "task1"),
("user2", "task2")
  )).toDF("user", "user.task")
{code}

Two columns "user" and "user.task". Both of them are string, and the schema 
resolution seems to be correct:

{noformat}
root
 |-- user: string (nullable = true)
 |-- user.task: string (nullable = true)
{noformat}

But when I'm trying to query this DataFrame like i.e.:
{code}
  df.select(df("user"), df("user.task"))
{code}

Spark throws an exception "Can't extract value from user#2;" 
It happens during the resolution of the LogicalPlan while processing the  
"user.task" column.

Here is the full stacktrace:

{noformat}
Can't extract value from user#2;
org.apache.spark.sql.AnalysisException: Can't extract value from user#2;
at 
org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696)
{noformat}

Is this actually an expected behaviour? 

  was:
When a column name contains dots and one of the segment in a name is the same 
as other column's name, Spark treats this column as a nested structure, 
although the actual type of column is String/Int/etc. Example:

{code}
  val df = sqlContext.createDataFrame(Seq(
("user1", "task1"),
("user2", "task2")
  )).toDF("user", "user.task")
{code}

Two columns "user" and "user.task". Both of them are string, and the schema 
resolution seems to be correct:

{noformat}
root
 |-- user: string (nullable = true)
 |-- user.task: string (nullable = true)
{noformat}

But when I'm trying to query this DataFrame like i.e.:
{code}
  df.select(df("user"), df("user.task"))
{code}

Spark throws an exception "Can't extract value from user#2;" 
It happens during the resolution of the LogicalPlan and while processing the  
"user.task" column.

Here is the full stacktrace:

{noformat}
Can't extract value from user#2;
org.apache.spark.sql.AnalysisException: Can't extract value from user#2;
at 
org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696)
{noformat}

Is this actually an expected behaviour? 


> Weird behaviour of the DataFrame when a column name contains dots.
> --
>
> Key: SPARK-17024
> URL: https://issues.apache.org/jira/browse/SPARK-17024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Iaroslav Zeigerman
>
> When a column name contains dots and one of the segment in a name is the same 
> as other column's name, Spark treats this column as a nested structure, 
> although the actual type of column is String/Int/etc. Example:
> {code}
>   val df = sqlContext.createDataFrame(Seq(
> ("user1", "task1"),
> ("user2", "task2")
>   )).toDF("user", "user.task")
> {code}
> Two columns "user" and "user.task". Both of them are string, and the 

[jira] [Resolved] (SPARK-17021) simplify the constructor parameters of QuantileSummaries

2016-08-11 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17021.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14603
[https://github.com/apache/spark/pull/14603]

> simplify the constructor parameters of QuantileSummaries
> 
>
> Key: SPARK-17021
> URL: https://issues.apache.org/jira/browse/SPARK-17021
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17015) group-by-ordinal and order-by-ordinal test cases

2016-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17015:

Fix Version/s: 2.0.1

> group-by-ordinal and order-by-ordinal test cases
> 
>
> Key: SPARK-17015
> URL: https://issues.apache.org/jira/browse/SPARK-17015
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17016) group-by/order-by ordinal should throw AnalysisException instead of UnresolvedException

2016-08-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17016:

Fix Version/s: 2.0.1

> group-by/order-by ordinal should throw AnalysisException instead of 
> UnresolvedException
> ---
>
> Key: SPARK-17016
> URL: https://issues.apache.org/jira/browse/SPARK-17016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17023) Update Kafka connetor to use Kafka 0.10.0.1

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17023:


Assignee: (was: Apache Spark)

> Update Kafka connetor to use Kafka 0.10.0.1
> ---
>
> Key: SPARK-17023
> URL: https://issues.apache.org/jira/browse/SPARK-17023
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Luciano Resende
>Priority: Minor
>
> Update Kafka connector to use latest version of Kafka dependencies (0.10.0.1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17023) Update Kafka connetor to use Kafka 0.10.0.1

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17023:


Assignee: Apache Spark

> Update Kafka connetor to use Kafka 0.10.0.1
> ---
>
> Key: SPARK-17023
> URL: https://issues.apache.org/jira/browse/SPARK-17023
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Luciano Resende
>Assignee: Apache Spark
>Priority: Minor
>
> Update Kafka connector to use latest version of Kafka dependencies (0.10.0.1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17023) Update Kafka connetor to use Kafka 0.10.0.1

2016-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417622#comment-15417622
 ] 

Apache Spark commented on SPARK-17023:
--

User 'lresende' has created a pull request for this issue:
https://github.com/apache/spark/pull/14606

> Update Kafka connetor to use Kafka 0.10.0.1
> ---
>
> Key: SPARK-17023
> URL: https://issues.apache.org/jira/browse/SPARK-17023
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Luciano Resende
>Priority: Minor
>
> Update Kafka connector to use latest version of Kafka dependencies (0.10.0.1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17023) Update Kafka connetor to use Kafka 0.10.0.1

2016-08-11 Thread Luciano Resende (JIRA)
Luciano Resende created SPARK-17023:
---

 Summary: Update Kafka connetor to use Kafka 0.10.0.1
 Key: SPARK-17023
 URL: https://issues.apache.org/jira/browse/SPARK-17023
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Luciano Resende
Priority: Minor


Update Kafka connector to use latest version of Kafka dependencies (0.10.0.1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16577) Add check-cran script to Jenkins

2016-08-11 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417595#comment-15417595
 ] 

Shivaram Venkataraman commented on SPARK-16577:
---

Good point - Let me check this with [~shaneknapp] who maintains our Jenkins 
cluster. We can also try --no-manual and see if that removes the PDF checking

> Add check-cran script to Jenkins
> 
>
> Key: SPARK-16577
> URL: https://issues.apache.org/jira/browse/SPARK-16577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> After we have fixed the warnings from the CRAN checks we should add this as a 
> part of the Jenkins build.
> This depends on SPARK-16507 and SPARK-16508 being resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check

2016-08-11 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417592#comment-15417592
 ] 

Shivaram Venkataraman commented on SPARK-16519:
---

Yeah I think the simplest thing to do is to just append RDD to the method names 
- i.e. unpersistRDD, collectRDD etc. Lets do this first while we continue to 
discuss things in SPARK-16611

> Handle SparkR RDD generics that create warnings in R CMD check
> --
>
> Key: SPARK-16519
> URL: https://issues.apache.org/jira/browse/SPARK-16519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> One of the warnings we get from R CMD check is that RDD implementations of 
> some of the generics are not documented. These generics are shared between 
> RDD, DataFrames in SparkR. The list includes
> {quote}
> WARNING
> Undocumented S4 methods:
>   generic 'cache' and siglist 'RDD'
>   generic 'collect' and siglist 'RDD'
>   generic 'count' and siglist 'RDD'
>   generic 'distinct' and siglist 'RDD'
>   generic 'first' and siglist 'RDD'
>   generic 'join' and siglist 'RDD,RDD'
>   generic 'length' and siglist 'RDD'
>   generic 'partitionBy' and siglist 'RDD'
>   generic 'persist' and siglist 'RDD,character'
>   generic 'repartition' and siglist 'RDD'
>   generic 'show' and siglist 'RDD'
>   generic 'take' and siglist 'RDD,numeric'
>   generic 'unpersist' and siglist 'RDD'
> {quote}
> As described in 
> https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks 
> like a limitation of R where exporting a generic from a package also exports 
> all the implementations of that generic. 
> One way to get around this is to remove the RDD API or rename the methods in 
> Spark 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17022) Potential deadlock in driver handling message

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17022:


Assignee: Apache Spark

> Potential deadlock in driver handling message
> -
>
> Key: SPARK-17022
> URL: https://issues.apache.org/jira/browse/SPARK-17022
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0
>Reporter: Tao Wang
>Assignee: Apache Spark
>Priority: Critical
>
> Suggest t1 < t2 < t3 
> At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one 
> of three functions: CoarseGrainedSchedulerBackend.killExecutors, 
> CoarseGrainedSchedulerBackend.requestTotalExecutors or 
> CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the 
> lock `CoarseGrainedSchedulerBackend`.
> Then YarnSchedulerBackend.doRequestTotalExecutors will send a 
> RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply.
> At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the 
> message is received by the endpoint.
> At t3, the RequestExexutor message sent at t1 is received by the endpoint.
> Then the endpoint would first handle RemoveExecutor then the RequestExecutor 
> message.
> When handling RemoveExecutor, it would send the same message to 
> `driverEndpoint` and wait for reply.
> In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to 
> handle that message, while the lock has been occupied in t1.
> So it would cause a deadlock.
> We have found the issue in our deployment, it would block the driver to make 
> it handle no messages until the two message all went timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17022) Potential deadlock in driver handling message

2016-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417582#comment-15417582
 ] 

Apache Spark commented on SPARK-17022:
--

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/14605

> Potential deadlock in driver handling message
> -
>
> Key: SPARK-17022
> URL: https://issues.apache.org/jira/browse/SPARK-17022
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0
>Reporter: Tao Wang
>Priority: Critical
>
> Suggest t1 < t2 < t3 
> At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one 
> of three functions: CoarseGrainedSchedulerBackend.killExecutors, 
> CoarseGrainedSchedulerBackend.requestTotalExecutors or 
> CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the 
> lock `CoarseGrainedSchedulerBackend`.
> Then YarnSchedulerBackend.doRequestTotalExecutors will send a 
> RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply.
> At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the 
> message is received by the endpoint.
> At t3, the RequestExexutor message sent at t1 is received by the endpoint.
> Then the endpoint would first handle RemoveExecutor then the RequestExecutor 
> message.
> When handling RemoveExecutor, it would send the same message to 
> `driverEndpoint` and wait for reply.
> In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to 
> handle that message, while the lock has been occupied in t1.
> So it would cause a deadlock.
> We have found the issue in our deployment, it would block the driver to make 
> it handle no messages until the two message all went timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17022) Potential deadlock in driver handling message

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17022:


Assignee: (was: Apache Spark)

> Potential deadlock in driver handling message
> -
>
> Key: SPARK-17022
> URL: https://issues.apache.org/jira/browse/SPARK-17022
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0
>Reporter: Tao Wang
>Priority: Critical
>
> Suggest t1 < t2 < t3 
> At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one 
> of three functions: CoarseGrainedSchedulerBackend.killExecutors, 
> CoarseGrainedSchedulerBackend.requestTotalExecutors or 
> CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the 
> lock `CoarseGrainedSchedulerBackend`.
> Then YarnSchedulerBackend.doRequestTotalExecutors will send a 
> RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply.
> At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the 
> message is received by the endpoint.
> At t3, the RequestExexutor message sent at t1 is received by the endpoint.
> Then the endpoint would first handle RemoveExecutor then the RequestExecutor 
> message.
> When handling RemoveExecutor, it would send the same message to 
> `driverEndpoint` and wait for reply.
> In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to 
> handle that message, while the lock has been occupied in t1.
> So it would cause a deadlock.
> We have found the issue in our deployment, it would block the driver to make 
> it handle no messages until the two message all went timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16958) Reuse subqueries within single query

2016-08-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16958.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14548
[https://github.com/apache/spark/pull/14548]

> Reuse subqueries within single query
> 
>
> Key: SPARK-16958
> URL: https://issues.apache.org/jira/browse/SPARK-16958
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.1.0
>
>
> There could be same subquery within a single query, we could reuse the result 
> without running it multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check

2016-08-11 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417541#comment-15417541
 ] 

Felix Cheung edited comment on SPARK-16519 at 8/11/16 4:40 PM:
---

since we are undecided on what to export for RDD, should we just go ahead and 
rename in these internal methods so their names won't match the generics we are 
exporting? That should eliminate the warnings.

If that makes sense, I could start this shortly.


was (Author: felixcheung):
since we are undecided on what to export for RDD, should we just go ahead and 
rename in these internal methods so we won't match the generics we are 
exporting? That should eliminate the warnings.

If that makes sense, I could start this shortly.

> Handle SparkR RDD generics that create warnings in R CMD check
> --
>
> Key: SPARK-16519
> URL: https://issues.apache.org/jira/browse/SPARK-16519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> One of the warnings we get from R CMD check is that RDD implementations of 
> some of the generics are not documented. These generics are shared between 
> RDD, DataFrames in SparkR. The list includes
> {quote}
> WARNING
> Undocumented S4 methods:
>   generic 'cache' and siglist 'RDD'
>   generic 'collect' and siglist 'RDD'
>   generic 'count' and siglist 'RDD'
>   generic 'distinct' and siglist 'RDD'
>   generic 'first' and siglist 'RDD'
>   generic 'join' and siglist 'RDD,RDD'
>   generic 'length' and siglist 'RDD'
>   generic 'partitionBy' and siglist 'RDD'
>   generic 'persist' and siglist 'RDD,character'
>   generic 'repartition' and siglist 'RDD'
>   generic 'show' and siglist 'RDD'
>   generic 'take' and siglist 'RDD,numeric'
>   generic 'unpersist' and siglist 'RDD'
> {quote}
> As described in 
> https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks 
> like a limitation of R where exporting a generic from a package also exports 
> all the implementations of that generic. 
> One way to get around this is to remove the RDD API or rename the methods in 
> Spark 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check

2016-08-11 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417541#comment-15417541
 ] 

Felix Cheung commented on SPARK-16519:
--

since we are undecided on what to export for RDD, should we just go ahead and 
rename in these internal methods so we won't match the generics we are 
exporting? That should eliminate the warnings.

If that makes sense, I could start this shortly.

> Handle SparkR RDD generics that create warnings in R CMD check
> --
>
> Key: SPARK-16519
> URL: https://issues.apache.org/jira/browse/SPARK-16519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> One of the warnings we get from R CMD check is that RDD implementations of 
> some of the generics are not documented. These generics are shared between 
> RDD, DataFrames in SparkR. The list includes
> {quote}
> WARNING
> Undocumented S4 methods:
>   generic 'cache' and siglist 'RDD'
>   generic 'collect' and siglist 'RDD'
>   generic 'count' and siglist 'RDD'
>   generic 'distinct' and siglist 'RDD'
>   generic 'first' and siglist 'RDD'
>   generic 'join' and siglist 'RDD,RDD'
>   generic 'length' and siglist 'RDD'
>   generic 'partitionBy' and siglist 'RDD'
>   generic 'persist' and siglist 'RDD,character'
>   generic 'repartition' and siglist 'RDD'
>   generic 'show' and siglist 'RDD'
>   generic 'take' and siglist 'RDD,numeric'
>   generic 'unpersist' and siglist 'RDD'
> {quote}
> As described in 
> https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks 
> like a limitation of R where exporting a generic from a package also exports 
> all the implementations of that generic. 
> One way to get around this is to remove the RDD API or rename the methods in 
> Spark 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16577) Add check-cran script to Jenkins

2016-08-11 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417536#comment-15417536
 ] 

Felix Cheung edited comment on SPARK-16577 at 8/11/16 4:36 PM:
---

I found that to run the cran check on PDF it requires these 
texlive
texinfo
texlive-fonts-extra

on Ubuntu - are these something we need to pre-install on Jenkins runs, before 
adding this to Jenkins?



was (Author: felixcheung):
I found that to run the cran check on PDF it requires these 
texlive
texinfo
texlive-fonts-extra

on Ubuntu - are these something we need to pre-install on Jenkins runs?


> Add check-cran script to Jenkins
> 
>
> Key: SPARK-16577
> URL: https://issues.apache.org/jira/browse/SPARK-16577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> After we have fixed the warnings from the CRAN checks we should add this as a 
> part of the Jenkins build.
> This depends on SPARK-16507 and SPARK-16508 being resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16577) Add check-cran script to Jenkins

2016-08-11 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417536#comment-15417536
 ] 

Felix Cheung commented on SPARK-16577:
--

I found that to run the cran check on PDF it requires these 
texlive
texinfo
texlive-fonts-extra

on Ubuntu - are these something we need to pre-install on Jenkins runs?


> Add check-cran script to Jenkins
> 
>
> Key: SPARK-16577
> URL: https://issues.apache.org/jira/browse/SPARK-16577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> After we have fixed the warnings from the CRAN checks we should add this as a 
> part of the Jenkins build.
> This depends on SPARK-16507 and SPARK-16508 being resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16831) CrossValidator reports incorrect avgMetrics

2016-08-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16831:
--
Fix Version/s: (was: 1.6.3)

> CrossValidator reports incorrect avgMetrics
> ---
>
> Key: SPARK-16831
> URL: https://issues.apache.org/jira/browse/SPARK-16831
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Assignee: Max Moroz
> Fix For: 2.0.1, 2.1.0
>
>
> The avgMetrics are summed up across all folds instead of being averaged. This 
> is an easy fix in CrossValidator._fit() function: 
> {code}metrics[j]+=metric{code} should be 
> {code}metrics[j]+=metric/nFolds{code}.
> {code}
> dataset = spark.createDataFrame(
>   [(Vectors.dense([0.0]), 0.0),
>(Vectors.dense([0.4]), 1.0),
>(Vectors.dense([0.5]), 0.0),
>(Vectors.dense([0.6]), 1.0),
>(Vectors.dense([1.0]), 1.0)] * 1000,
>   ["features", "label"]).cache()
> paramGrid = pyspark.ml.tuning.ParamGridBuilder().build()
> tvs = 
> pyspark.ml.tuning.TrainValidationSplit(estimator=pyspark.ml.regression.LinearRegression(),
>  
>estimatorParamMaps=paramGrid,
>
> evaluator=pyspark.ml.evaluation.RegressionEvaluator(),
>trainRatio=0.8)
> model = tvs.fit(train)
> print(model.validationMetrics)
> for folds in (3, 5, 10):
>   cv = 
> pyspark.ml.tuning.CrossValidator(estimator=pyspark.ml.regression.LinearRegression(),
>  
>   estimatorParamMaps=paramGrid, 
>   
> evaluator=pyspark.ml.evaluation.RegressionEvaluator(),
>   numFolds=folds
>  )
>   cvModel = cv.fit(dataset)
>   print(folds, cvModel.avgMetrics)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17022) Potential deadlock in driver handling message

2016-08-11 Thread Tao Wang (JIRA)
Tao Wang created SPARK-17022:


 Summary: Potential deadlock in driver handling message
 Key: SPARK-17022
 URL: https://issues.apache.org/jira/browse/SPARK-17022
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.0.0, 1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0
Reporter: Tao Wang
Priority: Critical


Suggest t1 < t2 < t3 
At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one of 
three functions: CoarseGrainedSchedulerBackend.killExecutors, 
CoarseGrainedSchedulerBackend.requestTotalExecutors or 
CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the 
lock `CoarseGrainedSchedulerBackend`.
Then YarnSchedulerBackend.doRequestTotalExecutors will send a RequestExecutors 
message to `yarnSchedulerEndpoint` and wait for reply.

At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the message 
is received by the endpoint.

At t3, the RequestExexutor message sent at t1 is received by the endpoint.

Then the endpoint would first handle RemoveExecutor then the RequestExecutor 
message.

When handling RemoveExecutor, it would send the same message to 
`driverEndpoint` and wait for reply.

In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to 
handle that message, while the lock has been occupied in t1.

So it would cause a deadlock.

We have found the issue in our deployment, it would block the driver to make it 
handle no messages until the two message all went timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

2016-08-11 Thread Roi Reshef (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417301#comment-15417301
 ] 

Roi Reshef edited comment on SPARK-17020 at 8/11/16 2:09 PM:
-

Nevertheless, any attempt to repartition the resulting RDD also end with having 
(almost) all of its partitions stay on the same node. I made it transform into 
a ShuffledRDD via PairRDDFunctions, set a HashPartitioner with 140 partitions 
using *.partitionBy*, and yet, I got the same data-distribution as in the 
screenshot I attached.

So I guess there's something very wrong with referring to a *DataFrame.rdd* 
without materializing it beforehand. What and why is beyond my understanding, 
currently.


was (Author: roireshef):
Nevertheless, any attempt to repartition the resulting RDD also end with having 
(almost) all of its partitions stay on the same node. I made it transform into 
a ShuffledRDD via PairRDDFunctions, set a HashPartitioner with 140 partitions 
*.partitionBy*, and yet, I got the same data-distribution as in the screenshot 
I attached.

So I guess there's something very wrong with referring to a *DataFrame.rdd* 
without materializing it beforehand. What and why is beyond my understanding, 
currently.

> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> --
>
> Key: SPARK-17020
> URL: https://issues.apache.org/jira/browse/SPARK-17020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Roi Reshef
> Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

2016-08-11 Thread Roi Reshef (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417301#comment-15417301
 ] 

Roi Reshef edited comment on SPARK-17020 at 8/11/16 2:09 PM:
-

Nevertheless, any attempt to repartition the resulting RDD also end with having 
(almost) all of its partitions stay on the same node. I made it transform into 
a ShuffledRDD via PairRDDFunctions, set a HashPartitioner with 140 partitions 
*.partitionBy*, and yet, I got the same data-distribution as in the screenshot 
I attached.

So I guess there's something very wrong with referring to a *DataFrame.rdd* 
without materializing it beforehand. What and why is beyond my understanding, 
currently.


was (Author: roireshef):
Nevertheless, any attempt to repartition the resulting RDD also end with having 
(almost) all of its partitions stay on the same node. I made it transform into 
a ShuffledRDD via PairRDDFunctions, set a HashPartitioner with 140 partitions, 
and yet, I got the same data-distribution as in the screenshot I attached.

So I guess there's something very wrong with referring to a *DataFrame.rdd* 
without materializing it beforehand. What and why is beyond my understanding, 
currently.

> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> --
>
> Key: SPARK-17020
> URL: https://issues.apache.org/jira/browse/SPARK-17020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Roi Reshef
> Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

2016-08-11 Thread Roi Reshef (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417301#comment-15417301
 ] 

Roi Reshef commented on SPARK-17020:


Nevertheless, any attempt to repartition the resulting RDD also end with having 
(almost) all of its partitions stay on the same node. I made it transform into 
a ShuffledRDD via PairRDDFunctions, set a HashPartitioner with 140 partitions, 
and yet, I got the same data-distribution as in the screenshot I attached.

So I guess there's something very wrong with referring to a *DataFrame.rdd* 
without materializing it beforehand. What and why is beyond my understanding, 
currently.

> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> --
>
> Key: SPARK-17020
> URL: https://issues.apache.org/jira/browse/SPARK-17020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Roi Reshef
> Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

2016-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417291#comment-15417291
 ] 

Sean Owen commented on SPARK-17020:
---

I see, I was asking because you show the results of caching a DataFrame above.
My guess is that in one case, the DataFrame is computed using the expected 
number of partitions, and somehow when you go straight through to the RDD, it 
ends up executing one task for one partition, thus putting the result in one 
big block. As to why, I don't know. You could confirm/deny by looking at the 
partition count for the DataFrame and RDD in these cases.

> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> --
>
> Key: SPARK-17020
> URL: https://issues.apache.org/jira/browse/SPARK-17020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Roi Reshef
> Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

2016-08-11 Thread Roi Reshef (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417288#comment-15417288
 ] 

Roi Reshef commented on SPARK-17020:


The problem occurs only when calling **.rdd** on an *not-previously-cached* 
DataFrame.
**data** is a DataFrame, so in the last code you have it cached whereas in the 
one before it wasn't, but rather only the RDD that was extracted from it.

> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> --
>
> Key: SPARK-17020
> URL: https://issues.apache.org/jira/browse/SPARK-17020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Roi Reshef
> Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

2016-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417268#comment-15417268
 ] 

Sean Owen commented on SPARK-17020:
---

Yeah, after it's cached and the partitions are established, I'd certainly 
expect it to do the sensible thing and use that locality, and that you'd find 
the locality of the RDD's partitions is the same and well-distributed.

What's the code path where you cache the DataFrame? I only see the RDD cached 
here.

> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> --
>
> Key: SPARK-17020
> URL: https://issues.apache.org/jira/browse/SPARK-17020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Roi Reshef
> Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

2016-08-11 Thread Roi Reshef (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417254#comment-15417254
 ] 

Roi Reshef commented on SPARK-17020:


Also note that I have just called:

*data.cache().count()*
val rdd = data.rdd.setName("rdd").cache()
rdd.count

and the rdd was distributed far better (similar to "data" DataFrame)

I'm not sure it solves the issue with the rdd that ignores repartitioning 
methods further down the road. I'll have to check that

> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> --
>
> Key: SPARK-17020
> URL: https://issues.apache.org/jira/browse/SPARK-17020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Roi Reshef
> Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

2016-08-11 Thread Roi Reshef (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417250#comment-15417250
 ] 

Roi Reshef commented on SPARK-17020:


val ab = SomeReader.read(...)  //some reader function that uses spark-csv with 
inferSchema=true
filter(!isnull($"name")).
alias("revab")

val meta = SomeReader.read(...) //same but different schema and data

val udaf = ... //some UserDefinedAggregateFunction
val features = ab.groupBy(...).agg(udaf(...))

val data = features.
join(meta, $"meta.id" === $"features.id").
select(...)   //only relevant fields

val rdd = data.rdd.setName("rdd").cache()
rdd.count


> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> --
>
> Key: SPARK-17020
> URL: https://issues.apache.org/jira/browse/SPARK-17020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Roi Reshef
> Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

2016-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417231#comment-15417231
 ] 

Sean Owen commented on SPARK-17020:
---

I think that's probably material, yes, as is the operations that created the 
DataFrame. Do you have any minimal reproduction?

> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> --
>
> Key: SPARK-17020
> URL: https://issues.apache.org/jira/browse/SPARK-17020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Roi Reshef
>Priority: Critical
> Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-11 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417240#comment-15417240
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

Great! Thank you for confirming.

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

2016-08-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17020:
--
Priority: Major  (was: Critical)

> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> --
>
> Key: SPARK-17020
> URL: https://issues.apache.org/jira/browse/SPARK-17020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Roi Reshef
> Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17021) simplify the constructor parameters of QuantileSummaries

2016-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417230#comment-15417230
 ] 

Apache Spark commented on SPARK-17021:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/14603

> simplify the constructor parameters of QuantileSummaries
> 
>
> Key: SPARK-17021
> URL: https://issues.apache.org/jira/browse/SPARK-17021
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17021) simplify the constructor parameters of QuantileSummaries

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17021:


Assignee: Wenchen Fan  (was: Apache Spark)

> simplify the constructor parameters of QuantileSummaries
> 
>
> Key: SPARK-17021
> URL: https://issues.apache.org/jira/browse/SPARK-17021
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17021) simplify the constructor parameters of QuantileSummaries

2016-08-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17021:


Assignee: Apache Spark  (was: Wenchen Fan)

> simplify the constructor parameters of QuantileSummaries
> 
>
> Key: SPARK-17021
> URL: https://issues.apache.org/jira/browse/SPARK-17021
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

2016-08-11 Thread Roi Reshef (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417218#comment-15417218
 ] 

Roi Reshef commented on SPARK-17020:


[~srowen] Should there be any effect on this if I cached and materialized the 
DF befor I call on .rdd?

> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> --
>
> Key: SPARK-17020
> URL: https://issues.apache.org/jira/browse/SPARK-17020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Roi Reshef
>Priority: Critical
> Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17021) simplify the constructor parameters of QuantileSummaries

2016-08-11 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-17021:
---

 Summary: simplify the constructor parameters of QuantileSummaries
 Key: SPARK-17021
 URL: https://issues.apache.org/jira/browse/SPARK-17021
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

2016-08-11 Thread Roi Reshef (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417204#comment-15417204
 ] 

Roi Reshef edited comment on SPARK-17020 at 8/11/16 1:13 PM:
-

[~srowen] I have 2 DataFrames that are generated from spark-csv reader.
Then I pass them through several transformations, and join them together.
After that I call either .rdd or .flatMap to get an RDD out of the joint 
DataFrame.

Throughout the whole process I've monitored the distribution of the DataFrames. 
It is good until the point where ".rdd" is called


was (Author: roireshef):
[~srowen] I have 2 DataFrames that are generated from spark-csv reader.
Then I pass them through several transformations, and join them together.
After that I call either .rdd or .flatMap to get an RDD out of the joint 
DataFrame.

Throughout all the process I've monitored the distribution of the DataFrames. 
It is good until the point where ".rdd" is called

> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> --
>
> Key: SPARK-17020
> URL: https://issues.apache.org/jira/browse/SPARK-17020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Roi Reshef
>Priority: Critical
> Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >