[jira] [Commented] (SPARK-10670) Link to each language's API in codetabs in ML docs: spark.ml

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905898#comment-14905898
 ] 

Apache Spark commented on SPARK-10670:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/8901

> Link to each language's API in codetabs in ML docs: spark.ml
> 
>
> Key: SPARK-10670
> URL: https://issues.apache.org/jira/browse/SPARK-10670
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>
> In the Markdown docs for the spark.ml Programming Guide, we have code 
> examples with codetabs for each language. We should link to each language's 
> API docs within the corresponding codetab, but we are inconsistent about 
> this. For an example of what we want to do, see the "Word2Vec" section in 
> https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/ml-features.md
> This JIRA is just for spark.ml, not spark.mllib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10670) Link to each language's API in codetabs in ML docs: spark.ml

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10670:


Assignee: (was: Apache Spark)

> Link to each language's API in codetabs in ML docs: spark.ml
> 
>
> Key: SPARK-10670
> URL: https://issues.apache.org/jira/browse/SPARK-10670
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>
> In the Markdown docs for the spark.ml Programming Guide, we have code 
> examples with codetabs for each language. We should link to each language's 
> API docs within the corresponding codetab, but we are inconsistent about 
> this. For an example of what we want to do, see the "Word2Vec" section in 
> https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/ml-features.md
> This JIRA is just for spark.ml, not spark.mllib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10670) Link to each language's API in codetabs in ML docs: spark.ml

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10670:


Assignee: Apache Spark

> Link to each language's API in codetabs in ML docs: spark.ml
> 
>
> Key: SPARK-10670
> URL: https://issues.apache.org/jira/browse/SPARK-10670
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> In the Markdown docs for the spark.ml Programming Guide, we have code 
> examples with codetabs for each language. We should link to each language's 
> API docs within the corresponding codetab, but we are inconsistent about 
> this. For an example of what we want to do, see the "Word2Vec" section in 
> https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/ml-features.md
> This JIRA is just for spark.ml, not spark.mllib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema

2015-09-23 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905856#comment-14905856
 ] 

Cheng Lian commented on SPARK-10659:


This behavior had once been a hacky way to workaround interoperability with 
Hive (fields in Hive schemata are always nullable). I think we can remove this 
one now. One potential design space problem need to be fixed is that, when 
persisting a DataFrame as a table in Parquet format into Hive metastore, what 
should we do if the schema has non-nullable fields. Basically two choices:
# Persist the table in Spark SQL data source specific format, which is Hive 
incompatible, this preserves Parquet schema
# Turn the schema into nullable form and save it in Hive compatible format
I'd go for 1.

> DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not 
> nullable) flag in schema
> --
>
> Key: SPARK-10659
> URL: https://issues.apache.org/jira/browse/SPARK-10659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0
>Reporter: Vladimir Picka
>
> DataFrames currently automatically promotes all Parquet schema fields to 
> optional when they are written to an empty directory. The problem remains in 
> v1.5.0.
> The culprit is this code:
> {code}
> val relation = if (doInsertion) {
>   // This is a hack. We always set 
> nullable/containsNull/valueContainsNull to true
>   // for the schema of a parquet data.
>   val df =
> sqlContext.createDataFrame(
>   data.queryExecution.toRdd,
>   data.schema.asNullable)
>   val createdRelation =
> createRelation(sqlContext, parameters, 
> df.schema).asInstanceOf[ParquetRelation2]
>   createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
>   createdRelation
> }
> {code}
> which was implemented as part of this PR:
> https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b
> This very unexpected behaviour for some use cases when files are read from 
> one place and written to another like small file packing - it ends up with 
> incompatible files because required can't be promoted to optional normally. 
> It is essence of a schema that it enforces "required" invariant on data. It 
> should be supposed that it is intended.
> I believe that a better approach is to have default behaviour to keep schema 
> as is and provide f.e. a builder method or option to allow forcing to 
> optional.
> Right now we have to overwrite private API so that our files are rewritten as 
> is with all its perils.
> Vladimir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema

2015-09-23 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905856#comment-14905856
 ] 

Cheng Lian edited comment on SPARK-10659 at 9/24/15 5:51 AM:
-

This behavior had once been a hacky way to workaround interoperability with 
Hive (fields in Hive schemata are always nullable). I think we can remove this 
one now. One potential design space problem need to be fixed is that, when 
persisting a DataFrame as a table in Parquet format into Hive metastore, what 
should we do if the schema has non-nullable fields. Basically two choices:

# Persist the table in Spark SQL data source specific format, which is Hive 
incompatible, this preserves Parquet schema
# Turn the schema into nullable form and save it in Hive compatible format

I'd go for 1.


was (Author: lian cheng):
This behavior had once been a hacky way to workaround interoperability with 
Hive (fields in Hive schemata are always nullable). I think we can remove this 
one now. One potential design space problem need to be fixed is that, when 
persisting a DataFrame as a table in Parquet format into Hive metastore, what 
should we do if the schema has non-nullable fields. Basically two choices:
# Persist the table in Spark SQL data source specific format, which is Hive 
incompatible, this preserves Parquet schema
# Turn the schema into nullable form and save it in Hive compatible format
I'd go for 1.

> DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not 
> nullable) flag in schema
> --
>
> Key: SPARK-10659
> URL: https://issues.apache.org/jira/browse/SPARK-10659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0
>Reporter: Vladimir Picka
>
> DataFrames currently automatically promotes all Parquet schema fields to 
> optional when they are written to an empty directory. The problem remains in 
> v1.5.0.
> The culprit is this code:
> {code}
> val relation = if (doInsertion) {
>   // This is a hack. We always set 
> nullable/containsNull/valueContainsNull to true
>   // for the schema of a parquet data.
>   val df =
> sqlContext.createDataFrame(
>   data.queryExecution.toRdd,
>   data.schema.asNullable)
>   val createdRelation =
> createRelation(sqlContext, parameters, 
> df.schema).asInstanceOf[ParquetRelation2]
>   createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
>   createdRelation
> }
> {code}
> which was implemented as part of this PR:
> https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b
> This very unexpected behaviour for some use cases when files are read from 
> one place and written to another like small file packing - it ends up with 
> incompatible files because required can't be promoted to optional normally. 
> It is essence of a schema that it enforces "required" invariant on data. It 
> should be supposed that it is intended.
> I believe that a better approach is to have default behaviour to keep schema 
> as is and provide f.e. a builder method or option to allow forcing to 
> optional.
> Right now we have to overwrite private API so that our files are rewritten as 
> is with all its perils.
> Vladimir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10763) Update Java MLLIB/ML tests to use simplified dataframe construction

2015-09-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10763:
--
Assignee: holdenk

> Update Java MLLIB/ML tests to use simplified dataframe construction
> ---
>
> Key: SPARK-10763
> URL: https://issues.apache.org/jira/browse/SPARK-10763
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 1.6.0
>
>
> As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now 
> have an easier way to create dataframes from local Java lists. Lets update 
> the tests to use those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10763) Update Java MLLIB/ML tests to use simplified dataframe construction

2015-09-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10763.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8886
[https://github.com/apache/spark/pull/8886]

> Update Java MLLIB/ML tests to use simplified dataframe construction
> ---
>
> Key: SPARK-10763
> URL: https://issues.apache.org/jira/browse/SPARK-10763
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: holdenk
>Priority: Minor
> Fix For: 1.6.0
>
>
> As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now 
> have an easier way to create dataframes from local Java lists. Lets update 
> the tests to use those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10763) Update Java MLLIB/ML tests to use simplified dataframe construction

2015-09-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10763:
--
Affects Version/s: 1.6.0
 Target Version/s: 1.6.0

> Update Java MLLIB/ML tests to use simplified dataframe construction
> ---
>
> Key: SPARK-10763
> URL: https://issues.apache.org/jira/browse/SPARK-10763
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 1.6.0
>
>
> As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now 
> have an easier way to create dataframes from local Java lists. Lets update 
> the tests to use those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-09-23 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-10788:
-

 Summary: Decision Tree duplicates bins for unordered categorical 
features
 Key: SPARK-10788
 URL: https://issues.apache.org/jira/browse/SPARK-10788
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley


Decision trees in spark.ml (RandomForest.scala) effectively creates a second 
copy of each split. E.g., if there are 3 categories A, B, C, then we should 
consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B

Currently, we also consider the 3 flipped splits:
* B,C vs. A
* C vs. A, B
* B vs. A, C

This means we communicate twice as much data as needed for these features.

We should eliminate these duplicate splits within the spark.ml implementation 
since the spark.mllib implementation will be removed before long (and will 
instead call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10770) SparkPlan.executeCollect/executeTake should return InternalRow rather than external Row

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905817#comment-14905817
 ] 

Apache Spark commented on SPARK-10770:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8900

> SparkPlan.executeCollect/executeTake should return InternalRow rather than 
> external Row
> ---
>
> Key: SPARK-10770
> URL: https://issues.apache.org/jira/browse/SPARK-10770
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10709) When loading a json dataset as a data frame, if the input path is wrong, the error message is very confusing

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10709:


Assignee: Apache Spark

> When loading a json dataset as a data frame, if the input path is wrong, the 
> error message is very confusing
> 
>
> Key: SPARK-10709
> URL: https://issues.apache.org/jira/browse/SPARK-10709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> If you do something like {{sqlContext.read.json("a wrong path")}}, when we 
> actually read data, the error message is 
> {code}
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:198)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:91)
>   at 
> org.apache.spark.sql.execution.ShuffledRowRDD.getDependencies(ShuffledRowRDD.scala:59)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224)
>   at 
> org.apache.spark.scheduler.DAGScheduler.visit$2(DAGScheduler.scala:427)
>   at 
> org.apache.spark.scheduler.DAGScheduler.getAncestorShuffleDependencies(DAGScheduler.scala:442)

[jira] [Assigned] (SPARK-10709) When loading a json dataset as a data frame, if the input path is wrong, the error message is very confusing

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10709:


Assignee: (was: Apache Spark)

> When loading a json dataset as a data frame, if the input path is wrong, the 
> error message is very confusing
> 
>
> Key: SPARK-10709
> URL: https://issues.apache.org/jira/browse/SPARK-10709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> If you do something like {{sqlContext.read.json("a wrong path")}}, when we 
> actually read data, the error message is 
> {code}
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:198)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:91)
>   at 
> org.apache.spark.sql.execution.ShuffledRowRDD.getDependencies(ShuffledRowRDD.scala:59)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224)
>   at 
> org.apache.spark.scheduler.DAGScheduler.visit$2(DAGScheduler.scala:427)
>   at 
> org.apache.spark.scheduler.DAGScheduler.getAncestorShuffleDependencies(DAGScheduler.scala:442)
>   at 
> org.apache

[jira] [Commented] (SPARK-10709) When loading a json dataset as a data frame, if the input path is wrong, the error message is very confusing

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905803#comment-14905803
 ] 

Apache Spark commented on SPARK-10709:
--

User 'navis' has created a pull request for this issue:
https://github.com/apache/spark/pull/8899

> When loading a json dataset as a data frame, if the input path is wrong, the 
> error message is very confusing
> 
>
> Key: SPARK-10709
> URL: https://issues.apache.org/jira/browse/SPARK-10709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> If you do something like {{sqlContext.read.json("a wrong path")}}, when we 
> actually read data, the error message is 
> {code}
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:198)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:91)
>   at 
> org.apache.spark.sql.execution.ShuffledRowRDD.getDependencies(ShuffledRowRDD.scala:59)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224)
>   at 
> org.apache.spark.scheduler.DAGScheduler.visit$2(DAGScheduler.scala:427)
>   at 
> org.apac

[jira] [Updated] (SPARK-10787) Reset ObjectOutputStream more often to prevent OOME

2015-09-23 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-10787:
---
Description: 
In the thread, Spark ClosureCleaner or java serializer OOM when trying to grow 
(http://search-hadoop.com/m/q3RTtAr5X543dNn), Jay Luan reported that 
ClosureCleaner#ensureSerializable() resulted in OOME.

The cause was that ObjectOutputStream keeps a strong reference of every object 
that was written to it.

This issue tries to avoid OOME by calling reset() more often.

  was:
In the thread, Spark ClosureCleaner or java serializer OOM when trying to grow, 
Jay Luan reported that ClosureCleaner#ensureSerializable() resulted in OOME.

The cause was that ObjectOutputStream keeps a strong reference of every object 
that was written to it.

This issue tries to avoid OOME by calling reset() more often.


> Reset ObjectOutputStream more often to prevent OOME
> ---
>
> Key: SPARK-10787
> URL: https://issues.apache.org/jira/browse/SPARK-10787
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>
> In the thread, Spark ClosureCleaner or java serializer OOM when trying to 
> grow (http://search-hadoop.com/m/q3RTtAr5X543dNn), Jay Luan reported that 
> ClosureCleaner#ensureSerializable() resulted in OOME.
> The cause was that ObjectOutputStream keeps a strong reference of every 
> object that was written to it.
> This issue tries to avoid OOME by calling reset() more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10787) Reset ObjectOutputStream more often to prevent OOME

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10787:


Assignee: Apache Spark

> Reset ObjectOutputStream more often to prevent OOME
> ---
>
> Key: SPARK-10787
> URL: https://issues.apache.org/jira/browse/SPARK-10787
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Apache Spark
>
> In the thread, Spark ClosureCleaner or java serializer OOM when trying to 
> grow, Jay Luan reported that ClosureCleaner#ensureSerializable() resulted in 
> OOME.
> The cause was that ObjectOutputStream keeps a strong reference of every 
> object that was written to it.
> This issue tries to avoid OOME by calling reset() more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10787) Reset ObjectOutputStream more often to prevent OOME

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10787:


Assignee: (was: Apache Spark)

> Reset ObjectOutputStream more often to prevent OOME
> ---
>
> Key: SPARK-10787
> URL: https://issues.apache.org/jira/browse/SPARK-10787
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>
> In the thread, Spark ClosureCleaner or java serializer OOM when trying to 
> grow, Jay Luan reported that ClosureCleaner#ensureSerializable() resulted in 
> OOME.
> The cause was that ObjectOutputStream keeps a strong reference of every 
> object that was written to it.
> This issue tries to avoid OOME by calling reset() more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10787) Reset ObjectOutputStream more often to prevent OOME

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905769#comment-14905769
 ] 

Apache Spark commented on SPARK-10787:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/8896

> Reset ObjectOutputStream more often to prevent OOME
> ---
>
> Key: SPARK-10787
> URL: https://issues.apache.org/jira/browse/SPARK-10787
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>
> In the thread, Spark ClosureCleaner or java serializer OOM when trying to 
> grow, Jay Luan reported that ClosureCleaner#ensureSerializable() resulted in 
> OOME.
> The cause was that ObjectOutputStream keeps a strong reference of every 
> object that was written to it.
> This issue tries to avoid OOME by calling reset() more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10787) Reset ObjectOutputStream more often to prevent OOME

2015-09-23 Thread Ted Yu (JIRA)
Ted Yu created SPARK-10787:
--

 Summary: Reset ObjectOutputStream more often to prevent OOME
 Key: SPARK-10787
 URL: https://issues.apache.org/jira/browse/SPARK-10787
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu


In the thread, Spark ClosureCleaner or java serializer OOM when trying to grow, 
Jay Luan reported that ClosureCleaner#ensureSerializable() resulted in OOME.

The cause was that ObjectOutputStream keeps a strong reference of every object 
that was written to it.

This issue tries to avoid OOME by calling reset() more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10086) Flaky StreamingKMeans test in PySpark

2015-09-23 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905731#comment-14905731
 ] 

Tathagata Das commented on SPARK-10086:
---

Actually never mind, its already in eventually. The default timeout is 30 
seconds. Then I dont get why this is failing. Could it be a thread race 
condition visibility issue?

> Flaky StreamingKMeans test in PySpark
> -
>
> Key: SPARK-10086
> URL: https://issues.apache.org/jira/browse/SPARK-10086
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark, Streaming, Tests
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>
> Here's a report on investigating test failures in StreamingKMeans in PySpark. 
> (See Jenkins links below.)
> It is a StreamingKMeans test which trains on a DStream with 2 batches and 
> then tests on those same 2 batches.  It fails here: 
> [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]
> I recreated the same test, with variants training on: (1) the original 2 
> batches, (2) just the first batch, (3) just the second batch, and (4) neither 
> batch.  Here is code which avoids Streaming altogether to identify what 
> batches were processed.
> {code}
> from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel
> batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
> batches = [sc.parallelize(batch) for batch in batches]
> stkm = StreamingKMeans(decayFactor=0.0, k=2)
> stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])
> # Train
> def update(rdd):
> stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)
> # Remove one or both of these lines to test skipping batches.
> update(batches[0])
> update(batches[1])
> # Test
> def predict(rdd):
> return stkm._model.predict(rdd)
> predict(batches[0]).collect()
> predict(batches[1]).collect()
> {code}
> *Results*:
> {code}
> ### EXPECTED
> [0, 1, 1] 
>   
> [1, 0, 1]
> ### Skip batch 0
> [1, 0, 0]
> [0, 1, 0]
> ### Skip batch 1
> [0, 1, 1]
> [1, 0, 1]
> ### Skip both batches  (This is what we see in the test 
> failures.)
> [0, 1, 1]
> [0, 0, 0]
> {code}
> Skipping both batches reproduces the failure.  There is no randomness in the 
> StreamingKMeans algorithm (since initial centers are fixed, not randomized).
> CC: [~tdas] [~freeman-lab] [~mengxr]
> Failure message:
> {code}
> ==
> FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
> Test that prediction happens on the updated model.
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 1147, in test_trainOn_predictOn
> self._eventually(condition, catch_assertions=True)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 123, in _eventually
> raise lastValue
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 114, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 1144, in condition
> self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
> AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]
> First differing element 1:
> [0, 0, 0]
> [1, 0, 1]
> - [[0, 1, 1], [0, 0, 0]]
> ? 
> + [[0, 1, 1], [1, 0, 1]]
> ?  +++   ^
> --
> Ran 62 tests in 164.188s
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10086) Flaky StreamingKMeans test in PySpark

2015-09-23 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905729#comment-14905729
 ] 

Tathagata Das commented on SPARK-10086:
---

You could use a maintain a counter for the number of batches completed. That is 
a foreachRDD can increment a counter. And the testing code should wait for the 
counter to reach 2 before checking for the model. Alternatively, the check 
should be in an eventually loop.

> Flaky StreamingKMeans test in PySpark
> -
>
> Key: SPARK-10086
> URL: https://issues.apache.org/jira/browse/SPARK-10086
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark, Streaming, Tests
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>
> Here's a report on investigating test failures in StreamingKMeans in PySpark. 
> (See Jenkins links below.)
> It is a StreamingKMeans test which trains on a DStream with 2 batches and 
> then tests on those same 2 batches.  It fails here: 
> [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]
> I recreated the same test, with variants training on: (1) the original 2 
> batches, (2) just the first batch, (3) just the second batch, and (4) neither 
> batch.  Here is code which avoids Streaming altogether to identify what 
> batches were processed.
> {code}
> from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel
> batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
> batches = [sc.parallelize(batch) for batch in batches]
> stkm = StreamingKMeans(decayFactor=0.0, k=2)
> stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])
> # Train
> def update(rdd):
> stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)
> # Remove one or both of these lines to test skipping batches.
> update(batches[0])
> update(batches[1])
> # Test
> def predict(rdd):
> return stkm._model.predict(rdd)
> predict(batches[0]).collect()
> predict(batches[1]).collect()
> {code}
> *Results*:
> {code}
> ### EXPECTED
> [0, 1, 1] 
>   
> [1, 0, 1]
> ### Skip batch 0
> [1, 0, 0]
> [0, 1, 0]
> ### Skip batch 1
> [0, 1, 1]
> [1, 0, 1]
> ### Skip both batches  (This is what we see in the test 
> failures.)
> [0, 1, 1]
> [0, 0, 0]
> {code}
> Skipping both batches reproduces the failure.  There is no randomness in the 
> StreamingKMeans algorithm (since initial centers are fixed, not randomized).
> CC: [~tdas] [~freeman-lab] [~mengxr]
> Failure message:
> {code}
> ==
> FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
> Test that prediction happens on the updated model.
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 1147, in test_trainOn_predictOn
> self._eventually(condition, catch_assertions=True)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 123, in _eventually
> raise lastValue
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 114, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 1144, in condition
> self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
> AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]
> First differing element 1:
> [0, 0, 0]
> [1, 0, 1]
> - [[0, 1, 1], [0, 0, 0]]
> ? 
> + [[0, 1, 1], [1, 0, 1]]
> ?  +++   ^
> --
> Ran 62 tests in 164.188s
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10692) Failed batches are never reported through the StreamingListener interface

2015-09-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10692.
---
   Resolution: Fixed
Fix Version/s: 1.6.0
   1.5.1

> Failed batches are never reported through the StreamingListener interface
> -
>
> Key: SPARK-10692
> URL: https://issues.apache.org/jira/browse/SPARK-10692
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.5.1, 1.6.0
>
>
> If an output operation fails, then corresponding batch is never marked as 
> completed, as the data structure are not updated properly.
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L183



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-23 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-10474.
-
Resolution: Fixed

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-m

[jira] [Assigned] (SPARK-10786) SparkSQLCLIDriver should take the whole statement to generate the CommandProcessor

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10786:


Assignee: Apache Spark

> SparkSQLCLIDriver should take the whole statement to generate the 
> CommandProcessor
> --
>
> Key: SPARK-10786
> URL: https://issues.apache.org/jira/browse/SPARK-10786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>Assignee: Apache Spark
>Priority: Minor
>
> In the now implementation of SparkSQLCLIDriver.scala: 
> *val proc: CommandProcessor = CommandProcessorFactory.get(Array(tokens(0)), 
> hconf)*
> *CommandProcessorFactory* only take the first token of the statement, and 
> this will be hard to diff the statement *delete jar xxx* and *delete from 
> xxx*.
> So maybe it's better to take the whole statement into the 
> *CommandProcessorFactory*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10786) SparkSQLCLIDriver should take the whole statement to generate the CommandProcessor

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10786:


Assignee: (was: Apache Spark)

> SparkSQLCLIDriver should take the whole statement to generate the 
> CommandProcessor
> --
>
> Key: SPARK-10786
> URL: https://issues.apache.org/jira/browse/SPARK-10786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>Priority: Minor
>
> In the now implementation of SparkSQLCLIDriver.scala: 
> *val proc: CommandProcessor = CommandProcessorFactory.get(Array(tokens(0)), 
> hconf)*
> *CommandProcessorFactory* only take the first token of the statement, and 
> this will be hard to diff the statement *delete jar xxx* and *delete from 
> xxx*.
> So maybe it's better to take the whole statement into the 
> *CommandProcessorFactory*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10786) SparkSQLCLIDriver should take the whole statement to generate the CommandProcessor

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905692#comment-14905692
 ] 

Apache Spark commented on SPARK-10786:
--

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/8895

> SparkSQLCLIDriver should take the whole statement to generate the 
> CommandProcessor
> --
>
> Key: SPARK-10786
> URL: https://issues.apache.org/jira/browse/SPARK-10786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>Priority: Minor
>
> In the now implementation of SparkSQLCLIDriver.scala: 
> *val proc: CommandProcessor = CommandProcessorFactory.get(Array(tokens(0)), 
> hconf)*
> *CommandProcessorFactory* only take the first token of the statement, and 
> this will be hard to diff the statement *delete jar xxx* and *delete from 
> xxx*.
> So maybe it's better to take the whole statement into the 
> *CommandProcessorFactory*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10786) SparkSQLCLIDriver should take the whole statement to generate the CommandProcessor

2015-09-23 Thread SaintBacchus (JIRA)
SaintBacchus created SPARK-10786:


 Summary: SparkSQLCLIDriver should take the whole statement to 
generate the CommandProcessor
 Key: SPARK-10786
 URL: https://issues.apache.org/jira/browse/SPARK-10786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: SaintBacchus
Priority: Minor


In the now implementation of SparkSQLCLIDriver.scala: 
*val proc: CommandProcessor = CommandProcessorFactory.get(Array(tokens(0)), 
hconf)*

*CommandProcessorFactory* only take the first token of the statement, and this 
will be hard to diff the statement *delete jar xxx* and *delete from xxx*.
So maybe it's better to take the whole statement into the 
*CommandProcessorFactory*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10741:
-
Assignee: Wenchen Fan

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>Assignee: Wenchen Fan
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2015-09-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6028.

   Resolution: Fixed
Fix Version/s: 1.6.0

> Provide an alternative RPC implementation based on the network transport 
> module
> ---
>
> Key: SPARK-6028
> URL: https://issues.apache.org/jira/browse/SPARK-6028
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.6.0
>
>
> Network transport module implements a low level RPC interface. We can build a 
> new RPC implementation on top of that to replace Akka's.
> Design document: 
> https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10724) SQL's floor() returns DOUBLE

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10724:


Assignee: (was: Apache Spark)

> SQL's floor() returns DOUBLE
> 
>
> Key: SPARK-10724
> URL: https://issues.apache.org/jira/browse/SPARK-10724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Priority: Critical
>  Labels: sql
>
> This is a change in behavior from 1.4.1 where {{floor}} returns a BIGINT. 
> {code}
> scala> sql("select floor(1)").printSchema
> root
>  |-- _c0: double (nullable = true)
> {code}
> In the [Hive Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF] 
> {{floor}} is defined to return BIGINT.
> This is a significant issue because it changes the DataFrame schema.
> I wonder what caused this and whether other SQL functions are affected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10724) SQL's floor() returns DOUBLE

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10724:


Assignee: Apache Spark

> SQL's floor() returns DOUBLE
> 
>
> Key: SPARK-10724
> URL: https://issues.apache.org/jira/browse/SPARK-10724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Apache Spark
>Priority: Critical
>  Labels: sql
>
> This is a change in behavior from 1.4.1 where {{floor}} returns a BIGINT. 
> {code}
> scala> sql("select floor(1)").printSchema
> root
>  |-- _c0: double (nullable = true)
> {code}
> In the [Hive Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF] 
> {{floor}} is defined to return BIGINT.
> This is a significant issue because it changes the DataFrame schema.
> I wonder what caused this and whether other SQL functions are affected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10724) SQL's floor() returns DOUBLE

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905658#comment-14905658
 ] 

Apache Spark commented on SPARK-10724:
--

User 'navis' has created a pull request for this issue:
https://github.com/apache/spark/pull/8893

> SQL's floor() returns DOUBLE
> 
>
> Key: SPARK-10724
> URL: https://issues.apache.org/jira/browse/SPARK-10724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Priority: Critical
>  Labels: sql
>
> This is a change in behavior from 1.4.1 where {{floor}} returns a BIGINT. 
> {code}
> scala> sql("select floor(1)").printSchema
> root
>  |-- _c0: double (nullable = true)
> {code}
> In the [Hive Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF] 
> {{floor}} is defined to return BIGINT.
> This is a significant issue because it changes the DataFrame schema.
> I wonder what caused this and whether other SQL functions are affected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10692) Failed batches are never reported through the StreamingListener interface

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905653#comment-14905653
 ] 

Apache Spark commented on SPARK-10692:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/8892

> Failed batches are never reported through the StreamingListener interface
> -
>
> Key: SPARK-10692
> URL: https://issues.apache.org/jira/browse/SPARK-10692
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Shixiong Zhu
>Priority: Critical
>
> If an output operation fails, then corresponding batch is never marked as 
> completed, as the data structure are not updated properly.
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L183



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10043) Add window functions into SparkR

2015-09-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10043:

Target Version/s: 1.6.0  (was: 1.5.1, 1.6.0)

> Add window functions into SparkR
> 
>
> Key: SPARK-10043
> URL: https://issues.apache.org/jira/browse/SPARK-10043
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add window functions as follows in SparkR. I think we should improve 
> {{collect}} function in SparkR.
> - lead
> - cumuDist
> - denseRank
> - lag
> - ntile
> - percentRank
> - rank
> - rowNumber



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10692) Failed batches are never reported through the StreamingListener interface

2015-09-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10692:

Priority: Critical  (was: Blocker)

> Failed batches are never reported through the StreamingListener interface
> -
>
> Key: SPARK-10692
> URL: https://issues.apache.org/jira/browse/SPARK-10692
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Shixiong Zhu
>Priority: Critical
>
> If an output operation fails, then corresponding batch is never marked as 
> completed, as the data structure are not updated properly.
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L183



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10538) java.lang.NegativeArraySizeException during join

2015-09-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10538:

Target Version/s: 1.5.2  (was: 1.5.1)

> java.lang.NegativeArraySizeException during join
> 
>
> Key: SPARK-10538
> URL: https://issues.apache.org/jira/browse/SPARK-10538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Assignee: Davies Liu
> Attachments: screenshot-1.png
>
>
> Hi,
> I've got a problem during joining tables in PySpark. (in my example 20 of 
> them)
> I can observe that during calculation of first partition (on one of 
> consecutive joins) there is a big shuffle read size (294.7 MB / 146 records) 
> vs on others partitions (approx. 272.5 KB / 113 record)
> I can also observe that just before the crash python process going up to few 
> gb of RAM.
> After some time there is an exception:
> {code}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I'm running this on 2 nodes cluster (12 cores, 64 GB RAM)
> Config:
> {code}
> spark.driver.memory  10g
> spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseParallelGC 
> -Dfile.encoding=UTF8
> spark.executor.memory   60g
> spark.storage.memoryFraction0.05
> spark.shuffle.memoryFraction0.75
> spark.driver.maxResultSize  10g  
> spark.cores.max 24
> spark.kryoserializer.buffer.max 1g
> spark.default.parallelism   200
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8115) Remove TestData

2015-09-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8115:
---
Target Version/s: 1.6.0  (was: 1.6.0, 1.5.1)

> Remove TestData
> ---
>
> Key: SPARK-8115
> URL: https://issues.apache.org/jira/browse/SPARK-8115
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Andrew Or
>Priority: Minor
>
> TestData was from the era when we didn't have easy ways to generate test 
> datasets. Now we have implicits on Seq + toDF, it'd make more sense to put 
> the test datasets closer to the test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9841) Params.clear needs to be public

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9841:
-
Shepherd: Joseph K. Bradley
Assignee: holdenk

> Params.clear needs to be public
> ---
>
> Key: SPARK-9841
> URL: https://issues.apache.org/jira/browse/SPARK-9841
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: holdenk
>
> It is currently impossible to clear Param values once set.  It would be 
> helpful to be able to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9798) CrossValidatorModel Documentation Improvements

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9798:
-
Shepherd: Joseph K. Bradley
Target Version/s: 1.6.0

> CrossValidatorModel Documentation Improvements
> --
>
> Key: SPARK-9798
> URL: https://issues.apache.org/jira/browse/SPARK-9798
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>  Labels: starter
>
> CrossValidatorModel's avgMetrics and bestModel need documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9798) CrossValidatorModel Documentation Improvements

2015-09-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905573#comment-14905573
 ] 

Joseph K. Bradley commented on SPARK-9798:
--

This is a very small task, so I don't think it should be split.  Perhaps 
another doc or starter task?  Thanks!

> CrossValidatorModel Documentation Improvements
> --
>
> Key: SPARK-9798
> URL: https://issues.apache.org/jira/browse/SPARK-9798
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>  Labels: starter
>
> CrossValidatorModel's avgMetrics and bestModel need documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10785) Scale QuantileDiscretizer using distributed binning

2015-09-23 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-10785:
-

 Summary: Scale QuantileDiscretizer using distributed binning
 Key: SPARK-10785
 URL: https://issues.apache.org/jira/browse/SPARK-10785
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley


[SPARK-10064] improves binning in decision trees by distributing the 
computation.  QuantileDiscretizer should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5890) Add QuantileDiscretizer

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5890:
-
Description: 
A `QuantileDiscretizer` takes a column with continuous features and outputs a 
column with binned categorical features.

{code}
val fd = new QuantileDiscretizer()
  .setInputCol("age")
  .setNumBins(32)
  .setOutputCol("ageBins")
{code}

This should an automatic feature discretizer, which uses a simple algorithm 
like approximate quantiles to discretize features. It should set the ML 
attribute correctly in the output column.

  was:
A `FeatureDiscretizer` takes a column with continuous features and outputs a 
column with binned categorical features.

{code}
val fd = new FeatureDiscretizer()
  .setInputCol("age")
  .setNumBins(32)
  .setOutputCol("ageBins")
{code}

This should an automatic feature discretizer, which uses a simple algorithm 
like approximate quantiles to discretize features. It should set the ML 
attribute correctly in the output column.


> Add QuantileDiscretizer
> ---
>
> Key: SPARK-5890
> URL: https://issues.apache.org/jira/browse/SPARK-5890
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> A `QuantileDiscretizer` takes a column with continuous features and outputs a 
> column with binned categorical features.
> {code}
> val fd = new QuantileDiscretizer()
>   .setInputCol("age")
>   .setNumBins(32)
>   .setOutputCol("ageBins")
> {code}
> This should an automatic feature discretizer, which uses a simple algorithm 
> like approximate quantiles to discretize features. It should set the ML 
> attribute correctly in the output column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5890) Add QuantileDiscretizer

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5890:
-
Summary: Add QuantileDiscretizer  (was: Add FeatureDiscretizer)

> Add QuantileDiscretizer
> ---
>
> Key: SPARK-5890
> URL: https://issues.apache.org/jira/browse/SPARK-5890
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> A `FeatureDiscretizer` takes a column with continuous features and outputs a 
> column with binned categorical features.
> {code}
> val fd = new FeatureDiscretizer()
>   .setInputCol("age")
>   .setNumBins(32)
>   .setOutputCol("ageBins")
> {code}
> This should an automatic feature discretizer, which uses a simple algorithm 
> like approximate quantiles to discretize features. It should set the ML 
> attribute correctly in the output column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10731) The head() implementation of dataframe is very slow

2015-09-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10731.
-
   Resolution: Fixed
 Assignee: Reynold Xin
Fix Version/s: 1.5.1
   1.6.0

> The head() implementation of dataframe is very slow
> ---
>
> Key: SPARK-10731
> URL: https://issues.apache.org/jira/browse/SPARK-10731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Jerry Lam
>Assignee: Reynold Xin
>  Labels: pyspark
> Fix For: 1.6.0, 1.5.1
>
>
> {code}
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
> {code}
> The above lines take over 15 minutes. It seems the dataframe requires 3 
> stages to return the first row. It reads all data (which is about 1 billion 
> rows) and run Limit twice. The take(1) implementation in the RDD performs 
> much better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10699) Support checkpointInterval can be disabled

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10699.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8820
[https://github.com/apache/spark/pull/8820]

> Support checkpointInterval can be disabled
> --
>
> Key: SPARK-10699
> URL: https://issues.apache.org/jira/browse/SPARK-10699
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Currently use can set checkpointInterval to specify how often should the 
> cache be checkpointed. But we also need the function that users can disable 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10741:


Assignee: (was: Apache Spark)

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905479#comment-14905479
 ] 

Apache Spark commented on SPARK-10741:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8889

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10741:


Assignee: Apache Spark

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>Assignee: Apache Spark
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10086) Flaky StreamingKMeans test in PySpark

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10086:
--
Description: 
Here's a report on investigating test failures in StreamingKMeans in PySpark. 
(See Jenkins links below.)

It is a StreamingKMeans test which trains on a DStream with 2 batches and then 
tests on those same 2 batches.  It fails here: 
[https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 
batches, (2) just the first batch, (3) just the second batch, and (4) neither 
batch.  Here is code which avoids Streaming altogether to identify what batches 
were processed.

{code}
from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel

batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
batches = [sc.parallelize(batch) for batch in batches]

stkm = StreamingKMeans(decayFactor=0.0, k=2)
stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])

# Train
def update(rdd):
stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)

# Remove one or both of these lines to test skipping batches.
update(batches[0])
update(batches[1])

# Test
def predict(rdd):
return stkm._model.predict(rdd)

predict(batches[0]).collect()
predict(batches[1]).collect()
{code}

*Results*:
{code}
### EXPECTED

[0, 1, 1]   
[1, 0, 1]

### Skip batch 0

[1, 0, 0]
[0, 1, 0]

### Skip batch 1

[0, 1, 1]
[1, 0, 1]

### Skip both batches  (This is what we see in the test 
failures.)

[0, 1, 1]
[0, 0, 0]
{code}

Skipping both batches reproduces the failure.  There is no randomness in the 
StreamingKMeans algorithm (since initial centers are fixed, not randomized).

CC: [~tdas] [~freeman-lab] [~mengxr]

Failure message:
{code}
==
FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
Test that prediction happens on the updated model.
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 1147, in test_trainOn_predictOn
self._eventually(condition, catch_assertions=True)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 123, in _eventually
raise lastValue
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 114, in _eventually
lastValue = condition()
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 1144, in condition
self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]

First differing element 1:
[0, 0, 0]
[1, 0, 1]

- [[0, 1, 1], [0, 0, 0]]
? 

+ [[0, 1, 1], [1, 0, 1]]
?  +++   ^


--
Ran 62 tests in 164.188s
{code}


  was:
Here's a report on investigating test failures in StreamingKMeans in PySpark. 
(See Jenkins links below.)

It is a StreamingKMeans test which trains on a DStream with 2 batches and then 
tests on those same 2 batches.  It fails here: 
[https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 
batches, (2) just the first batch, (3) just the second batch, and (4) neither 
batch.  Here is a reproduction of the failure which avoids Streaming altogether.

{code}
from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel

batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
batches = [sc.parallelize(batch) for batch in batches]

stkm = StreamingKMeans(decayFactor=0.0, k=2)
stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])

# Train
def update(rdd):
stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)

# Remove one or both of these lines to test skipping batches.
update(batches[0])
update(batches[1])

# Test
def predict(rdd):
return stkm._model.predict(rdd)

predict(batches[0]).collect()
predict(batches[1]).collect()
{code}

*Results*:
{code}
### EXPECTED

[0, 1, 1]   
[1, 0, 1]

### Skip batch 0

[1, 0, 0]
[0, 1, 0]

### Skip batch 1

[0, 1, 1]
[1, 0, 1]

### Skip both batches  (This is what we see in the test 
failures.)

[0, 1, 1]
[0, 0, 0]
{code}

Skipping both batches reproduces the failure.  There is no randomness in the 
StreamingKMeans algorithm (si

[jira] [Updated] (SPARK-10086) Flaky StreamingKMeans test in PySpark

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10086:
--
Description: 
Here's a report on investigating test failures in StreamingKMeans in PySpark. 
(See Jenkins links below.)

It is a StreamingKMeans test which trains on a DStream with 2 batches and then 
tests on those same 2 batches.  It fails here: 
[https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 
batches, (2) just the first batch, (3) just the second batch, and (4) neither 
batch.  Here is a reproduction of the failure which avoids Streaming altogether.

{code}
from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel

batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
batches = [sc.parallelize(batch) for batch in batches]

stkm = StreamingKMeans(decayFactor=0.0, k=2)
stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])

# Train
def update(rdd):
stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)

# Remove one or both of these lines to test skipping batches.
update(batches[0])
update(batches[1])

# Test
def predict(rdd):
return stkm._model.predict(rdd)

predict(batches[0]).collect()
predict(batches[1]).collect()
{code}

*Results*:
{code}
### EXPECTED

[0, 1, 1]   
[1, 0, 1]

### Skip batch 0

[1, 0, 0]
[0, 1, 0]

### Skip batch 1

[0, 1, 1]
[1, 0, 1]

### Skip both batches  (This is what we see in the test 
failures.)

[0, 1, 1]
[0, 0, 0]
{code}

Skipping both batches reproduces the failure.  There is no randomness in the 
StreamingKMeans algorithm (since initial centers are fixed, not randomized).

CC: [~tdas] [~freeman-lab] [~mengxr]

Failure message:
{code}
==
FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
Test that prediction happens on the updated model.
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 1147, in test_trainOn_predictOn
self._eventually(condition, catch_assertions=True)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 123, in _eventually
raise lastValue
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 114, in _eventually
lastValue = condition()
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 1144, in condition
self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]

First differing element 1:
[0, 0, 0]
[1, 0, 1]

- [[0, 1, 1], [0, 0, 0]]
? 

+ [[0, 1, 1], [1, 0, 1]]
?  +++   ^


--
Ran 62 tests in 164.188s
{code}


  was:
Here's a report on investigating test failures in StreamingKMeans in PySpark. 
(See Jenkins links below.)

It is a StreamingKMeans test which trains on a DStream with 2 batches and then 
tests on those same 2 batches.  It fails here: 
[https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 
batches, (2a) just the first batch, (2b) just the second batch.  Here's the 
code:

Here is a reproduction of the failure which avoids Streaming altogether.  (It 
is the same as the code I have linked in the comment above, except that I run 
each step manually rather than going through streaming.)

{code}
from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel

batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
batches = [sc.parallelize(batch) for batch in batches]

stkm = StreamingKMeans(decayFactor=0.0, k=2)
stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])

# Train
def update(rdd):
stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)

# Remove one or both of these lines to test skipping batches.
update(batches[0])
update(batches[1])

# Test
def predict(rdd):
return stkm._model.predict(rdd)

predict(batches[0]).collect()
predict(batches[1]).collect()
{code}

*Results*:
{code}
### EXPECTED

[0, 1, 1]   
[1, 0, 1]

### Skip batch 0

[1, 0, 0]
[0, 1, 0]

### Skip batch 1

[0, 1, 1]
[1, 0, 1]

### Skip both batches  (This is what we see in the test 
failures.)

[0, 1, 1]
[0

[jira] [Updated] (SPARK-10668) Use WeightedLeastSquares in LinearRegression with L2 regularization if the number of features is small

2015-09-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10668:
--
Shepherd: Xiangrui Meng

> Use WeightedLeastSquares in LinearRegression with L2 regularization if the 
> number of features is small
> --
>
> Key: SPARK-10668
> URL: https://issues.apache.org/jira/browse/SPARK-10668
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Kai Sasaki
>Priority: Critical
>
> If the number of features is small (<=4096) and the regularization is L2, we 
> should use WeightedLeastSquares to solve the problem rather than L-BFGS. The 
> former requires only one pass to the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10086) Flaky StreamingKMeans test in PySpark

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10086:
--
Description: 
Here's a report on investigating test failures in StreamingKMeans in PySpark. 
(See Jenkins links below.)

It is a StreamingKMeans test which trains on a DStream with 2 batches and then 
tests on those same 2 batches.  It fails here: 
[https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 
batches, (2a) just the first batch, (2b) just the second batch.  Here's the 
code:

Here is a reproduction of the failure which avoids Streaming altogether.  (It 
is the same as the code I have linked in the comment above, except that I run 
each step manually rather than going through streaming.)

{code}
from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel

batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
batches = [sc.parallelize(batch) for batch in batches]

stkm = StreamingKMeans(decayFactor=0.0, k=2)
stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])

# Train
def update(rdd):
stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)

# Remove one or both of these lines to test skipping batches.
update(batches[0])
update(batches[1])

# Test
def predict(rdd):
return stkm._model.predict(rdd)

predict(batches[0]).collect()
predict(batches[1]).collect()
{code}

*Results*:
{code}
### EXPECTED

[0, 1, 1]   
[1, 0, 1]

### Skip batch 0

[1, 0, 0]
[0, 1, 0]

### Skip batch 1

[0, 1, 1]
[1, 0, 1]

### Skip both batches  (This is what we see in the test 
failures.)

[0, 1, 1]
[0, 0, 0]
{code}

Skipping both batches reproduces the failure.  There is no randomness in the 
StreamingKMeans algorithm (since initial centers are fixed, not randomized).

CC: [~tdas] [~freeman-lab] [~mengxr]

Failure message:
{code}
==
FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
Test that prediction happens on the updated model.
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 1147, in test_trainOn_predictOn
self._eventually(condition, catch_assertions=True)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 123, in _eventually
raise lastValue
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 114, in _eventually
lastValue = condition()
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 1144, in condition
self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]

First differing element 1:
[0, 0, 0]
[1, 0, 1]

- [[0, 1, 1], [0, 0, 0]]
? 

+ [[0, 1, 1], [1, 0, 1]]
?  +++   ^


--
Ran 62 tests in 164.188s
{code}


  was:
Here's a report on investigating this test failure:

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41081/console]

It is a StreamingKMeans test which trains on a DStream with 2 batches and then 
tests on those same 2 batches.  It fails here: 
[https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 
batches, (2a) just the first batch, (2b) just the second batch.  Here's the 
code:
[https://github.com/jkbradley/spark/blob/d3eedb7773b9e15595cbc79c009fe932703c0b11/examples/src/main/python/mllib/streaming_kmeans.py]

Disturbingly, only (2b) produced the failure, indicating that batch 2 was 
processed and 1 was not.  [~tdas] says queueStream should have consistency 
guarantees and that should not happen.  There is no randomness in the 
StreamingKMeans algorithm (since initial centers are fixed, not randomized).

*Current status: Not sure happened*

CC: [~tdas] [~freeman-lab] [~mengxr]

Failure message:
{code}
==
FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
Test that prediction happens on the updated model.
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 1147, in test_trainOn_predictOn
self._eventually(condition, catch_assertions=True)
  File 
"/home/jenki

[jira] [Resolved] (SPARK-10686) Add quantileCol to AFTSurvivalRegression

2015-09-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10686.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8836
[https://github.com/apache/spark/pull/8836]

> Add quantileCol to AFTSurvivalRegression
> 
>
> Key: SPARK-10686
> URL: https://issues.apache.org/jira/browse/SPARK-10686
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
> Fix For: 1.6.0
>
>
> By default `quantileCol` should be empty. If both `quantileProbabilities` and 
> `quantileCol` are set, we should append quantiles as a new column (of type 
> Vector).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8616) SQLContext doesn't handle tricky column names when loading from JDBC

2015-09-23 Thread Rick Hillegas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905409#comment-14905409
 ] 

Rick Hillegas commented on SPARK-8616:
--

The following email thread may be useful for understanding this issue: 
http://apache-spark-developers-list.1001551.n3.nabble.com/column-identifiers-in-Spark-SQL-td14280.html

Thanks,
-Rick

> SQLContext doesn't handle tricky column names when loading from JDBC
> 
>
> Key: SPARK-8616
> URL: https://issues.apache.org/jira/browse/SPARK-8616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04, Sqlite 3.8.7, Spark 1.4.0
>Reporter: Gergely Svigruha
>
> Reproduce:
>  - create a table in a relational database (in my case sqlite) with a column 
> name containing a space:
>  CREATE TABLE my_table (id INTEGER, "tricky column" TEXT);
>  - try to create a DataFrame using that table:
> sqlContext.read.format("jdbc").options(Map(
>   "url" -> "jdbs:sqlite:...",
>   "dbtable" -> "my_table")).load()
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> column: tricky)
> According to the SQL spec this should be valid:
> http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10784) Flaky Streaming ML test umbrella

2015-09-23 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-10784:
-

 Summary: Flaky Streaming ML test umbrella
 Key: SPARK-10784
 URL: https://issues.apache.org/jira/browse/SPARK-10784
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib, Streaming
Reporter: Joseph K. Bradley


This is an umbrella for collecting reports of flaky Streaming ML tests in Scala 
or Python.  To report a failure, please check links for duplicates.  If it is a 
new failure, please create a JIRA and link/comment about it here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9715) Store numFeatures in all ML PredictionModel types

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9715.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8675
[https://github.com/apache/spark/pull/8675]

> Store numFeatures in all ML PredictionModel types
> -
>
> Key: SPARK-9715
> URL: https://issues.apache.org/jira/browse/SPARK-9715
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 1.6.0
>
>
> The PredictionModel abstraction should store numFeatures.  Currently, only 
> RandomForest* types do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Ian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905327#comment-14905327
 ] 

Ian edited comment on SPARK-10741 at 9/23/15 9:45 PM:
--

Yes, going through all rules when resolve Sort on Aggregate is a correct 
approach.

The main problem appeared to be that the execute call at 
(https://github.com/apache/spark/blob/v1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L571)
 is resolving to different attribute ids, and causing confusion at  
https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L592-L611.

just for me to understand a bit more:
the second approach you are proposing is to remove the confusion by changing 
how ids are resolved in Analyzer.scala#L571, right? 





was (Author: ianlcsd):
Yes, going through all rules when resolve Sort on Aggregate is a correct 
approach.

The main problem appeared that the execute call at 
(https://github.com/apache/spark/blob/v1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L571)
 is resolving to different attribute ids, and causing confusion at  
https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L592-L611.

just for me to understand a bit more:
the second approach you are proposing is to remove the confusion by changing 
how ids are resolved in Analyzer.scala#L571, right? 




> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10783) Do track the pointer array in UnsafeInMemorySorter

2015-09-23 Thread Andrew Or (JIRA)
Andrew Or created SPARK-10783:
-

 Summary: Do track the pointer array in UnsafeInMemorySorter
 Key: SPARK-10783
 URL: https://issues.apache.org/jira/browse/SPARK-10783
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0, 1.5.1
Reporter: Andrew Or
Priority: Blocker


SPARK-10474 (https://github.com/apache/spark/pull/) removed the pointer 
array tracking because `TungstenAggregate` would fail under memory pressure. 
However, this is somewhat of a hack that we should fix in the right way in 
1.6.0 to ensure we don't OOM because of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4885) Enable fetched blocks to exceed 2 GB

2015-09-23 Thread Sai Nishanth Parepally (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905338#comment-14905338
 ] 

Sai Nishanth Parepally commented on SPARK-4885:
---

I am using spark 1.4.1 and facing the same issue, what is the work around for 
the 2 GB limitation?

Here is the error 
15/09/23 13:19:29 WARN TransportChannelHandler: Exception in connection from 
XX.XX.XX.XX:X
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at 
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:148)
at 
org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
at 
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1189)
at 
org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1198)
at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:190)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:480)
at 
org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:302)
at 
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
at 
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

> Enable fetched blocks to exceed 2 GB
> 
>
> Key: SPARK-4885
> URL: https://issues.apache.org/jira/browse/SPARK-4885
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>
> {code}
> 14/12/18 09:53:13 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught 
> exception in thread Thread[handle-message-executor-12,5,main]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at 
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at 
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at 
> com.esotericsoftware.kryo.io.Output.require(Output.java:135)
> at 
> com.esotericsoftware.kryo.io.Output.writeLong(Output.java:477)
> at 
> com.esotericsoftware.kryo.io.Output.writeDouble(Output.java:596)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArra

[jira] [Commented] (SPARK-10767) Make pyspark shared params codegen more consistent

2015-09-23 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905329#comment-14905329
 ] 

holdenk commented on SPARK-10767:
-

My plan was to wait for that PR to go in and then do this as a quick cleanup 
after (since as pointed out in the original PR its a pretty unrelated change).

> Make pyspark shared params codegen more consistent 
> ---
>
> Key: SPARK-10767
> URL: https://issues.apache.org/jira/browse/SPARK-10767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> Namely "." shows up in some places in the template when using the param 
> docstring and not in others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Ian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905327#comment-14905327
 ] 

Ian edited comment on SPARK-10741 at 9/23/15 9:36 PM:
--

Yes, going through all rules when resolve Sort on Aggregate is a correct 
approach.

The main problem appeared that the execute call at 
(https://github.com/apache/spark/blob/v1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L571)
 is resolving to different attribute ids, and causing confusion at  
https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L592-L611.

just for me to understand a bit more:
the second approach you are proposing is to remove the confusion by changing 
how ids are resolved in Analyzer.scala#L571, right? 





was (Author: ianlcsd):

Yes, going through all rules when resolve Sort on Aggregate is a correct 
approach.

The main problem appeared that the execute call at 
(hhttps://github.com/apache/spark/blob/v1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L571)
 is resolving to different attribute ids, and causing confusion at  
https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L592-L611.

just for me to understand a bit more:
the second approach you are proposing is to remove the confusion by changing 
how ids are resolved in Analyzer.scala#L571, right? 




> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Ian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905327#comment-14905327
 ] 

Ian commented on SPARK-10741:
-


Yes, going through all rules when resolve Sort on Aggregate is a correct 
approach.

The main problem appeared that the execute call at 
(hhttps://github.com/apache/spark/blob/v1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L571)
 is resolving to different attribute ids, and causing confusion at  
https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L592-L611.

just for me to understand a bit more:
the second approach you are proposing is to remove the confusion by changing 
how ids are resolved in Analyzer.scala#L571, right? 




> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10767) Make pyspark shared params codegen more consistent

2015-09-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905322#comment-14905322
 ] 

Joseph K. Bradley commented on SPARK-10767:
---

Oh, yeah, that is annoying.  + 1

> Make pyspark shared params codegen more consistent 
> ---
>
> Key: SPARK-10767
> URL: https://issues.apache.org/jira/browse/SPARK-10767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> Namely "." shows up in some places in the template when using the param 
> docstring and not in others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10767) Make pyspark shared params codegen more consistent

2015-09-23 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905303#comment-14905303
 ] 

holdenk commented on SPARK-10767:
-

Updated the description, sorry about that. This comes from 
https://github.com/apache/spark/pull/8214#issuecomment-142486825 

> Make pyspark shared params codegen more consistent 
> ---
>
> Key: SPARK-10767
> URL: https://issues.apache.org/jira/browse/SPARK-10767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> Namely "." shows up in some places in the template when using the param 
> docstring and not in others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-23 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905299#comment-14905299
 ] 

Andrew Or commented on SPARK-10474:
---

Alright, I think this should fix it for real:
https://github.com/apache/spark/pull/

[~jameszhouyi] [~xjrk] please test it out one last time. Thanks for following 
through on this issue!

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimensi

[jira] [Updated] (SPARK-10767) Make pyspark shared params codegen more consistent

2015-09-23 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-10767:

Description: Namely "." shows up in some places in the template when using 
the param docstring and not in others

> Make pyspark shared params codegen more consistent 
> ---
>
> Key: SPARK-10767
> URL: https://issues.apache.org/jira/browse/SPARK-10767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> Namely "." shows up in some places in the template when using the param 
> docstring and not in others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905291#comment-14905291
 ] 

Apache Spark commented on SPARK-10474:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---

[jira] [Assigned] (SPARK-10622) Race condition between scheduler and YARN executor status update

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10622:


Assignee: (was: Apache Spark)

> Race condition between scheduler and YARN executor status update
> 
>
> Key: SPARK-10622
> URL: https://issues.apache.org/jira/browse/SPARK-10622
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> This is a follow up to SPARK-8167. From the comment left in the code:
> {quote}
> TODO there's a race condition where while we are querying the 
> ApplicationMaster for the executor loss reason, there is the potential that 
> tasks will be scheduled on the executor that failed. We should fix this by 
> having this onDisconnected event also "blacklist" executors so that tasks are 
> not assigned to them.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10622) Race condition between scheduler and YARN executor status update

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905285#comment-14905285
 ] 

Apache Spark commented on SPARK-10622:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8887

> Race condition between scheduler and YARN executor status update
> 
>
> Key: SPARK-10622
> URL: https://issues.apache.org/jira/browse/SPARK-10622
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> This is a follow up to SPARK-8167. From the comment left in the code:
> {quote}
> TODO there's a race condition where while we are querying the 
> ApplicationMaster for the executor loss reason, there is the potential that 
> tasks will be scheduled on the executor that failed. We should fix this by 
> having this onDisconnected event also "blacklist" executors so that tasks are 
> not assigned to them.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10622) Race condition between scheduler and YARN executor status update

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10622:


Assignee: Apache Spark

> Race condition between scheduler and YARN executor status update
> 
>
> Key: SPARK-10622
> URL: https://issues.apache.org/jira/browse/SPARK-10622
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Critical
>
> This is a follow up to SPARK-8167. From the comment left in the code:
> {quote}
> TODO there's a race condition where while we are querying the 
> ApplicationMaster for the executor loss reason, there is the potential that 
> tasks will be scheduled on the executor that failed. We should fix this by 
> having this onDisconnected event also "blacklist" executors so that tasks are 
> not assigned to them.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-23 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905279#comment-14905279
 ] 

Andrew Or commented on SPARK-10474:
---

Re-opening this because I found the real cause for this issue (which turns out 
to be the same as SPARK-10733).

In TungstenAggregate's prepare method, we only reserve a page. However, when we 
switch to sort-based aggregation, we try to acquire a page AND a pointer array. 
This means even if we spill the page we currently have, there's a reasonable 
chance that we still can't allocate for the pointer array.

The temporary solution for 1.5.1 would be to simply not track the pointer array 
(we already don't track it in some other places), which should be safe because 
of the generous shuffle safety fraction of 20%.

I'll submit another patch shortly.

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i

[jira] [Resolved] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-23 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10733.
---
Resolution: Duplicate

Looks like this is a duplicate of SPARK-10474 after all. I'm closing this...

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-23 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-10474:
---

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@s

[jira] [Commented] (SPARK-10413) Model should support prediction on single instance

2015-09-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905230#comment-14905230
 ] 

Joseph K. Bradley commented on SPARK-10413:
---

For API, I think my main question is whether predict() should take strong types 
(Vector, etc.) and/or Rows.  I prefer supporting strong types first (as you are 
doing) since we could add support for Rows later on (although there could be 
difficult questions about missing schema for Scala/Java).

For raw & probability, I would again vote for just making those public.  But 
that could be done at a later time.

> Model should support prediction on single instance
> --
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver

2015-09-23 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904471#comment-14904471
 ] 

Mohamed Baddar edited comment on SPARK-9836 at 9/23/15 8:39 PM:


Thanks a lot [~mengxr] , i will try one of the starter tasks , but seems they 
are all taken , if so , what should i do next ?


was (Author: mbaddar):
Thanks a lot , i will try one of the starter tasks , but seems they are all 
taken , if so , what should i do next ?

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10767) Make pyspark shared params codegen more consistent

2015-09-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905196#comment-14905196
 ] 

Joseph K. Bradley commented on SPARK-10767:
---

What issues specifically?

> Make pyspark shared params codegen more consistent 
> ---
>
> Key: SPARK-10767
> URL: https://issues.apache.org/jira/browse/SPARK-10767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10782) Duplicate examples for drop_duplicates and DropDuplicates

2015-09-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905180#comment-14905180
 ] 

Sean Owen commented on SPARK-10782:
---

Looks like you're right, feel free to make a PR with the correct example for 
drop_duplicates

> Duplicate examples for drop_duplicates and DropDuplicates
> -
>
> Key: SPARK-10782
> URL: https://issues.apache.org/jira/browse/SPARK-10782
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Asoka Diggs
>Priority: Trivial
>
> In documentation for pyspark.sql, the source code examples for DropDuplicates 
> and drop_duplicates are identical with each other.  It appears that the 
> example for DropDuplicates was copy/pasted for drop_duplicates and not edited.
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905166#comment-14905166
 ] 

Yin Huai commented on SPARK-10741:
--

The second options sounds better.

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10782) Duplicate examples for drop_duplicates and DropDuplicates

2015-09-23 Thread Asoka Diggs (JIRA)
Asoka Diggs created SPARK-10782:
---

 Summary: Duplicate examples for drop_duplicates and DropDuplicates
 Key: SPARK-10782
 URL: https://issues.apache.org/jira/browse/SPARK-10782
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.5.0
Reporter: Asoka Diggs
Priority: Trivial


In documentation for pyspark.sql, the source code examples for DropDuplicates 
and drop_duplicates are identical with each other.  It appears that the example 
for DropDuplicates was copy/pasted for drop_duplicates and not edited.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10781) Allow certain number of failed tasks and allow job to succeed

2015-09-23 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-10781:
-

 Summary: Allow certain number of failed tasks and allow job to 
succeed
 Key: SPARK-10781
 URL: https://issues.apache.org/jira/browse/SPARK-10781
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Thomas Graves


MapReduce has this config mapreduce.map.failures.maxpercent and 
mapreduce.reduce.failures.maxpercent which allows for a certain percent of 
tasks to fail but the job to still succeed.  

This could be a useful feature in Spark also if a job doesn't need all the 
tasks to be successful.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-23 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905083#comment-14905083
 ] 

Yin Huai commented on SPARK-10733:
--

Can you attach your query plan?

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-23 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904964#comment-14904964
 ] 

Yin Huai edited comment on SPARK-10733 at 9/23/15 7:23 PM:
---

[~jameszhouyi] Another two places for logging are 
{{UnsafeExternalSorter.acquireNewPage}} and 
{{ShuffleMemoryManager.tryToAcquire}}. In 
{{UnsafeExternalSorter.acquireNewPage}}, we log an entry to say we are trying 
to acquire some memory space. In {{ShuffleMemoryManager.tryToAcquire}}, we log 
the size of memory that we want to acquire, the size of memory that has already 
acquired for this task ({{curMem}}), {{maxToGrant}}, {{maxMemory}}, {{maxMemory 
/ (2 * numActiveTasks)}}, and {{maxMemory / numActiveTasks}}. These information 
will be very helpful for debugging.


was (Author: yhuai):
[~jameszhouyi] Another two places for logging are 
{{UnsafeExternalSorter.acquireNewPage}} and 
{{ShuffleMemoryManager.tryToAcquire}}. In 
{{UnsafeExternalSorter.acquireNewPage}}, we log an entry to say we are trying 
to acquire some memory space. In {{ShuffleMemoryManager.tryToAcquire}}, we log 
the size of memory that we want to acquire, the size of memory that has already 
acquired for this task ({{curMem}}), {{maxToGrant}}, {{maxMemory / (2 * 
numActiveTasks)}}, and {{maxMemory / numActiveTasks}}. These information will 
be very helpful for debugging.

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905069#comment-14905069
 ] 

Wenchen Fan commented on SPARK-10741:
-

This bug is caused by a conflict between 2 tricky part in our Analyzer. Let me 
explain it a little more.

We have a special rule for Sort on Aggregate in 
https://github.com/apache/spark/blob/v1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L563-L604
In this rule, we put sort ordering expressions in Aggregate and call Analyzer 
to resolve this Aggregate again(which means we go through all rules).

We also have a special rule for parquet in 
https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L580-L612
In this rule, we convert hive's MetastoreRelation to LogicalRelation of 
parquet, which means we replaced leaf node and changed the output attribute 
ids. At the end of this rule, we go through the whole tree to replace old 
AttributeRefence of MetastoreRelation with new ones of LogicalRelation.

Then these 2 rules get conflicted. At the point when we resolve Sort on 
Aggregate, we only have MetastoreRelation, but when we resolve sort ordering 
expressions with Aggregate, we go through all rules and these ordering 
expressions will reference to parquet's LogicalRelation whose output attribute 
ids are different from the old MetastoreRelations. Finally oops, our ordering 
expressions are referencing something doesn't  exist.

One solution is: do not go through all rules when resolve Sort on 
Aggregate(thus the parquet relation conversion won't happen).
Another is: keep the attribute ids when convert MetastoreRelation to 
LogicalRelation.

Personally I prefer the second one, what do you think?



> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10494) Multiple Python UDFs together with aggregation or sort merge join may cause OOM (failed to acquire memory)

2015-09-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10494:

Fix Version/s: 1.6.0

> Multiple Python UDFs together with aggregation or sort merge join may cause 
> OOM (failed to acquire memory)
> --
>
> Key: SPARK-10494
> URL: https://issues.apache.org/jira/browse/SPARK-10494
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>
> The RDD cache for Python UDF is removed in 1.4, then N Python UDFs in one 
> query stage could end up evaluate upstream (SparkPlan) 2^N times.
> In 1.5, If there is aggregation or sort merge join in upstream SparkPlan, 
> they will cause OOM (failed to acquire memory).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10780) Set initialModel in KMeans in Pipelines API

2015-09-23 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-10780:
-

 Summary: Set initialModel in KMeans in Pipelines API
 Key: SPARK-10780
 URL: https://issues.apache.org/jira/browse/SPARK-10780
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10780) Set initialModel in KMeans in Pipelines API

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10780:
--
Description: This is for the Scala version.  After this is merged, create a 
JIRA for Python version.

> Set initialModel in KMeans in Pipelines API
> ---
>
> Key: SPARK-10780
> URL: https://issues.apache.org/jira/browse/SPARK-10780
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is for the Scala version.  After this is merged, create a JIRA for 
> Python version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10779) Set initialModel for KMeans model in PySpark (spark.mllib)

2015-09-23 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-10779:
-

 Summary: Set initialModel for KMeans model in PySpark (spark.mllib)
 Key: SPARK-10779
 URL: https://issues.apache.org/jira/browse/SPARK-10779
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley


Provide initialModel param for pyspark.mllib.clustering.KMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10778) Implement toString for AssociationRules.Rule

2015-09-23 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10778:
-

 Summary: Implement toString for AssociationRules.Rule
 Key: SPARK-10778
 URL: https://issues.apache.org/jira/browse/SPARK-10778
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.6.0
Reporter: Xiangrui Meng
Priority: Trivial


pretty print for association rules: {a, b, c} => {d}: 0.8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10778) Implement toString for AssociationRules.Rule

2015-09-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10778:
--
Description: 
pretty print for association rules, e.g.

{code}
{a, b, c} => {d}: 0.8
{code}

  was:pretty print for association rules: {a, b, c} => {d}: 0.8


> Implement toString for AssociationRules.Rule
> 
>
> Key: SPARK-10778
> URL: https://issues.apache.org/jira/browse/SPARK-10778
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Priority: Trivial
>  Labels: starter
>
> pretty print for association rules, e.g.
> {code}
> {a, b, c} => {d}: 0.8
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10403) UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)

2015-09-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10403.
--
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8873
[https://github.com/apache/spark/pull/8873]

> UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)
> 
>
> Key: SPARK-10403
> URL: https://issues.apache.org/jira/browse/SPARK-10403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Josh Rosen
> Fix For: 1.6.0, 1.5.1
>
>
> UnsafeRowSerializer reply on EOF in the stream, but UnsafeRowWriter will not 
> write EOF between partitions.
> {code}
> java.io.EOFException
>   at java.io.DataInputStream.readInt(DataInputStream.java:392)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:122)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:174)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10777) order by fails when column is aliased and projection includes windowed aggregate

2015-09-23 Thread N Campbell (JIRA)
N Campbell created SPARK-10777:
--

 Summary: order by fails when column is aliased and projection 
includes windowed aggregate
 Key: SPARK-10777
 URL: https://issues.apache.org/jira/browse/SPARK-10777
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: N Campbell


This statement fails in SPARK (works fine in ORACLE, DB2 )

select r as c1, min ( s ) over ()  as c2 from
( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
order by r
Error: org.apache.spark.sql.AnalysisException: cannot resolve 'r' given input 
columns c1, c2; line 3 pos 9
SQLState:  null
ErrorCode: 0

Forcing the aliased column name works around the defect

select r as c1, min ( s ) over ()  as c2 from
( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
order by c1

These work fine

select r as c1, min ( s ) over ()  as c2 from
( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
order by c1

select r as c1, s  as c2 from
( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
order by r


create table  if not exists TINT ( RNUM int , CINT int   )
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' 
 STORED AS ORC  ;




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10728) Failed to set Jenkins Identity header on email.

2015-09-23 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904986#comment-14904986
 ] 

shane knapp commented on SPARK-10728:
-

well, the spark project doesn't need to fix it.  if you want to fix the jenkins 
plugin that causes the output, have at it!  :)

i really don't think we should have a bug filed against a different bug on a 
plugin for another project altogether.

> Failed to set Jenkins Identity header on email.
> ---
>
> Key: SPARK-10728
> URL: https://issues.apache.org/jira/browse/SPARK-10728
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Xiangrui Meng
>Assignee: shane knapp
>Priority: Trivial
>
> Saw couple Jenkins build failures due to "Failed to set Jenkins Identity 
> header on email", e.g.,
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3572/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/consoleFull
> {code}
> [error] running 
> /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/build/sbt
>  -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Phive -Pkinesis-asl 
> -Phive-thriftserver test ; received return code 143
> Build step 'Execute shell' marked build as failure
> Archiving artifacts
> Recording test results
> ERROR: Failed to set Jenkins Identity header on email.
> java.lang.NullPointerException
>   at 
> org.jenkinsci.main.modules.instance_identity.InstanceIdentity.get(InstanceIdentity.java:126)
>   at 
> jenkins.plugins.mailer.tasks.MimeMessageBuilder.setJenkinsInstanceIdent(MimeMessageBuilder.java:188)
>   at 
> jenkins.plugins.mailer.tasks.MimeMessageBuilder.buildMimeMessage(MimeMessageBuilder.java:166)
>   at hudson.tasks.MailSender.createEmptyMail(MailSender.java:391)
>   at hudson.tasks.MailSender.createFailureMail(MailSender.java:260)
>   at hudson.tasks.MailSender.createMail(MailSender.java:178)
>   at hudson.tasks.MailSender.run(MailSender.java:107)
>   at hudson.tasks.Mailer.perform(Mailer.java:141)
>   at 
> hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:75)
>   at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:726)
>   at hudson.model.Build$BuildExecution.post2(Build.java:185)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:671)
>   at hudson.model.Run.execute(Run.java:1766)
>   at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
>   at hudson.model.ResourceController.execute(ResourceController.java:98)
>   at hudson.model.Executor.run(Executor.java:408)
> Sending e-mails to: spark-bu...@databricks.com rosenvi...@gmail.com
> Finished: FAILURE
> {code}
> The workaround documented on 
> https://issues.jenkins-ci.org/browse/JENKINS-26740 is to downgrade mailer to 
> 1.12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10728) Failed to set Jenkins Identity header on email.

2015-09-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904979#comment-14904979
 ] 

Xiangrui Meng commented on SPARK-10728:
---

This is still an issue, though not high-priority. We should keep this JIRA open 
if we want to fix it sometime.

> Failed to set Jenkins Identity header on email.
> ---
>
> Key: SPARK-10728
> URL: https://issues.apache.org/jira/browse/SPARK-10728
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Xiangrui Meng
>Assignee: shane knapp
>Priority: Trivial
>
> Saw couple Jenkins build failures due to "Failed to set Jenkins Identity 
> header on email", e.g.,
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3572/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/consoleFull
> {code}
> [error] running 
> /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/build/sbt
>  -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Phive -Pkinesis-asl 
> -Phive-thriftserver test ; received return code 143
> Build step 'Execute shell' marked build as failure
> Archiving artifacts
> Recording test results
> ERROR: Failed to set Jenkins Identity header on email.
> java.lang.NullPointerException
>   at 
> org.jenkinsci.main.modules.instance_identity.InstanceIdentity.get(InstanceIdentity.java:126)
>   at 
> jenkins.plugins.mailer.tasks.MimeMessageBuilder.setJenkinsInstanceIdent(MimeMessageBuilder.java:188)
>   at 
> jenkins.plugins.mailer.tasks.MimeMessageBuilder.buildMimeMessage(MimeMessageBuilder.java:166)
>   at hudson.tasks.MailSender.createEmptyMail(MailSender.java:391)
>   at hudson.tasks.MailSender.createFailureMail(MailSender.java:260)
>   at hudson.tasks.MailSender.createMail(MailSender.java:178)
>   at hudson.tasks.MailSender.run(MailSender.java:107)
>   at hudson.tasks.Mailer.perform(Mailer.java:141)
>   at 
> hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:75)
>   at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:726)
>   at hudson.model.Build$BuildExecution.post2(Build.java:185)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:671)
>   at hudson.model.Run.execute(Run.java:1766)
>   at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
>   at hudson.model.ResourceController.execute(ResourceController.java:98)
>   at hudson.model.Executor.run(Executor.java:408)
> Sending e-mails to: spark-bu...@databricks.com rosenvi...@gmail.com
> Finished: FAILURE
> {code}
> The workaround documented on 
> https://issues.jenkins-ci.org/browse/JENKINS-26740 is to downgrade mailer to 
> 1.12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10728) Failed to set Jenkins Identity header on email.

2015-09-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10728:
--
Affects Version/s: (was: 1.6.0)

> Failed to set Jenkins Identity header on email.
> ---
>
> Key: SPARK-10728
> URL: https://issues.apache.org/jira/browse/SPARK-10728
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Xiangrui Meng
>Assignee: shane knapp
>Priority: Trivial
>
> Saw couple Jenkins build failures due to "Failed to set Jenkins Identity 
> header on email", e.g.,
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3572/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/consoleFull
> {code}
> [error] running 
> /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/build/sbt
>  -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Phive -Pkinesis-asl 
> -Phive-thriftserver test ; received return code 143
> Build step 'Execute shell' marked build as failure
> Archiving artifacts
> Recording test results
> ERROR: Failed to set Jenkins Identity header on email.
> java.lang.NullPointerException
>   at 
> org.jenkinsci.main.modules.instance_identity.InstanceIdentity.get(InstanceIdentity.java:126)
>   at 
> jenkins.plugins.mailer.tasks.MimeMessageBuilder.setJenkinsInstanceIdent(MimeMessageBuilder.java:188)
>   at 
> jenkins.plugins.mailer.tasks.MimeMessageBuilder.buildMimeMessage(MimeMessageBuilder.java:166)
>   at hudson.tasks.MailSender.createEmptyMail(MailSender.java:391)
>   at hudson.tasks.MailSender.createFailureMail(MailSender.java:260)
>   at hudson.tasks.MailSender.createMail(MailSender.java:178)
>   at hudson.tasks.MailSender.run(MailSender.java:107)
>   at hudson.tasks.Mailer.perform(Mailer.java:141)
>   at 
> hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:75)
>   at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:726)
>   at hudson.model.Build$BuildExecution.post2(Build.java:185)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:671)
>   at hudson.model.Run.execute(Run.java:1766)
>   at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
>   at hudson.model.ResourceController.execute(ResourceController.java:98)
>   at hudson.model.Executor.run(Executor.java:408)
> Sending e-mails to: spark-bu...@databricks.com rosenvi...@gmail.com
> Finished: FAILURE
> {code}
> The workaround documented on 
> https://issues.jenkins-ci.org/browse/JENKINS-26740 is to downgrade mailer to 
> 1.12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10763) Update Java MLLIB/ML tests to use simplified dataframe construction

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10763:


Assignee: Apache Spark

> Update Java MLLIB/ML tests to use simplified dataframe construction
> ---
>
> Key: SPARK-10763
> URL: https://issues.apache.org/jira/browse/SPARK-10763
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now 
> have an easier way to create dataframes from local Java lists. Lets update 
> the tests to use those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10728) Failed to set Jenkins Identity header on email.

2015-09-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10728:
--
Labels:   (was: flaky-test)

> Failed to set Jenkins Identity header on email.
> ---
>
> Key: SPARK-10728
> URL: https://issues.apache.org/jira/browse/SPARK-10728
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Xiangrui Meng
>Assignee: shane knapp
>Priority: Trivial
>
> Saw couple Jenkins build failures due to "Failed to set Jenkins Identity 
> header on email", e.g.,
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3572/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/consoleFull
> {code}
> [error] running 
> /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/build/sbt
>  -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Phive -Pkinesis-asl 
> -Phive-thriftserver test ; received return code 143
> Build step 'Execute shell' marked build as failure
> Archiving artifacts
> Recording test results
> ERROR: Failed to set Jenkins Identity header on email.
> java.lang.NullPointerException
>   at 
> org.jenkinsci.main.modules.instance_identity.InstanceIdentity.get(InstanceIdentity.java:126)
>   at 
> jenkins.plugins.mailer.tasks.MimeMessageBuilder.setJenkinsInstanceIdent(MimeMessageBuilder.java:188)
>   at 
> jenkins.plugins.mailer.tasks.MimeMessageBuilder.buildMimeMessage(MimeMessageBuilder.java:166)
>   at hudson.tasks.MailSender.createEmptyMail(MailSender.java:391)
>   at hudson.tasks.MailSender.createFailureMail(MailSender.java:260)
>   at hudson.tasks.MailSender.createMail(MailSender.java:178)
>   at hudson.tasks.MailSender.run(MailSender.java:107)
>   at hudson.tasks.Mailer.perform(Mailer.java:141)
>   at 
> hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:75)
>   at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:726)
>   at hudson.model.Build$BuildExecution.post2(Build.java:185)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:671)
>   at hudson.model.Run.execute(Run.java:1766)
>   at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
>   at hudson.model.ResourceController.execute(ResourceController.java:98)
>   at hudson.model.Executor.run(Executor.java:408)
> Sending e-mails to: spark-bu...@databricks.com rosenvi...@gmail.com
> Finished: FAILURE
> {code}
> The workaround documented on 
> https://issues.jenkins-ci.org/browse/JENKINS-26740 is to downgrade mailer to 
> 1.12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10728) Failed to set Jenkins Identity header on email.

2015-09-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10728:
--
Target Version/s:   (was: 1.6.0)

> Failed to set Jenkins Identity header on email.
> ---
>
> Key: SPARK-10728
> URL: https://issues.apache.org/jira/browse/SPARK-10728
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Xiangrui Meng
>Assignee: shane knapp
>Priority: Trivial
>
> Saw couple Jenkins build failures due to "Failed to set Jenkins Identity 
> header on email", e.g.,
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3572/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/consoleFull
> {code}
> [error] running 
> /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/build/sbt
>  -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Phive -Pkinesis-asl 
> -Phive-thriftserver test ; received return code 143
> Build step 'Execute shell' marked build as failure
> Archiving artifacts
> Recording test results
> ERROR: Failed to set Jenkins Identity header on email.
> java.lang.NullPointerException
>   at 
> org.jenkinsci.main.modules.instance_identity.InstanceIdentity.get(InstanceIdentity.java:126)
>   at 
> jenkins.plugins.mailer.tasks.MimeMessageBuilder.setJenkinsInstanceIdent(MimeMessageBuilder.java:188)
>   at 
> jenkins.plugins.mailer.tasks.MimeMessageBuilder.buildMimeMessage(MimeMessageBuilder.java:166)
>   at hudson.tasks.MailSender.createEmptyMail(MailSender.java:391)
>   at hudson.tasks.MailSender.createFailureMail(MailSender.java:260)
>   at hudson.tasks.MailSender.createMail(MailSender.java:178)
>   at hudson.tasks.MailSender.run(MailSender.java:107)
>   at hudson.tasks.Mailer.perform(Mailer.java:141)
>   at 
> hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:75)
>   at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:726)
>   at hudson.model.Build$BuildExecution.post2(Build.java:185)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:671)
>   at hudson.model.Run.execute(Run.java:1766)
>   at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
>   at hudson.model.ResourceController.execute(ResourceController.java:98)
>   at hudson.model.Executor.run(Executor.java:408)
> Sending e-mails to: spark-bu...@databricks.com rosenvi...@gmail.com
> Finished: FAILURE
> {code}
> The workaround documented on 
> https://issues.jenkins-ci.org/browse/JENKINS-26740 is to downgrade mailer to 
> 1.12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10763) Update Java MLLIB/ML tests to use simplified dataframe construction

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10763:


Assignee: (was: Apache Spark)

> Update Java MLLIB/ML tests to use simplified dataframe construction
> ---
>
> Key: SPARK-10763
> URL: https://issues.apache.org/jira/browse/SPARK-10763
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib
>Reporter: holdenk
>Priority: Minor
>
> As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now 
> have an easier way to create dataframes from local Java lists. Lets update 
> the tests to use those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10763) Update Java MLLIB/ML tests to use simplified dataframe construction

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904978#comment-14904978
 ] 

Apache Spark commented on SPARK-10763:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/8886

> Update Java MLLIB/ML tests to use simplified dataframe construction
> ---
>
> Key: SPARK-10763
> URL: https://issues.apache.org/jira/browse/SPARK-10763
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib
>Reporter: holdenk
>Priority: Minor
>
> As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now 
> have an easier way to create dataframes from local Java lists. Lets update 
> the tests to use those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-23 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904964#comment-14904964
 ] 

Yin Huai commented on SPARK-10733:
--

[~jameszhouyi] Another two places for logging are 
{{UnsafeExternalSorter.acquireNewPage}} and 
{{ShuffleMemoryManager.tryToAcquire}}. In 
{{UnsafeExternalSorter.acquireNewPage}}, we log an entry to say we are trying 
to acquire some memory space. In {{ShuffleMemoryManager.tryToAcquire}}, we log 
the size of memory that we want to acquire, the size of memory that has already 
acquired for this task ({{curMem}}), {{maxToGrant}}, {{maxMemory / (2 * 
numActiveTasks)}}, and {{maxMemory / numActiveTasks}}. These information will 
be very helpful for debugging.

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema

2015-09-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10659:
---
Description: 
DataFrames currently automatically promotes all Parquet schema fields to 
optional when they are written to an empty directory. The problem remains in 
v1.5.0.

The culprit is this code:
{code}
val relation = if (doInsertion) {
  // This is a hack. We always set nullable/containsNull/valueContainsNull 
to true
  // for the schema of a parquet data.
  val df =
sqlContext.createDataFrame(
  data.queryExecution.toRdd,
  data.schema.asNullable)
  val createdRelation =
createRelation(sqlContext, parameters, 
df.schema).asInstanceOf[ParquetRelation2]
  createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
  createdRelation
}
{code}
which was implemented as part of this PR:
https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b

This very unexpected behaviour for some use cases when files are read from one 
place and written to another like small file packing - it ends up with 
incompatible files because required can't be promoted to optional normally. It 
is essence of a schema that it enforces "required" invariant on data. It should 
be supposed that it is intended.

I believe that a better approach is to have default behaviour to keep schema as 
is and provide f.e. a builder method or option to allow forcing to optional.

Right now we have to overwrite private API so that our files are rewritten as 
is with all its perils.

Vladimir


  was:
DataFrames currently automatically promotes all Parquet schema fields to 
optional when they are written to an empty directory. The problem remains in 
v1.5.0.

The culprit is this code:
{code}
val relation = if (doInsertion) {
  // This is a hack. We always set
nullable/containsNull/valueContainsNull to true
  // for the schema of a parquet data.
  val df =
sqlContext.createDataFrame(
  data.queryExecution.toRdd,
  data.schema.asNullable)
  val createdRelation =
createRelation(sqlContext, parameters, 
df.schema).asInstanceOf[ParquetRelation2]
  createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
  createdRelation
}
{code}
which was implemented as part of this PR:
https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b

This very unexpected behaviour for some use cases when files are read from one 
place and written to another like small file packing - it ends up with 
incompatible files because required can't be promoted to optional normally. It 
is essence of a schema that it enforces "required" invariant on data. It should 
be supposed that it is intended.

I believe that a better approach is to have default behaviour to keep schema as 
is and provide f.e. a builder method or option to allow forcing to optional.

Right now we have to overwrite private API so that our files are rewritten as 
is with all its perils.

Vladimir



> DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not 
> nullable) flag in schema
> --
>
> Key: SPARK-10659
> URL: https://issues.apache.org/jira/browse/SPARK-10659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0
>Reporter: Vladimir Picka
>
> DataFrames currently automatically promotes all Parquet schema fields to 
> optional when they are written to an empty directory. The problem remains in 
> v1.5.0.
> The culprit is this code:
> {code}
> val relation = if (doInsertion) {
>   // This is a hack. We always set 
> nullable/containsNull/valueContainsNull to true
>   // for the schema of a parquet data.
>   val df =
> sqlContext.createDataFrame(
>   data.queryExecution.toRdd,
>   data.schema.asNullable)
>   val createdRelation =
> createRelation(sqlContext, parameters, 
> df.schema).asInstanceOf[ParquetRelation2]
>   createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
>   createdRelation
> }
> {code}
> which was implemented as part of this PR:
> https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b
> This very unexpected behaviour for some use cases when files are read from 
> one place and written to another like small file packing - it ends up with 
> incompatible files because required can't be promoted to optional normally. 
> It is essence of a schema that it enforces "required" invariant on data. It 
> should be supposed that it is intended.
> I believe that a better approach is to have default behaviour to keep schema 
> as is and provide f.e. a builder method or option to allow forcing

  1   2   >