date:20160623

[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader

2016-06-23 Thread huangning (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347750#comment-15347750
 ] 

huangning commented on SPARK-5770:
--

Because it is not support deleteJar or updataJar, so I can't update class 
without restart thrift server

> Use addJar() to upload a new jar file to executor, it can't be added to 
> classloader
> ---
>
> Key: SPARK-5770
> URL: https://issues.apache.org/jira/browse/SPARK-5770
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: meiyoula
>Priority: Minor
>
> First use addJar() to upload a jar to the executor, then change the jar 
> content and upload it again. We can see the jar file in the local has be 
> updated, but the classloader still load the old one. The executor log has no 
> error or exception to point it.
> I use spark-shell to test it. And set "spark.files.overwrite" is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16179) UDF explosion yielding empty dataframe fails

2016-06-23 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-16179:
--

Assignee: Davies Liu

> UDF explosion yielding empty dataframe fails
> 
>
> Key: SPARK-16179
> URL: https://issues.apache.org/jira/browse/SPARK-16179
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>Assignee: Davies Liu
>
> Command to replicate 
> https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0
> Resulting failure
> https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns

2016-06-23 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347705#comment-15347705
 ] 

Xiangrui Meng commented on SPARK-16000:
---

We can distribute the work easily if you can help create sub-task JIRAs and 
list the models that requires automatic conversion. Having 
convertMatrixColumnsToML/FromML sounds good to me.

> Make model loading backward compatible with saved models using old vector 
> columns
> -
>
> Key: SPARK-16000
> URL: https://issues.apache.org/jira/browse/SPARK-16000
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: yuhao yang
>
> To help users migrate from Spark 1.6. to 2.0, we should make model loading 
> backward compatible with models saved in 1.6. The main incompatibility is the 
> vector column type change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16179) UDF explosion yielding empty dataframe fails

2016-06-23 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16179:
---
Affects Version/s: 2.0.0

> UDF explosion yielding empty dataframe fails
> 
>
> Key: SPARK-16179
> URL: https://issues.apache.org/jira/browse/SPARK-16179
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>
> Command to replicate 
> https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0
> Resulting failure
> https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16181) Incorrect behavior for isNull filter

2016-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16181:


Assignee: (was: Apache Spark)

> Incorrect behavior for isNull filter
> 
>
> Key: SPARK-16181
> URL: https://issues.apache.org/jira/browse/SPARK-16181
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kevin Chen
>
> Repro:
> JavaRDD leftRdd = 
> javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("x")));
> JavaRDD rightRdd = 
> javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("y")));
> StructType schema = DataTypes.createStructType(ImmutableList.of(
> DataTypes.createStructField("col", DataTypes.StringType, 
> true)));
> Dataset left = sparkSession.createDataFrame(leftRdd, schema);
> Dataset right = sparkSession.createDataFrame(rightRdd, schema);
> // add a column to the right
> Dataset withConstantColumn = right.withColumn("new", 
> functions.lit(true));
> // do a left join. Nothing matches; expect Dataset joined to have a 
> single row ['x', null, null]
> Column joinCondition = left.col("col").equalTo(right.col("col"));
> Dataset joined = left.join(withConstantColumn, joinCondition, 
> LeftOuter.toString());
> //  filter for nulls, still expect the single row ['x', null, null]
> Dataset filtered = joined.filter(functions.col("new").isNull());
> // This fails with 1 != 0
> Assert.assertEquals(1, filtered.count());
> [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16181) Incorrect behavior for isNull filter

2016-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16181:


Assignee: Apache Spark

> Incorrect behavior for isNull filter
> 
>
> Key: SPARK-16181
> URL: https://issues.apache.org/jira/browse/SPARK-16181
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kevin Chen
>Assignee: Apache Spark
>
> Repro:
> JavaRDD leftRdd = 
> javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("x")));
> JavaRDD rightRdd = 
> javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("y")));
> StructType schema = DataTypes.createStructType(ImmutableList.of(
> DataTypes.createStructField("col", DataTypes.StringType, 
> true)));
> Dataset left = sparkSession.createDataFrame(leftRdd, schema);
> Dataset right = sparkSession.createDataFrame(rightRdd, schema);
> // add a column to the right
> Dataset withConstantColumn = right.withColumn("new", 
> functions.lit(true));
> // do a left join. Nothing matches; expect Dataset joined to have a 
> single row ['x', null, null]
> Column joinCondition = left.col("col").equalTo(right.col("col"));
> Dataset joined = left.join(withConstantColumn, joinCondition, 
> LeftOuter.toString());
> //  filter for nulls, still expect the single row ['x', null, null]
> Dataset filtered = joined.filter(functions.col("new").isNull());
> // This fails with 1 != 0
> Assert.assertEquals(1, filtered.count());
> [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16179) UDF explosion yielding empty dataframe fails

2016-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16179:


Assignee: Apache Spark

> UDF explosion yielding empty dataframe fails
> 
>
> Key: SPARK-16179
> URL: https://issues.apache.org/jira/browse/SPARK-16179
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Vladimir Feinberg
>Assignee: Apache Spark
>
> Command to replicate 
> https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0
> Resulting failure
> https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16181) Incorrect behavior for isNull filter

2016-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347704#comment-15347704
 ] 

Apache Spark commented on SPARK-16181:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/13884

> Incorrect behavior for isNull filter
> 
>
> Key: SPARK-16181
> URL: https://issues.apache.org/jira/browse/SPARK-16181
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kevin Chen
>
> Repro:
> JavaRDD leftRdd = 
> javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("x")));
> JavaRDD rightRdd = 
> javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("y")));
> StructType schema = DataTypes.createStructType(ImmutableList.of(
> DataTypes.createStructField("col", DataTypes.StringType, 
> true)));
> Dataset left = sparkSession.createDataFrame(leftRdd, schema);
> Dataset right = sparkSession.createDataFrame(rightRdd, schema);
> // add a column to the right
> Dataset withConstantColumn = right.withColumn("new", 
> functions.lit(true));
> // do a left join. Nothing matches; expect Dataset joined to have a 
> single row ['x', null, null]
> Column joinCondition = left.col("col").equalTo(right.col("col"));
> Dataset joined = left.join(withConstantColumn, joinCondition, 
> LeftOuter.toString());
> //  filter for nulls, still expect the single row ['x', null, null]
> Dataset filtered = joined.filter(functions.col("new").isNull());
> // This fails with 1 != 0
> Assert.assertEquals(1, filtered.count());
> [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16164) CombineFilters should keep the ordering in the logical plan

2016-06-23 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347701#comment-15347701
 ] 

Cheng Lian commented on SPARK-16164:


I'm posting a summary of our offline and GitHub discussion about this issue 
here for future reference.

The case described in this ticket can be generalized as the following query 
plan fragment:

{noformat}
Filter p1
 Project ...
  Filter p2
   
{noformat}

In this fragment, predicate {{p1}} is actually a partial function, which means 
it's not defined for all possible input values (for example, {{scala.math.log}} 
is only defined for positive numbers). Thus this query plan relies on the fact 
that {{p1}} is evaluated after {{p2}}, so that {{p2}} can reduce the range of 
input values for {{p1}}. However, filter push-down and {{CombineFilters}} 
optimizes this fragment into:

{noformat}
Project ...
 Filter p1 && p2
  
{noformat}

which forces {{p1}} to be evaluated before {{p2}}. This causes the exception 
because now {{p1}} may receive invalid input values that are supposed to be 
filtered out by {{p2}}.

The problem here is that, the SQL optimizer should have the freedom to 
rearrange filter predicate evaluation order. For example, we may want to 
evaluate cheap predicates first to shortcut expensive predicates. However, to 
enable this kind of optimizations, essentially we require all filter predicates 
to be deterministic *full* functions, which is violated in the above case 
({{p1}} is not a full function).

[PR #13872|https://github.com/apache/spark/pull/13872] "fixes" the specific 
case mentioned in this ticket by adjusting optimization rule 
{{CombineFilters}}, which is safe. But in general, user applications should NOT 
make the assumption that filter predicates are always evaluated in the order 
they appear in the original query.


> CombineFilters should keep the ordering in the logical plan
> ---
>
> Key: SPARK-16164
> URL: https://issues.apache.org/jira/browse/SPARK-16164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Dongjoon Hyun
> Fix For: 2.0.1, 2.1.0
>
>
> [~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with 
> additional filters. It seems that during filter pushdown, we changed the 
> ordering in the logical plan. I'm not sure whether we should treat this as a 
> bug.
> {code}
> val df1 = (0 until 3).map(_.toString).toDF
> val indexer = new StringIndexer()
>   .setInputCol("value")
>   .setOutputCol("idx")
>   .setHandleInvalid("skip")
>   .fit(df1)
> val df2 = (0 until 5).map(_.toString).toDF
> val predictions = indexer.transform(df2)
> predictions.show() // this is okay
> predictions.where('idx > 2).show() // this will throw an exception
> {code}
> Please see the notebook at 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html
>  for error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-06-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15784:
--
Target Version/s: 2.1.0

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16133) model loading backward compatibility for ml.feature

2016-06-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16133:
--
Assignee: yuhao yang

> model loading backward compatibility for ml.feature
> ---
>
> Key: SPARK-16133
> URL: https://issues.apache.org/jira/browse/SPARK-16133
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Assignee: yuhao yang
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16133) model loading backward compatibility for ml.feature

2016-06-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16133.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 13844
[https://github.com/apache/spark/pull/13844]

> model loading backward compatibility for ml.feature
> ---
>
> Key: SPARK-16133
> URL: https://issues.apache.org/jira/browse/SPARK-16133
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16179) UDF explosion yielding empty dataframe fails

2016-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16179:


Assignee: (was: Apache Spark)

> UDF explosion yielding empty dataframe fails
> 
>
> Key: SPARK-16179
> URL: https://issues.apache.org/jira/browse/SPARK-16179
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Vladimir Feinberg
>
> Command to replicate 
> https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0
> Resulting failure
> https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16179) UDF explosion yielding empty dataframe fails

2016-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347692#comment-15347692
 ] 

Apache Spark commented on SPARK-16179:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/13883

> UDF explosion yielding empty dataframe fails
> 
>
> Key: SPARK-16179
> URL: https://issues.apache.org/jira/browse/SPARK-16179
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Vladimir Feinberg
>
> Command to replicate 
> https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0
> Resulting failure
> https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16142) Group naive Bayes methods in generated doc

2016-06-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16142.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 13877
[https://github.com/apache/spark/pull/13877]

> Group naive Bayes methods in generated doc
> --
>
> Key: SPARK-16142
> URL: https://issues.apache.org/jira/browse/SPARK-16142
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>  Labels: starter
> Fix For: 2.0.1, 2.1.0
>
>
> Follow SPARK-16107 and group the doc of spark.naiveBayes: spark.naiveBayes, 
> predict(NB), summary(NB), read/write.ml(NB) under Rd spark.naiveBayes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16177) model loading backward compatibility for ml.regression

2016-06-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16177:
--
Assignee: yuhao yang

> model loading backward compatibility for ml.regression
> --
>
> Key: SPARK-16177
> URL: https://issues.apache.org/jira/browse/SPARK-16177
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16177) model loading backward compatibility for ml.regression

2016-06-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16177.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 13879
[https://github.com/apache/spark/pull/13879]

> model loading backward compatibility for ml.regression
> --
>
> Key: SPARK-16177
> URL: https://issues.apache.org/jira/browse/SPARK-16177
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16183) Large Spark SQL commands cause StackOverflowError in parser when using sqlContext.sql

2016-06-23 Thread Matthew Porter (JIRA)

Matthew Porter created SPARK-16183:
--

 Summary: Large Spark SQL commands cause StackOverflowError in 
parser when using sqlContext.sql
 Key: SPARK-16183
 URL: https://issues.apache.org/jira/browse/SPARK-16183
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.6.1
 Environment: Running on AWS EMR
Reporter: Matthew Porter


Hi,

I have created a PySpark SQL-based tool which auto-generates a complex SQL 
command to be run via sqlContext.sql(cmd) based on a large number of 
parameters. As the number of input files to be filtered and joined in this 
query grows, so does the length of the SQL query. The tool runs fine up until 
about 200+ files are included in the join, at which point the SQL command 
becomes very long (~100K characters). It is only on these longer queries that 
Spark fails, throwing an exception due to what seems to be too much recursion 
occurring within the SparkSQL parser:

{code}
Traceback (most recent call last):
...
merged_df = sqlsc.sql(cmd)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 
580, in sql
  File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 
813, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, 
in deco
  File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, 
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o173.sql.
: java.lang.StackOverflowError
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at

[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-06-23 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347576#comment-15347576
 ] 

Yin Huai commented on SPARK-15345:
--

[~m1lan] Do you use hive-site.xml to store the url to your metastore?

> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
>

[jira] [Commented] (SPARK-16168) Spark sql can not read ORC table

2016-06-23 Thread AnfengYuan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347566#comment-15347566
 ] 

AnfengYuan commented on SPARK-16168:


Ok, to make it clear, I can reproduce it with following steps:
1. use hive(version: 1.2.1) to create an orc table
{code}create table testorc(id bigint, name string) stored as orc;{code}
2. insert one line to this table
{code}insert into testorc values(1, '1');{code}
3. start spark-sql shell, and desc this table, I got
{code}
spark-sql> desc testorc;
16/06/24 10:20:03 INFO execution.SparkSqlParser: Parsing command: desc testorc
16/06/24 10:20:03 INFO spark.SparkContext: Starting job: processCmd at 
CliDriver.java:376
16/06/24 10:20:03 INFO scheduler.DAGScheduler: Got job 1 (processCmd at 
CliDriver.java:376) with 1 output partitions
16/06/24 10:20:03 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 
(processCmd at CliDriver.java:376)
16/06/24 10:20:03 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/06/24 10:20:03 INFO scheduler.DAGScheduler: Missing parents: List()
16/06/24 10:20:03 INFO scheduler.DAGScheduler: Submitting ResultStage 1 
(MapPartitionsRDD[5] at processCmd at CliDriver.java:376), which has no missing 
parents
16/06/24 10:20:03 INFO memory.MemoryStore: Block broadcast_1 stored as values 
in memory (estimated size 4.2 KB, free 912.3 MB)
16/06/24 10:20:03 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as 
bytes in memory (estimated size 2.5 KB, free 912.3 MB)
16/06/24 10:20:03 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in 
memory on 192.168.178.37:2874 (size: 2.5 KB, free: 912.3 MB)
16/06/24 10:20:03 INFO spark.SparkContext: Created broadcast 1 from broadcast 
at DAGScheduler.scala:996
16/06/24 10:20:03 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from 
ResultStage 1 (MapPartitionsRDD[5] at processCmd at CliDriver.java:376)
16/06/24 10:20:03 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 
tasks
16/06/24 10:20:03 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 
(TID 1, localhost, partition 0, PROCESS_LOCAL, 5571 bytes)
16/06/24 10:20:03 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
16/06/24 10:20:03 INFO codegen.CodeGenerator: Code generated in 23.231239 ms
16/06/24 10:20:03 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 
1). 1046 bytes result sent to driver
16/06/24 10:20:03 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 
(TID 1) in 78 ms on localhost (1/1)
16/06/24 10:20:03 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose 
tasks have all completed, from pool 
16/06/24 10:20:03 INFO scheduler.DAGScheduler: ResultStage 1 (processCmd at 
CliDriver.java:376) finished in 0.079 s
16/06/24 10:20:03 INFO scheduler.DAGScheduler: Job 1 finished: processCmd at 
CliDriver.java:376, took 0.097624 s
id  bigint  NULL
namestring  NULL
Time taken: 0.28 seconds, Fetched 2 row(s)
16/06/24 10:20:03 INFO CliDriver: Time taken: 0.28 seconds, Fetched 2 row(s)
{code}
4. query this table
{code}select * from testorc;{code}

then I got error 
{code}
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1429)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1417)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1416)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1416)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1638)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1597)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1586)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1872)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1885)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1898)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1912)
at

[jira] [Comment Edited] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-23 Thread Pedro Osorio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347556#comment-15347556
 ] 

Pedro Osorio edited comment on SPARK-14172 at 6/24/16 2:10 AM:
---

[~lucasmf] I have been able to reproduce the issue by running the following 
code in a spark shell:

{code}
sqlContext.sql("drop table test_partition_predicate")
sqlContext.sql("create table test_partition_predicate (col1 string) partitioned 
by (partition_col string)")
sqlContext.sql("explain extended select * from test_partition_predicate where 
partition_col = '1' and rand() < 0.9").collect().foreach(println)
{code}

I have verified using community.cloud.databricks.com that in spark1.4 the 
physical plan uses the partition_col predicate in the HiveTableScan, but in 
versions 1.6 and 2.0, this only shows up in the filters section, which means 
the whole table is scanned.

As [~jiangxb1987] pointed out, the issue is in collectProjectsAndFilters and 
was introduced in this patch, I believe: 
https://github.com/apache/spark/pull/8486


was (Author: pedrosorio):
[~lucasmf] I have been able to reproduce the issue by running the following 
code in a spark shell:

{code}
sqlContext.sql("drop table test_partition_predicate")
sqlContext.sql("create table test_partition_predicate (col1 string) partitioned 
by (partition_col string)")
sqlContext.sql("explain extended select * from test_partition_predicate where 
partition_col = '1' and rand() < 0.9").collect().foreach(println)
{code}

I have verified using community.cloud.databricks.com that in spark1.4 the 
physical plan uses the partition_col predicate in the HiveTableScan, but in 
versions 1.6 and 2.0, this shows up in the filters section, which means the 
whole table is scanned.

As [~jiangxb1987] pointed out, the issue is in collectProjectsAndFilters and 
was introduced in this patch, I believe: 
https://github.com/apache/spark/pull/8486

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-23 Thread Pedro Osorio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347556#comment-15347556
 ] 

Pedro Osorio commented on SPARK-14172:
--

[~lucasmf] I have been able to reproduce the issue by running the following 
code in a spark shell:

{code}
sqlContext.sql("drop table test_partition_predicate")
sqlContext.sql("create table test_partition_predicate (col1 string) partitioned 
by (partition_col string)")
sqlContext.sql("explain extended select * from test_partition_predicate where 
partition_col = '1' and rand() < 0.9").collect().foreach(println)
{code}

I have verified using community.cloud.databricks.com that in spark1.4 the 
physical plan uses the partition_col predicate in the HiveTableScan, but in 
versions 1.6 and 2.0, this shows up in the filters section, which means the 
whole table is scanned.

As [~jiangxb1987] pointed out, the issue is in collectProjectsAndFilters and 
was introduced in this patch, I believe: 
https://github.com/apache/spark/pull/8486

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16123) Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader

2016-06-23 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-16123.
---
Resolution: Resolved
  Assignee: Sameer Agarwal

> Avoid NegativeArraySizeException while reserving additional capacity in 
> VectorizedColumnReader
> --
>
> Key: SPARK-16123
> URL: https://issues.apache.org/jira/browse/SPARK-16123
> Project: Spark
>  Issue Type: Bug
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>
> Both off-heap and on-heap variants of ColumnVector.reserve() can 
> unfortunately overflow while reserving additional capacity during reads.
> {code}
> Caused by: java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserveInternal(OnHeapColumnVector.java:461)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserve(OnHeapColumnVector.java:397)
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:675)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:389)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:372)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:194)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:169)
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16182) companion object's shutdownhook not being invoked

2016-06-23 Thread Christian Chua (JIRA)

Christian Chua created SPARK-16182:
--

 Summary: companion object's shutdownhook not being invoked
 Key: SPARK-16182
 URL: https://issues.apache.org/jira/browse/SPARK-16182
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1
 Environment: OSX El Capitan (java "1.8.0_65"), Oracle Linux 6 (java 
1.8.0_92-b14)
Reporter: Christian Chua
Priority: Critical


Spark streaming documentation recommends application developers create static 
connection pools. To clean up this pool, we add a shutdown hook.

The problem is that in spark 1.6.1, the shutdown hook for an executor will be 
called only for the first submitted job.  (on the second and subsequent job 
submissions, the shutdown hook for the executor will NOT be invoked)

problem not seen when using java 1.7
problem not seen when using spark 1.6.0

looks like this bug is caused by this modification from 1.6.0 to 1.6.1:
https://issues.apache.org/jira/browse/SPARK-12486

steps to reproduce the problem :

1.) install spark 1.6.1
2.) submit this basic spark application

import org.apache.spark.{ SparkContext, SparkConf }
object MyPool {
def printToFile( f : java.io.File )( op : java.io.PrintWriter => Unit ) {
val p = new java.io.PrintWriter(f)
try {
op(p)
}
finally {
p.close()
}
}
def myfunc( ) = {
"a"
}
def createEvidence( ) = {
printToFile(new java.io.File("/var/tmp/evidence.txt")) { p =>
p.println("the evidence")
}
}
sys.addShutdownHook {
createEvidence()
}
}

object BasicSpark {
def main( args : Array[String] ) = {
val sparkConf = new SparkConf().setAppName("BasicPi")
val sc = new SparkContext(sparkConf)
sc.parallelize(1 to 2).foreach { i => println("f : " + MyPool.myfunc())
}
sc.stop()
}
}


3.) you will see that /var/tmp/evidence.txt is created
4.) now delete this file 
5.) submit a second job
6.) you will see that /var/tmp/evidence.txt is no longer created on the second 
submission

7.) if you use java 7 or spark 1.6.0, the evidence file will be created on the 
second and subsequent submits




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16181) Incorrect behavior for isNull filter

2016-06-23 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347509#comment-15347509
 ] 

Reynold Xin commented on SPARK-16181:
-

cc [~cloud_fan]

> Incorrect behavior for isNull filter
> 
>
> Key: SPARK-16181
> URL: https://issues.apache.org/jira/browse/SPARK-16181
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kevin Chen
>
> Repro:
> JavaRDD leftRdd = 
> javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("x")));
> JavaRDD rightRdd = 
> javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("y")));
> StructType schema = DataTypes.createStructType(ImmutableList.of(
> DataTypes.createStructField("col", DataTypes.StringType, 
> true)));
> Dataset left = sparkSession.createDataFrame(leftRdd, schema);
> Dataset right = sparkSession.createDataFrame(rightRdd, schema);
> // add a column to the right
> Dataset withConstantColumn = right.withColumn("new", 
> functions.lit(true));
> // do a left join. Nothing matches; expect Dataset joined to have a 
> single row ['x', null, null]
> Column joinCondition = left.col("col").equalTo(right.col("col"));
> Dataset joined = left.join(withConstantColumn, joinCondition, 
> LeftOuter.toString());
> //  filter for nulls, still expect the single row ['x', null, null]
> Dataset filtered = joined.filter(functions.col("new").isNull());
> // This fails with 1 != 0
> Assert.assertEquals(1, filtered.count());
> [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-06-23 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347497#comment-15347497
 ] 

Wenchen Fan commented on SPARK-15345:
-

hi, can you open a new JIRA for the remaining problem? thanks!

> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
>

[jira] [Assigned] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation

2016-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3723:
---

Assignee: (was: Apache Spark)

> DecisionTree, RandomForest: Add more instrumentation
> 
>
> Key: SPARK-3723
> URL: https://issues.apache.org/jira/browse/SPARK-3723
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Some simple instrumentation would help advanced users understand performance, 
> and to check whether parameters (such as maxMemoryInMB) need to be tuned.
> Most important instrumentation (simple):
> * min, avg, max nodes per group
> * number of groups (passes over data)
> More advanced instrumentation:
> * For each tree (or averaged over trees), training set accuracy after 
> training each level.  This would be useful for visualizing learning behavior 
> (to convince oneself that model selection was being done correctly).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation

2016-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3723:
---

Assignee: Apache Spark

> DecisionTree, RandomForest: Add more instrumentation
> 
>
> Key: SPARK-3723
> URL: https://issues.apache.org/jira/browse/SPARK-3723
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> Some simple instrumentation would help advanced users understand performance, 
> and to check whether parameters (such as maxMemoryInMB) need to be tuned.
> Most important instrumentation (simple):
> * min, avg, max nodes per group
> * number of groups (passes over data)
> More advanced instrumentation:
> * For each tree (or averaged over trees), training set accuracy after 
> training each level.  This would be useful for visualizing learning behavior 
> (to convince oneself that model selection was being done correctly).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation

2016-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347489#comment-15347489
 ] 

Apache Spark commented on SPARK-3723:
-

User 'smurching' has created a pull request for this issue:
https://github.com/apache/spark/pull/13881

> DecisionTree, RandomForest: Add more instrumentation
> 
>
> Key: SPARK-3723
> URL: https://issues.apache.org/jira/browse/SPARK-3723
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Some simple instrumentation would help advanced users understand performance, 
> and to check whether parameters (such as maxMemoryInMB) need to be tuned.
> Most important instrumentation (simple):
> * min, avg, max nodes per group
> * number of groups (passes over data)
> More advanced instrumentation:
> * For each tree (or averaged over trees), training set accuracy after 
> training each level.  This would be useful for visualizing learning behavior 
> (to convince oneself that model selection was being done correctly).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16181) Incorrect behavior for isNull filter

2016-06-23 Thread Kevin Chen (JIRA)

Kevin Chen created SPARK-16181:
--

 Summary: Incorrect behavior for isNull filter
 Key: SPARK-16181
 URL: https://issues.apache.org/jira/browse/SPARK-16181
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Kevin Chen


Repro:

JavaRDD leftRdd = 
javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("x")));
JavaRDD rightRdd = 
javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("y")));
StructType schema = DataTypes.createStructType(ImmutableList.of(
DataTypes.createStructField("col", DataTypes.StringType, 
true)));
Dataset left = sparkSession.createDataFrame(leftRdd, schema);
Dataset right = sparkSession.createDataFrame(rightRdd, schema);

// add a column to the right
Dataset withConstantColumn = right.withColumn("new", 
functions.lit(true));
// do a left join. Nothing matches; expect Dataset joined to have a 
single row ['x', null, null]
Column joinCondition = left.col("col").equalTo(right.col("col"));
Dataset joined = left.join(withConstantColumn, joinCondition, 
LeftOuter.toString());
//  filter for nulls, still expect the single row ['x', null, null]
Dataset filtered = joined.filter(functions.col("new").isNull());
// This fails with 1 != 0
Assert.assertEquals(1, filtered.count());

[~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16168) Spark sql can not read ORC table

2016-06-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347481#comment-15347481
 ] 

Hyukjin Kwon commented on SPARK-16168:
--

I think it would be nicer if there are some cores to reproduce this.

> Spark sql can not read ORC table
> 
>
> Key: SPARK-16168
> URL: https://issues.apache.org/jira/browse/SPARK-16168
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: AnfengYuan
>
> When using spark-sql shell to query orc table, exceptions are thrown:
> My table was generated by the tool in 
> https://github.com/hortonworks/hive-testbench
> {code}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1429)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1417)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1416)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1416)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1638)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1597)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1586)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1872)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1885)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1898)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:310)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$3.apply(QueryExecution.scala:131)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$3.apply(QueryExecution.scala:130)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:130)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:323)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:239)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.IllegalArgumentException: Field "i_item_sk" does not 
> exist.
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:254)
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:254)
>   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>   at 
>

[jira] [Closed] (SPARK-16146) Spark application failed by Yarn preempting

2016-06-23 Thread Cong Feng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cong Feng closed SPARK-16146.
-
Resolution: Fixed

> Spark application failed by Yarn preempting
> ---
>
> Key: SPARK-16146
> URL: https://issues.apache.org/jira/browse/SPARK-16146
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
> Environment: Amazon EC2, centos 6.6,
> Spark-1.6.1-bin-hadoop-2.6(binary from spark official web), Hadoop 2.7.2, 
> preemption and dynamic allocation enabled.
>Reporter: Cong Feng
>
> Hi,
> We are setting up our Spark cluster on amz ec2. We are using Spark Yarn 
> client mode, which is Spark-1.6.1-bin-hadoop-2.6(binary from spark official 
> web) and Hadoop 2.7.2. We also enable preemption, dynamic allocation and 
> spark.shuffle.service.enabled.
> During our test we found our Spark application frequently get killed when the 
> preemption happened. Mostly seems driver trying to send rpc to executor which 
> has been preempted before, also there are some connect rest by peer 
> exceptions which also cause job failed Below are the typical exceptions we 
> found:
> 16/06/22 08:13:30 ERROR spark.ContextCleaner: Error cleaning RDD 49
> java.io.IOException: Failed to send RPC 5721681506291542850 to 
> nodexx.xx..ddns.xx.com/xx.xx.xx.xx:42857: 
> java.nio.channels.ClosedChannelException
> at 
> org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239)
> at 
> org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567)
> at 
> io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedChannelException
> And 
> 16/06/19 22:33:14 INFO storage.BlockManager: Removing RDD 122
> 16/06/19 22:33:14 WARN server.TransportChannelHandler: Exception in 
> connection from nodexx-xx-xx.xx.ddns.xx.com/xx.xx.xx.xx:56618
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
> at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> 16/06/19 22:33:14 ERROR client.TransportResponseHandler: Still have 2 
> requests outstanding when connection from 
>

[jira] [Commented] (SPARK-16146) Spark application failed by Yarn preempting

2016-06-23 Thread Cong Feng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347479#comment-15347479
 ] 

Cong Feng commented on SPARK-16146:
---

Finally it turned out to be the Zeppelin issue, we run the same job on Spark 
shell, we saw the exactly same exception but seems shell is able to handle to 
it and keep rest of tasks running to the end. While the zeppelin consider those 
exception as fatal and fail the job (at least at UI level). Thanks guys for all 
the help, for now I resolve the ticket.

> Spark application failed by Yarn preempting
> ---
>
> Key: SPARK-16146
> URL: https://issues.apache.org/jira/browse/SPARK-16146
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
> Environment: Amazon EC2, centos 6.6,
> Spark-1.6.1-bin-hadoop-2.6(binary from spark official web), Hadoop 2.7.2, 
> preemption and dynamic allocation enabled.
>Reporter: Cong Feng
>
> Hi,
> We are setting up our Spark cluster on amz ec2. We are using Spark Yarn 
> client mode, which is Spark-1.6.1-bin-hadoop-2.6(binary from spark official 
> web) and Hadoop 2.7.2. We also enable preemption, dynamic allocation and 
> spark.shuffle.service.enabled.
> During our test we found our Spark application frequently get killed when the 
> preemption happened. Mostly seems driver trying to send rpc to executor which 
> has been preempted before, also there are some connect rest by peer 
> exceptions which also cause job failed Below are the typical exceptions we 
> found:
> 16/06/22 08:13:30 ERROR spark.ContextCleaner: Error cleaning RDD 49
> java.io.IOException: Failed to send RPC 5721681506291542850 to 
> nodexx.xx..ddns.xx.com/xx.xx.xx.xx:42857: 
> java.nio.channels.ClosedChannelException
> at 
> org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239)
> at 
> org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567)
> at 
> io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedChannelException
> And 
> 16/06/19 22:33:14 INFO storage.BlockManager: Removing RDD 122
> 16/06/19 22:33:14 WARN server.TransportChannelHandler: Exception in 
> connection from nodexx-xx-xx.xx.ddns.xx.com/xx.xx.xx.xx:56618
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
> at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
>

[jira] [Commented] (SPARK-16172) SQL Context's

2016-06-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347477#comment-15347477
 ] 

Hyukjin Kwon commented on SPARK-16172:
--

Do you mind updating the title to be in more details and testing this against 
newer versions?

> SQL Context's 
> --
>
> Key: SPARK-16172
> URL: https://issues.apache.org/jira/browse/SPARK-16172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1, 1.2.2
>Reporter: Scott Viteri
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Read operations on unsupported json files hang up instead of fail.
> The jsonFile and jsonRDD methods take arbitrarily long when trying to read 
> mutiple gzipped json files. If the RDD creation is not going to succeed, it 
> should fail quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-23 Thread MIN-FU YANG (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347465#comment-15347465
 ] 

MIN-FU YANG commented on SPARK-15516:
-

OK, I've reproduced the problem with following code:
```
import sqlContext.implicits._
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.write.parquet("data/test_table/key=1")
val df2 = sc.makeRDD(6 to 10).map(i => (i * 2147483647000l, i)).toDF("single", 
"triple")
df2.write.parquet("data/test_table/key=2")
val df3 = sqlContext.read.option("mergeSchema", 
"true").parquet("data/test_table")
df3.registerTempTable("df3")
sqlContext.sql("CREATE TABLE new_key_value_store LOCATION 
'/Users/lucasmf/data/new_key_value_store' AS select * from df3");
sqlContext.sql("SELECT * FROM new_key_value_store")
```

I'll looking into it.

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415)
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:333)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829)
> {code}
> cc @rxin and [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-23 Thread MIN-FU YANG (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MIN-FU YANG updated SPARK-15516:

Comment: was deleted

(was: I would like to look into this issue)

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415)
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:333)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829)
> {code}
> cc @rxin and [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions

2016-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16178:


Assignee: (was: Apache Spark)

> SQL - Hive writer should not require partition names to match table partitions
> --
>
> Key: SPARK-16178
> URL: https://issues.apache.org/jira/browse/SPARK-16178
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Ryan Blue
>
> SPARK-14459 added a check that the {{partition}} metadata on 
> {{InsertIntoTable}} must match the table's partition column names. But if 
> {{partitionBy}} is used to set up partition columns, those columns may not be 
> named or the names may not match.
> For example:
> {code}
> // Tables:
> // CREATE TABLE src (id string, date int, hour int, timestamp bigint);
> // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int)
> //   PARTITIONED BY (utc_dateint int, utc_hour int);
> spark.table("src").write.partitionBy("date", "hour").insertInto("dest")
> {code}
> The call to partitionBy correctly places the date and hour columns at the end 
> of the logical plan, but the names don't match the "utc_" prefix and the 
> write fails. But the analyzer will verify the types and insert an {{Alias}} 
> so the query is actually valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions

2016-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347462#comment-15347462
 ] 

Apache Spark commented on SPARK-16178:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/13880

> SQL - Hive writer should not require partition names to match table partitions
> --
>
> Key: SPARK-16178
> URL: https://issues.apache.org/jira/browse/SPARK-16178
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Ryan Blue
>
> SPARK-14459 added a check that the {{partition}} metadata on 
> {{InsertIntoTable}} must match the table's partition column names. But if 
> {{partitionBy}} is used to set up partition columns, those columns may not be 
> named or the names may not match.
> For example:
> {code}
> // Tables:
> // CREATE TABLE src (id string, date int, hour int, timestamp bigint);
> // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int)
> //   PARTITIONED BY (utc_dateint int, utc_hour int);
> spark.table("src").write.partitionBy("date", "hour").insertInto("dest")
> {code}
> The call to partitionBy correctly places the date and hour columns at the end 
> of the logical plan, but the names don't match the "utc_" prefix and the 
> write fails. But the analyzer will verify the types and insert an {{Alias}} 
> so the query is actually valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions

2016-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16178:


Assignee: Apache Spark

> SQL - Hive writer should not require partition names to match table partitions
> --
>
> Key: SPARK-16178
> URL: https://issues.apache.org/jira/browse/SPARK-16178
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Ryan Blue
>Assignee: Apache Spark
>
> SPARK-14459 added a check that the {{partition}} metadata on 
> {{InsertIntoTable}} must match the table's partition column names. But if 
> {{partitionBy}} is used to set up partition columns, those columns may not be 
> named or the names may not match.
> For example:
> {code}
> // Tables:
> // CREATE TABLE src (id string, date int, hour int, timestamp bigint);
> // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int)
> //   PARTITIONED BY (utc_dateint int, utc_hour int);
> spark.table("src").write.partitionBy("date", "hour").insertInto("dest")
> {code}
> The call to partitionBy correctly places the date and hour columns at the end 
> of the logical plan, but the names don't match the "utc_" prefix and the 
> write fails. But the analyzer will verify the types and insert an {{Alias}} 
> so the query is actually valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16180) Task hang on fetching blocks (cached RDD)

2016-06-23 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16180:
---
Description: 
Here is the stackdump of executor:

{code}
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
scala.concurrent.Await$.result(package.scala:107)
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:46)
org.apache.spark.scheduler.Task.run(Task.scala:96)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

{code}

  was:
{code}
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
scala.concurrent.Await$.result(package.scala:107)
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)

[jira] [Updated] (SPARK-16180) Task hang on fetching blocks (cached RDD)

2016-06-23 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16180:
---
Affects Version/s: 1.6.1

> Task hang on fetching blocks (cached RDD)
> -
>
> Key: SPARK-16180
> URL: https://issues.apache.org/jira/browse/SPARK-16180
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.1
>Reporter: Davies Liu
>
> {code}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:107)
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
> org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
> org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:46)
> org.apache.spark.scheduler.Task.run(Task.scala:96)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16180) Task hang on fetching blocks (cached RDD)

2016-06-23 Thread Davies Liu (JIRA)

Davies Liu created SPARK-16180:
--

 Summary: Task hang on fetching blocks (cached RDD)
 Key: SPARK-16180
 URL: https://issues.apache.org/jira/browse/SPARK-16180
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


{code}
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
scala.concurrent.Await$.result(package.scala:107)
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:46)
org.apache.spark.scheduler.Task.run(Task.scala:96)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-23 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347454#comment-15347454
 ] 

Hossein Falaki commented on SPARK-15516:


Here is an example:
{code}
create table mytable using parquet options (path "/logs/parquet-files", 
mergeSchema "true")
{code}

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415)
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:333)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829)
> {code}
> cc @rxin and [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16179) UDF explosion yielding empty dataframe fails

2016-06-23 Thread Vladimir Feinberg (JIRA)

Vladimir Feinberg created SPARK-16179:
-

 Summary: UDF explosion yielding empty dataframe fails
 Key: SPARK-16179
 URL: https://issues.apache.org/jira/browse/SPARK-16179
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Vladimir Feinberg


Command to replicate 
https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0

Resulting failure
https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14351) Optimize ImpurityAggregator for decision trees

2016-06-23 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335165#comment-15335165
 ] 

Manoj Kumar edited comment on SPARK-14351 at 6/23/16 11:43 PM:
---

OK, so here are some benchmarks that validate your claims partially (All 
trained to maxDepth=30 and the auto feature selection strategy). The trend is 
that as the number of trees increase, it seems to have a higher impact. I'll 
see what I can optimize tomorrow.

|| n_tree ||  n_samples || n_features || totalTime ||  percent of total time 
spent in impurityCalculator || percent of total time spent in impurityStats ||
|1 |  1 |  500 | 7.90 | 0.328% | 0.01%
|10 |  1 |  500 | 7.67 | 1.3% | 0.12%
|100 |  1 |  500 | 18.156 | 5.19% | 0.29%
|1 |  500 |  1 | 7.1308 | 0.39% | 0.014%
|10 |  500 |  1 | 7.5506 | 1.37% | 0.12%
|100 |  500 |  1 | 17.61| 6.18% | 0.349%
|1 |  1000 |  1000 | 6.99 | 0.28% | 0.029%
|10 |  1000 |  1000 | 7.415  | 1.7% | 0.09%
|100 |  1000 |  1000 | 17.89 | 6.1% | 0.3%
|500 |  1000 |  1000 | 71.02 | 6.8% | 0.3%



was (Author: mechcoder):
OK, so here are some benchmarks that validate your claims partially (All 
trained to maxDepth=30 and the auto feature selection strategy). The trend is 
that as the number of trees increase, it seems to have a higher impact. I'll 
see what I can optimize tomorrow.

|| n_tree ||  n_samples || n_features || totalTime ||  percent of total time 
spent in impurityCalculator || percent of total time spent in impurityStats ||
|1 |  1 |  500 | 7.90 | 0.328% | 0.01%
|10 |  1 |  500 | 7.67 | 1.3% | 0.12%
|100 |  1 |  500 | 18.156 | 5.19% | 0.29%
1 |  500 |  1 | 7.1308 | 0.39% | 0.014%
|10 |  500 |  1 | 7.5506 | 1.37% | 0.12%
|100 |  500 |  1 | 17.61| 6.18% | 0.349%
|1 |  1000 |  1000 | 6.99 | 0.28% | 0.029%
|10 |  1000 |  1000 | 7.415  | 1.7% | 0.09%
|100 |  1000 |  1000 | 17.89 | 6.1% | 0.3%
|500 |  1000 |  1000 | 71.02 | 6.8% | 0.3%


> Optimize ImpurityAggregator for decision trees
> --
>
> Key: SPARK-14351
> URL: https://issues.apache.org/jira/browse/SPARK-14351
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> {{RandomForest.binsToBestSplit}} currently takes a large amount of time.  
> Based on some quick profiling, I believe a big chunk of this is spent in 
> {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array 
> copies) and {{RandomForest.calculateImpurityStats}}.
> This JIRA is for:
> * Doing more profiling to confirm that unnecessary time is being spent in 
> some of these methods.
> * Optimizing the implementation
> * Profiling again to confirm the speedups
> Local profiling for large enough examples should suffice, especially since 
> the optimizations should not need to change the amount of data communicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14351) Optimize ImpurityAggregator for decision trees

2016-06-23 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335165#comment-15335165
 ] 

Manoj Kumar edited comment on SPARK-14351 at 6/23/16 11:43 PM:
---

OK, so here are some benchmarks that validate your claims partially (All 
trained to maxDepth=30 and the auto feature selection strategy). The trend is 
that as the number of trees increase, it seems to have a higher impact. I'll 
see what I can optimize tomorrow.

|| n_tree ||  n_samples || n_features || totalTime ||  percent of total time 
spent in impurityCalculator || percent of total time spent in impurityStats ||
|1 |  1 |  500 | 7.90 | 0.328% | 0.01%
|10 |  1 |  500 | 7.67 | 1.3% | 0.12%
|100 |  1 |  500 | 18.156 | 5.19% | 0.29%
1 |  500 |  1 | 7.1308 | 0.39% | 0.014%
|10 |  500 |  1 | 7.5506 | 1.37% | 0.12%
|100 |  500 |  1 | 17.61| 6.18% | 0.349%
|1 |  1000 |  1000 | 6.99 | 0.28% | 0.029%
|10 |  1000 |  1000 | 7.415  | 1.7% | 0.09%
|100 |  1000 |  1000 | 17.89 | 6.1% | 0.3%
|500 |  1000 |  1000 | 71.02 | 6.8% | 0.3%



was (Author: mechcoder):
OK, so here are some benchmarks that validate your claims partially (All 
trained to maxDepth=30 and the auto feature selection strategy). The trend is 
that as the number of trees increase, it seems to have a higher impact. I'll 
see what I can optimize tomorrow.

|| n_tree ||  n_samples || n_features || totalTime ||  percent in 
binsToBestSplit || percent in impurityCalculator || percent in 
impurityStatsTime ||
|1 |  1 |  500 | 2 | 19.5% | 15% | 0.1%
|10 |  1 |  500 | 2.45 | 13% | 8.5%| 0.7%
|100 |  1 |  500 | 4.48 | 64.5% | 41.5% | 2.1%
|500 |  1 |  500 | 15.2 | 89.6% | 61.1% | 3.4%
|1 |  500 |  1 | 2.16 | 18.5% | 16.2% | ~
|10 |  500 |  1 | 2.70 | 14.8% | 11.1%| 0.4%
|100 |  500 |  1 | 9.07 | 43.5% | 31.4% | 1.9%
|1 |  1000 |  1000 | 2.02 | 24.7% | 14.8% | 0.2%
|10 |  1000 |  1000 | 6.2 | 12.8% | 9.6%| 0.1%
|50 |  1000 |  1000 | 4.05 | 38.5% | 28.8% | 2.8%
|100 |  1000 |  1000 | 10.19 | 45.3% | 30.6% | 3.18%


> Optimize ImpurityAggregator for decision trees
> --
>
> Key: SPARK-14351
> URL: https://issues.apache.org/jira/browse/SPARK-14351
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> {{RandomForest.binsToBestSplit}} currently takes a large amount of time.  
> Based on some quick profiling, I believe a big chunk of this is spent in 
> {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array 
> copies) and {{RandomForest.calculateImpurityStats}}.
> This JIRA is for:
> * Doing more profiling to confirm that unnecessary time is being spent in 
> some of these methods.
> * Optimizing the implementation
> * Profiling again to confirm the speedups
> Local profiling for large enough examples should suffice, especially since 
> the optimizations should not need to change the amount of data communicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions

2016-06-23 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-16178:
-

 Summary: SQL - Hive writer should not require partition names to 
match table partitions
 Key: SPARK-16178
 URL: https://issues.apache.org/jira/browse/SPARK-16178
 Project: Spark
  Issue Type: Sub-task
Reporter: Ryan Blue


SPARK-14459 added a check that the {{partition}} metadata on 
{{InsertIntoTable}} must match the table's partition column names. But if 
{{partitionBy}} is used to set up partition columns, those columns may not be 
named or the names may not match.

For example:

{code}
// Tables:
// CREATE TABLE src (id string, date int, hour int, timestamp bigint);
// CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int)
//   PARTITIONED BY (utc_dateint int, utc_hour int);

spark.table("src").write.partitionBy("date", "hour").insertInto("dest")
{code}

The call to partitionBy correctly places the date and hour columns at the end 
of the logical plan, but the names don't match the "utc_" prefix and the write 
fails. But the analyzer will verify the types and insert an {{Alias}} so the 
query is actually valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-23 Thread MIN-FU YANG (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347428#comment-15347428
 ] 

MIN-FU YANG commented on SPARK-15516:
-

Do you have any sample code?

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415)
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:333)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829)
> {code}
> cc @rxin and [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16165) Fix the update logic for InMemoryTableScanExec.readBatches accumulator

2016-06-23 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16165.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 13870
[https://github.com/apache/spark/pull/13870]

> Fix the update logic for InMemoryTableScanExec.readBatches accumulator
> --
>
> Key: SPARK-16165
> URL: https://issues.apache.org/jira/browse/SPARK-16165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> Currently, `readBatches` accumulator of `InMemoryTableScanExec` is updated 
> only when `spark.sql.inMemoryColumnarStorage.partitionPruning` is true.
> Although this metric is used for only testing purpose, we had better have 
> correct metric without considering SQL options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16165) Fix the update logic for InMemoryTableScanExec.readBatches accumulator

2016-06-23 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16165:
---
Assignee: Dongjoon Hyun

> Fix the update logic for InMemoryTableScanExec.readBatches accumulator
> --
>
> Key: SPARK-16165
> URL: https://issues.apache.org/jira/browse/SPARK-16165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Currently, `readBatches` accumulator of `InMemoryTableScanExec` is updated 
> only when `spark.sql.inMemoryColumnarStorage.partitionPruning` is true.
> Although this metric is used for only testing purpose, we had better have 
> correct metric without considering SQL options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16174) Improve OptimizeIn optimizer to remove deterministic repetitions

2016-06-23 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16174:
--
Description: This issue improves `OptimizeIn` optimizer to remove the 
deterministic repetitions from SQL `IN` predicates. This optimizer prevents 
user mistakes and also can optimize some queries like 
[TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19].
  (was: This issue adds an optimizer to remove the duplicated literals from SQL 
`IN` predicates. This optimizer prevents user mistakes and also can optimize 
some queries like 
[TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19].)
Summary: Improve OptimizeIn optimizer to remove deterministic 
repetitions  (was: Add RemoveLiteralRepetitionFromIn optimizer)

> Improve OptimizeIn optimizer to remove deterministic repetitions
> 
>
> Key: SPARK-16174
> URL: https://issues.apache.org/jira/browse/SPARK-16174
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue improves `OptimizeIn` optimizer to remove the deterministic 
> repetitions from SQL `IN` predicates. This optimizer prevents user mistakes 
> and also can optimize some queries like 
> [TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15663) SparkSession.catalog.listFunctions shouldn't include the list of built-in functions

2016-06-23 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15663:
-
Labels: release_notes releasenotes  (was: )

> SparkSession.catalog.listFunctions shouldn't include the list of built-in 
> functions
> ---
>
> Key: SPARK-15663
> URL: https://issues.apache.org/jira/browse/SPARK-15663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sandeep Singh
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> SparkSession.catalog.listFunctions currently returns all functions, including 
> the list of built-in functions. This makes the method not as useful because 
> anytime it is run the result set contains over 100 built-in functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16177) model loading backward compatibility for ml.regression

2016-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16177:


Assignee: Apache Spark

> model loading backward compatibility for ml.regression
> --
>
> Key: SPARK-16177
> URL: https://issues.apache.org/jira/browse/SPARK-16177
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16177) model loading backward compatibility for ml.regression

2016-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347399#comment-15347399
 ] 

Apache Spark commented on SPARK-16177:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/13879

> model loading backward compatibility for ml.regression
> --
>
> Key: SPARK-16177
> URL: https://issues.apache.org/jira/browse/SPARK-16177
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16177) model loading backward compatibility for ml.regression

2016-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16177:


Assignee: (was: Apache Spark)

> model loading backward compatibility for ml.regression
> --
>
> Key: SPARK-16177
> URL: https://issues.apache.org/jira/browse/SPARK-16177
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

2016-06-23 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347172#comment-15347172
 ] 

Ryan Blue edited comment on SPARK-16032 at 6/23/16 10:59 PM:
-

bq. I am not sure apply by-name resolution just to partition columns is a good 
idea.

I'm not sure about this. After looking into it more, I agree in principle and 
that in the long term we don't want to mix by-position column matching with 
by-name partitioning. But I'm less certain about whether or not it's a good 
idea right now. As I look at it more, I agree with you guys more about what is 
"right". But, I'm still concerned about how to move forward from where we're 
at, given the way people are currently using the API.

I think we've already established that it isn't clear that the DataFrameWriter 
API relies on position. I actually think that most people aren't thinking about 
the choice between by-position or by-name resolution and are using what they 
get working. My first use of the API was to build a partitioned table from an 
unpartitioned table, which failed. When I went looking for a solution, 
{{partitionBy}} was the obvious choice (suggested by my IDE) and, sure enough, 
it fixed the problem by [moving the partition columns by 
name|https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L180]
 to the end. This solution is common because it works and is more obvious than 
thinking about column order because, as I noted above, it isn't clear that 
{{insertInto}} is using position.

The pattern of using {{partitionBy}} with {{insertInto}} has also become a best 
practice for maintaining ETL jobs in Spark. Consider this table setup, where 
data lands in {{src}} in batches and we move it to {{dest}} for long-term 
storage in Parquet. Here's some example DDL:

{code:lang=sql}
CREATE TABLE src (id string, timestamp bigint, other_properties map);
CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int)
  PARTITIONED BY (utc_dateint int, utc_hour int);
{code}

The Spark code for this ETL job should be this:

{code:lang=java}
spark.table("src")
  .withColumn("c1", $"other_properties".getItem("c1"))
  .withColumn("c2", $"other_properties".getItem("c2").cast(IntegerType))
  .withColumn("date", dateint($"timetamp"))
  .withColumn("hour", hour($"timestamp"))
  .dropColumn("other_properties")
  .write.insertInto("dest")
{code}

But, users are likely to try this next version instead. That's because it isn't 
obvious that partition columns go after data columns; they are two separate 
lists in the DDL.

{code:lang=java}
spark.table("src")
  .withColumn("date", dateint($"timetamp"))
  .withColumn("hour", hour($"timestamp"))
  .withColumn("c1", $"other_properties".getItem("c1"))
  .withColumn("c2", $"other_properties".getItem("c2").cast(IntegerType))
  .dropColumn("other_properties")
  .write.insertInto("dest")
{code}

And again, the most obvious fix is to add {{partitionBy}} to specify the 
partition columns, which appears to users as a match for Hive's 
{{PARTITION("date", "hour")}} syntax. Users then get the impression that 
{{partitionBy}} is equivalent to {{PARTITION}}, though in reality Hive operates 
by position.

{code:lang=java}
spark.table("src")
  .withColumn("date", dateint($"timetamp"))
  .withColumn("hour", hour($"timestamp"))
  .withColumn("c1", $"other_properties".getItem("c1"))
  .withColumn("c2", $"other_properties".getItem("c2").cast(IntegerType))
  .dropColumn("other_properties")
  .write.partitionBy("date", "hour").insertInto("dest")Another case 
{code}

Another reason to use {{partitionBy}} is for maintaining ETL over time. When 
structure changes, so does column order. Say I want to add a dedup step so I 
get just one row per ID per day. My first attempt, based on getting the column 
order right to begin with, looks like this:

{code:lang=java}
// column orders change, causing the query to break
spark.table("src")
  .withColumn("date", dateint($"timetamp")) // moved to before dropDuplicates
  .dropDuplicates($"date", $"id")   // added to dedup records
  .withColumn("c1", $"other_properties".getItem("c1"))
  .withColumn("c2", $"other_properties".getItem("c2").cast(IntegerType))
  .withColumn("hour", hour($"timestamp"))
  .dropColumn("other_properties")
  .write.insertInto("dest")
{code}

The result is that I get crazy partitioning because c2 and hour are used. The 
most obvious symptom of the wrong column order is partitioning and when I look 
into it, I find {{partitionBy}} fixes it. In many cases, that's the first 
method I'll try because I see bad partition values. This solution doesn't 
always fix the query, but it does solve the partitioning problem I observed. 
(Also: other order problems are hidden by inserting {{Cast}} instead of the 
safer {{UpCast}}.)

Users will also _choose_ this over the right solution,

[jira] [Updated] (SPARK-15443) Properly explain the streaming queries

2016-06-23 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15443:
-
Fix Version/s: (was: 2.0.0)
   2.1.0
   2.0.1

> Properly explain the streaming queries
> --
>
> Key: SPARK-15443
> URL: https://issues.apache.org/jira/browse/SPARK-15443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> Currently when called `explain()` on streaming dataset, it will only get the 
> parsed and analyzed logical plan, exceptions for optimized logical plan and 
> physical plan, like below:
> {code}
> scala> res0.explain(true)
> == Parsed Logical Plan ==
> FileSource[file:///tmp/input]
> == Analyzed Logical Plan ==
> value: string
> FileSource[file:///tmp/input]
> == Optimized Logical Plan ==
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must 
> be executed with write.startStream();
> == Physical Plan ==
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must 
> be executed with write.startStream();
> {code}
> The reason is that structure streaming dynamically materialize the plan in 
> the run-time.
> So here we should figure out a way to properly get the streaming plan. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15443) Properly explain the streaming queries

2016-06-23 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15443.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13815
[https://github.com/apache/spark/pull/13815]

> Properly explain the streaming queries
> --
>
> Key: SPARK-15443
> URL: https://issues.apache.org/jira/browse/SPARK-15443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently when called `explain()` on streaming dataset, it will only get the 
> parsed and analyzed logical plan, exceptions for optimized logical plan and 
> physical plan, like below:
> {code}
> scala> res0.explain(true)
> == Parsed Logical Plan ==
> FileSource[file:///tmp/input]
> == Analyzed Logical Plan ==
> value: string
> FileSource[file:///tmp/input]
> == Optimized Logical Plan ==
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must 
> be executed with write.startStream();
> == Physical Plan ==
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must 
> be executed with write.startStream();
> {code}
> The reason is that structure streaming dynamically materialize the plan in 
> the run-time.
> So here we should figure out a way to properly get the streaming plan. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16177) model loading backward compatibility for ml.regression

2016-06-23 Thread yuhao yang (JIRA)

yuhao yang created SPARK-16177:
--

 Summary: model loading backward compatibility for ml.regression
 Key: SPARK-16177
 URL: https://issues.apache.org/jira/browse/SPARK-16177
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: yuhao yang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15854) Spark History server gets null pointer exception

2016-06-23 Thread Yesha Vora (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesha Vora resolved SPARK-15854.

Resolution: Cannot Reproduce

> Spark History server gets null pointer exception
> 
>
> Key: SPARK-15854
> URL: https://issues.apache.org/jira/browse/SPARK-15854
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>
> In Spark2, Spark-History Server is configured to FSHistoryProvider. 
> Spark HS does not show any finished/running applications and gets Null 
> pointer exception.
> {code}
> 16/06/03 23:06:40 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx:8020/spark2-history/application_1464912457462_0002.inprogress
> 16/06/03 23:06:50 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xx:8020/spark2-history/application_1464912457462_0002
> 16/06/03 23:08:27 WARN ServletHandler: Error for /api/v1/applications
> java.lang.NoSuchMethodError: 
> javax.ws.rs.core.Application.getProperties()Ljava/util/Map;
> at 
> org.glassfish.jersey.server.ApplicationHandler.(ApplicationHandler.java:331)
> at 
> org.glassfish.jersey.servlet.WebComponent.(WebComponent.java:392)
> at 
> org.glassfish.jersey.servlet.ServletContainer.init(ServletContainer.java:177)
> at 
> org.glassfish.jersey.servlet.ServletContainer.init(ServletContainer.java:369)
> at javax.servlet.GenericServlet.init(GenericServlet.java:244)
> at 
> org.spark_project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:616)
> at 
> org.spark_project.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:472)
> at 
> org.spark_project.jetty.servlet.ServletHolder.ensureInstance(ServletHolder.java:767)
> at 
> org.spark_project.jetty.servlet.ServletHolder.prepare(ServletHolder.java:752)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.spark_project.jetty.server.Server.handle(Server.java:499)
> at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:745)
> 16/06/03 23:08:33 WARN ServletHandler: /api/v1/applications
> java.lang.NullPointerException
> at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
> at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
> at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
> at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.spark_project.jetty.server.Server.handle(Server.java:499)
> at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>

[jira] [Commented] (SPARK-15955) Failed Spark application returns with exitcode equals to zero

2016-06-23 Thread Yesha Vora (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347360#comment-15347360
 ] 

Yesha Vora commented on SPARK-15955:


[~sowen], I'm checking exit code of the process which started the application. 

[~tgraves]. This issue happens with yarn-client and yarn-cluster mode both. 


> Failed Spark application returns with exitcode equals to zero
> -
>
> Key: SPARK-15955
> URL: https://issues.apache.org/jira/browse/SPARK-15955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Scenario:
> * Set up cluster with wire-encryption enabled.
> * set 'spark.authenticate.enableSaslEncryption' = 'false' and 
> 'spark.shuffle.service.enabled' :'true'
> * run sparkPi application.
> {code}
> client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
> diagnostics: Max number of executor failures (3) reached
> ApplicationMaster host: xx.xx.xx.xxx
> ApplicationMaster RPC port: 0
> queue: default
> start time: 1465941051976
> final status: FAILED
> tracking URL: https://xx.xx.xx.xxx:8090/proxy/application_1465925772890_0016/
> user: hrt_qa
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1465925772890_0016 finished with failed status
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1092)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1139)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> INFO ShutdownHookManager: Shutdown hook called{code}
> This spark application exits with exitcode = 0. Failed application should not 
> return with exitcode = 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13288) [1.6.0] Memory leak in Spark streaming

2016-06-23 Thread roberto hashioka (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347355#comment-15347355
 ] 

roberto hashioka commented on SPARK-13288:
--

Do you really need to create multiple streams with DirectStream? I think you 
can create just one and Spark does the rest.

> [1.6.0] Memory leak in Spark streaming
> --
>
> Key: SPARK-13288
> URL: https://issues.apache.org/jira/browse/SPARK-13288
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: Bare metal cluster
> RHEL 6.6
>Reporter: JESSE CHEN
>  Labels: streaming
>
> Streaming in 1.6 seems to have a memory leak.
> Running the same streaming app in Spark 1.5.1 and 1.6, all things equal, 1.6 
> showed a gradual increasing processing time. 
> The app is simple: 1 Kafka receiver of tweet stream and 20 executors 
> processing the tweets in 5-second batches. 
> Spark 1.5.0 handles this smoothly and did not show increasing processing time 
> in the 40-minute test; but 1.6 showed increasing time about 8 minutes into 
> the test. Please see chart here:
> https://ibm.box.com/s/7q4ulik70iwtvyfhoj1dcl4nc469b116
> I captured heap dumps in two version and did a comparison. I noticed the Byte 
> is using 50X more space in 1.5.1.
> Here are some top classes in heap histogram and references. 
> Heap Histogram
>   
> All Classes (excluding platform)  
>   1.6.0 Streaming 1.5.1 Streaming 
> Class Instance Count  Total Size  Class   Instance Count  Total 
> Size
> class [B  84533,227,649,599   class [B5095
> 62,938,466
> class [C  44682   4,255,502   class [C130482  
> 12,844,182
> class java.lang.reflect.Method90591,177,670   class 
> java.lang.String  130171  1,562,052
>   
>   
> References by TypeReferences by Type  
>   
> class [B [0x640039e38]class [B [0x6c020bb08]  
> 
>   
> Referrers by Type Referrers by Type   
>   
> Class Count   Class   Count   
> java.nio.HeapByteBuffer   3239
> sun.security.util.DerInputBuffer1233
> sun.security.util.DerInputBuffer  1233
> sun.security.util.ObjectIdentifier  620 
> sun.security.util.ObjectIdentifier620 [[B 397 
> [Ljava.lang.Object;   408 java.lang.reflect.Method
> 326 
> 
> The total size by class B is 3GB in 1.5.1 and only 60MB in 1.6.0.
> The Java.nio.HeapByteBuffer referencing class did not show up in top in 
> 1.5.1. 
> I have also placed jstack output for 1.5.1 and 1.6.0 online..you can get them 
> here
> https://ibm.box.com/sparkstreaming-jstack160
> https://ibm.box.com/sparkstreaming-jstack151
> Jesse 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13288) [1.6.0] Memory leak in Spark streaming

2016-06-23 Thread JESSE CHEN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347338#comment-15347338
 ] 

JESSE CHEN commented on SPARK-13288:


[~AlexSparkJiang]The code is:
val multiTweetStreams=(1 to numStreams).map {i => 
KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, topicsSet) }
// unified stream
val tweetStream= ssc.union(multiTweetStreams)


> [1.6.0] Memory leak in Spark streaming
> --
>
> Key: SPARK-13288
> URL: https://issues.apache.org/jira/browse/SPARK-13288
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: Bare metal cluster
> RHEL 6.6
>Reporter: JESSE CHEN
>  Labels: streaming
>
> Streaming in 1.6 seems to have a memory leak.
> Running the same streaming app in Spark 1.5.1 and 1.6, all things equal, 1.6 
> showed a gradual increasing processing time. 
> The app is simple: 1 Kafka receiver of tweet stream and 20 executors 
> processing the tweets in 5-second batches. 
> Spark 1.5.0 handles this smoothly and did not show increasing processing time 
> in the 40-minute test; but 1.6 showed increasing time about 8 minutes into 
> the test. Please see chart here:
> https://ibm.box.com/s/7q4ulik70iwtvyfhoj1dcl4nc469b116
> I captured heap dumps in two version and did a comparison. I noticed the Byte 
> is using 50X more space in 1.5.1.
> Here are some top classes in heap histogram and references. 
> Heap Histogram
>   
> All Classes (excluding platform)  
>   1.6.0 Streaming 1.5.1 Streaming 
> Class Instance Count  Total Size  Class   Instance Count  Total 
> Size
> class [B  84533,227,649,599   class [B5095
> 62,938,466
> class [C  44682   4,255,502   class [C130482  
> 12,844,182
> class java.lang.reflect.Method90591,177,670   class 
> java.lang.String  130171  1,562,052
>   
>   
> References by TypeReferences by Type  
>   
> class [B [0x640039e38]class [B [0x6c020bb08]  
> 
>   
> Referrers by Type Referrers by Type   
>   
> Class Count   Class   Count   
> java.nio.HeapByteBuffer   3239
> sun.security.util.DerInputBuffer1233
> sun.security.util.DerInputBuffer  1233
> sun.security.util.ObjectIdentifier  620 
> sun.security.util.ObjectIdentifier620 [[B 397 
> [Ljava.lang.Object;   408 java.lang.reflect.Method
> 326 
> 
> The total size by class B is 3GB in 1.5.1 and only 60MB in 1.6.0.
> The Java.nio.HeapByteBuffer referencing class did not show up in top in 
> 1.5.1. 
> I have also placed jstack output for 1.5.1 and 1.6.0 online..you can get them 
> here
> https://ibm.box.com/sparkstreaming-jstack160
> https://ibm.box.com/sparkstreaming-jstack151
> Jesse 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16164) CombineFilters should keep the ordering in the logical plan

2016-06-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16164:
--
Summary: CombineFilters should keep the ordering in the logical plan  (was: 
Filter pushdown should keep the ordering in the logical plan)

> CombineFilters should keep the ordering in the logical plan
> ---
>
> Key: SPARK-16164
> URL: https://issues.apache.org/jira/browse/SPARK-16164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Dongjoon Hyun
> Fix For: 2.0.1, 2.1.0
>
>
> [~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with 
> additional filters. It seems that during filter pushdown, we changed the 
> ordering in the logical plan. I'm not sure whether we should treat this as a 
> bug.
> {code}
> val df1 = (0 until 3).map(_.toString).toDF
> val indexer = new StringIndexer()
>   .setInputCol("value")
>   .setOutputCol("idx")
>   .setHandleInvalid("skip")
>   .fit(df1)
> val df2 = (0 until 5).map(_.toString).toDF
> val predictions = indexer.transform(df2)
> predictions.show() // this is okay
> predictions.where('idx > 2).show() // this will throw an exception
> {code}
> Please see the notebook at 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html
>  for error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16176) model loading backward compatibility for ml.recommendation

2016-06-23 Thread yuhao yang (JIRA)

yuhao yang created SPARK-16176:
--

 Summary: model loading backward compatibility for ml.recommendation
 Key: SPARK-16176
 URL: https://issues.apache.org/jira/browse/SPARK-16176
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: yuhao yang
Priority: Minor


Check if current ALS can load the models saved by Apache Spark 1.6. If not, we 
need a fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16164) Filter pushdown should keep the ordering in the logical plan

2016-06-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16164.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 13872
[https://github.com/apache/spark/pull/13872]

> Filter pushdown should keep the ordering in the logical plan
> 
>
> Key: SPARK-16164
> URL: https://issues.apache.org/jira/browse/SPARK-16164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
> Fix For: 2.0.1, 2.1.0
>
>
> [~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with 
> additional filters. It seems that during filter pushdown, we changed the 
> ordering in the logical plan. I'm not sure whether we should treat this as a 
> bug.
> {code}
> val df1 = (0 until 3).map(_.toString).toDF
> val indexer = new StringIndexer()
>   .setInputCol("value")
>   .setOutputCol("idx")
>   .setHandleInvalid("skip")
>   .fit(df1)
> val df2 = (0 until 5).map(_.toString).toDF
> val predictions = indexer.transform(df2)
> predictions.show() // this is okay
> predictions.where('idx > 2).show() // this will throw an exception
> {code}
> Please see the notebook at 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html
>  for error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16164) Filter pushdown should keep the ordering in the logical plan

2016-06-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16164:
--
Assignee: Dongjoon Hyun

> Filter pushdown should keep the ordering in the logical plan
> 
>
> Key: SPARK-16164
> URL: https://issues.apache.org/jira/browse/SPARK-16164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Dongjoon Hyun
> Fix For: 2.0.1, 2.1.0
>
>
> [~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with 
> additional filters. It seems that during filter pushdown, we changed the 
> ordering in the logical plan. I'm not sure whether we should treat this as a 
> bug.
> {code}
> val df1 = (0 until 3).map(_.toString).toDF
> val indexer = new StringIndexer()
>   .setInputCol("value")
>   .setOutputCol("idx")
>   .setHandleInvalid("skip")
>   .fit(df1)
> val df2 = (0 until 5).map(_.toString).toDF
> val predictions = indexer.transform(df2)
> predictions.show() // this is okay
> predictions.where('idx > 2).show() // this will throw an exception
> {code}
> Please see the notebook at 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html
>  for error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11744) bin/pyspark --version doesn't return version and exit

2016-06-23 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347297#comment-15347297
 ] 

Nicholas Chammas commented on SPARK-11744:
--

This is not the appropriate place to ask random PySpark questions. Please post 
a question on Stack Overflow or on the Spark user list.

> bin/pyspark --version doesn't return version and exit
> -
>
> Key: SPARK-11744
> URL: https://issues.apache.org/jira/browse/SPARK-11744
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Nicholas Chammas
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 1.6.0
>
>
> {{bin/pyspark \-\-help}} offers a {{\-\-version}} option:
> {code}
> $ ./spark/bin/pyspark --help
> Usage: ./bin/pyspark [options]
> Options:
> ...
>   --version,  Print the version of current Spark
> ...
> {code}
> However, trying to get the version in this way doesn't yield the expected 
> results.
> Instead of printing the version and exiting, we get the version, a stack 
> trace, and then get dropped into a broken PySpark shell.
> {code}
> $ ./spark/bin/pyspark --version
> Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> 
> Type --help for more information.
> Traceback (most recent call last):
>   File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
> sc = SparkContext(pyFiles=add_files)
>   File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
>   File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in 
> _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway()
>   File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in 
> launch_gateway
> raise Exception("Java gateway process exited before sending the driver 
> its port number")
> Exception: Java gateway process exited before sending the driver its port 
> number
> >>> 
> >>> sc
> Traceback (most recent call last):
>   File "", line 1, in 
> NameError: name 'sc' is not defined
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16175) Handle None for all Python UDT

2016-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16175:


Assignee: (was: Apache Spark)

> Handle None for all Python UDT
> --
>
> Key: SPARK-16175
> URL: https://issues.apache.org/jira/browse/SPARK-16175
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Davies Liu
> Attachments: nullvector.dbc
>
>
> For Scala UDT, we will not call serialize()/deserialize() for all null, we 
> should also do that in Python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16175) Handle None for all Python UDT

2016-06-23 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16175:
---
Affects Version/s: 2.0.0
   1.6.1

> Handle None for all Python UDT
> --
>
> Key: SPARK-16175
> URL: https://issues.apache.org/jira/browse/SPARK-16175
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Davies Liu
> Attachments: nullvector.dbc
>
>
> For Scala UDT, we will not call serialize()/deserialize() for all null, we 
> should also do that in Python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16175) Handle None for all Python UDT

2016-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16175:


Assignee: Apache Spark

> Handle None for all Python UDT
> --
>
> Key: SPARK-16175
> URL: https://issues.apache.org/jira/browse/SPARK-16175
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Davies Liu
>Assignee: Apache Spark
> Attachments: nullvector.dbc
>
>
> For Scala UDT, we will not call serialize()/deserialize() for all null, we 
> should also do that in Python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16175) Handle None for all Python UDT

2016-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347287#comment-15347287
 ] 

Apache Spark commented on SPARK-16175:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/13878

> Handle None for all Python UDT
> --
>
> Key: SPARK-16175
> URL: https://issues.apache.org/jira/browse/SPARK-16175
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Davies Liu
> Attachments: nullvector.dbc
>
>
> For Scala UDT, we will not call serialize()/deserialize() for all null, we 
> should also do that in Python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10

2016-06-23 Thread Bo Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347255#comment-15347255
 ] 

Bo Meng edited comment on SPARK-16173 at 6/23/16 9:58 PM:
--

Use the latest master, I was not able to reproduce, here is my code:
{quote}
val a = Seq(("Alice", 1)).toDF("name", "age").describe()
val b = Seq(("Bob", 2)).toDF("name", "grade").describe()

a.show()
b.show()

a.join(b, Seq("summary")).show()
{quote}
Anything I am missing? I am using Scala 2.11, does it only happen to Scala 
2.10? 


was (Author: bomeng):
Use the latest master, I was not able to reproduce, here is my code:
{quote}
val a = Seq(("Alice", 1)).toDF("name", "age").describe()
val b = Seq(("Bob", 2)).toDF("name", "grade").describe()

a.show()
b.show()

a.join(b, Seq("summary")).show()
{quote}
Anything I am missing?

> Can't join describe() of DataFrame in Scala 2.10
> 
>
> Key: SPARK-16173
> URL: https://issues.apache.org/jira/browse/SPARK-16173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Davies Liu
>
> descripbe() of DataFrame use Seq() (it's a Iterator actually) to create 
> another DataFrame, which can not be serialized in Scala 2.10.
> {code}
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2060)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
>   at 
> org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.NotSerializableException: 
> scala.collection.Iterator$$anon$11
> Serialization stack:
>   - object not serializable (class: scala.collection.Iterator$$anon$11, 
> value: empty iterator)
>   - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: 
> $outer, type: interface scala.collection.Iterator)
>   - object (class scala.collection.Iterator$$anonfun$toStream$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), 
> WrappedArray(2), WrappedArray(2)))
>   - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$zip$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class

[jira] [Updated] (SPARK-16004) Improve CatalogTable information

2016-06-23 Thread Bo Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Meng updated SPARK-16004:

Description: 
A few issues found when running "describe extended | formatted [tableName]" 
command:
1. The last access time is incorrectly displayed something like "Last Access 
Time:   |Wed Dec 31 15:59:59 PST 1969", I think we should display as 
"UNKNOWN" as Hive does;
2. Comments fields display "null" instead of empty string when commend is None;

I will make a PR shortly.

  was:
A few issues found when running "describe extended | formatted [tableName]" 
command:
1. The last access time is incorrectly displayed something like "Last Access 
Time:   |Wed Dec 31 15:59:59 PST 1969", I think we should display as 
"UNKNOWN" as Hive does;
2. Owner is always empty, instead of the current login user, who creates the 
table;
3. Comments fields display "null" instead of empty string when commend is None;

I will make a PR shortly.


> Improve CatalogTable information
> 
>
> Key: SPARK-16004
> URL: https://issues.apache.org/jira/browse/SPARK-16004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Bo Meng
>
> A few issues found when running "describe extended | formatted [tableName]" 
> command:
> 1. The last access time is incorrectly displayed something like "Last Access 
> Time:   |Wed Dec 31 15:59:59 PST 1969", I think we should display as 
> "UNKNOWN" as Hive does;
> 2. Comments fields display "null" instead of empty string when commend is 
> None;
> I will make a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15606) Driver hang in o.a.s.DistributedSuite on 2 core machine

2016-06-23 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-15606:
-
Fix Version/s: (was: 1.6.2)
   1.6.3

> Driver hang in o.a.s.DistributedSuite on 2 core machine
> ---
>
> Key: SPARK-15606
> URL: https://issues.apache.org/jira/browse/SPARK-15606
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
> Environment: AMD64 box with only 2 cores
>Reporter: Pete Robbins
>Assignee: Pete Robbins
> Fix For: 1.6.3, 2.0.0
>
>
> repeatedly failing task that crashes JVM *** FAILED ***
>   The code passed to failAfter did not complete within 10 milliseconds. 
> (DistributedSuite.scala:128)
> This test started failing and DistrbutedSuite hanging following 
> https://github.com/apache/spark/pull/13055
> It looks like the extra message to remove the BlockManager deadlocks as there 
> are only 2 message processing loop threads. Related to 
> https://issues.apache.org/jira/browse/SPARK-13906
> {code}
>   /** Thread pool used for dispatching messages. */
>   private val threadpool: ThreadPoolExecutor = {
> val numThreads = 
> nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads",
>   math.max(2, Runtime.getRuntime.availableProcessors()))
> val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, 
> "dispatcher-event-loop")
> for (i <- 0 until numThreads) {
>   pool.execute(new MessageLoop)
> }
> pool
>   }
> {code} 
> Setting a minimum of 3 threads alleviates this issue but I'm not sure there 
> isn't another underlying problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13288) [1.6.0] Memory leak in Spark streaming

2016-06-23 Thread roberto hashioka (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347265#comment-15347265
 ] 

roberto hashioka commented on SPARK-13288:
--

I'm using the createDirectStream. 

> [1.6.0] Memory leak in Spark streaming
> --
>
> Key: SPARK-13288
> URL: https://issues.apache.org/jira/browse/SPARK-13288
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: Bare metal cluster
> RHEL 6.6
>Reporter: JESSE CHEN
>  Labels: streaming
>
> Streaming in 1.6 seems to have a memory leak.
> Running the same streaming app in Spark 1.5.1 and 1.6, all things equal, 1.6 
> showed a gradual increasing processing time. 
> The app is simple: 1 Kafka receiver of tweet stream and 20 executors 
> processing the tweets in 5-second batches. 
> Spark 1.5.0 handles this smoothly and did not show increasing processing time 
> in the 40-minute test; but 1.6 showed increasing time about 8 minutes into 
> the test. Please see chart here:
> https://ibm.box.com/s/7q4ulik70iwtvyfhoj1dcl4nc469b116
> I captured heap dumps in two version and did a comparison. I noticed the Byte 
> is using 50X more space in 1.5.1.
> Here are some top classes in heap histogram and references. 
> Heap Histogram
>   
> All Classes (excluding platform)  
>   1.6.0 Streaming 1.5.1 Streaming 
> Class Instance Count  Total Size  Class   Instance Count  Total 
> Size
> class [B  84533,227,649,599   class [B5095
> 62,938,466
> class [C  44682   4,255,502   class [C130482  
> 12,844,182
> class java.lang.reflect.Method90591,177,670   class 
> java.lang.String  130171  1,562,052
>   
>   
> References by TypeReferences by Type  
>   
> class [B [0x640039e38]class [B [0x6c020bb08]  
> 
>   
> Referrers by Type Referrers by Type   
>   
> Class Count   Class   Count   
> java.nio.HeapByteBuffer   3239
> sun.security.util.DerInputBuffer1233
> sun.security.util.DerInputBuffer  1233
> sun.security.util.ObjectIdentifier  620 
> sun.security.util.ObjectIdentifier620 [[B 397 
> [Ljava.lang.Object;   408 java.lang.reflect.Method
> 326 
> 
> The total size by class B is 3GB in 1.5.1 and only 60MB in 1.6.0.
> The Java.nio.HeapByteBuffer referencing class did not show up in top in 
> 1.5.1. 
> I have also placed jstack output for 1.5.1 and 1.6.0 online..you can get them 
> here
> https://ibm.box.com/sparkstreaming-jstack160
> https://ibm.box.com/sparkstreaming-jstack151
> Jesse 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10

2016-06-23 Thread Bo Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347255#comment-15347255
 ] 

Bo Meng edited comment on SPARK-16173 at 6/23/16 9:38 PM:
--

Use the latest master, I was not able to reproduce, here is my code:
{quote}
val a = Seq(("Alice", 1)).toDF("name", "age").describe()
val b = Seq(("Bob", 2)).toDF("name", "grade").describe()

a.show()
b.show()

a.join(b, Seq("summary")).show()
{quote}
Anything I am missing?


was (Author: bomeng):
Use the latest master, I was not able to reproduce, here is my code:
{{
val a = Seq(("Alice", 1)).toDF("name", "age").describe()
val b = Seq(("Bob", 2)).toDF("name", "grade").describe()

a.show()
b.show()

a.join(b, Seq("summary")).show()
}}
Anything I am missing?

> Can't join describe() of DataFrame in Scala 2.10
> 
>
> Key: SPARK-16173
> URL: https://issues.apache.org/jira/browse/SPARK-16173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Davies Liu
>
> descripbe() of DataFrame use Seq() (it's a Iterator actually) to create 
> another DataFrame, which can not be serialized in Scala 2.10.
> {code}
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2060)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
>   at 
> org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.NotSerializableException: 
> scala.collection.Iterator$$anon$11
> Serialization stack:
>   - object not serializable (class: scala.collection.Iterator$$anon$11, 
> value: empty iterator)
>   - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: 
> $outer, type: interface scala.collection.Iterator)
>   - object (class scala.collection.Iterator$$anonfun$toStream$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), 
> WrappedArray(2), WrappedArray(2)))
>   - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$zip$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream((WrappedArray(1),(count,)),

[jira] [Comment Edited] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10

2016-06-23 Thread Bo Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347255#comment-15347255
 ] 

Bo Meng edited comment on SPARK-16173 at 6/23/16 9:37 PM:
--

Use the latest master, I was not able to reproduce, here is my code:
{{
val a = Seq(("Alice", 1)).toDF("name", "age").describe()
val b = Seq(("Bob", 2)).toDF("name", "grade").describe()

a.show()
b.show()

a.join(b, Seq("summary")).show()
}}
Anything I am missing?


was (Author: bomeng):
Use the latest master, I was not able to reproduce, here is my code:

val a = Seq(("Alice", 1)).toDF("name", "age").describe()
val b = Seq(("Bob", 2)).toDF("name", "grade").describe()

a.show()
b.show()

a.join(b, Seq("summary")).show()

Anything I am missing?

> Can't join describe() of DataFrame in Scala 2.10
> 
>
> Key: SPARK-16173
> URL: https://issues.apache.org/jira/browse/SPARK-16173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Davies Liu
>
> descripbe() of DataFrame use Seq() (it's a Iterator actually) to create 
> another DataFrame, which can not be serialized in Scala 2.10.
> {code}
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2060)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
>   at 
> org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.NotSerializableException: 
> scala.collection.Iterator$$anon$11
> Serialization stack:
>   - object not serializable (class: scala.collection.Iterator$$anon$11, 
> value: empty iterator)
>   - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: 
> $outer, type: interface scala.collection.Iterator)
>   - object (class scala.collection.Iterator$$anonfun$toStream$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), 
> WrappedArray(2), WrappedArray(2)))
>   - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$zip$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream((WrappedArray(1),(count,)), 
>

[jira] [Commented] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10

2016-06-23 Thread Bo Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347255#comment-15347255
 ] 

Bo Meng commented on SPARK-16173:
-

Use the latest master, I was not able to reproduce, here is my code:

val a = Seq(("Alice", 1)).toDF("name", "age").describe()
val b = Seq(("Bob", 2)).toDF("name", "grade").describe()

a.show()
b.show()

a.join(b, Seq("summary")).show()

Anything I am missing?

> Can't join describe() of DataFrame in Scala 2.10
> 
>
> Key: SPARK-16173
> URL: https://issues.apache.org/jira/browse/SPARK-16173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Davies Liu
>
> descripbe() of DataFrame use Seq() (it's a Iterator actually) to create 
> another DataFrame, which can not be serialized in Scala 2.10.
> {code}
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2060)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
>   at 
> org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.NotSerializableException: 
> scala.collection.Iterator$$anon$11
> Serialization stack:
>   - object not serializable (class: scala.collection.Iterator$$anon$11, 
> value: empty iterator)
>   - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: 
> $outer, type: interface scala.collection.Iterator)
>   - object (class scala.collection.Iterator$$anonfun$toStream$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), 
> WrappedArray(2), WrappedArray(2)))
>   - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$zip$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream((WrappedArray(1),(count,)), 
> (WrappedArray(2.0),(mean,)), 
> (WrappedArray(NaN),(stddev,)), 
> (WrappedArray(2),(min,)), (WrappedArray(2),(max,
>   - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$map$1, 
> )
>   - field (class:

[jira] [Commented] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-06-23 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347247#comment-15347247
 ] 

Bikas Saha commented on SPARK-15565:


SPARK-15565 is pulled into Erie Spark2 by the last refresh. So I am resolving 
the jira.

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).
> Also see 
> https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-06-23 Thread Bikas Saha (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated SPARK-15565:
---
Comment: was deleted

(was: SPARK-15565 is pulled into Erie Spark2 by the last refresh. So I am 
resolving the jira.)

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).
> Also see 
> https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-06-23 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347247#comment-15347247
 ] 

Bikas Saha edited comment on SPARK-15565 at 6/23/16 9:34 PM:
-

SPARK-15565 is pulled into Erie Spark2 by the last refresh. So I am resolving 
the jira.


was (Author: bikassaha):
SPARK-15565 is pulled into Erie Spark2 by the last refresh. So I am resolving 
the jira.

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).
> Also see 
> https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16175) Handle None for all Python UDT

2016-06-23 Thread Vladimir Feinberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16175:
--
Attachment: nullvector.dbc

databricks nb demonstrating the issue

> Handle None for all Python UDT
> --
>
> Key: SPARK-16175
> URL: https://issues.apache.org/jira/browse/SPARK-16175
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Davies Liu
> Attachments: nullvector.dbc
>
>
> For Scala UDT, we will not call serialize()/deserialize() for all null, we 
> should also do that in Python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16175) Handle None for all Python UDT

2016-06-23 Thread Davies Liu (JIRA)

Davies Liu created SPARK-16175:
--

 Summary: Handle None for all Python UDT
 Key: SPARK-16175
 URL: https://issues.apache.org/jira/browse/SPARK-16175
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Davies Liu


For Scala UDT, we will not call serialize()/deserialize() for all null, we 
should also do that in Python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13697) TransformFunctionSerializer.loads doesn't restore the function's module name if it's 'main'

2016-06-23 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13697:
-
Fix Version/s: (was: 1.6.1)
   1.6.2

> TransformFunctionSerializer.loads doesn't restore the function's module name 
> if it's '__main__'
> ---
>
> Key: SPARK-13697
> URL: https://issues.apache.org/jira/browse/SPARK-13697
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.4.2, 1.5.3, 1.6.2, 2.0.0
>
>
> Here is a reproducer
> {code}
> >>> from pyspark.streaming import StreamingContext
> >>> from pyspark.streaming.util import TransformFunction
> >>> ssc = StreamingContext(sc, 1)
> >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
> >>> func.rdd_wrapper(lambda x: x)
> TransformFunction( at 0x106ac8b18>)
> >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, 
> >>> func.rdd_wrap_func, func.deserializers))) 
> >>> func2 = ssc._transformerSerializer.loads(bytes)
> >>> print(func2.func.__module__)
> None
> >>> print(func2.rdd_wrap_func.__module__)
> None
> >>> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16142) Group naive Bayes methods in generated doc

2016-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347205#comment-15347205
 ] 

Apache Spark commented on SPARK-16142:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/13877

> Group naive Bayes methods in generated doc
> --
>
> Key: SPARK-16142
> URL: https://issues.apache.org/jira/browse/SPARK-16142
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>  Labels: starter
>
> Follow SPARK-16107 and group the doc of spark.naiveBayes: spark.naiveBayes, 
> predict(NB), summary(NB), read/write.ml(NB) under Rd spark.naiveBayes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13601) Invoke task failure callbacks before calling outputstream.close()

2016-06-23 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13601:
-
Fix Version/s: 1.6.2

> Invoke task failure callbacks before calling outputstream.close()
> -
>
> Key: SPARK-13601
> URL: https://issues.apache.org/jira/browse/SPARK-13601
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.2, 2.0.0
>
>
> We need to submit another PR against Spark to call the task failure callbacks 
> before Spark calls the close function on various output streams.
> For example, we need to intercept an exception and call 
> TaskContext.markTaskFailed before calling close in the following code (in 
> PairRDDFunctions.scala):
> {code}
>   Utils.tryWithSafeFinally {
> while (iter.hasNext) {
>   val record = iter.next()
>   writer.write(record._1.asInstanceOf[AnyRef], 
> record._2.asInstanceOf[AnyRef])
>   // Update bytes written metric every few records
>   maybeUpdateOutputMetrics(outputMetricsAndBytesWrittenCallback, 
> recordsWritten)
>   recordsWritten += 1
> }
>   } {
> writer.close()
>   }
> {code}
> Changes to Spark should include unit tests to make sure this always work in 
> the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13465) Add a task failure listener to TaskContext

2016-06-23 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13465:
-
Fix Version/s: 1.6.2

> Add a task failure listener to TaskContext
> --
>
> Key: SPARK-13465
> URL: https://issues.apache.org/jira/browse/SPARK-13465
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.2, 2.0.0
>
>
> TaskContext supports task completion callback, which gets called regardless 
> of task failures. However, there is no way for the listener to know if there 
> is an error. This ticket proposes adding a new listener that gets called when 
> a task fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

2016-06-23 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347172#comment-15347172
 ] 

Ryan Blue commented on SPARK-16032:
---

bq. I am not sure apply by-name resolution just to partition columns is a good 
idea.

I'm not sure about this. After looking into it more, I agree in principle and 
that in the long term we don't want to mix by-position column matching with 
by-name partitioning. But I'm less certain about whether or not it's a good 
idea right now. As I look at it more, I agree with you guys more about what is 
"right". But, I'm still concerned about how to move forward from where we're 
at, given the way people are currently using the API.

I think we've already established that it isn't clear that the DataFrameWriter 
API relies on position. I actually think that most people aren't thinking about 
the choice between by-position or by-name resolution and are using what they 
get working. My first use of the API was to build a partitioned table from an 
unpartitioned table, which failed. When I went looking for a solution, 
{{partitionBy}} was the obvious choice (suggested by my IDE) and, sure enough, 
it fixed the problem by [moving the partition columns by 
name|https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L180]
 to the end. This solution is common because it works and is more obvious than 
thinking about column order because, as I noted above, it isn't clear that 
{{insertInto}} is using position.

The pattern of using {{partitionBy}} with {{insertInto}} has also become a best 
practice for maintaining ETL jobs in Spark. Consider this table setup, where 
data lands in {{src}} in batches and we move it to {{dest}} for long-term 
storage in Parquet. Here's some example DDL:

{code:lang=sql}
CREATE TABLE src (id string, timestamp bigint, other_properties map);
CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int)
  PARTITIONED BY (utc_dateint int, utc_hour int);
{code}

The Spark code for this ETL job should be this:

{code:lang=java}
spark.table("src")
  .withColumn("c1", $"other_properties".getItem("c1"))
  .withColumn("c2", $"other_properties".getItem("c2").cast(IntegerType))
  .withColumn("date", dateint($"timetamp"))
  .withColumn("hour", hour($"timestamp"))
  .dropColumn("other_properties")
  .write.insertInto("dest")
{code}

But, users are likely to try this next version instead. That's because it isn't 
obvious that partition columns go after data columns; they are two separate 
lists in the DDL.

{code:lang=java}
spark.table("src")
  .withColumn("date", dateint($"timetamp"))
  .withColumn("hour", hour($"timestamp"))
  .withColumn("c1", $"other_properties".getItem("c1"))
  .withColumn("c2", $"other_properties".getItem("c2").cast(IntegerType))
  .dropColumn("other_properties")
  .write.insertInto("dest")
{code}

And again, the most obvious fix is to add {{partitionBy}} to specify the 
partition columns, which appears to users as a match for Hive's 
{{PARTITION("date", "hour")}} syntax. Users then get the impression that 
{{partitionBy}} is equivalent to {{PARTITION}}, though in reality Hive operates 
by position.

{code:lang=java}
spark.table("src")
  .withColumn("date", dateint($"timetamp"))
  .withColumn("hour", hour($"timestamp"))
  .withColumn("c1", $"other_properties".getItem("c1"))
  .withColumn("c2", $"other_properties".getItem("c2").cast(IntegerType))
  .dropColumn("other_properties")
  .write.partitionBy("date", "hour").insertInto("dest")Another case 
{code}

Another reason to use {{partitionBy}} is for maintaining ETL over time. When 
structure changes, so does column order. Say I want to add a dedup step so I 
get just one row per ID per day. My first attempt, based on getting the column 
order right to begin with, looks like this:

{code:lang=java}
// column orders change, causing the query to break
spark.table("src")
  .withColumn("date", dateint($"timetamp")) // moved to before dropDuplicates
  .dropDuplicates($"date", $"id")   // added to dedup records
  .withColumn("c1", $"other_properties".getItem("c1"))
  .withColumn("c2", $"other_properties".getItem("c2").cast(IntegerType))
  .withColumn("hour", hour($"timestamp"))
  .dropColumn("other_properties")
  .write.insertInto("dest")
{code}

The result is that I get crazy partitioning because c2 and hour are used. The 
most obvious symptom of the wrong column order is partitioning and when I look 
into it, I find {{partitionBy}} fixes it. In many cases, that's the first 
method I'll try because I see bad partition values. This solution doesn't 
always fix the query, but it does solve the partitioning problem I observed. 
(Also: other order problems are hidden by inserting {{Cast}} instead of the 
safer {{UpCast}}.)

Users will also _choose_ this over the right solution, which is to add 
{{select}}:

{code:lang=java}
//

[jira] [Commented] (SPARK-9478) Add class weights to Random Forest

2016-06-23 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347138#comment-15347138
 ] 

Seth Hendrickson commented on SPARK-9478:
-

[~mengxr] Thanks for your feedback. Originally I did not implement a change to 
the sampling semantics, though after some thought it does not seem entirely 
correct to only apply the sampling weights after bagging. I checked 
scikit-learn and they do not use weighted sampling (instead applying weights 
after taking uniform samples), but I think we should implement the weighted 
sampling assuming it can fit into the current Spark abstractions.

>From my understanding, it is reasonable to use the Poisson distribution as an 
>approximation to the Multinomial sampling. Currently, we approximate binomial 
>sampling using a Poisson sampler with constant mean. To implement weighted 
>sampling with replacement, we can use a Poisson sampler with mean parameter 
>proportional to the sample weight - is that correct? We could use the 
>{{RandomDataGenerator}} class in StratifiedSamplingUtils, which maintains a 
>cache of Poisson sampling functions. I am not an expert in sampling algorithms 
>so I really appreciate your thoughts on this. 

> Add class weights to Random Forest
> --
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16163) Statistics of logical plan is super slow on large query

2016-06-23 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347136#comment-15347136
 ] 

Davies Liu commented on SPARK-16163:


[~srowen] yes, thanks for correct it.

> Statistics of logical plan is super slow on large query
> ---
>
> Key: SPARK-16163
> URL: https://issues.apache.org/jira/browse/SPARK-16163
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
> Fix For: 2.0.1
>
>
> It took several minutes to plan TPC-DS Q64, because the canBroadcast() is 
> super slow on large plan.
> Right now, we are considering the schema in statistics(), it's not trivial 
> anymore, we should cache the result (using lazy val rather than def).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16174) Add RemoveLiteralRepetitionFromIn optimizer

2016-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16174:


Assignee: (was: Apache Spark)

> Add RemoveLiteralRepetitionFromIn optimizer
> ---
>
> Key: SPARK-16174
> URL: https://issues.apache.org/jira/browse/SPARK-16174
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue adds an optimizer to remove the duplicated literals from SQL `IN` 
> predicates. This optimizer prevents user mistakes and also can optimize some 
> queries like 
> [TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16174) Add RemoveLiteralRepetitionFromIn optimizer

2016-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16174:


Assignee: Apache Spark

> Add RemoveLiteralRepetitionFromIn optimizer
> ---
>
> Key: SPARK-16174
> URL: https://issues.apache.org/jira/browse/SPARK-16174
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> This issue adds an optimizer to remove the duplicated literals from SQL `IN` 
> predicates. This optimizer prevents user mistakes and also can optimize some 
> queries like 
> [TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16174) Add RemoveLiteralRepetitionFromIn optimizer

2016-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347131#comment-15347131
 ] 

Apache Spark commented on SPARK-16174:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/13876

> Add RemoveLiteralRepetitionFromIn optimizer
> ---
>
> Key: SPARK-16174
> URL: https://issues.apache.org/jira/browse/SPARK-16174
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue adds an optimizer to remove the duplicated literals from SQL `IN` 
> predicates. This optimizer prevents user mistakes and also can optimize some 
> queries like 
> [TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15069) GSoC 2016: Exposing more R and Python APIs for MLlib

2016-06-23 Thread Kai Jiang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347121#comment-15347121
 ] 

Kai Jiang commented on SPARK-15069:
---

*6/22/2016 - Week 5*
To-do items
* Keep investigating more differences between MLlib API and R API with Decision 
Tree
* Same thing to start with Random Forest
* Continue to change PR for Decision Tree wrapper according to investigation.

> GSoC 2016: Exposing more R and Python APIs for MLlib
> 
>
> Key: SPARK-15069
> URL: https://issues.apache.org/jira/browse/SPARK-15069
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Kai Jiang
>  Labels: gsoc2016, mentor
> Attachments: 1458791046_[GSoC2016]ApacheSpark_KaiJiang_Proposal.pdf
>
>
> This issue is for tracking the Google Summer of Code 2016 project for Kai 
> Jiang: "Apache Spark: Exposing more R and Python APIs for MLlib"
> See attached proposal for details.  Note that the tasks listed in the 
> proposal are tentative and can adapt as the community works on these various 
> parts of MLlib.
> This umbrella will contain links for tasks included in this project, to be 
> added as each task begins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16174) Add RemoveLiteralRepetitionFromIn optimizer

2016-06-23 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-16174:
-

 Summary: Add RemoveLiteralRepetitionFromIn optimizer
 Key: SPARK-16174
 URL: https://issues.apache.org/jira/browse/SPARK-16174
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Dongjoon Hyun
Priority: Minor


This issue adds an optimizer to remove the duplicated literals from SQL `IN` 
predicates. This optimizer prevents user mistakes and also can optimize some 
queries like 
[TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13216) Spark streaming application not honoring --num-executors in restarting of an application from a checkpoint

2016-06-23 Thread Neelesh Srinivas Salian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347067#comment-15347067
 ] 

Neelesh Srinivas Salian commented on SPARK-13216:
-

[~srowen] are we moving forward with this? Based on [~hshreedharan]'s comment, 
it could be done.

CC'ing [~tdas] here to get more insight.

> Spark streaming application not honoring --num-executors in restarting of an 
> application from a checkpoint
> --
>
> Key: SPARK-13216
> URL: https://issues.apache.org/jira/browse/SPARK-13216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, Streaming
>Affects Versions: 1.5.0
>Reporter: Neelesh Srinivas Salian
>Priority: Minor
>  Labels: Streaming
>
> Scenario to help understand:
> 1) The Spark streaming job with 12 executors was initiated with checkpointing 
> enabled.
> 2) In version 1.3, the user was able to append the number of executors to 20 
> using --num-executors but was unable to do so in version 1.5.
> In 1.5, the spark application still runs with 13 executors (1 for driver and 
> 12 executors).
> There is a need to start from the checkpoint itself and not restart the 
> application to avoid the loss of information.
> 3) Checked the code in 1.3 and 1.5, which shows the command 
> ''--num-executors" has been deprecated.
> Any thoughts on this? Not sure if anyone hit this one specifically before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15725) Dynamic allocation hangs YARN app when executors time out

2016-06-23 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-15725:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Dynamic allocation hangs YARN app when executors time out
> -
>
> Key: SPARK-15725
> URL: https://issues.apache.org/jira/browse/SPARK-15725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 2.0.1
>
>
> We've had a problem with a dynamic allocation and YARN (since 1.6) where a 
> large stage will cause a lot of executors to get killed around the same time, 
> causing the driver and AM to lock up and wait forever. This can happen even 
> with a small number of executors (~100).
> When executors are killed by the driver, the [network connection to the 
> driver 
> disconnects|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L201].
>  That results in a call to the AM to find out why the executor died, followed 
> by a [blocking and retrying `RemoveExecutor` RPC 
> call|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L227]
>  that results in a second `KillExecutor` call to the AM. When a lot of 
> executors are killed around the same time, the driver's AM threads are all 
> taken up blocking and waiting on the AM (see the stack trace below, which was 
> the same for 42 threads). I think this behavior, the network disconnect and 
> subsequent cleanup, is unique to YARN.
> {code:title=Driver AM thread stack}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:190)
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$2.apply$mcV$sp(YarnSchedulerBackend.scala:286)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$2.apply(YarnSchedulerBackend.scala:286)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$2.apply(YarnSchedulerBackend.scala:286)
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> The RPC calls to the AM aren't returning because the `YarnAllocator` is 
> spending all of its time in the `allocateResources` method. That class's 
> public methods are synchronized so only one RPC can be satisfied at a time. 
> The reason why it is constantly calling `allocateResources` is because [its 
> thread|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L467]
>  is [woken 
> up|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L686]
>  by calls to get the failure reason for an executor -- which is part of the 
> chain of events in the driver for each executor that goes down.
> The final result is that the `YarnAllocator` doesn't respond to RPC calls for 
> long enough that calls time out and replies for non-blocking calls are 
> dropped. Then the application is unable to do any work because everything 
> retries or exits and the application *hangs for 24+ hours*, until enough 
> errors accumulate that it dies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail:

[jira] [Updated] (SPARK-13723) YARN - Change behavior of --num-executors when spark.dynamicAllocation.enabled true

2016-06-23 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13723:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> YARN - Change behavior of --num-executors when 
> spark.dynamicAllocation.enabled true
> ---
>
> Key: SPARK-13723
> URL: https://issues.apache.org/jira/browse/SPARK-13723
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Assignee: Ryan Blue
>Priority: Minor
> Fix For: 2.0.1
>
>
> I think we should change the behavior when --num-executors is specified when 
> dynamic allocation is enabled. Currently if --num-executors is specified 
> dynamic allocation is disabled and it just uses a static number of executors.
> I would rather see the default behavior changed in the 2.x line. If dynamic 
> allocation config is on then num-executors goes to max and initial # of 
> executors. I think this would allow users to easily cap their usage and would 
> still allow it to free up executors. It would also allow users doing ML start 
> out with a # of executors and if they are actually caching the data the 
> executors wouldn't be freed up. So you would get very similar behavior to if 
> dynamic allocation was off.
> Part of the reason for this is when using a static number if generally wastes 
> resources, especially with people doing adhoc things with spark-shell. It 
> also has a big affect when people are doing MapReduce/ETL type work loads.   
> The problem is that people are used to specifying num-executors so if we turn 
> it on by default in a cluster config its just overridden.
> We should also update the spark-submit --help description for --num-executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe

2016-06-23 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15230:

Component/s: (was: Examples)
 SQL

> Back quoted column with dot in it fails when running distinct on dataframe
> --
>
> Key: SPARK-15230
> URL: https://issues.apache.org/jira/browse/SPARK-15230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Barry Becker
>Assignee: Bo Meng
> Fix For: 2.0.1
>
>
> When working with a dataframe columns with .'s in them must be backquoted 
> (``) or the column name will not be found. This works for most dataframe 
> methods, but I discovered that it does not work for distinct().
> Suppose you have a dataFrame, testDf, with a DoubleType column named 
> {{pos.NoZero}}.  This statememt:
> {noformat}
> testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ")
> {noformat}
> will fail with this error:
> {noformat}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name 
> "pos.NoZero" among (pos.NoZero);
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319)
>   at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612)
>   at 
> com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 214 matches

Mail list logo