[jira] [Created] (SPARK-7940) Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH

2015-05-28 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7940:
--

 Summary: Enforce whitespace checking for DO, TRY, CATCH, FINALLY, 
MATCH
 Key: SPARK-7940
 URL: https://issues.apache.org/jira/browse/SPARK-7940
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7938) Use errorprone in Spark

2015-05-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564300#comment-14564300
 ] 

Josh Rosen commented on SPARK-7938:
---

If possible, we should also integrate this into our SBT build so that these 
tests are run in pull request builders.  If it's not possible to do that yet 
(e.g. we'd need to wait for someone to write a SBT plugin), then we can just do 
Maven for now and leave SBT to future work.

> Use errorprone in Spark
> ---
>
> Key: SPARK-7938
> URL: https://issues.apache.org/jira/browse/SPARK-7938
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Reynold Xin
>  Labels: starter
>
> We have quite a bit of low level code written in Java (e.g. unsafe module). 
> One nice thing about Java is that we can use better tools for finding common 
> errors, e.g. Google's error prone.
> This is a ticket to integrate error pone into our Maven build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7541) Check model save/load for MLlib 1.4

2015-05-28 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564296#comment-14564296
 ] 

yuhao yang commented on SPARK-7541:
---

Oh, "checked" means I found no python support for save/load for the model. 

> Check model save/load for MLlib 1.4
> ---
>
> Key: SPARK-7541
> URL: https://issues.apache.org/jira/browse/SPARK-7541
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> For each model which supports save/load methods, we need to verify:
> * These methods are tested in unit tests in Scala and Python (if save/load is 
> supported in Python).
> * If a model's name, data members, or constructors have changed _at all_, 
> then we likely need to support a new save/load format version.  Different 
> versions must be tested in unit tests to ensure backwards compatibility 
> (i.e., verify we can load old model formats).
> * Examples in the programming guide should include save/load when available.  
> It's important to try running each example in the guide whenever it is 
> modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7541) Check model save/load for MLlib 1.4

2015-05-28 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564263#comment-14564263
 ] 

yuhao yang edited comment on SPARK-7541 at 5/29/15 6:40 AM:


||model||Scala UT ||  python UT 
||  changes ||backwards Compatibility||
|LogisticRegressionModel|   LogisticRegressionSuite|
LogisticRegressionModel doctests|no public change|  y
|NaiveBayesModel|   NaiveBayesSuite|
NaiveBayesModel doctests|   save/load 2.0|  y|
|SVMModel|  SVMSuite|   SVMModel 
doctests   |   no public change|   y|
|GaussianMixtureModel|  GaussianMixtureSuite|   checked 
|   New Saveable in 1.4 |New Saveable in 1.4|
|KMeansModel|   KMeansSuite |   KMeansModel 
doctests|   New Saveable in 1.4 |New Saveable in 1.4|
|PowerIterationClusteringModel  |PowerIterationClusteringSuite| checked 
|   New Saveable in 1.4|New Savable in 1.4|
|Word2VecModel  |   Word2VecSuite   |   checked 
|   New Saveable in 1.4|New Saveable in 1.4|
|MatrixFactorizationModel  |MatrixFactorizationModelSuite  |
MatrixFactorizationModel doctests | no public change |  y|
|IsotonicRegressionModel|   IsotonicRegressionSuite |   
IsotonicRegressionModel |   New Saveable in 1.4 |   New Saveable in 
1.4|
|LassoModel |   LassoSuite  |   LassoModel 
doctests |   no public change|   y|
|LinearRegressionModel  |   LinearRegressionSuite   |   
LinearRegressionModel doctests  |   no public change|y|
|RidgeRegressionModel   |   RidgeRegressionSuite|   
RidgeRegressionModel doctests   |   no public change|y|
|DecisionTreeModel  |   DecisionTreeSuite|  dt_model.save|  
no public change|   y|
|RandomForestModel| RandomForestSuite   |   rf_model.save   
|   no public change|   y|
|GradientBoostedTreesModel  |GradientBoostedTreesSuite  |gbt_model.sav  
|   no public change|   y|

Above contents have been checked and no obvious issue detected. 
And Joseph, do you think we should add save/load wherever available in the 
example documents? 


was (Author: yuhaoyan):
||model||Scala UT ||  python UT 
||  changes ||backwards Compatibility||
|LogisticRegressionModel|   LogisticRegressionSuite|
LogisticRegressionModel doctests|no public change|  y
|NaiveBayesModel|   NaiveBayesSuite|
NaiveBayesModel doctests|   save/load 2.0|  y|
|SVMModel|  SVMSuite|   SVMModel 
doctests   |   no public change|   y|
|GaussianMixtureModel|  GaussianMixtureSuite|   checked 
|   New Savable in 1.4  |New Savable in 1.4|
|KMeansModel|   KMeansSuite |   KMeansModel 
doctests|   New Savable in 1.4  |New Savable in 1.4|
|PowerIterationClusteringModel  |PowerIterationClusteringSuite| checked 
|   New Savable in 1.4| New Savable in 1.4|
|Word2VecModel  |   Word2VecSuite   |   checked 
|   New Savable in 1.4| New Savable in 1.4|
|MatrixFactorizationModel  |MatrixFactorizationModelSuite  |
MatrixFactorizationModel doctests | no public change |  y|
|IsotonicRegressionModel|   IsotonicRegressionSuite |   
IsotonicRegressionModel |   New Savable in 1.4 |New Savable in 
1.4|
|LassoModel |   LassoSuite  |   LassoModel 
doctests |   no public change|   y|
|LinearRegressionModel  |   LinearRegressionSuite   |   
LinearRegressionModel doctests  |   no public change|y|
|RidgeRegressionModel   |   RidgeRegressionSuite|   
RidgeRegressionModel doctests   |   no public change|y|
|DecisionTreeModel  |   DecisionTreeSuite|  dt_model.save|  
no public change|   y|
|RandomForestModel| RandomForestSuite   |   rf_model.save   
|   no public change|   y|
|GradientBoostedTreesModel  |GradientBoostedTreesSuite  |gbt_model.sav  
|

[jira] [Commented] (SPARK-6806) SparkR examples in programming guide

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564293#comment-14564293
 ] 

Apache Spark commented on SPARK-6806:
-

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/6490

> SparkR examples in programming guide
> 
>
> Key: SPARK-6806
> URL: https://issues.apache.org/jira/browse/SPARK-6806
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, SparkR
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.4.0
>
>
> Add R examples for Spark Core and DataFrame programming guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7938) Use errorprone in Spark

2015-05-28 Thread Yijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564292#comment-14564292
 ] 

Yijie Shen commented on SPARK-7938:
---

I'd love to take this :)

> Use errorprone in Spark
> ---
>
> Key: SPARK-7938
> URL: https://issues.apache.org/jira/browse/SPARK-7938
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Reynold Xin
>  Labels: starter
>
> We have quite a bit of low level code written in Java (e.g. unsafe module). 
> One nice thing about Java is that we can use better tools for finding common 
> errors, e.g. Google's error prone.
> This is a ticket to integrate error pone into our Maven build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7541) Check model save/load for MLlib 1.4

2015-05-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564283#comment-14564283
 ] 

Joseph K. Bradley commented on SPARK-7541:
--

Awesome, thank you for the careful check!

Q: In the "python UT" column, I understand what the "doctests" are, but what do 
the other entries mean?  E.g., what does "checked" mean?

Good point about having save/load in all relevant example docs.  Would you mind 
putting together a PR for adding that to example code in the Markdown 
programming guide docs?

> Check model save/load for MLlib 1.4
> ---
>
> Key: SPARK-7541
> URL: https://issues.apache.org/jira/browse/SPARK-7541
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> For each model which supports save/load methods, we need to verify:
> * These methods are tested in unit tests in Scala and Python (if save/load is 
> supported in Python).
> * If a model's name, data members, or constructors have changed _at all_, 
> then we likely need to support a new save/load format version.  Different 
> versions must be tested in unit tests to ensure backwards compatibility 
> (i.e., verify we can load old model formats).
> * Examples in the programming guide should include save/load when available.  
> It's important to try running each example in the guide whenever it is 
> modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7936) Add configuration for initial size and limit of hash for aggregation

2015-05-28 Thread Navis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navis updated SPARK-7936:
-
Summary: Add configuration for initial size and limit of hash for 
aggregation  (was: Add configuration for initial size of hash for aggregation 
and limit)

> Add configuration for initial size and limit of hash for aggregation
> 
>
> Key: SPARK-7936
> URL: https://issues.apache.org/jira/browse/SPARK-7936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Navis
>Priority: Minor
>
> Partial aggregation takes a lot of memory and mostly cannot be completed if 
> it's not sliced into very small partitions (large in count). This patch is 
> for limiting entry size for partial aggregation. Initial size for hash is 
> just a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7939) Make URL partition recognition return String by default for all partition column types and values

2015-05-28 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-7939:
-
Summary: Make URL partition recognition return String by default for all 
partition column types and values  (was: Make URL partition recognition return 
String by default for all partition column values)

> Make URL partition recognition return String by default for all partition 
> column types and values
> -
>
> Key: SPARK-7939
> URL: https://issues.apache.org/jira/browse/SPARK-7939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Jianshi Huang
>
> Imagine the following HDFS paths:
> /data/split=00
> /data/split=01
> ...
> /data/split=FF
> If I have less than or equal to 10 partitions (00, 01, ... 09), currently 
> partition recognition will treat column 'split' as integer column. 
> If I have more than 10 partitions, column 'split' will be recognized as 
> String...
> This is very confusing. *So I'm suggesting to treat partition columns as 
> String by default*, and allow user to specify types if needed.
> Another example is date:
> /data/date=2015-04-01 => 'date' is String
> /data/date=20150401 => 'date' is Int
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7939) Make URL partition recognition return String by default for all partition column values

2015-05-28 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-7939:


 Summary: Make URL partition recognition return String by default 
for all partition column values
 Key: SPARK-7939
 URL: https://issues.apache.org/jira/browse/SPARK-7939
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jianshi Huang


Imagine the following HDFS paths:

/data/split=00
/data/split=01
...
/data/split=FF

If I have less than or equal to 10 partitions (00, 01, ... 09), currently 
partition recognition will treat column 'split' as integer column. 

If I have more than 10 partitions, column 'split' will be recognized as 
String...

This is very confusing. *So I'm suggesting to treat partition columns as String 
by default*, and allow user to specify types if needed.

Another example is date:
/data/date=2015-04-01 => 'date' is String
/data/date=20150401 => 'date' is Int

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7938) Use errorprone in Spark

2015-05-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7938:
---
Labels: starter  (was: )

> Use errorprone in Spark
> ---
>
> Key: SPARK-7938
> URL: https://issues.apache.org/jira/browse/SPARK-7938
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Reynold Xin
>  Labels: starter
>
> We have quite a bit of low level code written in Java (e.g. unsafe module). 
> One nice thing about Java is that we can use better tools for finding common 
> errors, e.g. Google's error prone.
> This is a ticket to integrate error pone into our Maven build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7936) Add configuration for initial size of hash for aggregation and limit

2015-05-28 Thread Navis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564271#comment-14564271
 ] 

Navis commented on SPARK-7936:
--

Added two configuration 
1. spark.sql.aggregation.hash.initSize : initialize size of hash. Applied to 
both(final, partial) aggregation
2. spark.sql.partial.aggregation.maxEntry : max size of hash for partial 
aggregation. should not be used for final aggregation

> Add configuration for initial size of hash for aggregation and limit
> 
>
> Key: SPARK-7936
> URL: https://issues.apache.org/jira/browse/SPARK-7936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Navis
>Priority: Minor
>
> Partial aggregation takes a lot of memory and mostly cannot be completed if 
> it's not sliced into very small partitions (large in count). This patch is 
> for limiting entry size for partial aggregation. Initial size for hash is 
> just a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7936) Add configuration for initial size of hash for aggregation and limit

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7936:
---

Assignee: Apache Spark

> Add configuration for initial size of hash for aggregation and limit
> 
>
> Key: SPARK-7936
> URL: https://issues.apache.org/jira/browse/SPARK-7936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Navis
>Assignee: Apache Spark
>Priority: Minor
>
> Partial aggregation takes a lot of memory and mostly cannot be completed if 
> it's not sliced into very small partitions (large in count). This patch is 
> for limiting entry size for partial aggregation. Initial size for hash is 
> just a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7936) Add configuration for initial size of hash for aggregation and limit

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564270#comment-14564270
 ] 

Apache Spark commented on SPARK-7936:
-

User 'navis' has created a pull request for this issue:
https://github.com/apache/spark/pull/6488

> Add configuration for initial size of hash for aggregation and limit
> 
>
> Key: SPARK-7936
> URL: https://issues.apache.org/jira/browse/SPARK-7936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Navis
>Priority: Minor
>
> Partial aggregation takes a lot of memory and mostly cannot be completed if 
> it's not sliced into very small partitions (large in count). This patch is 
> for limiting entry size for partial aggregation. Initial size for hash is 
> just a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7936) Add configuration for initial size of hash for aggregation and limit

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7936:
---

Assignee: (was: Apache Spark)

> Add configuration for initial size of hash for aggregation and limit
> 
>
> Key: SPARK-7936
> URL: https://issues.apache.org/jira/browse/SPARK-7936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Navis
>Priority: Minor
>
> Partial aggregation takes a lot of memory and mostly cannot be completed if 
> it's not sliced into very small partitions (large in count). This patch is 
> for limiting entry size for partial aggregation. Initial size for hash is 
> just a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7938) Use errorprone in Spark

2015-05-28 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7938:
--

 Summary: Use errorprone in Spark
 Key: SPARK-7938
 URL: https://issues.apache.org/jira/browse/SPARK-7938
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Reynold Xin


We have quite a bit of low level code written in Java (e.g. unsafe module). One 
nice thing about Java is that we can use better tools for finding common 
errors, e.g. Google's error prone.

This is a ticket to integrate error pone into our Maven build.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)

2015-05-28 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564267#comment-14564267
 ] 

Jianshi Huang commented on SPARK-7937:
--

Blog for describing Hive's argmax, argmin feature: 
https://www.joefkelley.com/?p=727

HIVE JIRA: https://issues.apache.org/jira/browse/HIVE-1128

Jianshi

> Cannot compare Hive named_struct. (when using argmax, argmin)
> -
>
> Key: SPARK-7937
> URL: https://issues.apache.org/jira/browse/SPARK-7937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Jianshi Huang
>
> Imagine the following SQL:
> Intention: get last used bank account country.
>  
> {code:sql}
> select bank_account_id, 
>   max(named_struct(
> 'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D 
> HH:mm:ss'), 
> 'bank_country', bank_country)).bank_country 
> from bank_account_monthly
> where year_month='201502' 
> group by bank_account_id
> {code}
> => 
> {noformat}
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in 
> stage 96.0 (TID 22281, ): java.lang.RuntimeException: Type 
> StructType(StructField(src_row_update_ts,LongType,true), 
> StructField(bank_country,StringType,true)) does not support ordered operations
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
> at 
> org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
> at 
> org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
> at 
> org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
> at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
> at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564266#comment-14564266
 ] 

Josh Rosen commented on SPARK-7708:
---

Also, it looks like Chill is still using Kryo 2.2.1 instead of a newer version 
because of some Storm incompatibilities or dependency problems or something: 
https://github.com/twitter/chill/commit/3869b0122660c908e189ff08b615bd7221956224#commitcomment-8362755.
  Therefore, it might be an uphill battle to do a version bump since it might 
require community involvement from both the Kryo and/or Chill developers.

If the only blocker for Chill is Storm compatibility issues that don't affect 
us, we might consider publishing our own fork of Chill under the 
org.apache.spark namespace, similar to how we used to publish custom versions 
of Pyrolite.  If possible, though, I'd like to avoid that option and only use 
it as a last resort.

I can't really spend much more time investigating this myself right now, but 
would really appreciate it if someone would dig into these issues in more 
detail and post a summary here. 

> Incorrect task serialization with Kryo closure serializer
> -
>
> Key: SPARK-7708
> URL: https://issues.apache.org/jira/browse/SPARK-7708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.2
>Reporter: Akshat Aranya
>
> I've been investigating the use of Kryo for closure serialization with Spark 
> 1.2, and it seems like I've hit upon a bug:
> When a task is serialized before scheduling, the following log message is 
> generated:
> [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
> , PROCESS_LOCAL, 302 bytes)
> This message comes from TaskSetManager which serializes the task using the 
> closure serializer.  Before the message is sent out, the TaskDescription 
> (which included the original task as a byte array), is serialized again into 
> a byte array with the closure serializer.  I added a log message for this in 
> CoarseGrainedSchedulerBackend, which produces the following output:
> [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
> The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
> than serialized task that it contains (302 bytes). This implies that 
> TaskDescription.buffer is not getting serialized correctly.
> On the executor side, the deserialization produces a null value for 
> TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)

2015-05-28 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-7937:
-
Description: 
Imagine the following SQL:

Intention: get last used bank account country.
 
{code:sql}
select bank_account_id, 
  max(named_struct(
'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 
'bank_country', bank_country)).bank_country 
from bank_account_monthly
where year_month='201502' 
group by bank_account_id
{code}

=> 
{noformat}
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 
96.0 (TID 22281, ): java.lang.RuntimeException: Type 
StructType(StructField(src_row_update_ts,LongType,true), 
StructField(bank_country,StringType,true)) does not support ordered operations
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
at 
org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
{noformat}

  was:
Imagine the following SQL:

Intention: get last used bank account country.
 
``` sql
select bank_account_id, 
  max(named_struct(
'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 
'bank_country', bank_country)).bank_country 
from bank_account_monthly
where year_month='201502' 
group by bank_account_id
```

=> 
```
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 
96.0 (TID 22281, ): java.lang.RuntimeException: Type 
StructType(StructField(src_row_update_ts,LongType,true), 
StructField(bank_country,StringType,true)) does not support ordered operations
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
at 
org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at 

[jira] [Updated] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)

2015-05-28 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-7937:
-
Description: 
Imagine the following SQL:

Intention: get last used bank account country.
 
``` sql
select bank_account_id, 
  max(named_struct(
'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 
'bank_country', bank_country)).bank_country 
from bank_account_monthly
where year_month='201502' 
group by bank_account_id
```

=> 
```
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 
96.0 (TID 22281, ): java.lang.RuntimeException: Type 
StructType(StructField(src_row_update_ts,LongType,true), 
StructField(bank_country,StringType,true)) does not support ordered operations
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
at 
org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
```

  was:
Imagine the following SQL:

Intention: get last used bank account country.
 
select bank_account_id, 
  max(named_struct(
'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 
'bank_country', bank_country)).bank_country 
from bank_account_monthly
where year_month='201502' 
group by bank_account_id
 
=> 

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 
96.0 (TID 22281, ): java.lang.RuntimeException: Type 
StructType(StructField(src_row_update_ts,LongType,true), 
StructField(bank_country,StringType,true)) does not support ordered operations
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
at 
org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.ru

[jira] [Created] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)

2015-05-28 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-7937:


 Summary: Cannot compare Hive named_struct. (when using argmax, 
argmin)
 Key: SPARK-7937
 URL: https://issues.apache.org/jira/browse/SPARK-7937
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jianshi Huang


Imagine the following SQL:

Intention: get last used bank account country.
 
select bank_account_id, 
  max(named_struct(
'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D HH:mm:ss'), 
'bank_country', bank_country)).bank_country 
from bank_account_monthly
where year_month='201502' 
group by bank_account_id
 
=> 

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in stage 
96.0 (TID 22281, ): java.lang.RuntimeException: Type 
StructType(StructField(src_row_update_ts,LongType,true), 
StructField(bank_country,StringType,true)) does not support ordered operations
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
at 
org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7541) Check model save/load for MLlib 1.4

2015-05-28 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564263#comment-14564263
 ] 

yuhao yang commented on SPARK-7541:
---

||model||Scala UT ||  python UT 
||  changes ||backwards Compatibility||
|LogisticRegressionModel|   LogisticRegressionSuite|
LogisticRegressionModel doctests|no public change|  y
|NaiveBayesModel|   NaiveBayesSuite|
NaiveBayesModel doctests|   save/load 2.0|  y|
|SVMModel|  SVMSuite|   SVMModel 
doctests   |   no public change|   y|
|GaussianMixtureModel|  GaussianMixtureSuite|   checked 
|   New Savable in 1.4  |New Savable in 1.4|
|KMeansModel|   KMeansSuite |   KMeansModel 
doctests|   New Savable in 1.4  |New Savable in 1.4|
|PowerIterationClusteringModel  |PowerIterationClusteringSuite| checked 
|   New Savable in 1.4| New Savable in 1.4|
|Word2VecModel  |   Word2VecSuite   |   checked 
|   New Savable in 1.4| New Savable in 1.4|
|MatrixFactorizationModel  |MatrixFactorizationModelSuite  |
MatrixFactorizationModel doctests | no public change |  y|
|IsotonicRegressionModel|   IsotonicRegressionSuite |   
IsotonicRegressionModel |   New Savable in 1.4 |New Savable in 
1.4|
|LassoModel |   LassoSuite  |   LassoModel 
doctests |   no public change|   y|
|LinearRegressionModel  |   LinearRegressionSuite   |   
LinearRegressionModel doctests  |   no public change|y|
|RidgeRegressionModel   |   RidgeRegressionSuite|   
RidgeRegressionModel doctests   |   no public change|y|
|DecisionTreeModel  |   DecisionTreeSuite|  dt_model.save|  
no public change|   y|
|RandomForestModel| RandomForestSuite   |   rf_model.save   
|   no public change|   y|
|GradientBoostedTreesModel  |GradientBoostedTreesSuite  |gbt_model.sav  
|   no public change|   y|

Above contents have been checked and no obvious issue detected. 
And Joseph, do you think we should add save/load wherever available in the 
example documents? 

> Check model save/load for MLlib 1.4
> ---
>
> Key: SPARK-7541
> URL: https://issues.apache.org/jira/browse/SPARK-7541
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> For each model which supports save/load methods, we need to verify:
> * These methods are tested in unit tests in Scala and Python (if save/load is 
> supported in Python).
> * If a model's name, data members, or constructors have changed _at all_, 
> then we likely need to support a new save/load format version.  Different 
> versions must be tested in unit tests to ensure backwards compatibility 
> (i.e., verify we can load old model formats).
> * Examples in the programming guide should include save/load when available.  
> It's important to try running each example in the guide whenever it is 
> modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7927) Enforce whitespace for more tokens in style checker

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564251#comment-14564251
 ] 

Apache Spark commented on SPARK-7927:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6487

> Enforce whitespace for more tokens in style checker
> ---
>
> Key: SPARK-7927
> URL: https://issues.apache.org/jira/browse/SPARK-7927
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.4.0
>
>
> Enforce whitespace on comma, colon, if, while, etc ... so we don't need to 
> keep spending time on this in code reviews.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7927) Enforce whitespace for more tokens in style checker

2015-05-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7927.

   Resolution: Fixed
Fix Version/s: 1.4.0

> Enforce whitespace for more tokens in style checker
> ---
>
> Key: SPARK-7927
> URL: https://issues.apache.org/jira/browse/SPARK-7927
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.4.0
>
>
> Enforce whitespace on comma, colon, if, while, etc ... so we don't need to 
> keep spending time on this in code reviews.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7929) Remove Bagel examples

2015-05-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7929.

   Resolution: Fixed
Fix Version/s: 1.4.0

> Remove Bagel examples
> -
>
> Key: SPARK-7929
> URL: https://issues.apache.org/jira/browse/SPARK-7929
> Project: Spark
>  Issue Type: Task
>  Components: Examples, GraphX
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.4.0
>
>
> Bagel has been deprecated for a while. We should remove the example code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7922) ALSModel in the pipeline API should return DataFrames for factors

2015-05-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7922.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6468
[https://github.com/apache/spark/pull/6468]

> ALSModel in the pipeline API should return DataFrames for factors
> -
>
> Key: SPARK-7922
> URL: https://issues.apache.org/jira/browse/SPARK-7922
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.4.0
>
>
> This is to be more consistent with the pipeline API. It also helps maintain 
> consistent APIs across languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7890) Document that Spark 2.11 now supports Kafka

2015-05-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7890:
---
Assignee: Sean Owen  (was: Iulian Dragos)

> Document that Spark 2.11 now supports Kafka
> ---
>
> Key: SPARK-7890
> URL: https://issues.apache.org/jira/browse/SPARK-7890
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Sean Owen
>Priority: Critical
>
> The building-spark.html page needs to be updated. It's a simple fix, just 
> remove the caveat about Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7890) Document that Spark 2.11 now supports Kafka

2015-05-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564231#comment-14564231
 ] 

Patrick Wendell commented on SPARK-7890:


No - the JDBC component is not supported.

> Document that Spark 2.11 now supports Kafka
> ---
>
> Key: SPARK-7890
> URL: https://issues.apache.org/jira/browse/SPARK-7890
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Sean Owen
>Priority: Critical
>
> The building-spark.html page needs to be updated. It's a simple fix, just 
> remove the caveat about Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7930) Shutdown hook deletes rool local dir before SparkContext is stopped, throwing errors

2015-05-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-7930.

   Resolution: Fixed
Fix Version/s: 1.4.0

> Shutdown hook deletes rool local dir before SparkContext is stopped, throwing 
> errors
> 
>
> Key: SPARK-7930
> URL: https://issues.apache.org/jira/browse/SPARK-7930
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 1.4.0
>
>
> Shutdown hook for temp directories had priority 100 while SparkContext was 
> 50. So the local root directory was deleted before SparkContext was shutdown. 
> This leads to scary errors on running jobs, at the time of shutdown. This is 
> especially a problem when running streaming examples, where Ctrl-C is the 
> only way to  shutdown. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7936) Add configuration for initial size of hash for aggregation and limit

2015-05-28 Thread Navis (JIRA)
Navis created SPARK-7936:


 Summary: Add configuration for initial size of hash for 
aggregation and limit
 Key: SPARK-7936
 URL: https://issues.apache.org/jira/browse/SPARK-7936
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Navis
Priority: Minor


Partial aggregation takes a lot of memory and mostly cannot be completed if 
it's not sliced into very small partitions (large in count). This patch is for 
limiting entry size for partial aggregation. Initial size for hash is just a 
bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564219#comment-14564219
 ] 

Josh Rosen commented on SPARK-7708:
---

I invested the time to dig into this because I was worried that this issue 
might impact us in 1.4 due to our increased serializer reuse.  On closer 
analysis, though, I think we're safe.  In 1.3.x, it appears that there are some 
cases where the old could _would_ re-use the same SerializerInstance and make 
multiple `serialize()` calls using the same `Output`.  If the bug didn’t 
manifest in those older versions and we didn’t introduce any new cases of this 
pattern in 1.4.0, then I don’t think we need to take any additional action for 
1.4.

It might be good to have someone else confirm this, though, but my quick 
glances through IntelliJ suggest that things are okay.

Regarding upgrading Kryo, there may be some considerations due to our use of 
Chill.  I'm not sure whether chill supports Kryo 3.x.  We also need to be 
careful to not introduce bugs / regressions by upgrading to 2.23.  Definitely 
give 2.23.0 a try, though, and let me know if it fixes the problem.  If it 
does, you can modify your PR to bump to that version and try to copy the code 
from my gist into a KryoSerializerSuite regression test.

> Incorrect task serialization with Kryo closure serializer
> -
>
> Key: SPARK-7708
> URL: https://issues.apache.org/jira/browse/SPARK-7708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.2
>Reporter: Akshat Aranya
>
> I've been investigating the use of Kryo for closure serialization with Spark 
> 1.2, and it seems like I've hit upon a bug:
> When a task is serialized before scheduling, the following log message is 
> generated:
> [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
> , PROCESS_LOCAL, 302 bytes)
> This message comes from TaskSetManager which serializes the task using the 
> closure serializer.  Before the message is sent out, the TaskDescription 
> (which included the original task as a byte array), is serialized again into 
> a byte array with the closure serializer.  I added a log message for this in 
> CoarseGrainedSchedulerBackend, which produces the following output:
> [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
> The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
> than serialized task that it contains (302 bytes). This implies that 
> TaskDescription.buffer is not getting serialized correctly.
> On the executor side, the deserialization produces a null value for 
> TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7932) Scheduler delay shown in event timeline is incorrect

2015-05-28 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-7932.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6484
[https://github.com/apache/spark/pull/6484]

> Scheduler delay shown in event timeline is incorrect
> 
>
> Key: SPARK-7932
> URL: https://issues.apache.org/jira/browse/SPARK-7932
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 1.4.0
>
>
> In StagePage.scala, we round *down* to the nearest percent when computing the 
> proportion of a task's time spend in each phase of execution.  Scheduler 
> delay is computed by taking 100 - sum(all other proportions), which means 
> that a few extra percent may go into the scheduler delay.  As a result, 
> scheduler delay can appear larger in the visualization than it actually is.
> cc [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7926) Switch to the official Pyrolite release

2015-05-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7926.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6472
[https://github.com/apache/spark/pull/6472]

> Switch to the official Pyrolite release
> ---
>
> Key: SPARK-7926
> URL: https://issues.apache.org/jira/browse/SPARK-7926
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.4.0
>
>
> Since there are official releases of Pyrolite on Maven Central, it is time 
> for us to switch to them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7909) spark-ec2 and associated tools not py3 ready

2015-05-28 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564168#comment-14564168
 ] 

Shivaram Venkataraman commented on SPARK-7909:
--

The packages will get to S3 once the 1.4 release is finalized. We are still 
testing / voting on release candidates and you can follow these on the Spark 
developer mailing list.  BTW I also have a change open at spark-ec2 for 
substituting the Spark version based on pattern 
https://github.com/mesos/spark-ec2/pull/116/files#diff-1d040c3294246f2b59643d63868fc2ad,
 so that should take care of picking up the binary once its released.

However, feel free to send out PRs for the other python3 print fixes you had to 
make in init.sh etc. 

> spark-ec2 and associated tools not py3 ready
> 
>
> Key: SPARK-7909
> URL: https://issues.apache.org/jira/browse/SPARK-7909
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
> Environment: ec2 python3
>Reporter: Matthew Goodman
>
> At present there is not a possible permutation of tools that supports Python3 
> on both the launching computer and running cluster.  There are a couple 
> problems involved:
>  - There is no prebuilt spark binary with python3 support.
>  - spark-ec2/spark/init.sh contains inline py3 unfriendly print statements
>  - Config files for cluster processes don't seem to make it to all nodes in a 
> working format.
> I have fixes for some of this, but the config and running context debugging 
> remains elusive to me.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7909) spark-ec2 and associated tools not py3 ready

2015-05-28 Thread Matthew Goodman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564166#comment-14564166
 ] 

Matthew Goodman commented on SPARK-7909:


Using the prebuilt binaries from the links provided yields a working cluster.  
Is there a timeline for when the spark 1.4.0 binaries make the s3 bucket?  I 
can add the link to the spark/init.sh script, but it will bounce until the 
binary is actually place in the bucket.

In either case I suspect the naming convention will be similar, so would a PR 
for the changes outlined above be a good step at this stage?

> spark-ec2 and associated tools not py3 ready
> 
>
> Key: SPARK-7909
> URL: https://issues.apache.org/jira/browse/SPARK-7909
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
> Environment: ec2 python3
>Reporter: Matthew Goodman
>
> At present there is not a possible permutation of tools that supports Python3 
> on both the launching computer and running cluster.  There are a couple 
> problems involved:
>  - There is no prebuilt spark binary with python3 support.
>  - spark-ec2/spark/init.sh contains inline py3 unfriendly print statements
>  - Config files for cluster processes don't seem to make it to all nodes in a 
> working format.
> I have fixes for some of this, but the config and running context debugging 
> remains elusive to me.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7935) sparkContext in SparkPlan is better to be define as val

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7935:
---

Assignee: Apache Spark

> sparkContext  in SparkPlan is better to be define as val
> 
>
> Key: SPARK-7935
> URL: https://issues.apache.org/jira/browse/SPARK-7935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: baishuo
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7935) sparkContext in SparkPlan is better to be define as val

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7935:
---

Assignee: (was: Apache Spark)

> sparkContext  in SparkPlan is better to be define as val
> 
>
> Key: SPARK-7935
> URL: https://issues.apache.org/jira/browse/SPARK-7935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: baishuo
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7935) sparkContext in SparkPlan is better to be define as val

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564159#comment-14564159
 ] 

Apache Spark commented on SPARK-7935:
-

User 'baishuo' has created a pull request for this issue:
https://github.com/apache/spark/pull/6486

> sparkContext  in SparkPlan is better to be define as val
> 
>
> Key: SPARK-7935
> URL: https://issues.apache.org/jira/browse/SPARK-7935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: baishuo
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7929) Remove Bagel examples

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564144#comment-14564144
 ] 

Apache Spark commented on SPARK-7929:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6487

> Remove Bagel examples
> -
>
> Key: SPARK-7929
> URL: https://issues.apache.org/jira/browse/SPARK-7929
> Project: Spark
>  Issue Type: Task
>  Components: Examples, GraphX
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Bagel has been deprecated for a while. We should remove the example code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7935) sparkContext in SparkPlan is better to be define as val

2015-05-28 Thread baishuo (JIRA)
baishuo created SPARK-7935:
--

 Summary: sparkContext  in SparkPlan is better to be define as val
 Key: SPARK-7935
 URL: https://issues.apache.org/jira/browse/SPARK-7935
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: baishuo
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7934) In some cases, Spark hangs in yarn-client mode.

2015-05-28 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-7934:
--

 Summary: In some cases, Spark hangs in yarn-client mode.
 Key: SPARK-7934
 URL: https://issues.apache.org/jira/browse/SPARK-7934
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.1
Reporter: Guoqiang Li


The log:

 

 {noformat}
15/05/29 10:20:20 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/05/29 10:20:20 INFO SecurityManager: Changing view acls to: spark
15/05/29 10:20:20 INFO SecurityManager: Changing modify acls to: spark
15/05/29 10:20:20 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark); users with 
modify permissions: Set(spark)
15/05/29 10:20:20 INFO HttpServer: Starting HTTP Server
15/05/29 10:20:20 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/29 10:20:20 INFO AbstractConnector: Started SocketConnector@0.0.0.0:54276
15/05/29 10:20:20 INFO Utils: Successfully started service 'HTTP class server' 
on port 54276.
15/05/29 10:20:31 INFO SparkContext: Running Spark version 1.3.1
15/05/29 10:20:31 WARN SparkConf: The configuration option 
'spark.yarn.user.classpath.first' has been replaced as of Spark 1.3 and may be 
removed in the future. Use spark.{driver,executor}.userClassPathFirst instead.
15/05/29 10:20:31 INFO SecurityManager: Changing view acls to: spark
15/05/29 10:20:31 INFO SecurityManager: Changing modify acls to: spark
15/05/29 10:20:31 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark); users with 
modify permissions: Set(spark)
15/05/29 10:20:32 INFO Slf4jLogger: Slf4jLogger started
15/05/29 10:20:32 INFO Remoting: Starting remoting
15/05/29 10:20:33 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkdri...@10dian71.domain.test:55492]
15/05/29 10:20:33 INFO Utils: Successfully started service 'sparkDriver' on 
port 55492.
15/05/29 10:20:33 INFO SparkEnv: Registering MapOutputTracker
15/05/29 10:20:33 INFO SparkEnv: Registering BlockManagerMaster
15/05/29 10:20:33 INFO DiskBlockManager: Created local directory at 
/tmp/spark-94c41fce-1788-484e-9878-88d1bf8c7247/blockmgr-b3d7ba9d-6656-408f-b9e2-683784493f22
15/05/29 10:20:33 INFO MemoryStore: MemoryStore started with capacity 4.1 GB
15/05/29 10:20:34 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-271bab98-b4e8-4b02-8267-0020a38f355b/httpd-92bb8c15-51a7-4b40-9d01-2fb01cfbb148
15/05/29 10:20:34 INFO HttpServer: Starting HTTP Server
15/05/29 10:20:34 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/29 10:20:34 INFO AbstractConnector: Started SocketConnector@0.0.0.0:38530
15/05/29 10:20:34 INFO Utils: Successfully started service 'HTTP file server' 
on port 38530.
15/05/29 10:20:34 INFO SparkEnv: Registering OutputCommitCoordinator
15/05/29 10:20:34 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/29 10:20:34 INFO AbstractConnector: Started 
SelectChannelConnector@0.0.0.0:4040
15/05/29 10:20:34 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
15/05/29 10:20:34 INFO SparkUI: Started SparkUI at 
http://10dian71.domain.test:4040
15/05/29 10:20:34 INFO SparkContext: Added JAR 
file:/opt/spark/spark-1.3.0-cdh5/lib/hadoop-lzo-0.4.15-gplextras5.0.1-SNAPSHOT.jar
 at 
http://192.168.10.71:38530/jars/hadoop-lzo-0.4.15-gplextras5.0.1-SNAPSHOT.jar 
with timestamp 1432866034769
15/05/29 10:20:34 INFO SparkContext: Added JAR 
file:/opt/spark/classes/toona-assembly.jar at 
http://192.168.10.71:38530/jars/toona-assembly.jar with timestamp 1432866034972
15/05/29 10:20:35 INFO RMProxy: Connecting to ResourceManager at 
10dian72/192.168.10.72:9080
15/05/29 10:20:36 INFO Client: Requesting a new application from cluster with 9 
NodeManagers
15/05/29 10:20:36 INFO Client: Verifying our application has not requested more 
than the maximum memory capability of the cluster (10240 MB per container)
15/05/29 10:20:36 INFO Client: Will allocate AM container, with 896 MB memory 
including 384 MB overhead
15/05/29 10:20:36 INFO Client: Setting up container launch context for our AM
15/05/29 10:20:36 INFO Client: Preparing resources for our AM container
15/05/29 10:20:37 INFO Client: Uploading resource 
file:/opt/spark/spark-1.3.0-cdh5/lib/spark-assembly-1.3.2-SNAPSHOT-hadoop2.3.0-cdh5.0.1.jar
 -> 
hdfs://ns1/user/spark/.sparkStaging/application_1429108701044_0881/spark-assembly-1.3.2-SNAPSHOT-hadoop2.3.0-cdh5.0.1.jar
15/05/29 10:20:39 INFO Client: Uploading resource 
hdfs://ns1:8020/input/lbs/recommend/toona/spark/conf -> 
hdfs://ns1/user/spark/.sparkStaging/application_1429108701044_0881/conf
15/05/29 10:20:41 INFO Client: Setting up the launch environment for our AM 
container
15/05/29 10:20:42 INFO SecurityManager: Changing view acls to: spark
15/05/29 10:20:42 INFO SecurityManager: Changing modify acls to: spark
15/05/29 10:20:42 INF

[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-28 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564092#comment-14564092
 ] 

Akshat Aranya commented on SPARK-7708:
--

Wow, that's some serious sleuthing! I will try the newer version of Kryo and 
see if the rest of my serialization problems go away.

> Incorrect task serialization with Kryo closure serializer
> -
>
> Key: SPARK-7708
> URL: https://issues.apache.org/jira/browse/SPARK-7708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.2
>Reporter: Akshat Aranya
>
> I've been investigating the use of Kryo for closure serialization with Spark 
> 1.2, and it seems like I've hit upon a bug:
> When a task is serialized before scheduling, the following log message is 
> generated:
> [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
> , PROCESS_LOCAL, 302 bytes)
> This message comes from TaskSetManager which serializes the task using the 
> closure serializer.  Before the message is sent out, the TaskDescription 
> (which included the original task as a byte array), is serialized again into 
> a byte array with the closure serializer.  I added a log message for this in 
> CoarseGrainedSchedulerBackend, which produces the following output:
> [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
> The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
> than serialized task that it contains (302 bytes). This implies that 
> TaskDescription.buffer is not getting serialized correctly.
> On the executor side, the deserialization produces a null value for 
> TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7826) Suppress extra calling getCacheLocs.

2015-05-28 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-7826.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6352
[https://github.com/apache/spark/pull/6352]

> Suppress extra calling getCacheLocs.
> 
>
> Key: SPARK-7826
> URL: https://issues.apache.org/jira/browse/SPARK-7826
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.5.0
>
>
> There are too many extra call method {{getCacheLocs}} for {{DAGScheduler}}, 
> which includes Akka communication.
> To improve {{DAGScheduler}} performance, suppress extra calling the method.
> In my application with over 1200 stages, the execution time became 3.8 min 
> from 8.5 min with my patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7933) The default merge script JIRA username / password should be empty

2015-05-28 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-7933:
--
Description: It looks like this was changed accidentally a few months ago.  
(was: It looks like this was added accidentally when [~pwendell] merged a PR a 
few months ago.)
Summary: The default merge script JIRA username / password should be 
empty  (was: Patrick's username / password shouldn't be the defaults in the 
merge script)

> The default merge script JIRA username / password should be empty
> -
>
> Key: SPARK-7933
> URL: https://issues.apache.org/jira/browse/SPARK-7933
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 1.4.0
>
>
> It looks like this was changed accidentally a few months ago.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7933) Patrick's username / password shouldn't be the defaults in the merge script

2015-05-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564074#comment-14564074
 ] 

Patrick Wendell commented on SPARK-7933:


Thanks - this was a dummy password I added in there, but yeah fine to have it 
be the empty string.

> Patrick's username / password shouldn't be the defaults in the merge script
> ---
>
> Key: SPARK-7933
> URL: https://issues.apache.org/jira/browse/SPARK-7933
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 1.4.0
>
>
> It looks like this was added accidentally when [~pwendell] merged a PR a few 
> months ago.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7933) Patrick's username / password shouldn't be the defaults in the merge script

2015-05-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-7933.

   Resolution: Fixed
Fix Version/s: 1.4.0

> Patrick's username / password shouldn't be the defaults in the merge script
> ---
>
> Key: SPARK-7933
> URL: https://issues.apache.org/jira/browse/SPARK-7933
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 1.4.0
>
>
> It looks like this was added accidentally when [~pwendell] merged a PR a few 
> months ago.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7933) Patrick's username / password shouldn't be the defaults in the merge script

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7933:
---

Assignee: Apache Spark  (was: Kay Ousterhout)

> Patrick's username / password shouldn't be the defaults in the merge script
> ---
>
> Key: SPARK-7933
> URL: https://issues.apache.org/jira/browse/SPARK-7933
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Kay Ousterhout
>Assignee: Apache Spark
>Priority: Minor
>
> It looks like this was added accidentally when [~pwendell] merged a PR a few 
> months ago.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7933) Patrick's username / password shouldn't be the defaults in the merge script

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7933:
---

Assignee: Kay Ousterhout  (was: Apache Spark)

> Patrick's username / password shouldn't be the defaults in the merge script
> ---
>
> Key: SPARK-7933
> URL: https://issues.apache.org/jira/browse/SPARK-7933
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> It looks like this was added accidentally when [~pwendell] merged a PR a few 
> months ago.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7933) Patrick's username / password shouldn't be the defaults in the merge script

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564073#comment-14564073
 ] 

Apache Spark commented on SPARK-7933:
-

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/6485

> Patrick's username / password shouldn't be the defaults in the merge script
> ---
>
> Key: SPARK-7933
> URL: https://issues.apache.org/jira/browse/SPARK-7933
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> It looks like this was added accidentally when [~pwendell] merged a PR a few 
> months ago.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7933) Patrick's username / password shouldn't be the defaults in the merge script

2015-05-28 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-7933:
-

 Summary: Patrick's username / password shouldn't be the defaults 
in the merge script
 Key: SPARK-7933
 URL: https://issues.apache.org/jira/browse/SPARK-7933
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor


It looks like this was added accidentally when [~pwendell] merged a PR a few 
months ago.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7932) Scheduler delay shown in event timeline is incorrect

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7932:
---

Assignee: Kay Ousterhout  (was: Apache Spark)

> Scheduler delay shown in event timeline is incorrect
> 
>
> Key: SPARK-7932
> URL: https://issues.apache.org/jira/browse/SPARK-7932
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> In StagePage.scala, we round *down* to the nearest percent when computing the 
> proportion of a task's time spend in each phase of execution.  Scheduler 
> delay is computed by taking 100 - sum(all other proportions), which means 
> that a few extra percent may go into the scheduler delay.  As a result, 
> scheduler delay can appear larger in the visualization than it actually is.
> cc [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7932) Scheduler delay shown in event timeline is incorrect

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564065#comment-14564065
 ] 

Apache Spark commented on SPARK-7932:
-

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/6484

> Scheduler delay shown in event timeline is incorrect
> 
>
> Key: SPARK-7932
> URL: https://issues.apache.org/jira/browse/SPARK-7932
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> In StagePage.scala, we round *down* to the nearest percent when computing the 
> proportion of a task's time spend in each phase of execution.  Scheduler 
> delay is computed by taking 100 - sum(all other proportions), which means 
> that a few extra percent may go into the scheduler delay.  As a result, 
> scheduler delay can appear larger in the visualization than it actually is.
> cc [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7932) Scheduler delay shown in event timeline is incorrect

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7932:
---

Assignee: Apache Spark  (was: Kay Ousterhout)

> Scheduler delay shown in event timeline is incorrect
> 
>
> Key: SPARK-7932
> URL: https://issues.apache.org/jira/browse/SPARK-7932
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Kay Ousterhout
>Assignee: Apache Spark
>Priority: Minor
>
> In StagePage.scala, we round *down* to the nearest percent when computing the 
> proportion of a task's time spend in each phase of execution.  Scheduler 
> delay is computed by taking 100 - sum(all other proportions), which means 
> that a few extra percent may go into the scheduler delay.  As a result, 
> scheduler delay can appear larger in the visualization than it actually is.
> cc [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7932) Scheduler delay shown in event timeline is incorrect

2015-05-28 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-7932:
-

 Summary: Scheduler delay shown in event timeline is incorrect
 Key: SPARK-7932
 URL: https://issues.apache.org/jira/browse/SPARK-7932
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor


In StagePage.scala, we round *down* to the nearest percent when computing the 
proportion of a task's time spend in each phase of execution.  Scheduler delay 
is computed by taking 100 - sum(all other proportions), which means that a few 
extra percent may go into the scheduler delay.  As a result, scheduler delay 
can appear larger in the visualization than it actually is.

cc [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564054#comment-14564054
 ] 

Josh Rosen commented on SPARK-7708:
---

I've opened https://github.com/EsotericSoftware/kryo/issues/312 to discuss this 
with the Kryo developers.  Updating to Kryo 2.23.0 fixes the symptoms that 
we've observed here, but it would still be good to get confirmation that our 
re-use of {{Output}} is something that Kryo intends to support.

> Incorrect task serialization with Kryo closure serializer
> -
>
> Key: SPARK-7708
> URL: https://issues.apache.org/jira/browse/SPARK-7708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.2
>Reporter: Akshat Aranya
>
> I've been investigating the use of Kryo for closure serialization with Spark 
> 1.2, and it seems like I've hit upon a bug:
> When a task is serialized before scheduling, the following log message is 
> generated:
> [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
> , PROCESS_LOCAL, 302 bytes)
> This message comes from TaskSetManager which serializes the task using the 
> closure serializer.  Before the message is sent out, the TaskDescription 
> (which included the original task as a byte array), is serialized again into 
> a byte array with the closure serializer.  I added a log message for this in 
> CoarseGrainedSchedulerBackend, which produces the following output:
> [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
> The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
> than serialized task that it contains (302 bytes). This implies that 
> TaskDescription.buffer is not getting serialized correctly.
> On the executor side, the deserialization produces a null value for 
> TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed

2015-05-28 Thread Haopu Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564042#comment-14564042
 ] 

Haopu Wang commented on SPARK-6950:
---

I hit this issue on 1.3.0 and 1.3.1.
It can be reproduced using a very simple application and a standalone cluster 
(1 master and 1 slave).

> Spark master UI believes some applications are in progress when they are 
> actually completed
> ---
>
> Key: SPARK-6950
> URL: https://issues.apache.org/jira/browse/SPARK-6950
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
> Fix For: 1.3.1
>
>
> In Spark 1.2.x, I was able to set my spark event log directory to be a 
> different location from the default, and after the job finishes, I can replay 
> the UI by clicking on the appropriate link under "Completed Applications".
> Now, on a non-deterministic basis (but seems to happen most of the time), 
> when I click on the link under "Completed Applications", I instead get a 
> webpage that says:
> Application history not found (app-20150415052927-0014)
> Application myApp is still in progress.
> I am able to view the application's UI using the Spark history server, so 
> something regressed in the Spark master code between 1.2 and 1.3, but that 
> regression does not apply in the history server use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564038#comment-14564038
 ] 

Josh Rosen commented on SPARK-7708:
---

Two hours later and I've now found where the state was hiding.  I discovered 
this using the following isolated test project, which explains the 3-byte size 
difference: https://gist.github.com/JoshRosen/14ba69ef53af53ef2839

Intuitively, you might think that it's in {{Output}} because using a new 
{{Output}} solves the issue.  However, it turns out that the state was hiding 
inside Kryo's {{JavaSerializer}} class:

{code}
public class JavaSerializer extends Serializer {
  private ObjectOutputStream objectStream;
  private Output lastOutput;

  public JavaSerializer() {
  }

  public void write(Kryo kryo, Output output, Object object) {
try {
  if(output != this.lastOutput) {
this.objectStream = new ObjectOutputStream(output);
this.lastOutput = output;
  } else {
this.objectStream.reset();
  }

  this.objectStream.writeObject(object);
  this.objectStream.flush();
} catch (Exception var5) {
  throw new KryoException("Error during Java serialization.", var5);
}
  }

[...]
{code}

When you pass a new output, it opens a new ObjectOutputStream and write a new 
stream header, but when you reuse the output it only writes a reset flag.  The 
header is two shorts, which is four bytes, whereas the reset is only one byte, 
leading to the 3-byte difference.

I'm not sure whether this is a bug in Kryo or whether our re-use of Output is 
unsafe.  I'll email the developers to ask.

> Incorrect task serialization with Kryo closure serializer
> -
>
> Key: SPARK-7708
> URL: https://issues.apache.org/jira/browse/SPARK-7708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.2
>Reporter: Akshat Aranya
>
> I've been investigating the use of Kryo for closure serialization with Spark 
> 1.2, and it seems like I've hit upon a bug:
> When a task is serialized before scheduling, the following log message is 
> generated:
> [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
> , PROCESS_LOCAL, 302 bytes)
> This message comes from TaskSetManager which serializes the task using the 
> closure serializer.  Before the message is sent out, the TaskDescription 
> (which included the original task as a byte array), is serialized again into 
> a byte array with the closure serializer.  I added a log message for this in 
> CoarseGrainedSchedulerBackend, which produces the following output:
> [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
> The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
> than serialized task that it contains (302 bytes). This implies that 
> TaskDescription.buffer is not getting serialized correctly.
> On the executor side, the deserialization produces a null value for 
> TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7930) Shutdown hook deletes rool local dir before SparkContext is stopped, throwing errors

2015-05-28 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7930:
-
Priority: Critical  (was: Blocker)

> Shutdown hook deletes rool local dir before SparkContext is stopped, throwing 
> errors
> 
>
> Key: SPARK-7930
> URL: https://issues.apache.org/jira/browse/SPARK-7930
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Shutdown hook for temp directories had priority 100 while SparkContext was 
> 50. So the local root directory was deleted before SparkContext was shutdown. 
> This leads to scary errors on running jobs, at the time of shutdown. This is 
> especially a problem when running streaming examples, where Ctrl-C is the 
> only way to  shutdown. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7931) Do not restart a socket receiver when the receiver is being shutdown

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7931:
---

Assignee: Apache Spark  (was: Tathagata Das)

> Do not restart a socket receiver when the receiver is being shutdown
> 
>
> Key: SPARK-7931
> URL: https://issues.apache.org/jira/browse/SPARK-7931
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Critical
>
> Attempts to restart the socket receiver when it is supposed to be stopped 
> causes undesirable error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7931) Do not restart a socket receiver when the receiver is being shutdown

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564034#comment-14564034
 ] 

Apache Spark commented on SPARK-7931:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/6483

> Do not restart a socket receiver when the receiver is being shutdown
> 
>
> Key: SPARK-7931
> URL: https://issues.apache.org/jira/browse/SPARK-7931
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Attempts to restart the socket receiver when it is supposed to be stopped 
> causes undesirable error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7931) Do not restart a socket receiver when the receiver is being shutdown

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7931:
---

Assignee: Tathagata Das  (was: Apache Spark)

> Do not restart a socket receiver when the receiver is being shutdown
> 
>
> Key: SPARK-7931
> URL: https://issues.apache.org/jira/browse/SPARK-7931
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Attempts to restart the socket receiver when it is supposed to be stopped 
> causes undesirable error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7930) Shutdown hook deletes rool local dir before SparkContext is stopped, throwing errors

2015-05-28 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7930:
-
Priority: Blocker  (was: Major)

> Shutdown hook deletes rool local dir before SparkContext is stopped, throwing 
> errors
> 
>
> Key: SPARK-7930
> URL: https://issues.apache.org/jira/browse/SPARK-7930
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> Shutdown hook for temp directories had priority 100 while SparkContext was 
> 50. So the local root directory was deleted before SparkContext was shutdown. 
> This leads to scary errors on running jobs, at the time of shutdown. This is 
> especially a problem when running streaming examples, where Ctrl-C is the 
> only way to  shutdown. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7931) Do not restart a socket receiver when the receiver is being shutdown

2015-05-28 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-7931:


 Summary: Do not restart a socket receiver when the receiver is 
being shutdown
 Key: SPARK-7931
 URL: https://issues.apache.org/jira/browse/SPARK-7931
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical


Attempts to restart the socket receiver when it is supposed to be stopped 
causes undesirable error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7930) Shutdown hook deletes rool local dir before SparkContext is stopped, throwing errors

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564025#comment-14564025
 ] 

Apache Spark commented on SPARK-7930:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/6482

> Shutdown hook deletes rool local dir before SparkContext is stopped, throwing 
> errors
> 
>
> Key: SPARK-7930
> URL: https://issues.apache.org/jira/browse/SPARK-7930
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Shutdown hook for temp directories had priority 100 while SparkContext was 
> 50. So the local root directory was deleted before SparkContext was shutdown. 
> This leads to scary errors on running jobs, at the time of shutdown. This is 
> especially a problem when running streaming examples, where Ctrl-C is the 
> only way to  shutdown. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7930) Shutdown hook deletes rool local dir before SparkContext is stopped, throwing errors

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7930:
---

Assignee: Tathagata Das  (was: Apache Spark)

> Shutdown hook deletes rool local dir before SparkContext is stopped, throwing 
> errors
> 
>
> Key: SPARK-7930
> URL: https://issues.apache.org/jira/browse/SPARK-7930
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Shutdown hook for temp directories had priority 100 while SparkContext was 
> 50. So the local root directory was deleted before SparkContext was shutdown. 
> This leads to scary errors on running jobs, at the time of shutdown. This is 
> especially a problem when running streaming examples, where Ctrl-C is the 
> only way to  shutdown. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7930) Shutdown hook deletes rool local dir before SparkContext is stopped, throwing errors

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7930:
---

Assignee: Apache Spark  (was: Tathagata Das)

> Shutdown hook deletes rool local dir before SparkContext is stopped, throwing 
> errors
> 
>
> Key: SPARK-7930
> URL: https://issues.apache.org/jira/browse/SPARK-7930
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> Shutdown hook for temp directories had priority 100 while SparkContext was 
> 50. So the local root directory was deleted before SparkContext was shutdown. 
> This leads to scary errors on running jobs, at the time of shutdown. This is 
> especially a problem when running streaming examples, where Ctrl-C is the 
> only way to  shutdown. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7930) Shutdown hook deletes rool local dir before SparkContext is stopped, throwing errors

2015-05-28 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-7930:


 Summary: Shutdown hook deletes rool local dir before SparkContext 
is stopped, throwing errors
 Key: SPARK-7930
 URL: https://issues.apache.org/jira/browse/SPARK-7930
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das


Shutdown hook for temp directories had priority 100 while SparkContext was 50. 
So the local root directory was deleted before SparkContext was shutdown. This 
leads to scary errors on running jobs, at the time of shutdown. This is 
especially a problem when running streaming examples, where Ctrl-C is the only 
way to  shutdown. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7927) Enforce whitespace for more tokens in style checker

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563999#comment-14563999
 ] 

Apache Spark commented on SPARK-7927:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/6481

> Enforce whitespace for more tokens in style checker
> ---
>
> Key: SPARK-7927
> URL: https://issues.apache.org/jira/browse/SPARK-7927
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Enforce whitespace on comma, colon, if, while, etc ... so we don't need to 
> keep spending time on this in code reviews.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7577) User guide update for Bucketizer

2015-05-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7577.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6451
[https://github.com/apache/spark/pull/6451]

> User guide update for Bucketizer
> 
>
> Key: SPARK-7577
> URL: https://issues.apache.org/jira/browse/SPARK-7577
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
> Fix For: 1.4.0
>
>
> Copied from [SPARK-7443]:
> {quote}
> Now that we have algorithms in spark.ml which are not in spark.mllib, we 
> should start making subsections for the spark.ml API as needed. We can follow 
> the structure of the spark.mllib user guide.
> * The spark.ml user guide can provide: (a) code examples and (b) info on 
> algorithms which do not exist in spark.mllib.
> * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
> still the primary API, we should provide links to the corresponding 
> algorithms in the spark.mllib user guide for more info.
> {quote}
> Note: I created a new subsection for links to spark.ml-specific guides in 
> this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
> subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7929) Remove Bagel examples

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7929:
---

Assignee: Apache Spark  (was: Reynold Xin)

> Remove Bagel examples
> -
>
> Key: SPARK-7929
> URL: https://issues.apache.org/jira/browse/SPARK-7929
> Project: Spark
>  Issue Type: Task
>  Components: Examples, GraphX
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Bagel has been deprecated for a while. We should remove the example code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7929) Remove Bagel examples

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7929:
---

Assignee: Reynold Xin  (was: Apache Spark)

> Remove Bagel examples
> -
>
> Key: SPARK-7929
> URL: https://issues.apache.org/jira/browse/SPARK-7929
> Project: Spark
>  Issue Type: Task
>  Components: Examples, GraphX
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Bagel has been deprecated for a while. We should remove the example code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7929) Remove Bagel examples

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563989#comment-14563989
 ] 

Apache Spark commented on SPARK-7929:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6480

> Remove Bagel examples
> -
>
> Key: SPARK-7929
> URL: https://issues.apache.org/jira/browse/SPARK-7929
> Project: Spark
>  Issue Type: Task
>  Components: Examples, GraphX
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Bagel has been deprecated for a while. We should remove the example code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7929) Remove Bagel examples

2015-05-28 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7929:
--

 Summary: Remove Bagel examples
 Key: SPARK-7929
 URL: https://issues.apache.org/jira/browse/SPARK-7929
 Project: Spark
  Issue Type: Task
  Components: Examples, GraphX
Reporter: Reynold Xin
Assignee: Reynold Xin


Bagel has been deprecated for a while. We should remove the example code.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7038) [Streaming] Spark Sink requires spark assembly in classpath

2015-05-28 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563985#comment-14563985
 ] 

Hari Shreedharan commented on SPARK-7038:
-

[~vanzin] - Does adding the shade plugin to the pom for the sink fix this issue?

> [Streaming] Spark Sink requires spark assembly in classpath
> ---
>
> Key: SPARK-7038
> URL: https://issues.apache.org/jira/browse/SPARK-7038
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1
>Reporter: Hari Shreedharan
>
> In 1.3.0 Spark, we shaded Guava, which meant that the Spark Sink's guava 
> dependency is not standard guava anymore - thus the one from Flume's 
> classpath does not work and can throw a NoClassDefFoundError while using 
> Spark Sink.
> We must pull in the guava dependency into the Spark Sink jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-28 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563978#comment-14563978
 ] 

Yin Huai commented on SPARK-7819:
-

btw, I fix I did is 
https://github.com/apache/spark/commit/572b62cafe4bc7b1d464c9dcfb449c9d53456826.

> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Fi
>Priority: Critical
> Attachments: stacktrace.txt, test.py
>
>
> In reference to the pull request: https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> {noformat}
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> {noformat}
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> ** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-28 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7819:

Target Version/s: 1.4.1, 1.5.0  (was: 1.4.1)

> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Fi
>Priority: Critical
> Attachments: stacktrace.txt, test.py
>
>
> In reference to the pull request: https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> {noformat}
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> {noformat}
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> ** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7837) NPE when save as parquet in speculative tasks

2015-05-28 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563977#comment-14563977
 ] 

Yin Huai commented on SPARK-7837:
-

We have made the parquet reader side robust to files left in _temporary. So, 
this problem should have a much smaller impact. 

I am re-targeting it to 1.5. Will keep an eye on it and investigate the root 
cause.

> NPE when save as parquet in speculative tasks
> -
>
> Key: SPARK-7837
> URL: https://issues.apache.org/jira/browse/SPARK-7837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Priority: Critical
>
> The query is like {{df.orderBy(...).saveAsTable(...)}}.
> When there is no partitioning columns and there is a skewed key, I found the 
> following exception in speculative tasks. After these failures, seems we 
> could not call {{SparkHadoopMapRedUtil.commitTask}} correctly.
> {code}
> java.lang.NullPointerException
>   at 
> parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
>   at 
> parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
>   at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
>   at 
> org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115)
>   at 
> org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-28 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7819:

Target Version/s: 1.4.1  (was: 1.4.0)

> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Fi
>Priority: Critical
> Attachments: stacktrace.txt, test.py
>
>
> In reference to the pull request: https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> {noformat}
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> {noformat}
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> ** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7837) NPE when save as parquet in speculative tasks

2015-05-28 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7837:

Target Version/s: 1.5.0  (was: 1.4.0)

> NPE when save as parquet in speculative tasks
> -
>
> Key: SPARK-7837
> URL: https://issues.apache.org/jira/browse/SPARK-7837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Priority: Critical
>
> The query is like {{df.orderBy(...).saveAsTable(...)}}.
> When there is no partitioning columns and there is a skewed key, I found the 
> following exception in speculative tasks. After these failures, seems we 
> could not call {{SparkHadoopMapRedUtil.commitTask}} correctly.
> {code}
> java.lang.NullPointerException
>   at 
> parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
>   at 
> parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
>   at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
>   at 
> org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115)
>   at 
> org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-28 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563973#comment-14563973
 ] 

Yin Huai commented on SPARK-7819:
-

[~coderfi] I just checked in a bug fix related to the class loader and the 
spark sql conf set in spark conf (e.g. spark-default). Can you try to the 
latest 1.4 branch and put the following entry in the 
{{conf/spark-defaults.conf}} 

{{spark.sql.hive.metastore.sharedPrefixes 
com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni}}

Basically, it contains a few packages for JDBC drivers and a few mapr packages.

We will try to figure out a way to let JNI libs work with our two classloaders.

> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Fi
>Priority: Critical
> Attachments: stacktrace.txt, test.py
>
>
> In reference to the pull request: https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> {noformat}
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> {noformat}
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> ** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7853) ClassNotFoundException for SparkSQL

2015-05-28 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-7853.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6459
[https://github.com/apache/spark/pull/6459]

> ClassNotFoundException for SparkSQL
> ---
>
> Key: SPARK-7853
> URL: https://issues.apache.org/jira/browse/SPARK-7853
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Hao
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.4.0
>
>
> Reproduce steps:
> {code}
> bin/spark-sql --jars 
> ./sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar
> CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 
> 'org.apache.hive.hcatalog.data.JsonSerDe';
> {code}
> Throws Exception like:
> {noformat}
> 15/05/26 00:16:33 ERROR SparkSQLDriver: Failed in [CREATE TABLE t1(a string, 
> b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe']
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
> Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot 
> validate serde: org.apache.hive.hcatalog.data.JsonSerDe
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:333)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:310)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:139)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:310)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:300)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:457)
>   at 
> org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:922)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:922)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:147)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:131)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:727)
>   at 
> org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:283)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:218)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7198) VectorAssembler should carry ML metadata

2015-05-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7198.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6452
[https://github.com/apache/spark/pull/6452]

> VectorAssembler should carry ML metadata
> 
>
> Key: SPARK-7198
> URL: https://issues.apache.org/jira/browse/SPARK-7198
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.4.0
>
>
> Now it only outputs assembled vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7910) Expose partitioner information in JavaRDD

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7910:
---

Assignee: (was: Apache Spark)

> Expose partitioner information in JavaRDD
> -
>
> Key: SPARK-7910
> URL: https://issues.apache.org/jira/browse/SPARK-7910
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Reporter: holdenk
>Priority: Minor
>
> It would be useful to expose the partitioner info in the Java & Python APIs 
> for RDDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7910) Expose partitioner information in JavaRDD

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7910:
---

Assignee: Apache Spark

> Expose partitioner information in JavaRDD
> -
>
> Key: SPARK-7910
> URL: https://issues.apache.org/jira/browse/SPARK-7910
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> It would be useful to expose the partitioner info in the Java & Python APIs 
> for RDDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7890) Document that Spark 2.11 now supports Kafka

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7890:
---

Assignee: Iulian Dragos  (was: Apache Spark)

> Document that Spark 2.11 now supports Kafka
> ---
>
> Key: SPARK-7890
> URL: https://issues.apache.org/jira/browse/SPARK-7890
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Iulian Dragos
>Priority: Critical
>
> The building-spark.html page needs to be updated. It's a simple fix, just 
> remove the caveat about Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7890) Document that Spark 2.11 now supports Kafka

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7890:
---

Assignee: Apache Spark  (was: Iulian Dragos)

> Document that Spark 2.11 now supports Kafka
> ---
>
> Key: SPARK-7890
> URL: https://issues.apache.org/jira/browse/SPARK-7890
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Apache Spark
>Priority: Critical
>
> The building-spark.html page needs to be updated. It's a simple fix, just 
> remove the caveat about Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563831#comment-14563831
 ] 

Josh Rosen commented on SPARK-7708:
---

I think that there might be some state inside of the Kryo {{Output}} which 
isn't being reset properly after {{clear()}}.  I'm investigating now.

> Incorrect task serialization with Kryo closure serializer
> -
>
> Key: SPARK-7708
> URL: https://issues.apache.org/jira/browse/SPARK-7708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.2
>Reporter: Akshat Aranya
>
> I've been investigating the use of Kryo for closure serialization with Spark 
> 1.2, and it seems like I've hit upon a bug:
> When a task is serialized before scheduling, the following log message is 
> generated:
> [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
> , PROCESS_LOCAL, 302 bytes)
> This message comes from TaskSetManager which serializes the task using the 
> closure serializer.  Before the message is sent out, the TaskDescription 
> (which included the original task as a byte array), is serialized again into 
> a byte array with the closure serializer.  I added a log message for this in 
> CoarseGrainedSchedulerBackend, which produces the following output:
> [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
> The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
> than serialized task that it contains (302 bytes). This implies that 
> TaskDescription.buffer is not getting serialized correctly.
> On the executor side, the deserialization produces a null value for 
> TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7928) Yarn App Master Logs are not displayed in the spark historyserver UI

2015-05-28 Thread Hari Shreedharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Shreedharan resolved SPARK-7928.
-
Resolution: Duplicate

This issue was fixed by https://github.com/apache/spark/pull/6166

Can you try that patch and see if it works for you?

> Yarn App Master Logs are not displayed in the spark historyserver UI
> 
>
> Key: SPARK-7928
> URL: https://issues.apache.org/jira/browse/SPARK-7928
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.1, 1.3.1
> Environment: yarn hadoop 2.7.0
>Reporter: Aditya Rao
>
> in hadoop 2.7.0 the link to App Master Log has been disabled as the YARN Job 
> History Server would show the app master logs in the Resource Manager  UI, 
> but Spark Historyserver doesn't show the app master logs.
> So anyone who is running a spark job in yarn-cluster mode has no way to know 
> the results other than checking the userlogs manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB

2015-05-28 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6101:

Description: similar to https://github.com/databricks/spark-avro  and 
https://github.com/databricks/spark-csv  (was: similar to 
https://github.com/databricks/spark-avro)

> Create a SparkSQL DataSource API implementation for DynamoDB
> 
>
> Key: SPARK-6101
> URL: https://issues.apache.org/jira/browse/SPARK-6101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Chris Fregly
>Assignee: Chris Fregly
> Fix For: 1.5.0
>
>
> similar to https://github.com/databricks/spark-avro  and 
> https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7927) Enforce whitespace for more tokens in style checker

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563820#comment-14563820
 ] 

Apache Spark commented on SPARK-7927:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6478

> Enforce whitespace for more tokens in style checker
> ---
>
> Key: SPARK-7927
> URL: https://issues.apache.org/jira/browse/SPARK-7927
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Enforce whitespace on comma, colon, if, while, etc ... so we don't need to 
> keep spending time on this in code reviews.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7928) Yarn App Master Logs are not displayed in the spark historyserver UI

2015-05-28 Thread Aditya Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Rao updated SPARK-7928:
--
Description: 
in hadoop 2.7.0 the link to App Master Log has been disabled as the YARN Job 
History Server would show the app master logs in the Resource Manager  UI, but 
Spark Historyserver doesn't show the app master logs.

So anyone who is running a spark job in yarn-cluster mode has no way to know 
the results other than checking the userlogs manually.

  was:
in hadoop 2.7.0 the link to App Master Log has been disabled as the YARN Job 
History Server would show the app master logs, but Spark Historyserver doesn't 
show the app master logs.

So anyone who is running a spark job in yarn-cluster mode has no way to know 
the results other than checking the userlogs manually.


> Yarn App Master Logs are not displayed in the spark historyserver UI
> 
>
> Key: SPARK-7928
> URL: https://issues.apache.org/jira/browse/SPARK-7928
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.1, 1.3.1
> Environment: yarn hadoop 2.7.0
>Reporter: Aditya Rao
>
> in hadoop 2.7.0 the link to App Master Log has been disabled as the YARN Job 
> History Server would show the app master logs in the Resource Manager  UI, 
> but Spark Historyserver doesn't show the app master logs.
> So anyone who is running a spark job in yarn-cluster mode has no way to know 
> the results other than checking the userlogs manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7928) Yarn App Master Logs are not displayed in the spark historyserver UI

2015-05-28 Thread Aditya Rao (JIRA)
Aditya Rao created SPARK-7928:
-

 Summary: Yarn App Master Logs are not displayed in the spark 
historyserver UI
 Key: SPARK-7928
 URL: https://issues.apache.org/jira/browse/SPARK-7928
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.1, 1.2.1
 Environment: yarn hadoop 2.7.0
Reporter: Aditya Rao


in hadoop 2.7.0 the link to App Master Log has been disabled as the YARN Job 
History Server would show the app master logs, but Spark Historyserver doesn't 
show the app master logs.

So anyone who is running a spark job in yarn-cluster mode has no way to know 
the results other than checking the userlogs manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7927) Enforce whitespace for more tokens in style checker

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563812#comment-14563812
 ] 

Apache Spark commented on SPARK-7927:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6477

> Enforce whitespace for more tokens in style checker
> ---
>
> Key: SPARK-7927
> URL: https://issues.apache.org/jira/browse/SPARK-7927
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Enforce whitespace on comma, colon, if, while, etc ... so we don't need to 
> keep spending time on this in code reviews.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7921) Change includeFirst to dropLast in OneHotEncoder

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7921:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

> Change includeFirst to dropLast in OneHotEncoder
> 
>
> Key: SPARK-7921
> URL: https://issues.apache.org/jira/browse/SPARK-7921
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Change includeFirst to dropLast and leave the default to true. There are 
> couple benefits:
> a. consistent with other tutorials of one-hot encoding (or dummy coding) 
> (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
> b. keep the indices unmodified in the output vector. If we drop the first, 
> all indices will be shifted by 1.
> c. If users use StringIndex, the last element is the least frequent one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7921) Change includeFirst to dropLast in OneHotEncoder

2015-05-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7921:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

> Change includeFirst to dropLast in OneHotEncoder
> 
>
> Key: SPARK-7921
> URL: https://issues.apache.org/jira/browse/SPARK-7921
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> Change includeFirst to dropLast and leave the default to true. There are 
> couple benefits:
> a. consistent with other tutorials of one-hot encoding (or dummy coding) 
> (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
> b. keep the indices unmodified in the output vector. If we drop the first, 
> all indices will be shifted by 1.
> c. If users use StringIndex, the last element is the least frequent one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7925) Address inconsistencies in capturing appName in different Metrics Sources

2015-05-28 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563808#comment-14563808
 ] 

Tathagata Das commented on SPARK-7925:
--

[~jerryshao] I remember you implemented some of these sources for streaming, do 
you know why the naming has this inconsistency between core and streaming?

> Address inconsistencies in capturing appName in different Metrics Sources
> -
>
> Key: SPARK-7925
> URL: https://issues.apache.org/jira/browse/SPARK-7925
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Bharat Venkat
>
> StreamingSource and ApplicationSource captures the appName, however the rest 
> of the sources (DAGSchedulerSource, ExecutorSource etc.) do not.  Capturing 
> the appName allows us to automate monitoring the metrics for an application.  
> It would be good if appName is consistent captured across all spark metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7927) Enforce whitespace for more tokens in style checker

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563801#comment-14563801
 ] 

Apache Spark commented on SPARK-7927:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6476

> Enforce whitespace for more tokens in style checker
> ---
>
> Key: SPARK-7927
> URL: https://issues.apache.org/jira/browse/SPARK-7927
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Enforce whitespace on comma, colon, if, while, etc ... so we don't need to 
> keep spending time on this in code reviews.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7927) Enforce whitespace for more tokens in style checker

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563787#comment-14563787
 ] 

Apache Spark commented on SPARK-7927:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6475

> Enforce whitespace for more tokens in style checker
> ---
>
> Key: SPARK-7927
> URL: https://issues.apache.org/jira/browse/SPARK-7927
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Enforce whitespace on comma, colon, if, while, etc ... so we don't need to 
> keep spending time on this in code reviews.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-28 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563786#comment-14563786
 ] 

Akshat Aranya commented on SPARK-7708:
--

[~joshrosen] I tried the test once again with your new code merged in, and it 
seems like it's not a problem with resetting the Kryo object.  In my test, I 
serialize same object twice with the same KryoSerializerInstance, but I end up 
with two different serialized buffers:

{noformat}
serialized.limit=369
serialized.limit=366
{noformat}

Clearly, there is some state inside the serializer that isn't reset even after 
calling {{reset()}}.

> Incorrect task serialization with Kryo closure serializer
> -
>
> Key: SPARK-7708
> URL: https://issues.apache.org/jira/browse/SPARK-7708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.2
>Reporter: Akshat Aranya
>
> I've been investigating the use of Kryo for closure serialization with Spark 
> 1.2, and it seems like I've hit upon a bug:
> When a task is serialized before scheduling, the following log message is 
> generated:
> [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
> , PROCESS_LOCAL, 302 bytes)
> This message comes from TaskSetManager which serializes the task using the 
> closure serializer.  Before the message is sent out, the TaskDescription 
> (which included the original task as a byte array), is serialized again into 
> a byte array with the closure serializer.  I added a log message for this in 
> CoarseGrainedSchedulerBackend, which produces the following output:
> [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
> The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
> than serialized task that it contains (302 bytes). This implies that 
> TaskDescription.buffer is not getting serialized correctly.
> On the executor side, the deserialization produces a null value for 
> TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7927) Enforce whitespace for more tokens in style checker

2015-05-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563780#comment-14563780
 ] 

Apache Spark commented on SPARK-7927:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6474

> Enforce whitespace for more tokens in style checker
> ---
>
> Key: SPARK-7927
> URL: https://issues.apache.org/jira/browse/SPARK-7927
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Enforce whitespace on comma, colon, if, while, etc ... so we don't need to 
> keep spending time on this in code reviews.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >