[jira] [Commented] (SPARK-4243) Spark SQL SELECT COUNT DISTINCT optimization

2015-11-10 Thread Piotr Niemcunowicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998799#comment-14998799
 ] 

Piotr Niemcunowicz commented on SPARK-4243:
---

Same happens when one uses HiveContext.

> Spark SQL SELECT COUNT DISTINCT optimization
> 
>
> Key: SPARK-4243
> URL: https://issues.apache.org/jira/browse/SPARK-4243
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Bojan Kostić
>
> Spark SQL runs slow when using this code:
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
> val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") 
> parquetFile.registerTempTable("parquetFile") 
> val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") 
> count.map(t => t(0)).collect().foreach(println)
> {code}
> But with this query it runs much faster:
> {code}
> SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
> {code}
> Old queries stats by phases: 
> 3.2min 
> 17s 
> New query stats by phases: 
> 0.3 s 
> 16 s 
> 20 s
> Maybe you should also see this query for optimization:
> {code}
> SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) 
> FROM parquetFile 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4243) Spark SQL SELECT COUNT DISTINCT optimization

2015-11-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998813#comment-14998813
 ] 

Yin Huai commented on SPARK-4243:
-

The optimization of {{SELECT COUNT(DISTINCT f2) FROM parquetFile}} will be done 
as a part of https://github.com/apache/spark/pull/9556. We will rewrite the 
query to an equivalent form of {{SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM 
parquetFile) a}}.

For the improvement of {{SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT 
f3), COUNT(DISTINCT f4) FROM parquetFile 
}}, it is part of SPARK-9241.

> Spark SQL SELECT COUNT DISTINCT optimization
> 
>
> Key: SPARK-4243
> URL: https://issues.apache.org/jira/browse/SPARK-4243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Bojan Kostić
>
> Spark SQL runs slow when using this code:
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
> val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") 
> parquetFile.registerTempTable("parquetFile") 
> val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") 
> count.map(t => t(0)).collect().foreach(println)
> {code}
> But with this query it runs much faster:
> {code}
> SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
> {code}
> Old queries stats by phases: 
> 3.2min 
> 17s 
> New query stats by phases: 
> 0.3 s 
> 16 s 
> 20 s
> Maybe you should also see this query for optimization:
> {code}
> SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) 
> FROM parquetFile 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4243) Spark SQL SELECT COUNT DISTINCT optimization

2015-11-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-4243:

Target Version/s: 1.6.0

> Spark SQL SELECT COUNT DISTINCT optimization
> 
>
> Key: SPARK-4243
> URL: https://issues.apache.org/jira/browse/SPARK-4243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Bojan Kostić
>
> Spark SQL runs slow when using this code:
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
> val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") 
> parquetFile.registerTempTable("parquetFile") 
> val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") 
> count.map(t => t(0)).collect().foreach(println)
> {code}
> But with this query it runs much faster:
> {code}
> SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
> {code}
> Old queries stats by phases: 
> 3.2min 
> 17s 
> New query stats by phases: 
> 0.3 s 
> 16 s 
> 20 s
> Maybe you should also see this query for optimization:
> {code}
> SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) 
> FROM parquetFile 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10388) Public dataset loader interface

2015-11-10 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-10388:
---
Attachment: SPARK-10388PublicDataSetLoaderInterface.pdf

> Public dataset loader interface
> ---
>
> Key: SPARK-10388
> URL: https://issues.apache.org/jira/browse/SPARK-10388
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Attachments: SPARK-10388PublicDataSetLoaderInterface.pdf
>
>
> It is very useful to have a public dataset loader to fetch ML datasets from 
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the 
> API and implementation are pending discussion. Note that this requires http 
> and https support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10388) Public dataset loader interface

2015-11-10 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998759#comment-14998759
 ] 

Jeff Zhang commented on SPARK-10388:


[~mengxr] I talked with [~rams] offline, and would love to collaborate with him 
on this ticket.  I attach the design, please help review. Thanks 

> Public dataset loader interface
> ---
>
> Key: SPARK-10388
> URL: https://issues.apache.org/jira/browse/SPARK-10388
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Attachments: SPARK-10388PublicDataSetLoaderInterface.pdf
>
>
> It is very useful to have a public dataset loader to fetch ML datasets from 
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the 
> API and implementation are pending discussion. Note that this requires http 
> and https support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4243) Spark SQL SELECT COUNT DISTINCT optimization

2015-11-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-4243:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-4366

> Spark SQL SELECT COUNT DISTINCT optimization
> 
>
> Key: SPARK-4243
> URL: https://issues.apache.org/jira/browse/SPARK-4243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Bojan Kostić
>
> Spark SQL runs slow when using this code:
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
> val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") 
> parquetFile.registerTempTable("parquetFile") 
> val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") 
> count.map(t => t(0)).collect().foreach(println)
> {code}
> But with this query it runs much faster:
> {code}
> SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
> {code}
> Old queries stats by phases: 
> 3.2min 
> 17s 
> New query stats by phases: 
> 0.3 s 
> 16 s 
> 20 s
> Maybe you should also see this query for optimization:
> {code}
> SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) 
> FROM parquetFile 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11343) Regression Imposes doubles on prediction/label columns

2015-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998732#comment-14998732
 ] 

Apache Spark commented on SPARK-11343:
--

User 'dahlem' has created a pull request for this issue:
https://github.com/apache/spark/pull/9598

> Regression Imposes doubles on prediction/label columns
> --
>
> Key: SPARK-11343
> URL: https://issues.apache.org/jira/browse/SPARK-11343
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.1
> Environment: all environments
>Reporter: Dominik Dahlem
>Assignee: Dominik Dahlem
> Fix For: 1.6.0
>
>
> Using pyspark.ml and DataFrames, The ALS recommender cannot be evaluated 
> using the RegressionEvaluator, because of a type mis-match between the model 
> transformation and the evaluation APIs. One can work around this by casting 
> the prediction column into double before passing it into the evaluator. 
> However, this does not work with pipelines and cross validation.
> Code and traceback below:
> {code}
> als = ALS(rank=10, maxIter=30, regParam=0.1, userCol='userID', 
> itemCol='movieID', ratingCol='rating')
> model = als.fit(training)
> predictions = model.transform(validation)
> evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='rating')
> validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 
> 'rmse'})
> {code}
> Traceback:
> validationRmse = evaluator.evaluate(predictions,
> {evaluator.metricName: 'rmse'}
> )
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 63, in evaluate
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 94, in _evaluate
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in _call_
> File 
> "/Users/dominikdahlem/projects/repositories/spark/python/pyspark/sql/utils.py",
>  line 42, in deco
> raise IllegalArgumentException(s.split(': ', 1)[1])
> pyspark.sql.utils.IllegalArgumentException: requirement failed: Column 
> prediction must be of type DoubleType but was actually FloatType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-11-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998883#comment-14998883
 ] 

Yin Huai commented on SPARK-10978:
--

I am not sure {{partiallyHandledFilters}} is really useful. I mean a filter is 
either evaluated for every rows in the data source or not evaluated for every 
row (it is possible that the data source does not support filter pushdown or 
the data source only has corse-grained index). So, as long as we know this 
information, we can make decision on if we should add spark-sql side filter. 
For our internal data sources, I think it is fine to let unhandledFilters just 
return all filters because it is not really expensive to re-evaluate them. 


> Allow PrunedFilterScan to eliminate predicates from further evaluation
> --
>
> Key: SPARK-10978
> URL: https://issues.apache.org/jira/browse/SPARK-10978
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Russell Alexander Spitzer
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.6.0
>
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11578) User facing api for typed aggregation

2015-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998754#comment-14998754
 ] 

Apache Spark commented on SPARK-11578:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9599

> User facing api for typed aggregation
> -
>
> Key: SPARK-11578
> URL: https://issues.apache.org/jira/browse/SPARK-11578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4243) Spark SQL SELECT COUNT DISTINCT optimization

2015-11-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-4243.
-
   Resolution: Fixed
 Assignee: Yin Huai
Fix Version/s: 1.6.0

> Spark SQL SELECT COUNT DISTINCT optimization
> 
>
> Key: SPARK-4243
> URL: https://issues.apache.org/jira/browse/SPARK-4243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Bojan Kostić
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>
> Spark SQL runs slow when using this code:
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
> val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") 
> parquetFile.registerTempTable("parquetFile") 
> val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") 
> count.map(t => t(0)).collect().foreach(println)
> {code}
> But with this query it runs much faster:
> {code}
> SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
> {code}
> Old queries stats by phases: 
> 3.2min 
> 17s 
> New query stats by phases: 
> 0.3 s 
> 16 s 
> 20 s
> Maybe you should also see this query for optimization:
> {code}
> SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) 
> FROM parquetFile 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11590) use native json_tuple in lateral view

2015-11-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11590.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9562
[https://github.com/apache/spark/pull/9562]

> use native json_tuple in lateral view
> -
>
> Key: SPARK-11590
> URL: https://issues.apache.org/jira/browse/SPARK-11590
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4243) Spark SQL SELECT COUNT DISTINCT optimization

2015-11-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999138#comment-14999138
 ] 

Yin Huai commented on SPARK-4243:
-

https://github.com/apache/spark/pull/9556 has been merged. I am resolving this 
issue.

> Spark SQL SELECT COUNT DISTINCT optimization
> 
>
> Key: SPARK-4243
> URL: https://issues.apache.org/jira/browse/SPARK-4243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Bojan Kostić
> Fix For: 1.6.0
>
>
> Spark SQL runs slow when using this code:
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
> val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") 
> parquetFile.registerTempTable("parquetFile") 
> val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") 
> count.map(t => t(0)).collect().foreach(println)
> {code}
> But with this query it runs much faster:
> {code}
> SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
> {code}
> Old queries stats by phases: 
> 3.2min 
> 17s 
> New query stats by phases: 
> 0.3 s 
> 16 s 
> 20 s
> Maybe you should also see this query for optimization:
> {code}
> SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) 
> FROM parquetFile 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10863) Method coltypes() to return the R column types of a DataFrame

2015-11-10 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10863.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9579
[https://github.com/apache/spark/pull/9579]

> Method coltypes() to return the R column types of a DataFrame
> -
>
> Key: SPARK-10863
> URL: https://issues.apache.org/jira/browse/SPARK-10863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Oscar D. Lara Yejas
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10863) Method coltypes() to return the R column types of a DataFrame

2015-11-10 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10863:
--
Assignee: Oscar D. Lara Yejas

> Method coltypes() to return the R column types of a DataFrame
> -
>
> Key: SPARK-10863
> URL: https://issues.apache.org/jira/browse/SPARK-10863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Oscar D. Lara Yejas
>Assignee: Oscar D. Lara Yejas
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10892) Join with Data Frame returns wrong results

2015-11-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10892:
-
Fix Version/s: (was: 1.6.0)

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  189|2012|
> |20121005|  5|  TMAX|   10|USW00023272|  194|2012|
> |20121006|  6|  TMAX|   10|USW00023272|  200|2012|
> |20121007|  7|  TMAX|   

[jira] [Commented] (SPARK-10892) Join with Data Frame returns wrong results

2015-11-10 Thread Ofer Mendelevitch (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999171#comment-14999171
 ] 

Ofer Mendelevitch commented on SPARK-10892:
---

Sure, sorry. 
Any idea when this might be resolved?

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  189|2012|
> |20121005|  5|  TMAX|   10|USW00023272|  194|2012|
> |20121006| 

[jira] [Updated] (SPARK-11626) ml.feature.Word2Vec.transform() should not recompute word-vector map each time

2015-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11626:
--
Issue Type: Bug  (was: Improvement)

> ml.feature.Word2Vec.transform() should not recompute word-vector map each time
> --
>
> Key: SPARK-11626
> URL: https://issues.apache.org/jira/browse/SPARK-11626
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.1
>Reporter: q79969786
>Priority: Minor
>
> org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not 
> read broadcast every sentence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11631) DAGScheduler prints "Stopping DAGScheduler" at INFO to the logs with no corresponding "Starting"

2015-11-10 Thread Xiu(Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999101#comment-14999101
 ] 

Xiu(Joe) Guo commented on SPARK-11631:
--

I am looking at it, will submit a PR shortly.

> DAGScheduler prints "Stopping DAGScheduler" at INFO to the logs with no 
> corresponding "Starting"
> 
>
> Key: SPARK-11631
> URL: https://issues.apache.org/jira/browse/SPARK-11631
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
> Environment: Spark sources as of today - revision {{5039a49}}
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> At stop, DAGScheduler prints out {{INFO DAGScheduler: Stopping 
> DAGScheduler}}, but there's no corresponding Starting INFO message. It can be 
> surprising.
> I think Spark should have a change and pick one:
> 1. {{INFO DAGScheduler: Stopping DAGScheduler}} should be DEBUG at the most 
> (or even TRACE)
> 2. {{INFO DAGScheduler: Stopping DAGScheduler}} should have corresponding 
> {{INFO DAGScheduler: Starting DAGScheduler}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10444) Remove duplication in Mesos schedulers

2015-11-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10444:
--
Target Version/s: 1.7.0  (was: 1.6.0)

> Remove duplication in Mesos schedulers
> --
>
> Key: SPARK-10444
> URL: https://issues.apache.org/jira/browse/SPARK-10444
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.0
>Reporter: Iulian Dragos
>  Labels: refactoring
>
> Currently coarse-grained and fine-grained Mesos schedulers don't share much 
> code, and that leads to inconsistencies. For instance:
> - only coarse-grained mode respects {{spark.cores.max}}, see SPARK-9873
> - only coarse-grained mode blacklists slaves that fail repeatedly, but that 
> seams like generally useful
> - constraints and memory checking are done on both sides (code is shared 
> though)
> - framework re-registration (master election) is only done for cluster-mode 
> deployment
> We should find a better design that groups together common concerns and 
> generally improves the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11583) Make MapStatus use less memory uage

2015-11-10 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999009#comment-14999009
 ] 

Imran Rashid commented on SPARK-11583:
--

I think [~srowen] is right that we are going in circles a bit here, we don't 
just want to re-implement our own RoaringBitmap ... I have a feeling we 
actually made a mis-step in https://issues.apache.org/jira/browse/SPARK-11271.  
I think there was some confusion about the way these structures work, as well 
as some misleading comments.  After a bit of digging, here is my understanding, 
but please correct me (would especially appreciate feedback from [~lemire])

1. Given a relatively full set, {{RoaringBitmap}} is not going to do better 
than a {{BitSet}}.  In fact, it will most likely use a bit more space, because 
its trying to do some extra book-keeping beyond a normal {{BitSet}}.  >From 
https://issues.apache.org/jira/browse/SPARK-11271, it appears in one case this 
is 20%.  However, we don't really know whether that is the worst case, 
relatively typical, or maybe it could be far worse in other cases.  This might 
require a more thorough investigation from someone.

2. For sparse sets, a {{RoaringBitmap}} will use much less space than a 
{{BitSet}}, _including deserialized_.  That is, the old comment in MapStatus, 
["During serialization, this bitmap is compressed" | 
https://github.com/apache/spark/blob/6e823b4d7d52e9cf707f144256006e4575d23dc2/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L125]
 is very misleading -- the bitmap is always compressed.  The original 
implementation of {{HighlyCompressedMapStatus}} was just mostly concerned with 
decreasing the size of the network traffic, but its also compressed in memory.

3. When a set is nearly full, {{RoaringBitmap}} does *not* automatically invert 
the bits in order to minimize space.  Here's an example:

{noformat}
scala> import org.roaringbitmap._
import org.roaringbitmap._
scala> val rr = RoaringBitmap.bitmapOf(1,2,3,1000)
rr: org.roaringbitmap.RoaringBitmap = {1,2,3,1000}
scala> val x = rr.clone()
x: org.roaringbitmap.RoaringBitmap = {1,2,3,1000}
scala> x.flip(0,1001)
scala> rr.getSizeInBytes()
res1: Int = 22
scala> x.getSizeInBytes()
res2: Int = 2008
{noformat}

There is another comment in the old code: ["From a compression standpoint, it 
shouldn't matter whether we track empty or non-empty blocks" | 
https://github.com/apache/spark/blob/6e823b4d7d52e9cf707f144256006e4575d23dc2/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L177].
  This was also a bit misleading.  That comment was also probably referring to 
when the data was serialized, and sent over the network, with another layer of 
compression enabled, in which case the versions would be the same.  But the *in 
memory* usage can actually be very different.


So I'm pretty sure this means that after 
https://issues.apache.org/jira/browse/SPARK-11271, we are actually using much 
*more* memory when those sets are relatively empty -- that is, when most blocks 
are *non*-empty.  We did at least reduce the worst-case memory usage that comes 
with a nearly full set (when most blocks are *empty*).

The only thing Spark needs from this bitset is:
1) small size in-memory
2) small size serialized
3) fast {{contains()}}

{{RoaringBitmap}} is optimized for some other use cases, eg. fast intersection 
& modification.  Note that spark doesn't even need mutability for the bitsets 
in {{MapStatus}} -- after they are created, they are never changed.  
Nonetheless, {{RoaringBitmap}} might still be a good fit because it does 
compression.

I think our options are:

1. Roll our own compressed bit set -- basically what is in the current PR
2. Go back to RoaringBitmap, but choose whether to store the empty or non-empty 
blocks based on what will use the least memory.
3. Look for some other pre-existing bitset implementation which is closer to 
our needs.

I'm leaning towards (1), and moving forward with the PR, but I thought it was 
worth clarifying the situation and making sure we understood what was going on.

> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if 

[jira] [Updated] (SPARK-11622) Make LibSVMRelation extends HadoopFsRelation and Add LibSVMOutputWriter

2015-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11622:
--
Shepherd: Kai Sasaki

> Make LibSVMRelation extends HadoopFsRelation and Add LibSVMOutputWriter
> ---
>
> Key: SPARK-11622
> URL: https://issues.apache.org/jira/browse/SPARK-11622
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jeff Zhang
>Priority: Minor
>
> so that LibSVMRealtion can leverage the features from HadoopFsRelation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11622) Make LibSVMRelation extends HadoopFsRelation and Add LibSVMOutputWriter

2015-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11622:
--
Assignee: Jeff Zhang

> Make LibSVMRelation extends HadoopFsRelation and Add LibSVMOutputWriter
> ---
>
> Key: SPARK-11622
> URL: https://issues.apache.org/jira/browse/SPARK-11622
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Minor
>
> so that LibSVMRealtion can leverage the features from HadoopFsRelation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11626) ml.feature.Word2Vec.transform() should not recompute word-vector map each time

2015-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11626:
--
Assignee: q79969786

> ml.feature.Word2Vec.transform() should not recompute word-vector map each time
> --
>
> Key: SPARK-11626
> URL: https://issues.apache.org/jira/browse/SPARK-11626
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.1
>Reporter: q79969786
>Assignee: q79969786
>
> org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not 
> read broadcast every sentence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11626) ml.feature.Word2Vec.transform() should not recompute word-vector map each time

2015-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11626:
--
Shepherd: Xiangrui Meng
Target Version/s: 1.6.0
Priority: Major  (was: Minor)

> ml.feature.Word2Vec.transform() should not recompute word-vector map each time
> --
>
> Key: SPARK-11626
> URL: https://issues.apache.org/jira/browse/SPARK-11626
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.1
>Reporter: q79969786
>
> org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not 
> read broadcast every sentence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10892) Join with Data Frame returns wrong results

2015-11-10 Thread Ofer Mendelevitch (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ofer Mendelevitch updated SPARK-10892:
--
Fix Version/s: 1.6.0

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
> Fix For: 1.6.0
>
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  189|2012|
> |20121005|  5|  TMAX|   10|USW00023272|  194|2012|
> |20121006|  6|  TMAX|   10|USW00023272|  

[jira] [Resolved] (SPARK-10371) Optimize sequential projections

2015-11-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10371.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9480
[https://github.com/apache/spark/pull/9480]

> Optimize sequential projections
> ---
>
> Key: SPARK-10371
> URL: https://issues.apache.org/jira/browse/SPARK-10371
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Nong Li
>Priority: Critical
> Fix For: 1.6.0
>
>
> In ML pipelines, each transformer/estimator appends new columns to the input 
> DataFrame. For example, it might produce DataFrames like the following 
> columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), 
> and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c 
> and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.
> {code}
> val input = sqlContext.range(10)
> val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 
> 2)
> output.explain(true)
> == Parsed Logical Plan ==
> 'Project [*,('x * 2) AS y#254]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Analyzed Logical Plan ==
> id: bigint, x: bigint, y: bigint
> Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Optimized Logical Plan ==
> Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
>  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Physical Plan ==
> TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS 
> y#254L]
>  Scan PhysicalRDD[id#252L]
> Code Generation: true
> input: org.apache.spark.sql.DataFrame = [id: bigint]
> output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10892) Join with Data Frame returns wrong results

2015-11-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999178#comment-14999178
 ] 

Yin Huai commented on SPARK-10892:
--

We are still working on it. It is a pretty hard one. For not, can you use the 
workaround provided by [~cloud_fan] (using {{$"prcp.value"}})?

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  

[jira] [Updated] (SPARK-11382) Replace example code in mllib-decision-tree.md using include_example

2015-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11382:
--
Shepherd: Xiangrui Meng  (was: Xusen Yin)
Assignee: Xusen Yin

> Replace example code in mllib-decision-tree.md using include_example
> 
>
> Key: SPARK-11382
> URL: https://issues.apache.org/jira/browse/SPARK-11382
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>  Labels: starter
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-decision-tree.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10892) Join with Data Frame returns wrong results

2015-11-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999156#comment-14999156
 ] 

Yin Huai commented on SPARK-10892:
--

[~ofermend] (we only assign fix version after a jira is resolved.)

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  189|2012|
> |20121005|  5|  TMAX|   10|USW00023272|  194|2012|
> |20121006|  

[jira] [Created] (SPARK-11634) Make simple transformers and estimators implement default read/write

2015-11-10 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-11634:
-

 Summary: Make simple transformers and estimators implement default 
read/write
 Key: SPARK-11634
 URL: https://issues.apache.org/jira/browse/SPARK-11634
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical


After SPARK-11217, we can make simple transformers and estimators implement the 
default read/write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10892) Join with Data Frame returns wrong results

2015-11-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999178#comment-14999178
 ] 

Yin Huai edited comment on SPARK-10892 at 11/10/15 7:32 PM:


We are still working on it. It is a pretty hard one. For now, can you use the 
workaround provided by [~cloud_fan] (using {{$"prcp.value"}})?


was (Author: yhuai):
We are still working on it. It is a pretty hard one. For not, can you use the 
workaround provided by [~cloud_fan] (using {{$"prcp.value"}})?

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> 

[jira] [Resolved] (SPARK-7841) Spark build should not use lib_managed for dependencies

2015-11-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-7841.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9575
[https://github.com/apache/spark/pull/9575]

> Spark build should not use lib_managed for dependencies
> ---
>
> Key: SPARK-7841
> URL: https://issues.apache.org/jira/browse/SPARK-7841
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Iulian Dragos
>Assignee: Josh Rosen
>  Labels: easyfix, sbt
> Fix For: 1.6.0
>
>
> - unnecessary duplication (I will have those libraries under ./m2, via maven 
> anyway)
> - every time I call make-distribution I lose lib_managed (via mvn clean 
> install) and have to wait to download again all jars next time I use sbt
> - Eclipse does not handle relative paths very well (source attachments from 
> lib_managed don’t always work)
> - it's not the default configuration. If we stray from defaults I think there 
> should be a clear advantage.
> Digging through history, the only reference to `retrieveManaged := true` I 
> found was in f686e3d, from July 2011 ("Initial work on converting build to 
> SBT 0.10.1"). My guess this is purely an accident of porting the build form 
> Sbt 0.7.x and trying to keep the old project layout.
> If there are reasons for keeping it, please comment (I didn't get any answers 
> on the [dev mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/Why-use-quot-lib-managed-quot-for-the-Sbt-build-td12361.html])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9830) Remove AggregateExpression1 and Aggregate Operator used to evaluate AggregateExpression1s

2015-11-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-9830.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9556
[https://github.com/apache/spark/pull/9556]

> Remove AggregateExpression1 and Aggregate Operator used to evaluate 
> AggregateExpression1s
> -
>
> Key: SPARK-9830
> URL: https://issues.apache.org/jira/browse/SPARK-9830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.6.0
>
>
> In 1.6.0, we should remove the old aggregation code path (used to evaluate 
> AggregateExpression1). While removing this code path, there are several code 
> clean up we need to do,
> 1. Remove all of our hacks from ResolveFunctions.
> 2. Remove all of the conversion logics from 
> {org.apache.spark.sql.catalyst.expressions.aggregate.utils}}.
> 3. Remove {{newAggregation}} field from {{logical.Aggregate}}.
> 4. Remove the query planning rule for old aggregate path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11583) Make MapStatus use less memory uage

2015-11-10 Thread Daniel Lemire (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999130#comment-14999130
 ] 

Daniel Lemire commented on SPARK-11583:
---

> So, this is about compressing the in-memory representation in some way, 
> whereas roaringbitmap compressed the external representation?

Probably not. Roaring bitmaps use about as much RAM as they use serialized 
bytes.

Possibly Spark would either need to use a version of Roaring like Lucene (that 
eagerly handles the flips) or make use of our "runOptimize" method. See my 
answer below to [~irashid].

> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6990) Add Java linting script

2015-11-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6990:
---

Assignee: (was: Apache Spark)

> Add Java linting script
> ---
>
> Key: SPARK-6990
> URL: https://issues.apache.org/jira/browse/SPARK-6990
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Reporter: Josh Rosen
>Priority: Minor
>  Labels: starter
>
> It would be nice to add a {{dev/lint-java}} script to enforce style rules for 
> Spark's Java code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6990) Add Java linting script

2015-11-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6990:
---

Assignee: Apache Spark

> Add Java linting script
> ---
>
> Key: SPARK-6990
> URL: https://issues.apache.org/jira/browse/SPARK-6990
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> It would be nice to add a {{dev/lint-java}} script to enforce style rules for 
> Spark's Java code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6990) Add Java linting script

2015-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998961#comment-14998961
 ] 

Apache Spark commented on SPARK-6990:
-

User 'dskrvk' has created a pull request for this issue:
https://github.com/apache/spark/pull/9600

> Add Java linting script
> ---
>
> Key: SPARK-6990
> URL: https://issues.apache.org/jira/browse/SPARK-6990
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Reporter: Josh Rosen
>Priority: Minor
>  Labels: starter
>
> It would be nice to add a {{dev/lint-java}} script to enforce style rules for 
> Spark's Java code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11633) HiveContext throws TreeNode Exception : Failed to Copy Node

2015-11-10 Thread Saurabh Santhosh (JIRA)
Saurabh Santhosh created SPARK-11633:


 Summary: HiveContext throws TreeNode Exception : Failed to Copy 
Node
 Key: SPARK-11633
 URL: https://issues.apache.org/jira/browse/SPARK-11633
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Saurabh Santhosh
Priority: Critical


h2. HiveContext#sql is throwing the following exception in a specific scenario :

h2. Exception :

Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Failed to copy node.
Is otherCopyArgs specified correctly for LogicalRDD.
Exception message: wrong number of arguments
ctor: public org.apache.spark.sql.execution.LogicalRDD
(scala.collection.Seq,org.apache.spark.rdd.RDD,org.apache.spark.sql.SQLContext)?

h2. Code :

{code:title=SparkClient.java|borderStyle=solid}
StructField[] fields = new StructField[2];
fields[0] = new StructField("F1", DataTypes.StringType, true, Metadata.empty());
fields[1] = new StructField("F2", DataTypes.StringType, true, Metadata.empty());

JavaRDD rdd = 
javaSparkContext.parallelize(Arrays.asList(RowFactory.create("", "", 0)));

DataFrame df = sparkHiveContext.createDataFrame(rdd, new StructType(fields));
sparkHiveContext.registerDataFrameAsTable(df, "t1");

DataFrame aliasedDf = sparkHiveContext.sql("select f1, F2 as F2 from t1");

sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t2");
sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t3");

sparkHiveContext.sql("select a.F1 from t2 a inner join t3 b on a.F2=b.F2");

{code}

h2. Observations :

* if F1(exact name of field) is used instead of f1, the code works correctly.
* If alias is not used for F2, then also code works irrespective of case of F1.
* if Field F2 is not used in the final query also the code works correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11382) Replace example code in mllib-decision-tree.md using include_example

2015-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11382.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9596
[https://github.com/apache/spark/pull/9596]

> Replace example code in mllib-decision-tree.md using include_example
> 
>
> Key: SPARK-11382
> URL: https://issues.apache.org/jira/browse/SPARK-11382
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>  Labels: starter
> Fix For: 1.6.0
>
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-decision-tree.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11583) Make MapStatus use less memory uage

2015-11-10 Thread Daniel Lemire (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999117#comment-14999117
 ] 

Daniel Lemire commented on SPARK-11583:
---

> When a set is nearly full, RoaringBitmap does not automatically invert the 
> bits in order to minimize space. 

The Roaring implementation in Lucene invert bits to minimize space, as 
descriped...

https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/util/RoaringDocIdSet.java

The RoaringBitmap libary which we produced does not. However, it does something 
similar upon request.

You might want to try...

 x.flip(0,1001);
 x.runOptimize();
 x.getSizeInBytes();

The call to runOptimize should significantly reduce memory usage in this case. 


The intention is that users should call "runOptimize" when their bitmaps has 
been created and is no long expected to be changed frequently. So "runOptimize" 
should always be called prior to serialization.


> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11583) Make MapStatus use less memory uage

2015-11-10 Thread Daniel Lemire (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999142#comment-14999142
 ] 

Daniel Lemire commented on SPARK-11583:
---

[~Qin Yao] wrote: 

"Roaringbitmap is same as the BitSet now we use in HiglyCompressedMapStatus, 
but take 20% memory usage more than BitSet. They both don't be compressed 
in-memory. According to the annotations of the former 
Roaring-HiglyCompressedMapStatus, it can be compressed during serialization not 
in-memory."


I think that's a misunderstanding.

Lucene and Apache Kylin use Roaring for in-memory bitmaps, and it saves a ton 
of memory. Druid uses them for memory-mapped bitmaps, and it compresses well.

If you do flips, then it is possible that Roaring might end up being 
inefficient. Lucene has one approach to that, in RoaringBitmap, we offer the 
"runOptimize" function.

> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11590) use native json_tuple in lateral view

2015-11-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11590:
-
Assignee: Wenchen Fan

> use native json_tuple in lateral view
> -
>
> Key: SPARK-11590
> URL: https://issues.apache.org/jira/browse/SPARK-11590
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6152) Spark does not support Java 8 compiled Scala classes

2015-11-10 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999190#comment-14999190
 ] 

Josh Rosen edited comment on SPARK-6152 at 11/10/15 7:44 PM:
-

Does anyone have a standalone reproduction of this issue that I can use to test 
my PR? https://github.com/apache/spark/pull/9512
 
EDIT: just realized that this issue pertains to _Scala_ classes that were 
compiled with Java 8. Will add a new test to try that out.


was (Author: joshrosen):
Does anyone have a standalone reproduction of this issue that I can use to test 
my PR? https://github.com/apache/spark/pull/9512

> Spark does not support Java 8 compiled Scala classes
> 
>
> Key: SPARK-6152
> URL: https://issues.apache.org/jira/browse/SPARK-6152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Java 8+
> Scala 2.11
>Reporter: Ronald Chen
>Assignee: Josh Rosen
>Priority: Critical
>
> Spark uses reflectasm to check Scala closures which fails if the *user 
> defined Scala closures* are compiled to Java 8 class version
> The cause is reflectasm does not support Java 8
> https://github.com/EsotericSoftware/reflectasm/issues/35
> Workaround:
> Don't compile Scala classes to Java 8, Scala 2.11 does not support nor 
> require any Java 8 features
> Stack trace:
> {code}
> java.lang.IllegalArgumentException
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:288)
>   at ...my Scala 2.11 compiled to Java 8 code calling into spark
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11636) Support as on Classes defined in the REPL

2015-11-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11636:


Assignee: Apache Spark  (was: Michael Armbrust)

> Support as on Classes defined in the REPL
> -
>
> Key: SPARK-11636
> URL: https://issues.apache.org/jira/browse/SPARK-11636
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11637) Alias do not work with udf with * parameter

2015-11-10 Thread Pierre Borckmans (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Borckmans updated SPARK-11637:
-
Description: 
In Spark < 1.5.0, this used to work :

{code:java|title=Spark <1.5.0|borderStyle=solid}
scala> sqlContext.sql("select hash(*) as x from T")
res2: org.apache.spark.sql.DataFrame = [x: int]
{code}

>From Spark 1.5.0+, it fails:
{code:java|title=Spark +1.5.0|borderStyle=solid}
scala> sqlContext.sql("select hash(*) as x from T")
org.apache.spark.sql.AnalysisException: unresolved operator 'Project ['hash(*) 
AS x#1];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
...
{code}

This is not specific to the `hash` udf. It also applies to user defined 
functions.
The `*` seems to be the issue.

  was:
In Spark < 1.5.0, this used to work :

{code:java|title=Spark <1.5.0|borderStyle=solid}
scala> sqlContext.sql("select hash(*) as x from T")
res2: org.apache.spark.sql.DataFrame = [x: int]
{code}

>From Spark 1.5.0+, it fails:
{code:java|title=Spark +1.5.0|borderStyle=solid}
scala> sqlContext.sql("select hash(*) as x from T")
org.apache.spark.sql.AnalysisException: unresolved operator 'Project ['hash(*) 
AS x#1];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
...
{code}



> Alias do not work with udf with * parameter
> ---
>
> Key: SPARK-11637
> URL: https://issues.apache.org/jira/browse/SPARK-11637
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
>Reporter: Pierre Borckmans
>
> In Spark < 1.5.0, this used to work :
> {code:java|title=Spark <1.5.0|borderStyle=solid}
> scala> sqlContext.sql("select hash(*) as x from T")
> res2: org.apache.spark.sql.DataFrame = [x: int]
> {code}
> From Spark 1.5.0+, it fails:
> {code:java|title=Spark +1.5.0|borderStyle=solid}
> scala> sqlContext.sql("select hash(*) as x from T")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
> ['hash(*) AS x#1];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> ...
> {code}
> This is not specific to the `hash` udf. It also applies to user defined 
> functions.
> The `*` seems to be the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11631) DAGScheduler prints "Stopping DAGScheduler" at INFO to the logs with no corresponding "Starting"

2015-11-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11631:


Assignee: (was: Apache Spark)

> DAGScheduler prints "Stopping DAGScheduler" at INFO to the logs with no 
> corresponding "Starting"
> 
>
> Key: SPARK-11631
> URL: https://issues.apache.org/jira/browse/SPARK-11631
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
> Environment: Spark sources as of today - revision {{5039a49}}
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> At stop, DAGScheduler prints out {{INFO DAGScheduler: Stopping 
> DAGScheduler}}, but there's no corresponding Starting INFO message. It can be 
> surprising.
> I think Spark should have a change and pick one:
> 1. {{INFO DAGScheduler: Stopping DAGScheduler}} should be DEBUG at the most 
> (or even TRACE)
> 2. {{INFO DAGScheduler: Stopping DAGScheduler}} should have corresponding 
> {{INFO DAGScheduler: Starting DAGScheduler}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11638) Apache Spark on Docker with Bridge networking / run Spark in Mesos on Docker with Bridge networking

2015-11-10 Thread Radoslaw Gruchalski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radoslaw Gruchalski updated SPARK-11638:

Attachment: 2.3.11.patch
2.3.4.patch
1.6.0-master.patch
1.5.2.patch
1.5.1.patch
1.5.0.patch
1.4.1.patch
1.4.0.patch

The {{2.3.4.patch}} and {{2.3.11.patch}} are {{akka-remote}} patches. The rest 
of the files are Apache Spark patches for the respective versions.

> Apache Spark on Docker with Bridge networking / run Spark in Mesos on Docker 
> with Bridge networking
> ---
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch
>
>
> Full description will be provided within the next few minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6519) Add spark.ml API for bisecting k-means

2015-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999358#comment-14999358
 ] 

Apache Spark commented on SPARK-6519:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/9604

> Add spark.ml API for bisecting k-means
> --
>
> Key: SPARK-6519
> URL: https://issues.apache.org/jira/browse/SPARK-6519
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Yu Ishikawa
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6519) Add spark.ml API for bisecting k-means

2015-11-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6519:
---

Assignee: Apache Spark

> Add spark.ml API for bisecting k-means
> --
>
> Key: SPARK-6519
> URL: https://issues.apache.org/jira/browse/SPARK-6519
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Yu Ishikawa
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11640) shading packages in spark-assembly jar

2015-11-10 Thread PJ Fanning (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated SPARK-11640:
---
Issue Type: Wish  (was: Bug)

> shading packages in spark-assembly jar
> --
>
> Key: SPARK-11640
> URL: https://issues.apache.org/jira/browse/SPARK-11640
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Reporter: PJ Fanning
>
> The spark assembly jar contains classes from many external dependencies like 
> hadoop and bouncycastle.
> I have run into issues trying to use bouncycastle code in a Spark job because 
> the JCE codebase expects the encryption code to be in a signed jar and since 
> the classes are copied into spark-assembly jar and it is not signed, the JCE 
> framework returns an error.
> If the bouncycastle classes in spark-assembly were shaded, then I could 
> deploy the properly signed bcprov jar. The spark code could access the shaded 
> copies of the bouncycastle classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11640) shading packages in spark-assembly jar

2015-11-10 Thread PJ Fanning (JIRA)
PJ Fanning created SPARK-11640:
--

 Summary: shading packages in spark-assembly jar
 Key: SPARK-11640
 URL: https://issues.apache.org/jira/browse/SPARK-11640
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: PJ Fanning


The spark assembly jar contains classes from many external dependencies like 
hadoop and bouncycastle.
I have run into issues trying to use bouncycastle code in a Spark job because 
the JCE codebase expects the encryption code to be in a signed jar and since 
the classes are copied into spark-assembly jar and it is not signed, the JCE 
framework returns an error.
If the bouncycastle classes in spark-assembly were shaded, then I could deploy 
the properly signed bcprov jar. The spark code could access the shaded copies 
of the bouncycastle classes.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11640) shading packages in spark-assembly jar

2015-11-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-11640:
---

> shading packages in spark-assembly jar
> --
>
> Key: SPARK-11640
> URL: https://issues.apache.org/jira/browse/SPARK-11640
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Reporter: PJ Fanning
>
> The spark assembly jar contains classes from many external dependencies like 
> hadoop and bouncycastle.
> I have run into issues trying to use bouncycastle code in a Spark job because 
> the JCE codebase expects the encryption code to be in a signed jar and since 
> the classes are copied into spark-assembly jar and it is not signed, the JCE 
> framework returns an error.
> If the bouncycastle classes in spark-assembly were shaded, then I could 
> deploy the properly signed bcprov jar. The spark code could access the shaded 
> copies of the bouncycastle classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11640) shading packages in spark-assembly jar

2015-11-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11640.
---
Resolution: Fixed

I'm going to take a pretty good guess that you want to build with the 
"hadoop-provided" profile if that's your issue.

> shading packages in spark-assembly jar
> --
>
> Key: SPARK-11640
> URL: https://issues.apache.org/jira/browse/SPARK-11640
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Reporter: PJ Fanning
>
> The spark assembly jar contains classes from many external dependencies like 
> hadoop and bouncycastle.
> I have run into issues trying to use bouncycastle code in a Spark job because 
> the JCE codebase expects the encryption code to be in a signed jar and since 
> the classes are copied into spark-assembly jar and it is not signed, the JCE 
> framework returns an error.
> If the bouncycastle classes in spark-assembly were shaded, then I could 
> deploy the properly signed bcprov jar. The spark code could access the shaded 
> copies of the bouncycastle classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6152) Spark does not support Java 8 compiled Scala classes

2015-11-10 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999190#comment-14999190
 ] 

Josh Rosen commented on SPARK-6152:
---

Does anyone have a standalone reproduction of this issue that I can use to test 
my PR? https://github.com/apache/spark/pull/9512

> Spark does not support Java 8 compiled Scala classes
> 
>
> Key: SPARK-6152
> URL: https://issues.apache.org/jira/browse/SPARK-6152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Java 8+
> Scala 2.11
>Reporter: Ronald Chen
>Assignee: Josh Rosen
>Priority: Critical
>
> Spark uses reflectasm to check Scala closures which fails if the *user 
> defined Scala closures* are compiled to Java 8 class version
> The cause is reflectasm does not support Java 8
> https://github.com/EsotericSoftware/reflectasm/issues/35
> Workaround:
> Don't compile Scala classes to Java 8, Scala 2.11 does not support nor 
> require any Java 8 features
> Stack trace:
> {code}
> java.lang.IllegalArgumentException
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:288)
>   at ...my Scala 2.11 compiled to Java 8 code calling into spark
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11635) Avoid Guava version dependency for Hashcodes

2015-11-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11635:


Assignee: Apache Spark

> Avoid Guava version dependency for Hashcodes
> 
>
> Key: SPARK-11635
> URL: https://issues.apache.org/jira/browse/SPARK-11635
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.6.0
>Reporter: Nishkam Ravi
>Assignee: Apache Spark
>
> To avoid Guava version dependency we can use Hashcode instead of Hashcodes 
> (deprecated started 16.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11637) Alias do not work with udf with * parameter

2015-11-10 Thread Pierre Borckmans (JIRA)
Pierre Borckmans created SPARK-11637:


 Summary: Alias do not work with udf with * parameter
 Key: SPARK-11637
 URL: https://issues.apache.org/jira/browse/SPARK-11637
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.1, 1.5.0, 1.5.2
Reporter: Pierre Borckmans


In Spark < 1.5.0, this used to work :
```
scala> sqlContext.sql("select hash(*) as x from T")
res2: org.apache.spark.sql.DataFrame = [x: int]
```

>From Spark 1.5.0+, it fails:
```
scala> sqlContext.sql("select hash(*) as x from T")
org.apache.spark.sql.AnalysisException: unresolved operator 'Project ['hash(*) 
AS x#1];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
...
```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11618) Refactoring of basic ML import/export

2015-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11618.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9587
[https://github.com/apache/spark/pull/9587]

> Refactoring of basic ML import/export
> -
>
> Key: SPARK-11618
> URL: https://issues.apache.org/jira/browse/SPARK-11618
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 1.6.0
>
>
> This is for a few updates to the original PR for basic ML import/export in 
> [SPARK-11217].
> * The original PR diverges from the design doc in that it does not include 
> the Spark version or a model format version.  We should include the Spark 
> version in the metadata.  If we do that, then we don't really need a model 
> format version.
> * Proposal: DefaultParamsWriter includes two separable pieces of logic in 
> save(): (a) handling overwriting and (b) saving Params.  I want to separate 
> these by putting (a) in a save() method in Writer which calls an abstract 
> saveImpl, and (b) in the saveImpl implementation in DefaultParamsWriter.  
> This is described below:
> {code}
> abstract class Writer {
>   def save(path: String) = {
> // handle overwrite
> saveImpl(path)
>   }
>   def saveImpl(path: String)   // abstract
> }
> class DefaultParamsWriter extends Writer {
>   def saveImpl(path: String) = {
> // save Params
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6152) Spark does not support Java 8 compiled Scala classes

2015-11-10 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999222#comment-14999222
 ] 

Josh Rosen commented on SPARK-6152:
---

Yep, was able to reproduce trivially by running Spark's existing Scala unit 
tests with JDK 8. I'm going to add some plumbing to the build in order to let 
us test this in Jenkins.

> Spark does not support Java 8 compiled Scala classes
> 
>
> Key: SPARK-6152
> URL: https://issues.apache.org/jira/browse/SPARK-6152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Java 8+
> Scala 2.11
>Reporter: Ronald Chen
>Assignee: Josh Rosen
>Priority: Critical
>
> Spark uses reflectasm to check Scala closures which fails if the *user 
> defined Scala closures* are compiled to Java 8 class version
> The cause is reflectasm does not support Java 8
> https://github.com/EsotericSoftware/reflectasm/issues/35
> Workaround:
> Don't compile Scala classes to Java 8, Scala 2.11 does not support nor 
> require any Java 8 features
> Stack trace:
> {code}
> java.lang.IllegalArgumentException
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:288)
>   at ...my Scala 2.11 compiled to Java 8 code calling into spark
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11637) Alias do not work with udf with * parameter

2015-11-10 Thread Pierre Borckmans (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Borckmans updated SPARK-11637:
-
Description: 
In Spark < 1.5.0, this used to work :

{code:java|title=Spark <1.5.0|borderStyle=solid}
scala> sqlContext.sql("select hash(*) as x from T")
res2: org.apache.spark.sql.DataFrame = [x: int]
{code}

>From Spark 1.5.0+, it fails:
{code:java|title=Spark>=1.5.0|borderStyle=solid}
scala> sqlContext.sql("select hash(*) as x from T")
org.apache.spark.sql.AnalysisException: unresolved operator 'Project ['hash(*) 
AS x#1];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
...
{code}

This is not specific to the `hash` udf. It also applies to user defined 
functions.
The `*` seems to be the issue.

  was:
In Spark < 1.5.0, this used to work :

{code:java|title=Spark <1.5.0|borderStyle=solid}
scala> sqlContext.sql("select hash(*) as x from T")
res2: org.apache.spark.sql.DataFrame = [x: int]
{code}

>From Spark 1.5.0+, it fails:
{code:java|title=Spark +1.5.0|borderStyle=solid}
scala> sqlContext.sql("select hash(*) as x from T")
org.apache.spark.sql.AnalysisException: unresolved operator 'Project ['hash(*) 
AS x#1];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
...
{code}

This is not specific to the `hash` udf. It also applies to user defined 
functions.
The `*` seems to be the issue.


> Alias do not work with udf with * parameter
> ---
>
> Key: SPARK-11637
> URL: https://issues.apache.org/jira/browse/SPARK-11637
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
>Reporter: Pierre Borckmans
>
> In Spark < 1.5.0, this used to work :
> {code:java|title=Spark <1.5.0|borderStyle=solid}
> scala> sqlContext.sql("select hash(*) as x from T")
> res2: org.apache.spark.sql.DataFrame = [x: int]
> {code}
> From Spark 1.5.0+, it fails:
> {code:java|title=Spark>=1.5.0|borderStyle=solid}
> scala> sqlContext.sql("select hash(*) as x from T")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
> ['hash(*) AS x#1];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> ...
> {code}
> This is not specific to the `hash` udf. It also applies to user defined 
> functions.
> The `*` seems to be the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11583) Make MapStatus use less memory uage

2015-11-10 Thread Daniel Lemire (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999309#comment-14999309
 ] 

Daniel Lemire edited comment on SPARK-11583 at 11/10/15 8:40 PM:
-

[~irashid] 

What I would suggest is a quantified benchmark. E.g., the Elastic people did 
something of the sort... comparing various formats including a BitSet, Roaring, 
and so forth, see  
https://www.elastic.co/blog/frame-of-reference-and-roaring-bitmaps

I'm available to help with this benchmark if it is needed... 




was (Author: lemire):
[~irashid] 

What I would suggest is a quantified benchmark. E.g., the Elastic people did 
something of the sort... comparing various formats including a BitSet, Roaring, 
and so forth, see  
https://www.elastic.co/blog/frame-of-reference-and-roaring-bitmaps

I'm available to help with this benchmark if it is needed... 



> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11583) Make MapStatus use less memory uage

2015-11-10 Thread Daniel Lemire (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999142#comment-14999142
 ] 

Daniel Lemire edited comment on SPARK-11583 at 11/10/15 8:42 PM:
-

[~Qin Yao] wrote: 

"Roaringbitmap is same as the BitSet now we use in HiglyCompressedMapStatus, 
but take 20% memory usage more than BitSet. They both don't be compressed 
in-memory. According to the annotations of the former 
Roaring-HiglyCompressedMapStatus, it can be compressed during serialization not 
in-memory."


I think that's a misunderstanding.

Lucene and Apache Kylin use Roaring for in-memory bitmaps, and it saves a ton 
of memory. Druid uses them for memory-mapped bitmaps, and it compresses well.

If you do flips, then it is possible that Roaring might end up being 
inefficient. Lucene has one approach to solve this matter and, in 
RoaringBitmap, we offer the "runOptimize" function. But, generally, you should 
expect Roaring bitmaps to compress rather well.

Please get in touch with examples if you want, we could discuss the matter 
further.


was (Author: lemire):
[~Qin Yao] wrote: 

"Roaringbitmap is same as the BitSet now we use in HiglyCompressedMapStatus, 
but take 20% memory usage more than BitSet. They both don't be compressed 
in-memory. According to the annotations of the former 
Roaring-HiglyCompressedMapStatus, it can be compressed during serialization not 
in-memory."


I think that's a misunderstanding.

Lucene and Apache Kylin use Roaring for in-memory bitmaps, and it saves a ton 
of memory. Druid uses them for memory-mapped bitmaps, and it compresses well.

If you do flips, then it is possible that Roaring might end up being 
inefficient. Lucene has one approach to that, in RoaringBitmap, we offer the 
"runOptimize" function.

> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11638) Apache Spark on Docker with Bridge networking / run Spark in Mesos on Docker with Bridge networking

2015-11-10 Thread Radoslaw Gruchalski (JIRA)
Radoslaw Gruchalski created SPARK-11638:
---

 Summary: Apache Spark on Docker with Bridge networking / run Spark 
in Mesos on Docker with Bridge networking
 Key: SPARK-11638
 URL: https://issues.apache.org/jira/browse/SPARK-11638
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, Spark Core
Affects Versions: 1.5.1, 1.5.0, 1.4.1, 1.4.0, 1.5.2, 1.6.0
Reporter: Radoslaw Gruchalski


Full description will be provided within the next few minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11583) Make MapStatus use less memory uage

2015-11-10 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999185#comment-14999185
 ] 

Imran Rashid commented on SPARK-11583:
--

thanks [~lemire].  So I guess that means that option 2, going back to 
RoaringBitmap, would really just require us to insert a call to 
{{runOptimize}}, no need for us to worry about whether or not flip the bits 
ourselves.

For the case of relatively full blocks, does the 20% overhead seem like a 
reasonable amount to you?  That sounds super-high to me.  If I understand 
correctly, the worst case of extra overhead will be when all the blocks are 
dense?  Furthermore, it seems like roaring will actually save space, unless the 
maximum element is exactly {{2^n -1}}.  Otherwise, the roaring bitmap will 
still be smaller b/c it can make the final block smaller.  (Well, I guess the 
condition isn't "exactly", its a bit more subtle, but without going into too 
many specifics ...).  I actually tried with 2^17 - 1, and a 
{{java.util.BitSet}} saved less than 1% over roaring.

So I'm now more inclined to stick with RoaringBitmap and let it do its job 
(just make sure we use it correctly by adding a call to {{runOptimize}}

> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11636) Support as on Classes defined in the REPL

2015-11-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11636:


Assignee: Michael Armbrust  (was: Apache Spark)

> Support as on Classes defined in the REPL
> -
>
> Key: SPARK-11636
> URL: https://issues.apache.org/jira/browse/SPARK-11636
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11636) Support as on Classes defined in the REPL

2015-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999219#comment-14999219
 ] 

Apache Spark commented on SPARK-11636:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/9602

> Support as on Classes defined in the REPL
> -
>
> Key: SPARK-11636
> URL: https://issues.apache.org/jira/browse/SPARK-11636
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11637) Alias do not work with udf with * parameter

2015-11-10 Thread Pierre Borckmans (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Borckmans updated SPARK-11637:
-
Description: 
In Spark < 1.5.0, this used to work :

{code:java|title=Spark <1.5.0|borderStyle=solid}
scala> sqlContext.sql("select hash(*) as x from T")
res2: org.apache.spark.sql.DataFrame = [x: int]
{code}

>From Spark 1.5.0+, it fails:
{code:java|title=Spark +1.5.0|borderStyle=solid}
scala> sqlContext.sql("select hash(*) as x from T")
org.apache.spark.sql.AnalysisException: unresolved operator 'Project ['hash(*) 
AS x#1];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
...
{code}


  was:
In Spark < 1.5.0, this used to work :

{code:scala|title=Spark <1.5.0|borderStyle=solid}
scala> sqlContext.sql("select hash(*) as x from T")
res2: org.apache.spark.sql.DataFrame = [x: int]
{code}

>From Spark 1.5.0+, it fails:
{code:scala|title=Spark +1.5.0|borderStyle=solid}
scala> sqlContext.sql("select hash(*) as x from T")
org.apache.spark.sql.AnalysisException: unresolved operator 'Project ['hash(*) 
AS x#1];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
...
{code}



> Alias do not work with udf with * parameter
> ---
>
> Key: SPARK-11637
> URL: https://issues.apache.org/jira/browse/SPARK-11637
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
>Reporter: Pierre Borckmans
>
> In Spark < 1.5.0, this used to work :
> {code:java|title=Spark <1.5.0|borderStyle=solid}
> scala> sqlContext.sql("select hash(*) as x from T")
> res2: org.apache.spark.sql.DataFrame = [x: int]
> {code}
> From Spark 1.5.0+, it fails:
> {code:java|title=Spark +1.5.0|borderStyle=solid}
> scala> sqlContext.sql("select hash(*) as x from T")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
> ['hash(*) AS x#1];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11631) DAGScheduler prints "Stopping DAGScheduler" at INFO to the logs with no corresponding "Starting"

2015-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999305#comment-14999305
 ] 

Apache Spark commented on SPARK-11631:
--

User 'xguo27' has created a pull request for this issue:
https://github.com/apache/spark/pull/9603

> DAGScheduler prints "Stopping DAGScheduler" at INFO to the logs with no 
> corresponding "Starting"
> 
>
> Key: SPARK-11631
> URL: https://issues.apache.org/jira/browse/SPARK-11631
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
> Environment: Spark sources as of today - revision {{5039a49}}
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> At stop, DAGScheduler prints out {{INFO DAGScheduler: Stopping 
> DAGScheduler}}, but there's no corresponding Starting INFO message. It can be 
> surprising.
> I think Spark should have a change and pick one:
> 1. {{INFO DAGScheduler: Stopping DAGScheduler}} should be DEBUG at the most 
> (or even TRACE)
> 2. {{INFO DAGScheduler: Stopping DAGScheduler}} should have corresponding 
> {{INFO DAGScheduler: Starting DAGScheduler}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11631) DAGScheduler prints "Stopping DAGScheduler" at INFO to the logs with no corresponding "Starting"

2015-11-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11631:


Assignee: Apache Spark

> DAGScheduler prints "Stopping DAGScheduler" at INFO to the logs with no 
> corresponding "Starting"
> 
>
> Key: SPARK-11631
> URL: https://issues.apache.org/jira/browse/SPARK-11631
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
> Environment: Spark sources as of today - revision {{5039a49}}
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Trivial
>
> At stop, DAGScheduler prints out {{INFO DAGScheduler: Stopping 
> DAGScheduler}}, but there's no corresponding Starting INFO message. It can be 
> surprising.
> I think Spark should have a change and pick one:
> 1. {{INFO DAGScheduler: Stopping DAGScheduler}} should be DEBUG at the most 
> (or even TRACE)
> 2. {{INFO DAGScheduler: Stopping DAGScheduler}} should have corresponding 
> {{INFO DAGScheduler: Starting DAGScheduler}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11635) Avoid Guava version dependency for Hashcodes

2015-11-10 Thread Nishkam Ravi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishkam Ravi closed SPARK-11635.


> Avoid Guava version dependency for Hashcodes
> 
>
> Key: SPARK-11635
> URL: https://issues.apache.org/jira/browse/SPARK-11635
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.6.0
>Reporter: Nishkam Ravi
>
> To avoid Guava version dependency we can use Hashcode instead of Hashcodes 
> (deprecated started 16.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11638) Apache Spark on Docker with Bridge networking / run Spark in Mesos on Docker with Bridge networking

2015-11-10 Thread Radoslaw Gruchalski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999363#comment-14999363
 ] 

Radoslaw Gruchalski commented on SPARK-11638:
-

Working on providing full description. I will update the JIRA ticket shortly 
giving full context and what the attached patches bring to Spark.

> Apache Spark on Docker with Bridge networking / run Spark in Mesos on Docker 
> with Bridge networking
> ---
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch
>
>
> Full description will be provided within the next few minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11635) Avoid Guava version dependency for Hashcodes

2015-11-10 Thread Nishkam Ravi (JIRA)
Nishkam Ravi created SPARK-11635:


 Summary: Avoid Guava version dependency for Hashcodes
 Key: SPARK-11635
 URL: https://issues.apache.org/jira/browse/SPARK-11635
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.6.0
Reporter: Nishkam Ravi


To avoid Guava version dependency we can use Hashcode instead of Hashcodes 
(deprecated started 16.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11077) Join elimination in Catalyst

2015-11-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11077:
-
Target Version/s:   (was: 1.6.0)

> Join elimination in Catalyst
> 
>
> Key: SPARK-11077
> URL: https://issues.apache.org/jira/browse/SPARK-11077
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> Join elimination is a query optimization where certain joins can be 
> eliminated when followed by projections that only keep columns from one side 
> of the join, and when certain columns are known to be unique or foreign keys. 
> This can be very useful for queries involving views and machine-generated 
> queries.
> Adding join elimination to Catalyst requires (1) support for unique and 
> foreign key hints in logical plans, (2) methods in the DataFrame API to let 
> users provide these hints, and (3) an optimizer rule that eliminates unique 
> key outer joins and referential integrity joins when followed by an 
> appropriate projection.
> This proposal is described in detail here: 
> https://docs.google.com/document/d/1-YgQSQywHfAo4PhAT-zOOkFZtVcju99h3dYQq-i9GWQ/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11633) HiveContext throws TreeNode Exception : Failed to Copy Node

2015-11-10 Thread Saurabh Santhosh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Santhosh updated SPARK-11633:
-
Affects Version/s: 1.5.0

> HiveContext throws TreeNode Exception : Failed to Copy Node
> ---
>
> Key: SPARK-11633
> URL: https://issues.apache.org/jira/browse/SPARK-11633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0, 1.5.1
>Reporter: Saurabh Santhosh
>Priority: Critical
>
> h2. HiveContext#sql is throwing the following exception in a specific 
> scenario :
> h2. Exception :
> Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Failed to copy node.
> Is otherCopyArgs specified correctly for LogicalRDD.
> Exception message: wrong number of arguments
> ctor: public org.apache.spark.sql.execution.LogicalRDD
> (scala.collection.Seq,org.apache.spark.rdd.RDD,org.apache.spark.sql.SQLContext)?
> h2. Code :
> {code:title=SparkClient.java|borderStyle=solid}
> StructField[] fields = new StructField[2];
> fields[0] = new StructField("F1", DataTypes.StringType, true, 
> Metadata.empty());
> fields[1] = new StructField("F2", DataTypes.StringType, true, 
> Metadata.empty());
> 
> JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(RowFactory.create("", "", 0)));
> DataFrame df = sparkHiveContext.createDataFrame(rdd, new StructType(fields));
> sparkHiveContext.registerDataFrameAsTable(df, "t1");
> DataFrame aliasedDf = sparkHiveContext.sql("select f1, F2 as F2 from t1");
> sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t2");
> sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t3");
> sparkHiveContext.sql("select a.F1 from t2 a inner join t3 b on a.F2=b.F2");
> {code}
> h2. Observations :
> * if F1(exact name of field) is used instead of f1, the code works correctly.
> * If alias is not used for F2, then also code works irrespective of case of 
> F1.
> * if Field F2 is not used in the final query also the code works correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11635) Avoid Guava version dependency for Hashcodes

2015-11-10 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11635.

Resolution: Invalid

Believe me, I'd have used the non-deprecated version if it were available in 
Guava 14.

> Avoid Guava version dependency for Hashcodes
> 
>
> Key: SPARK-11635
> URL: https://issues.apache.org/jira/browse/SPARK-11635
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.6.0
>Reporter: Nishkam Ravi
>
> To avoid Guava version dependency we can use Hashcode instead of Hashcodes 
> (deprecated started 16.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11633) HiveContext throws TreeNode Exception : Failed to Copy Node

2015-11-10 Thread Saurabh Santhosh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Santhosh updated SPARK-11633:
-
Affects Version/s: 1.5.1

> HiveContext throws TreeNode Exception : Failed to Copy Node
> ---
>
> Key: SPARK-11633
> URL: https://issues.apache.org/jira/browse/SPARK-11633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0, 1.5.1
>Reporter: Saurabh Santhosh
>Priority: Critical
>
> h2. HiveContext#sql is throwing the following exception in a specific 
> scenario :
> h2. Exception :
> Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Failed to copy node.
> Is otherCopyArgs specified correctly for LogicalRDD.
> Exception message: wrong number of arguments
> ctor: public org.apache.spark.sql.execution.LogicalRDD
> (scala.collection.Seq,org.apache.spark.rdd.RDD,org.apache.spark.sql.SQLContext)?
> h2. Code :
> {code:title=SparkClient.java|borderStyle=solid}
> StructField[] fields = new StructField[2];
> fields[0] = new StructField("F1", DataTypes.StringType, true, 
> Metadata.empty());
> fields[1] = new StructField("F2", DataTypes.StringType, true, 
> Metadata.empty());
> 
> JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(RowFactory.create("", "", 0)));
> DataFrame df = sparkHiveContext.createDataFrame(rdd, new StructType(fields));
> sparkHiveContext.registerDataFrameAsTable(df, "t1");
> DataFrame aliasedDf = sparkHiveContext.sql("select f1, F2 as F2 from t1");
> sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t2");
> sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t3");
> sparkHiveContext.sql("select a.F1 from t2 a inner join t3 b on a.F2=b.F2");
> {code}
> h2. Observations :
> * if F1(exact name of field) is used instead of f1, the code works correctly.
> * If alias is not used for F2, then also code works irrespective of case of 
> F1.
> * if Field F2 is not used in the final query also the code works correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11635) Avoid Guava version dependency for Hashcodes

2015-11-10 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999319#comment-14999319
 ] 

Marcelo Vanzin commented on SPARK-11635:


bq. why do we use Guava 14 instead of a newer version?

Backwards compatibility. Guava is exposed in the public API (we have to exclude 
a bunch of classes from shading because of that). Once we're allowed to break 
the public API, we can fully shade Guava and then use whatever version.

> Avoid Guava version dependency for Hashcodes
> 
>
> Key: SPARK-11635
> URL: https://issues.apache.org/jira/browse/SPARK-11635
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.6.0
>Reporter: Nishkam Ravi
>
> To avoid Guava version dependency we can use Hashcode instead of Hashcodes 
> (deprecated started 16.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11635) Avoid Guava version dependency for Hashcodes

2015-11-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11635:


Assignee: (was: Apache Spark)

> Avoid Guava version dependency for Hashcodes
> 
>
> Key: SPARK-11635
> URL: https://issues.apache.org/jira/browse/SPARK-11635
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.6.0
>Reporter: Nishkam Ravi
>
> To avoid Guava version dependency we can use Hashcode instead of Hashcodes 
> (deprecated started 16.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11635) Avoid Guava version dependency for Hashcodes

2015-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999213#comment-14999213
 ] 

Apache Spark commented on SPARK-11635:
--

User 'nishkamravi2' has created a pull request for this issue:
https://github.com/apache/spark/pull/9601

> Avoid Guava version dependency for Hashcodes
> 
>
> Key: SPARK-11635
> URL: https://issues.apache.org/jira/browse/SPARK-11635
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.6.0
>Reporter: Nishkam Ravi
>
> To avoid Guava version dependency we can use Hashcode instead of Hashcodes 
> (deprecated started 16.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11637) Alias do not work with udf with * parameter

2015-11-10 Thread Pierre Borckmans (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Borckmans updated SPARK-11637:
-
Description: 
In Spark < 1.5.0, this used to work :

{code:scala|title=Spark <1.5.0|borderStyle=solid}
scala> sqlContext.sql("select hash(*) as x from T")
res2: org.apache.spark.sql.DataFrame = [x: int]
{code}

>From Spark 1.5.0+, it fails:
{code:scala|title=Spark +1.5.0|borderStyle=solid}
scala> sqlContext.sql("select hash(*) as x from T")
org.apache.spark.sql.AnalysisException: unresolved operator 'Project ['hash(*) 
AS x#1];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
...
{code}


  was:
In Spark < 1.5.0, this used to work :
```
scala> sqlContext.sql("select hash(*) as x from T")
res2: org.apache.spark.sql.DataFrame = [x: int]
```

>From Spark 1.5.0+, it fails:
```
scala> sqlContext.sql("select hash(*) as x from T")
org.apache.spark.sql.AnalysisException: unresolved operator 'Project ['hash(*) 
AS x#1];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
...
```


> Alias do not work with udf with * parameter
> ---
>
> Key: SPARK-11637
> URL: https://issues.apache.org/jira/browse/SPARK-11637
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
>Reporter: Pierre Borckmans
>
> In Spark < 1.5.0, this used to work :
> {code:scala|title=Spark <1.5.0|borderStyle=solid}
> scala> sqlContext.sql("select hash(*) as x from T")
> res2: org.apache.spark.sql.DataFrame = [x: int]
> {code}
> From Spark 1.5.0+, it fails:
> {code:scala|title=Spark +1.5.0|borderStyle=solid}
> scala> sqlContext.sql("select hash(*) as x from T")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
> ['hash(*) AS x#1];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11635) Avoid Guava version dependency for Hashcodes

2015-11-10 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999308#comment-14999308
 ] 

Nishkam Ravi commented on SPARK-11635:
--

I see. So we end up with a dependency either way. HashCode with newer versions 
and HashCodes for older versions. I'm sure there is a good reason for it, but 
just for my own understanding: why do we use Guava 14 instead of a newer 
version? 

> Avoid Guava version dependency for Hashcodes
> 
>
> Key: SPARK-11635
> URL: https://issues.apache.org/jira/browse/SPARK-11635
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.6.0
>Reporter: Nishkam Ravi
>
> To avoid Guava version dependency we can use Hashcode instead of Hashcodes 
> (deprecated started 16.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11583) Make MapStatus use less memory uage

2015-11-10 Thread Daniel Lemire (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999309#comment-14999309
 ] 

Daniel Lemire commented on SPARK-11583:
---

[~irashid] 

What I would suggest is a quantified benchmark. E.g., the Elastic people did 
something of the sort... comparing various formats including a BitSet, Roaring, 
and so forth, see  
https://www.elastic.co/blog/frame-of-reference-and-roaring-bitmaps

I'm available to help with this benchmark if it is needed... 



> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11639) Flaky test: BatchedWriteAheadLog - name log with aggregated entries with the timestamp of last entry

2015-11-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11639:


Assignee: (was: Apache Spark)

> Flaky test: BatchedWriteAheadLog - name log with aggregated entries with the 
> timestamp of last entry
> 
>
> Key: SPARK-11639
> URL: https://issues.apache.org/jira/browse/SPARK-11639
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Burak Yavuz
>  Labels: flaky, flaky-test
>
> I added this test yesterday, and it has started showing flakiness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11639) Flaky test: BatchedWriteAheadLog - name log with aggregated entries with the timestamp of last entry

2015-11-10 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-11639:
---

 Summary: Flaky test: BatchedWriteAheadLog - name log with 
aggregated entries with the timestamp of last entry
 Key: SPARK-11639
 URL: https://issues.apache.org/jira/browse/SPARK-11639
 Project: Spark
  Issue Type: Test
  Components: Streaming
Affects Versions: 1.6.0
Reporter: Burak Yavuz


I added this test yesterday, and it has started showing flakiness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11639) Flaky test: BatchedWriteAheadLog - name log with aggregated entries with the timestamp of last entry

2015-11-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11639:


Assignee: Apache Spark

> Flaky test: BatchedWriteAheadLog - name log with aggregated entries with the 
> timestamp of last entry
> 
>
> Key: SPARK-11639
> URL: https://issues.apache.org/jira/browse/SPARK-11639
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>  Labels: flaky, flaky-test
>
> I added this test yesterday, and it has started showing flakiness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11639) Flaky test: BatchedWriteAheadLog - name log with aggregated entries with the timestamp of last entry

2015-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999376#comment-14999376
 ] 

Apache Spark commented on SPARK-11639:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/9605

> Flaky test: BatchedWriteAheadLog - name log with aggregated entries with the 
> timestamp of last entry
> 
>
> Key: SPARK-11639
> URL: https://issues.apache.org/jira/browse/SPARK-11639
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Burak Yavuz
>  Labels: flaky, flaky-test
>
> I added this test yesterday, and it has started showing flakiness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11640) shading packages in spark-assembly jar

2015-11-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11640.
---
Resolution: Not A Problem

> shading packages in spark-assembly jar
> --
>
> Key: SPARK-11640
> URL: https://issues.apache.org/jira/browse/SPARK-11640
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Reporter: PJ Fanning
>
> The spark assembly jar contains classes from many external dependencies like 
> hadoop and bouncycastle.
> I have run into issues trying to use bouncycastle code in a Spark job because 
> the JCE codebase expects the encryption code to be in a signed jar and since 
> the classes are copied into spark-assembly jar and it is not signed, the JCE 
> framework returns an error.
> If the bouncycastle classes in spark-assembly were shaded, then I could 
> deploy the properly signed bcprov jar. The spark code could access the shaded 
> copies of the bouncycastle classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11636) Support as on Classes defined in the REPL

2015-11-10 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-11636:


 Summary: Support as on Classes defined in the REPL
 Key: SPARK-11636
 URL: https://issues.apache.org/jira/browse/SPARK-11636
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11633) HiveContext throws TreeNode Exception : Failed to Copy Node

2015-11-10 Thread Saurabh Santhosh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Santhosh updated SPARK-11633:
-
Description: 
h2. HiveContext#sql is throwing the following exception in a specific scenario :

h2. Exception :

Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Failed to copy node.
Is otherCopyArgs specified correctly for LogicalRDD.
Exception message: wrong number of arguments
ctor: public org.apache.spark.sql.execution.LogicalRDD
(scala.collection.Seq,org.apache.spark.rdd.RDD,org.apache.spark.sql.SQLContext)?

h2. Code :

{code:title=SparkClient.java|borderStyle=solid}
StructField[] fields = new StructField[2];
fields[0] = new StructField("F1", DataTypes.StringType, true, Metadata.empty());
fields[1] = new StructField("F2", DataTypes.StringType, true, Metadata.empty());

JavaRDD rdd = 
javaSparkContext.parallelize(Arrays.asList(RowFactory.create("", "")));

DataFrame df = sparkHiveContext.createDataFrame(rdd, new StructType(fields));
sparkHiveContext.registerDataFrameAsTable(df, "t1");

DataFrame aliasedDf = sparkHiveContext.sql("select f1, F2 as F2 from t1");

sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t2");
sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t3");

sparkHiveContext.sql("select a.F1 from t2 a inner join t3 b on a.F2=b.F2");

{code}

h2. Observations :

* if F1(exact name of field) is used instead of f1, the code works correctly.
* If alias is not used for F2, then also code works irrespective of case of F1.
* if Field F2 is not used in the final query also the code works correctly.

  was:
h2. HiveContext#sql is throwing the following exception in a specific scenario :

h2. Exception :

Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Failed to copy node.
Is otherCopyArgs specified correctly for LogicalRDD.
Exception message: wrong number of arguments
ctor: public org.apache.spark.sql.execution.LogicalRDD
(scala.collection.Seq,org.apache.spark.rdd.RDD,org.apache.spark.sql.SQLContext)?

h2. Code :

{code:title=SparkClient.java|borderStyle=solid}
StructField[] fields = new StructField[2];
fields[0] = new StructField("F1", DataTypes.StringType, true, Metadata.empty());
fields[1] = new StructField("F2", DataTypes.StringType, true, Metadata.empty());

JavaRDD rdd = 
javaSparkContext.parallelize(Arrays.asList(RowFactory.create("", "", 0)));

DataFrame df = sparkHiveContext.createDataFrame(rdd, new StructType(fields));
sparkHiveContext.registerDataFrameAsTable(df, "t1");

DataFrame aliasedDf = sparkHiveContext.sql("select f1, F2 as F2 from t1");

sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t2");
sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t3");

sparkHiveContext.sql("select a.F1 from t2 a inner join t3 b on a.F2=b.F2");

{code}

h2. Observations :

* if F1(exact name of field) is used instead of f1, the code works correctly.
* If alias is not used for F2, then also code works irrespective of case of F1.
* if Field F2 is not used in the final query also the code works correctly.


> HiveContext throws TreeNode Exception : Failed to Copy Node
> ---
>
> Key: SPARK-11633
> URL: https://issues.apache.org/jira/browse/SPARK-11633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0, 1.5.1
>Reporter: Saurabh Santhosh
>Priority: Critical
>
> h2. HiveContext#sql is throwing the following exception in a specific 
> scenario :
> h2. Exception :
> Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Failed to copy node.
> Is otherCopyArgs specified correctly for LogicalRDD.
> Exception message: wrong number of arguments
> ctor: public org.apache.spark.sql.execution.LogicalRDD
> (scala.collection.Seq,org.apache.spark.rdd.RDD,org.apache.spark.sql.SQLContext)?
> h2. Code :
> {code:title=SparkClient.java|borderStyle=solid}
> StructField[] fields = new StructField[2];
> fields[0] = new StructField("F1", DataTypes.StringType, true, 
> Metadata.empty());
> fields[1] = new StructField("F2", DataTypes.StringType, true, 
> Metadata.empty());
> 
> JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(RowFactory.create("", "")));
> DataFrame df = sparkHiveContext.createDataFrame(rdd, new StructType(fields));
> sparkHiveContext.registerDataFrameAsTable(df, "t1");
> DataFrame aliasedDf = sparkHiveContext.sql("select f1, F2 as F2 from t1");
> sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t2");
> sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t3");
> sparkHiveContext.sql("select a.F1 from t2 a inner join t3 b on a.F2=b.F2");
> {code}
> h2. Observations :
> * if F1(exact name of field) is used instead of f1, the code works correctly.
> * If alias is not used for F2, then also code works 

[jira] [Comment Edited] (SPARK-11583) Make MapStatus use less memory uage

2015-11-10 Thread Daniel Lemire (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999117#comment-14999117
 ] 

Daniel Lemire edited comment on SPARK-11583 at 11/10/15 8:40 PM:
-

> When a set is nearly full, RoaringBitmap does not automatically invert the 
> bits in order to minimize space. 

The Roaring implementation in Lucene inverts bits to minimize space, as 
described...

https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/util/RoaringDocIdSet.java

The RoaringBitmap library which we produced does not. However, it does 
something similar upon request.

You might want to try...

 x.flip(0,1001);
 x.runOptimize();
 x.getSizeInBytes();

The call to runOptimize should significantly reduce memory usage in this case. 


The intention is that users should call "runOptimize" when their bitmaps have 
been created and is no longer expected to be changed frequently. So 
"runOptimize" should always be called prior to serialization.



was (Author: lemire):
> When a set is nearly full, RoaringBitmap does not automatically invert the 
> bits in order to minimize space. 

The Roaring implementation in Lucene invert bits to minimize space, as 
descriped...

https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/util/RoaringDocIdSet.java

The RoaringBitmap libary which we produced does not. However, it does something 
similar upon request.

You might want to try...

 x.flip(0,1001);
 x.runOptimize();
 x.getSizeInBytes();

The call to runOptimize should significantly reduce memory usage in this case. 


The intention is that users should call "runOptimize" when their bitmaps has 
been created and is no long expected to be changed frequently. So "runOptimize" 
should always be called prior to serialization.


> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11638) Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker with Bridge networking

2015-11-10 Thread Radoslaw Gruchalski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radoslaw Gruchalski updated SPARK-11638:

Summary: Apache Spark in Docker with Bridge networking / run Spark on 
Mesos, in Docker with Bridge networking  (was: Apache Spark on Docker with 
Bridge networking / run Spark in Mesos on Docker with Bridge networking)

> Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker 
> with Bridge networking
> 
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch
>
>
> Full description will be provided within the next few minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11638) Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker with Bridge networking

2015-11-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999399#comment-14999399
 ] 

Sean Owen commented on SPARK-11638:
---

We don't use patches in Spark. Why not just get the text ready before opening a 
JIRA? 
I'm not sure what this is but I think you might want to discuss on user@ first.

> Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker 
> with Bridge networking
> 
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch
>
>
> Full description will be provided within the next few minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8459) Add import/export to spark.mllib bisecting k-means

2015-11-10 Thread Yu Ishikawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-8459:
---
Labels: 1.7.0  (was: )

> Add import/export to spark.mllib bisecting k-means
> --
>
> Key: SPARK-8459
> URL: https://issues.apache.org/jira/browse/SPARK-8459
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yu Ishikawa
>  Labels: 1.7.0
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9830) Remove AggregateExpression1 and Aggregate Operator used to evaluate AggregateExpression1s

2015-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999441#comment-14999441
 ] 

Apache Spark commented on SPARK-9830:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/9607

> Remove AggregateExpression1 and Aggregate Operator used to evaluate 
> AggregateExpression1s
> -
>
> Key: SPARK-9830
> URL: https://issues.apache.org/jira/browse/SPARK-9830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.6.0
>
>
> In 1.6.0, we should remove the old aggregation code path (used to evaluate 
> AggregateExpression1). While removing this code path, there are several code 
> clean up we need to do,
> 1. Remove all of our hacks from ResolveFunctions.
> 2. Remove all of the conversion logics from 
> {org.apache.spark.sql.catalyst.expressions.aggregate.utils}}.
> 3. Remove {{newAggregation}} field from {{logical.Aggregate}}.
> 4. Remove the query planning rule for old aggregate path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11641) Exchange plan string is too verbose

2015-11-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11641:


Assignee: Apache Spark  (was: Reynold Xin)

> Exchange plan string is too verbose
> ---
>
> Key: SPARK-11641
> URL: https://issues.apache.org/jira/browse/SPARK-11641
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> By default it says "TungstenExchange(Shuffle without coordinator)". It should 
> just say TungstenExchange.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11550) Replace example code in mllib-optimization.md using include_example

2015-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11550.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9516
[https://github.com/apache/spark/pull/9516]

> Replace example code in mllib-optimization.md using include_example
> ---
>
> Key: SPARK-11550
> URL: https://issues.apache.org/jira/browse/SPARK-11550
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11550) Replace example code in mllib-optimization.md using include_example

2015-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11550:
--
Assignee: Pravin Gadakh

> Replace example code in mllib-optimization.md using include_example
> ---
>
> Key: SPARK-11550
> URL: https://issues.apache.org/jira/browse/SPARK-11550
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Pravin Gadakh
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11550) Replace example code in mllib-optimization.md using include_example

2015-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11550:
--
Target Version/s: 1.6.0

> Replace example code in mllib-optimization.md using include_example
> ---
>
> Key: SPARK-11550
> URL: https://issues.apache.org/jira/browse/SPARK-11550
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Pravin Gadakh
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11641) Exchange plan string is too verbose

2015-11-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11641.
-
   Resolution: Fixed
 Assignee: Yin Huai  (was: Reynold Xin)
Fix Version/s: 1.6.0

> Exchange plan string is too verbose
> ---
>
> Key: SPARK-11641
> URL: https://issues.apache.org/jira/browse/SPARK-11641
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>
> By default it says "TungstenExchange(Shuffle without coordinator)". It should 
> just say TungstenExchange.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6519) Add spark.ml API for bisecting k-means

2015-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6519:
-
Shepherd: Xiangrui Meng
Assignee: Yu Ishikawa

> Add spark.ml API for bisecting k-means
> --
>
> Key: SPARK-6519
> URL: https://issues.apache.org/jira/browse/SPARK-6519
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Yu Ishikawa
>Assignee: Yu Ishikawa
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10726) Using dynamic-executor-allocation,When jobs are submitted parallelly, executors will be removed before tasks finish

2015-11-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10726.
---
Resolution: Duplicate

> Using dynamic-executor-allocation,When jobs are submitted parallelly, 
> executors will be removed before tasks finish
> ---
>
> Key: SPARK-10726
> URL: https://issues.apache.org/jira/browse/SPARK-10726
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: KaiXinXIaoLei
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >