date:20161211

[jira] [Resolved] (SPARK-18790) Keep a general offset history of stream batches

2016-12-11 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18790.
--
  Resolution: Fixed
Assignee: Tyson Condie
   Fix Version/s: 2.1.1
  2.0.3
Target Version/s:   (was: 2.1.0)

> Keep a general offset history of stream batches
> ---
>
> Key: SPARK-18790
> URL: https://issues.apache.org/jira/browse/SPARK-18790
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tyson Condie
>Assignee: Tyson Condie
> Fix For: 2.0.3, 2.1.1
>
>
> Instead of only keeping the minimum number of offsets around, we should keep 
> enough information to allow us to roll back n batches and reexecute the 
> stream starting from a given point. In particular, we should create a config 
> in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and 
> ensure that we keep enough log files in the following places to roll back the 
> specified number of batches:
> the offsets that are present in each batch
> versions of the state store
> the files lists stored for the FileStreamSource
> the metadata log stored by the FileStreamSink



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18828) Refactor SparkR build and test scripts

2016-12-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18828:


Assignee: (was: Apache Spark)

> Refactor SparkR build and test scripts
> --
>
> Key: SPARK-18828
> URL: https://issues.apache.org/jira/browse/SPARK-18828
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> Since we are building SparkR source package we are now seeing the call tree 
> getting more convoluted and more parts are getting duplicated.
> We should try to clean this up.
> One issue is with the requirement to install SparkR before building SparkR 
> source package (ie. R CMD build) because of the loading of SparkR via 
> "library(SparkR)" in the vignettes. When we refactor that part in the 
> vignettes we should be able to further decouple the scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18828) Refactor SparkR build and test scripts

2016-12-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18828:


Assignee: Apache Spark

> Refactor SparkR build and test scripts
> --
>
> Key: SPARK-18828
> URL: https://issues.apache.org/jira/browse/SPARK-18828
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>
> Since we are building SparkR source package we are now seeing the call tree 
> getting more convoluted and more parts are getting duplicated.
> We should try to clean this up.
> One issue is with the requirement to install SparkR before building SparkR 
> source package (ie. R CMD build) because of the loading of SparkR via 
> "library(SparkR)" in the vignettes. When we refactor that part in the 
> vignettes we should be able to further decouple the scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18828) Refactor SparkR build and test scripts

2016-12-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741200#comment-15741200
 ] 

Apache Spark commented on SPARK-18828:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16249

> Refactor SparkR build and test scripts
> --
>
> Key: SPARK-18828
> URL: https://issues.apache.org/jira/browse/SPARK-18828
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> Since we are building SparkR source package we are now seeing the call tree 
> getting more convoluted and more parts are getting duplicated.
> We should try to clean this up.
> One issue is with the requirement to install SparkR before building SparkR 
> source package (ie. R CMD build) because of the loading of SparkR via 
> "library(SparkR)" in the vignettes. When we refactor that part in the 
> vignettes we should be able to further decouple the scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18828) Refactor SparkR build and test scripts

2016-12-11 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-18828:


 Summary: Refactor SparkR build and test scripts
 Key: SPARK-18828
 URL: https://issues.apache.org/jira/browse/SPARK-18828
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung


Since we are building SparkR source package we are now seeing the call tree 
getting more convoluted and more parts are getting duplicated.

We should try to clean this up.

One issue is with the requirement to install SparkR before building SparkR 
source package (ie. R CMD build) because of the loading of SparkR via 
"library(SparkR)" in the vignettes. When we refactor that part in the vignettes 
we should be able to further decouple the scripts.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18570) Consider supporting other R formula operators

2016-12-11 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18570:
-
Priority: Minor  (was: Major)

> Consider supporting other R formula operators
> -
>
> Key: SPARK-18570
> URL: https://issues.apache.org/jira/browse/SPARK-18570
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Priority: Minor
>
> Such as
> {code}
> ∗ 
>  X∗Y include these variables and the interactions between them
> ^
>  (X + Z + W)^3 include these variables and all interactions up to three way
> |
>  X | Z conditioning: include x given z
> {code}
> Other includes, %in%, ` (backtick)
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18569) Support R formula arithmetic

2016-12-11 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18569:
-
Affects Version/s: (was: 2.2.0)
 Target Version/s: 2.2.0

> Support R formula arithmetic 
> -
>
> Key: SPARK-18569
> URL: https://issues.apache.org/jira/browse/SPARK-18569
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> I think we should support arithmetic which makes it a lot more convenient to 
> build model. Something like
> {code}
>   log(y) ~ a + log(x)
> {code}
> And to avoid resolution confusions we should support the I() operator:
> {code}
> I
>  I(X∗Z) as is: include a new variable consisting of these variables multiplied
> {code}
> Such that this works:
> {code}
> y ~ a + I(b+c)
> {code}
> the term b+c is to be interpreted as the sum of b and c.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18569) Support R formula arithmetic

2016-12-11 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18569:
-
Affects Version/s: 2.2.0

> Support R formula arithmetic 
> -
>
> Key: SPARK-18569
> URL: https://issues.apache.org/jira/browse/SPARK-18569
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> I think we should support arithmetic which makes it a lot more convenient to 
> build model. Something like
> {code}
>   log(y) ~ a + log(x)
> {code}
> And to avoid resolution confusions we should support the I() operator:
> {code}
> I
>  I(X∗Z) as is: include a new variable consisting of these variables multiplied
> {code}
> Such that this works:
> {code}
> y ~ a + I(b+c)
> {code}
> the term b+c is to be interpreted as the sum of b and c.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18570) Consider supporting other R formula operators

2016-12-11 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18570:
-
Target Version/s: 2.2.0

> Consider supporting other R formula operators
> -
>
> Key: SPARK-18570
> URL: https://issues.apache.org/jira/browse/SPARK-18570
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> Such as
> {code}
> ∗ 
>  X∗Y include these variables and the interactions between them
> ^
>  (X + Z + W)^3 include these variables and all interactions up to three way
> |
>  X | Z conditioning: include x given z
> {code}
> Other includes, %in%, ` (backtick)
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18348) Improve tree ensemble model summary

2016-12-11 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18348:
-
Target Version/s: 2.2.0

> Improve tree ensemble model summary
> ---
>
> Key: SPARK-18348
> URL: https://issues.apache.org/jira/browse/SPARK-18348
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
> discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, 
> loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language 
> bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
> method = "class")
>   n= 81
>   CP nsplit rel errorxerror  xstd
> 1 0.17647059  0 1.000 1.000 0.2155872
> 2 0.01960784  1 0.8235294 0.9411765 0.2107780
> 3 0.0100  4 0.7647059 1.0588235 0.2200975
> Variable importance
>  StartAge Number
> 64 24 12
> Node number 1: 81 observations,complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
> class counts:6417
>probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>   Start  < 8.5  to the right, improve=6.762330, (0 missing)
>   Number < 5.5  to the left,  improve=2.866795, (0 missing)
>   Age< 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>   Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
> class counts:56 6
>probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>   Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>   Age< 55   to the left,  improve=0.6848635, (0 missing)
>   Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>   Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>   Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
> class counts: 811
>probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
> class counts:29 0
>probabilities: 1.000 0.000
> Node number 5: 33 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
> class counts:27 6
>probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>   Age< 55   to the left,  improve=1.2467530, (0 missing)
>   Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>   Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>   Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>   Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
> class counts:12 0
>probabilities: 1.000 0.000
> Node number 11: 21 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
> class counts:15 6
>probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>   Age< 111  to the right, improve=1.71428600, (0 missing)
>   Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>   Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
> class counts:12 2
>probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
> class counts: 3 4
>probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10413) Model should support prediction on single instance

2016-12-11 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741169#comment-15741169
 ] 

Yanbo Liang commented on SPARK-10413:
-

[~anshbansal] Yeah, we will put this feature at a high priority in Spark 2.2 
release cycle. I think there is no JIRA ticket for predict method on the whole 
pipeline model, it depends on this feature. Thanks.

> Model should support prediction on single instance
> --
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.
> UPDATE: This issue is for making predictions with single models.  We can make 
> methods like {{def predict(features: Vector): Double}} public.
> * This issue is *not* for single-instance prediction for full Pipelines, 
> which would require making predictions on {{Row}}s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10413) Model should support prediction on single instance

2016-12-11 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10413:

Labels:   (was: 2.2.0)

> Model should support prediction on single instance
> --
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.
> UPDATE: This issue is for making predictions with single models.  We can make 
> methods like {{def predict(features: Vector): Double}} public.
> * This issue is *not* for single-instance prediction for full Pipelines, 
> which would require making predictions on {{Row}}s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10413) Model should support prediction on single instance

2016-12-11 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10413:

Labels: 2.2.0  (was: )

> Model should support prediction on single instance
> --
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>  Labels: 2.2.0
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.
> UPDATE: This issue is for making predictions with single models.  We can make 
> methods like {{def predict(features: Vector): Double}} public.
> * This issue is *not* for single-instance prediction for full Pipelines, 
> which would require making predictions on {{Row}}s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10884) Support prediction on single instance for regression and classification related models

2016-12-11 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10884:

Labels: 2.2.0  (was: )

> Support prediction on single instance for regression and classification 
> related models
> --
>
> Key: SPARK-10884
> URL: https://issues.apache.org/jira/browse/SPARK-10884
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>  Labels: 2.2.0
>
> Support prediction on single instance for regression and classification 
> related models (i.e., PredictionModel, ClassificationModel and their sub 
> classes). 
> Add corresponding test cases.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10884) Support prediction on single instance for regression and classification related models

2016-12-11 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-10884:
---

Assignee: Yanbo Liang

> Support prediction on single instance for regression and classification 
> related models
> --
>
> Key: SPARK-10884
> URL: https://issues.apache.org/jira/browse/SPARK-10884
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>  Labels: 2.2.0
>
> Support prediction on single instance for regression and classification 
> related models (i.e., PredictionModel, ClassificationModel and their sub 
> classes). 
> Add corresponding test cases.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18827) Cann't cache broadcast to disk

2016-12-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18827:


Assignee: Apache Spark

> Cann't cache broadcast to disk
> --
>
> Key: SPARK-18827
> URL: https://issues.apache.org/jira/browse/SPARK-18827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>
> How to reproduce it:
> {code:java}
>   test("Cache broadcast to disk") {
> val conf = new SparkConf()
>   .setAppName("Cache broadcast to disk")
>   .setMaster("local")
>   .set("spark.memory.useLegacyMode", "true")
>   .set("spark.storage.memoryFraction", "0.0")
> sc = new SparkContext(conf)
> val list = List[Int](1, 2, 3, 4)
> val broadcast = sc.broadcast(list)
> assert(broadcast.value.sum === 10)
>   }
> {code}
> It will fail on spark2.0.1, spark2.0.2 and spark2.1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18827) Cann't cache broadcast to disk

2016-12-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18827:


Assignee: (was: Apache Spark)

> Cann't cache broadcast to disk
> --
>
> Key: SPARK-18827
> URL: https://issues.apache.org/jira/browse/SPARK-18827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
>Reporter: Yuming Wang
>
> How to reproduce it:
> {code:java}
>   test("Cache broadcast to disk") {
> val conf = new SparkConf()
>   .setAppName("Cache broadcast to disk")
>   .setMaster("local")
>   .set("spark.memory.useLegacyMode", "true")
>   .set("spark.storage.memoryFraction", "0.0")
> sc = new SparkContext(conf)
> val list = List[Int](1, 2, 3, 4)
> val broadcast = sc.broadcast(list)
> assert(broadcast.value.sum === 10)
>   }
> {code}
> It will fail on spark2.0.1, spark2.0.2 and spark2.1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18827) Cann't cache broadcast to disk

2016-12-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741061#comment-15741061
 ] 

Apache Spark commented on SPARK-18827:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/16252

> Cann't cache broadcast to disk
> --
>
> Key: SPARK-18827
> URL: https://issues.apache.org/jira/browse/SPARK-18827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
>Reporter: Yuming Wang
>
> How to reproduce it:
> {code:java}
>   test("Cache broadcast to disk") {
> val conf = new SparkConf()
>   .setAppName("Cache broadcast to disk")
>   .setMaster("local")
>   .set("spark.memory.useLegacyMode", "true")
>   .set("spark.storage.memoryFraction", "0.0")
> sc = new SparkContext(conf)
> val list = List[Int](1, 2, 3, 4)
> val broadcast = sc.broadcast(list)
> assert(broadcast.value.sum === 10)
>   }
> {code}
> It will fail on spark2.0.1, spark2.0.2 and spark2.1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18826) Make FileStream be able to start with most recent files

2016-12-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18826:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Make FileStream be able to start with most recent files
> ---
>
> Key: SPARK-18826
> URL: https://issues.apache.org/jira/browse/SPARK-18826
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> When starting a stream with a lot of backfill and maxFilesPerTrigger, the 
> user could often want to start with most recent files first. This would let 
> you keep low latency for recent data and slowly backfill historical data.
> It's better to add an option to control this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18826) Make FileStream be able to start with most recent files

2016-12-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18826:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Make FileStream be able to start with most recent files
> ---
>
> Key: SPARK-18826
> URL: https://issues.apache.org/jira/browse/SPARK-18826
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> When starting a stream with a lot of backfill and maxFilesPerTrigger, the 
> user could often want to start with most recent files first. This would let 
> you keep low latency for recent data and slowly backfill historical data.
> It's better to add an option to control this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18826) Make FileStream be able to start with most recent files

2016-12-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741011#comment-15741011
 ] 

Apache Spark commented on SPARK-18826:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16251

> Make FileStream be able to start with most recent files
> ---
>
> Key: SPARK-18826
> URL: https://issues.apache.org/jira/browse/SPARK-18826
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> When starting a stream with a lot of backfill and maxFilesPerTrigger, the 
> user could often want to start with most recent files first. This would let 
> you keep low latency for recent data and slowly backfill historical data.
> It's better to add an option to control this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18827) Cann't cache broadcast to disk

2016-12-11 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741009#comment-15741009
 ] 

Yuming Wang commented on SPARK-18827:
-

I will create a PR later.

> Cann't cache broadcast to disk
> --
>
> Key: SPARK-18827
> URL: https://issues.apache.org/jira/browse/SPARK-18827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
>Reporter: Yuming Wang
>
> How to reproduce it:
> {code:java}
>   test("Cache broadcast to disk") {
> val conf = new SparkConf()
>   .setAppName("Cache broadcast to disk")
>   .setMaster("local")
>   .set("spark.memory.useLegacyMode", "true")
>   .set("spark.storage.memoryFraction", "0.0")
> sc = new SparkContext(conf)
> val list = List[Int](1, 2, 3, 4)
> val broadcast = sc.broadcast(list)
> assert(broadcast.value.sum === 10)
>   }
> {code}
> It will fail on spark2.0.1, spark2.0.2 and spark2.1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18827) Cann't cache broadcast to disk

2016-12-11 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-18827:
---

 Summary: Cann't cache broadcast to disk
 Key: SPARK-18827
 URL: https://issues.apache.org/jira/browse/SPARK-18827
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.2, 2.0.1, 2.1.0
Reporter: Yuming Wang


How to reproduce it:
{code:java}
  test("Cache broadcast to disk") {
val conf = new SparkConf()
  .setAppName("Cache broadcast to disk")
  .setMaster("local")
  .set("spark.memory.useLegacyMode", "true")
  .set("spark.storage.memoryFraction", "0.0")
sc = new SparkContext(conf)
val list = List[Int](1, 2, 3, 4)
val broadcast = sc.broadcast(list)
assert(broadcast.value.sum === 10)
  }
{code}
It will fail on spark2.0.1, spark2.0.2 and spark2.1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18826) Make FileStream be able to start with most recent files

2016-12-11 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-18826:


 Summary: Make FileStream be able to start with most recent files
 Key: SPARK-18826
 URL: https://issues.apache.org/jira/browse/SPARK-18826
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


When starting a stream with a lot of backfill and maxFilesPerTrigger, the user 
could often want to start with most recent files first. This would let you keep 
low latency for recent data and slowly backfill historical data.

It's better to add an option to control this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15572) MLlib in R format: compatibility with other languages

2016-12-11 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-15572:

Shepherd: Yanbo Liang

> MLlib in R format: compatibility with other languages
> -
>
> Key: SPARK-15572
> URL: https://issues.apache.org/jira/browse/SPARK-15572
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>
> Currently, models saved in R cannot be loaded easily into other languages.  
> This is because R saves extra metadata (feature names) alongside the model.  
> We should fix this issue so that models can be transferred seamlessly between 
> languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15572) MLlib in R format: compatibility with other languages

2016-12-11 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740967#comment-15740967
 ] 

Yanbo Liang edited comment on SPARK-15572 at 12/12/16 4:50 AM:
---

Sure, that's great. I updated myself as the shepherd.


was (Author: yanboliang):
Sure, that great. I updated me as the shepherd.

> MLlib in R format: compatibility with other languages
> -
>
> Key: SPARK-15572
> URL: https://issues.apache.org/jira/browse/SPARK-15572
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>
> Currently, models saved in R cannot be loaded easily into other languages.  
> This is because R saves extra metadata (feature names) alongside the model.  
> We should fix this issue so that models can be transferred seamlessly between 
> languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15572) MLlib in R format: compatibility with other languages

2016-12-11 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740967#comment-15740967
 ] 

Yanbo Liang commented on SPARK-15572:
-

Sure, that great. I updated me as the shepherd.

> MLlib in R format: compatibility with other languages
> -
>
> Key: SPARK-15572
> URL: https://issues.apache.org/jira/browse/SPARK-15572
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>
> Currently, models saved in R cannot be loaded easily into other languages.  
> This is because R saves extra metadata (feature names) alongside the model.  
> We should fix this issue so that models can be transferred seamlessly between 
> languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18325) SparkR 2.1 QA: Check for new R APIs requiring example code

2016-12-11 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-18325.
-
   Resolution: Fixed
Fix Version/s: 2.1.1

> SparkR 2.1 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-18325
> URL: https://issues.apache.org/jira/browse/SPARK-18325
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
> Fix For: 2.1.1
>
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18325) SparkR 2.1 QA: Check for new R APIs requiring example code

2016-12-11 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740943#comment-15740943
 ] 

Yanbo Liang commented on SPARK-18325:
-

Since PR 16148 has been merged, I think we can resolve this task. PR 16214 is a 
follow-up work which not strongly required in this release (may be merged after 
2.1). So I will resolve this in case blocking 2.1 release. Thanks.

> SparkR 2.1 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-18325
> URL: https://issues.apache.org/jira/browse/SPARK-18325
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2016-12-11 Thread caolan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740906#comment-15740906
 ] 

caolan commented on SPARK-17147:


I am using spark 2.0.0 + kafka 0.10 + compact mode topics even in some 
production environment.  This fix is really important. so the question is how 
to decide it is important or not. Compact kafka topic should be widely used 
now, spark 2.0 should support it well.

For the other issue, it did not happen all the time, did not have regular 
pattern, several times one day or did not happen in several days.
so  I should enlarge the spark.streaming.kafka.consumer.poll.ms, right.

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18824) Add optimizer rule to reorder expensive Filter predicates like ScalaUDF

2016-12-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18824:


Assignee: (was: Apache Spark)

> Add optimizer rule to reorder expensive Filter predicates like ScalaUDF
> ---
>
> Key: SPARK-18824
> URL: https://issues.apache.org/jira/browse/SPARK-18824
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> During evaluation of predicates in Filter, we can reorder the expressions in 
> order to evaluate the more expensive expressions like ScalaUDF later. So if 
> other expressions are evaluated to false, we can avoid evaluation of these 
> UDFs.
> We can add an optimizer rule to do this optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18824) Add optimizer rule to reorder expensive Filter predicates like ScalaUDF

2016-12-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740903#comment-15740903
 ] 

Apache Spark commented on SPARK-18824:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/16245

> Add optimizer rule to reorder expensive Filter predicates like ScalaUDF
> ---
>
> Key: SPARK-18824
> URL: https://issues.apache.org/jira/browse/SPARK-18824
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> During evaluation of predicates in Filter, we can reorder the expressions in 
> order to evaluate the more expensive expressions like ScalaUDF later. So if 
> other expressions are evaluated to false, we can avoid evaluation of these 
> UDFs.
> We can add an optimizer rule to do this optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18824) Add optimizer rule to reorder expensive Filter predicates like ScalaUDF

2016-12-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18824:


Assignee: Apache Spark

> Add optimizer rule to reorder expensive Filter predicates like ScalaUDF
> ---
>
> Key: SPARK-18824
> URL: https://issues.apache.org/jira/browse/SPARK-18824
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> During evaluation of predicates in Filter, we can reorder the expressions in 
> order to evaluate the more expensive expressions like ScalaUDF later. So if 
> other expressions are evaluated to false, we can avoid evaluation of these 
> UDFs.
> We can add an optimizer rule to do this optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18332) SparkR 2.1 QA: Programming guide, migration guide, vignettes updates

2016-12-11 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740557#comment-15740557
 ] 

Joseph K. Bradley edited comment on SPARK-18332 at 12/12/16 4:05 AM:
-

Let's do it after the 2.1 release.  We can always update the docs post-hoc.  I 
made a JIRA for it: [SPARK-18825]


was (Author: josephkb):
Let's do it after the 2.1 release.  We can always update the docs post-hoc.  
I'll make a JIRA for it.

> SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-18332
> URL: https://issues.apache.org/jira/browse/SPARK-18332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18825) Eliminate duplicate links in SparkR API doc index

2016-12-11 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-18825:
-

 Summary: Eliminate duplicate links in SparkR API doc index
 Key: SPARK-18825
 URL: https://issues.apache.org/jira/browse/SPARK-18825
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SparkR
Reporter: Joseph K. Bradley


The SparkR API docs contain many duplicate links with suffixes {{-method}} or 
{{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same 
doc.

Copying from [~felixcheung] in [SPARK-18332]:
{quote}
They are because of the
{{@ aliases}}
tags. I think we are adding them because CRAN checks require them to match the 
specific format - [~shivaram] would you know?
I am pretty sure they are double-listed because in addition to aliases we also 
have
{{@ rdname}}
which automatically generate the links as well.
I suspect if we change all the rdname to match the string in aliases then there 
will be one link. I can take a shot at this to test this out, but changes will 
be very extensive - is this something we could get into 2.1 still?
{quote}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18824) Add optimizer rule to reorder expensive Filter predicates like ScalaUDF

2016-12-11 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-18824:
---

 Summary: Add optimizer rule to reorder expensive Filter predicates 
like ScalaUDF
 Key: SPARK-18824
 URL: https://issues.apache.org/jira/browse/SPARK-18824
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


During evaluation of predicates in Filter, we can reorder the expressions in 
order to evaluate the more expensive expressions like ScalaUDF later. So if 
other expressions are evaluated to false, we can avoid evaluation of these UDFs.

We can add an optimizer rule to do this optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16073) Performance of Parquet encodings on saving primitive arrays

2016-12-11 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740807#comment-15740807
 ] 

Kazuaki Ishizaki commented on SPARK-16073:
--

It is an interesting topic. In the current situation, SPARK-16043 will not be 
merged soon. This is because performance issues for DataFrame/Dataset programs 
with primitive arrays are addressed by other approaches.
If there are some bench programs for this measurement, I am happy to run them 
with SPARK-16043. Are there any benchmark programs?

> Performance of Parquet encodings on saving primitive arrays
> ---
>
> Key: SPARK-16073
> URL: https://issues.apache.org/jira/browse/SPARK-16073
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet 
> data. However, Parquet also has its own encodings to compress columns/arrays, 
> e.g., dictionary encoding: 
> https://github.com/apache/parquet-format/blob/master/Encodings.md.
> It might be worth checking the performance overhead of Parquet encodings on 
> saving large primitive arrays, which is a machine learning use case. If the 
> overhead is significant, we should expose a configuration in Spark to control 
> the encoding levels.
> Note that this shouldn't be tested under Spark until SPARK-16043 was fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18806) driverwrapper and executor doesn't exit when worker killed

2016-12-11 Thread liujianhui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740792#comment-15740792
 ] 

liujianhui commented on SPARK-18806:


no， it's a problem, sometimes there are exist two same driver! and the 
coarseexecutor with  zombie state will reserved the memory even the worker had 
been exited 

> driverwrapper and executor doesn't exit when worker killed
> --
>
> Key: SPARK-18806
> URL: https://issues.apache.org/jira/browse/SPARK-18806
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.1
> Environment: java1.8
>Reporter: liujianhui
>
> submit an application with standlone-cluster mode,  and then the master will 
> launch executor and driverwrapper on worker. They are all start WorkerWatcher 
> to watch the worker, as a result, when the worker killed  by manual, the 
> driverwrapper and executor sometimes will not exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18820) Driver may send "LaunchTask" before executor receive "RegisteredExecutor"

2016-12-11 Thread jin xing (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740754#comment-15740754
 ] 

jin xing commented on SPARK-18820:
--

[~lins05]
Thanks a lot for your comment : )
In our company's cluster, we find lots of the NullPointerException described 
above;
Checking source code, I found CoarseGrainedSchedulerBackend will 
executorDataMap first, then reply "RegisteredExecutor";
After updating executorDataMap, the new joined executor may be sent 
"LaunchTask", which will result in "LaunchTask" arrives before than 
"RegisteredExecutor";
How do you think about this?

> Driver may send "LaunchTask" before executor receive "RegisteredExecutor"
> -
>
> Key: SPARK-18820
> URL: https://issues.apache.org/jira/browse/SPARK-18820
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.3
> Environment: spark-1.6.3
>Reporter: jin xing
>
> CoarseGrainedSchedulerBackend will update executorDataMap after receiving 
> "RegisterExecutor", thus task scheduler may assign tasks on to this executor;
> If LaunchTask arrives at CoarseGrainedExecutorBackend before 
> RegisteredExecutor, it will result in NullPointerException and executor 
> backend will exit;
> Is it a bug? If so can I make a pr? I think driver should send "LaunchTask" 
> after "RegisteredExecutor" is already received.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18823) Assignation by column name variable not available or bug?

2016-12-11 Thread Vicente Masip (JIRA)

Vicente Masip created SPARK-18823:
-

 Summary: Assignation by column name variable not available or bug?
 Key: SPARK-18823
 URL: https://issues.apache.org/jira/browse/SPARK-18823
 Project: Spark
  Issue Type: Question
  Components: SparkR
Affects Versions: 2.0.2
 Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
4. Or databricks (community.cloud.databricks.com) .
Reporter: Vicente Masip
 Fix For: 2.0.2


I really don't know if this is a bug or can be done with some function:

Sometimes is very important to assign something to a column which name has to 
be access trough a variable. Normally, I have always used it with doble 
brackets likes this out of SparkR problems:

# df could be faithful normal data frame or data table.
# accesing by variable name:
myname = "waiting"
df[[myname]] <- c(1:nrow(df))
# or even column number
df[[2]] <- df$eruptions

The error is not caused by the right side of the "<-" operator of assignment. 
The problem is that I can't assign to a column name using a variable or column 
number as I do in this examples out of spark. Doesn't matter if I am modifying 
or creating column. Same problem.

I have also tried to use this with no results:
val df2 = withColumn(df,"tmp", df$eruptions)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18332) SparkR 2.1 QA: Programming guide, migration guide, vignettes updates

2016-12-11 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740557#comment-15740557
 ] 

Joseph K. Bradley commented on SPARK-18332:
---

Let's do it after the 2.1 release.  We can always update the docs post-hoc.  
I'll make a JIRA for it.

> SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-18332
> URL: https://issues.apache.org/jira/browse/SPARK-18332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-12-11 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740419#comment-15740419
 ] 

Nicholas Chammas commented on SPARK-13587:
--

Thanks to a lot of help from [~quasi...@gmail.com] and [his blog post on this 
problem|http://quasiben.github.io/blog/2016/4/15/conda-spark/], I was able to 
develop a solution that works for Spark on YARN:

{code}
set -e

# Both these directories exist on all of our YARN nodes.
# Otherwise, everything else is built and shipped out at submit-time
# with our application.
export HADOOP_CONF_DIR="/etc/hadoop/conf"
export SPARK_HOME="/hadoop/spark/spark-2.0.2-bin-hadoop2.6"

export PATH="$SPARK_HOME/bin:$PATH"

python3 -m venv venv/
source venv/bin/activate

pip install -U pip
pip install -r requirements.pip
pip install -r requirements-dev.pip

deactivate

# This convoluted zip machinery is to ensure that the paths to the files inside 
the zip
# look the same to Python when it runs within YARN.
# If there is a simpler way to express this, I'd be interested to know!
pushd venv/
zip -rq ../venv.zip *
popd
pushd myproject/
zip -rq ../myproject.zip *
popd
pushd tests/
zip -rq ../tests.zip *
popd

export PYSPARK_PYTHON="venv/bin/python"

spark-submit \
  --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=venv/bin/python" \
  --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \
  --master yarn \
  --deploy-mode client \
  --archives "venv.zip#venv,myproject.zip#myproject,tests.zip#tests" \
  run_tests.py -v
{code}

My solution is based off of Ben's, except where Ben uses Conda I just use pip. 
I don't know if there is a way to adapt this solution to work with Spark on 
Mesos or Spark Standalone (and I haven't tried since my environment is YARN), 
but if someone figures it out please post your solution here!

As Ben explains in [his blog 
post|http://quasiben.github.io/blog/2016/4/15/conda-spark/], this lets you 
build and ship an isolated environment with your PySpark application out to the 
YARN cluster. The YARN nodes don't even need to have the correct version of 
Python (or Python at all!) installed, because you are shipping out a complete 
Python environment via the {{--archives}} option.

I hope this helps some people who are looking for a workaround they can use 
today while a more robust solution is developed directly into Spark.

And I wonder... if this {{--archives}} technique can be extended or translated 
to Mesos and Standalone somehow, maybe that would be a good enough solution for 
the time being? People would be able to run their jobs in an isolated Python 
environment using their tool of choice (conda or pip), and Spark wouldn't need 
to add any virtualenv-specific machinery.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-11 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740216#comment-15740216
 ] 

Felix Cheung commented on SPARK-18813:
--

This is great, Joseph. Thanks for putting down the framework on this.

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.  The issue may slip to the 
> next release if development is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* 
> receive attention.  The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no 
> promises are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to promise 
> interest to proceed.
> h1. Instructions
> h2. For contributors
> Getting started
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time contributor, please always start with a small 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a larger feature.
> Coordinating on JIRA
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start work. This is to avoid duplicate work. For small patches, you do 
> not need to get the JIRA assigned to you to begin work.
> * For medium/large features or features with dependencies, please get 
> assigned first before coding and keep the ETA updated on the JIRA. If there 
> is no activity on the JIRA page for a certain amount of time, the JIRA should 
> be released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Do not set these fields: Target Version, Fix Version, or Shepherd.  Only 
> Committers should set those.
> Writing and reviewing PRs
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * *Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.*
> h2. For Committers
> Adding to this roadmap
> * You can update the roadmap by (a) adding issues to this list and (b) 
> setting Target Versions.  Only Committers may make these changes.
> * *If you add an issue to this roadmap or set a Target Version, you _must_ 
> assign yourself or another Committer as Shepherd.*
> * This list should be actively mana

[jira] [Updated] (SPARK-18821) Bisecting k-means wrapper in SparkR

2016-12-11 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18821:
-
Shepherd: Felix Cheung

> Bisecting k-means wrapper in SparkR
> ---
>
> Key: SPARK-18821
> URL: https://issues.apache.org/jira/browse/SPARK-18821
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> Implement a wrapper in SparkR to support bisecting k-means



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18822) Support ML Pipeline in SparkR

2016-12-11 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18822:
-
Shepherd: Felix Cheung

> Support ML Pipeline in SparkR
> -
>
> Key: SPARK-18822
> URL: https://issues.apache.org/jira/browse/SPARK-18822
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> From Joseph Bradley:
> "
> Supporting Pipelines and advanced use cases: There really needs to be more 
> design discussion around SparkR. Felix Cheung would you be interested in 
> leading some discussion? I'm envisioning something similar to what was done a 
> while back for Pipelines in Scala/Java/Python, where we consider several use 
> cases of MLlib: fitting a single model, creating and tuning a complex 
> Pipeline, and working with multiple languages. That should help inform what 
> APIs should look like in Spark R.
> "
> Certain ML model, such as OneVsRest, is harder to represent in a single call 
> R API. Having advanced API or Pipeline API like this could help to expose 
> that to our users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15767) Decision Tree Regression wrapper in SparkR

2016-12-11 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-15767:
-
Shepherd: Felix Cheung

> Decision Tree Regression wrapper in SparkR
> --
>
> Key: SPARK-15767
> URL: https://issues.apache.org/jira/browse/SPARK-15767
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>
> Implement a wrapper in SparkR to support decision tree regression. R's naive 
> Decision Tree Regression implementation is from package rpart with signature 
> rpart(formula, dataframe, method="anova"). I propose we could implement API 
> like spark.rpart(dataframe, formula, ...) .  After having implemented 
> decision tree classification, we could refactor this two into an API more 
> like rpart()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18819) Failure to read single-row Parquet files

2016-12-11 Thread Michael Kamprath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740208#comment-15740208
 ] 

Michael Kamprath commented on SPARK-18819:
--

One more note, this issue only arises when doubles are in the Parquet file. 
This code runs just fine in the ARM71 environment:

{code}
from pyspark.sql.types import *
rdd2 = sc.parallelize([('row3',1,5,'name'),('row4',2,6,'string')])
my_schema2 = StructType([
StructField("id", StringType(), True),
StructField("value1", IntegerType(), True),
StructField("value2", IntegerType(), True),
StructField("name",StringType(), True)
])
df2 = spark.createDataFrame( rdd2, schema=my_schema2)
df2.coalesce(1).write.parquet('hdfs://master:9000/user/michael/test_data2',mode='overwrite')

newdf2 = spark.read.parquet('hdfs://master:9000/user/michael/test_data2/')
newdf2.take(1)
{code}

ARM71 requires doubles to be 8-byte aligned. So this is the first time I am 
digging into the Spark code ... is 
[SPARK-16962|https://github.com/apache/spark/pull/14762] a similar issue? I see 
that issue didn't address double alignment.

> Failure to read single-row Parquet files
> 
>
> Key: SPARK-18819
> URL: https://issues.apache.org/jira/browse/SPARK-18819
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.0.2
> Environment: Ubuntu 14.04 LTS on ARM 7.1
>Reporter: Michael Kamprath
>Priority: Critical
>
> When I create a data frame in PySpark with a small row count (less than 
> number executors), then write it to a parquet file, then load that parquet 
> file into a new data frame, and finally do any sort of read against the 
> loaded new data frame, Spark fails with an {{ExecutorLostFailure}}.
> Example code to replicate this issue:
> {code}
> from pyspark.sql.types import *
> rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')])
> my_schema = StructType([
> StructField("id", StringType(), True),
> StructField("value1", IntegerType(), True),
> StructField("value2", DoubleType(), True),
> StructField("name",StringType(), True)
> ])
> df = spark.createDataFrame( rdd, schema=my_schema)
> df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite')
> newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> newdf.take(1)
> {code}
> The error I get when the {{take}} step runs is:
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   1 newdf = 
> spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> > 2 newdf.take(1)
> /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num)
> 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
> 347 """
> --> 348 return self.limit(num).collect()
> 349 
> 350 @since(1.3)
> /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self)
> 308 """
> 309 with SCCallSiteSync(self._sc) as css:
> --> 310 port = self._jdf.collectToPython()
> 311 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 312 
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134 
>1135 for temp_arg in temp_args:
> /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o54.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of 
> the running tasks) Reason: Remote RPC client disassociated. Likely due to 
> containers exceeding thresholds, or network issues. Check driver logs for 
> WARN messages.
> Driver stacktrace:
>   at 
> org.apache

[jira] [Updated] (SPARK-18822) Support ML Pipeline in SparkR

2016-12-11 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18822:
-
Description: 
>From Joseph Bradley:

"
Supporting Pipelines and advanced use cases: There really needs to be more 
design discussion around SparkR. Felix Cheung would you be interested in 
leading some discussion? I'm envisioning something similar to what was done a 
while back for Pipelines in Scala/Java/Python, where we consider several use 
cases of MLlib: fitting a single model, creating and tuning a complex Pipeline, 
and working with multiple languages. That should help inform what APIs should 
look like in Spark R.
"

Certain ML model, such as OneVsRest, is harder to represent in a single call R 
API. Having advanced API or Pipeline API like this could help to expose that to 
our users.



  was:
>From Joseph Bradley:

"
Supporting Pipelines and advanced use cases: There really needs to be more 
design discussion around SparkR. Felix Cheung would you be interested in 
leading some discussion? I'm envisioning something similar to what was done a 
while back for Pipelines in Scala/Java/Python, where we consider several use 
cases of MLlib: fitting a single model, creating and tuning a complex Pipeline, 
and working with multiple languages. That should help inform what APIs should 
look like in Spark R.
"

Certain ML model, such as OneVsRest, is harder to represent in a single call R 
API. Having advanced API or Pipeline API like this could help to expose that to 
our users


> Support ML Pipeline in SparkR
> -
>
> Key: SPARK-18822
> URL: https://issues.apache.org/jira/browse/SPARK-18822
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> From Joseph Bradley:
> "
> Supporting Pipelines and advanced use cases: There really needs to be more 
> design discussion around SparkR. Felix Cheung would you be interested in 
> leading some discussion? I'm envisioning something similar to what was done a 
> while back for Pipelines in Scala/Java/Python, where we consider several use 
> cases of MLlib: fitting a single model, creating and tuning a complex 
> Pipeline, and working with multiple languages. That should help inform what 
> APIs should look like in Spark R.
> "
> Certain ML model, such as OneVsRest, is harder to represent in a single call 
> R API. Having advanced API or Pipeline API like this could help to expose 
> that to our users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18822) Support ML Pipeline in SparkR

2016-12-11 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18822:
-
Description: 
>From Joseph Bradley:

"
Supporting Pipelines and advanced use cases: There really needs to be more 
design discussion around SparkR. Felix Cheung would you be interested in 
leading some discussion? I'm envisioning something similar to what was done a 
while back for Pipelines in Scala/Java/Python, where we consider several use 
cases of MLlib: fitting a single model, creating and tuning a complex Pipeline, 
and working with multiple languages. That should help inform what APIs should 
look like in Spark R.
"

Certain ML model, such as OneVsRest, is harder to represent in a single call R 
API. Having advanced API or Pipeline API like this could help to expose that to 
our users

  was:
>From Joseph Bradley:

"
Supporting Pipelines and advanced use cases: There really needs to be more 
design discussion around SparkR. Felix Cheung would you be interested in 
leading some discussion? I'm envisioning something similar to what was done a 
while back for Pipelines in Scala/Java/Python, where we consider several use 
cases of MLlib: fitting a single model, creating and tuning a complex Pipeline, 
and working with multiple languages. That should help inform what APIs should 
look like in Spark R.
"


> Support ML Pipeline in SparkR
> -
>
> Key: SPARK-18822
> URL: https://issues.apache.org/jira/browse/SPARK-18822
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> From Joseph Bradley:
> "
> Supporting Pipelines and advanced use cases: There really needs to be more 
> design discussion around SparkR. Felix Cheung would you be interested in 
> leading some discussion? I'm envisioning something similar to what was done a 
> while back for Pipelines in Scala/Java/Python, where we consider several use 
> cases of MLlib: fitting a single model, creating and tuning a complex 
> Pipeline, and working with multiple languages. That should help inform what 
> APIs should look like in Spark R.
> "
> Certain ML model, such as OneVsRest, is harder to represent in a single call 
> R API. Having advanced API or Pipeline API like this could help to expose 
> that to our users



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-11 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740184#comment-15740184
 ] 

Felix Cheung edited comment on SPARK-18813 at 12/11/16 7:11 PM:


I added a couple of JIRAs for R that can be found with [this 
query|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0%20ORDER%20BY%20priority%20DESC]

We could turn them into subtasks if we are having umbrella


was (Author: felixcheung):
I added a couple of JIRAs for R that can be found with [this 
query|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0%20ORDER%20BY%20priority%20DESC]

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.  The issue may slip to the 
> next release if development is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* 
> receive attention.  The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no 
> promises are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to promise 
> interest to proceed.
> h1. Instructions
> h2. For contributors
> Getting started
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time contributor, please always start with a small 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a larger feature.
> Coordinating on JIRA
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start work. This is to avoid duplicate work. For small patches, you do 
> not need to get the JIRA assigned to you to begin work.
> * For medium/large features or features with dependencies, please get 
> assigned first before coding and keep the ETA updated on the JIRA. If there 
> is no activity on the JIRA page for a certain amount of time, the JIRA should 
> be released for other contributors.
> * Do not claim multip

[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-12-11 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740185#comment-15740185
 ] 

Felix Cheung commented on SPARK-15581:
--

re: Pipeline in R - certainly. opened 
https://issues.apache.org/jira/browse/SPARK-18822 to track.

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
> Fix For: 2.1.0
>
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main goal of the Python API is to have feature parity 
> wit

[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-11 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740184#comment-15740184
 ] 

Felix Cheung commented on SPARK-18813:
--

I added a couple of JIRAs for R that can be found with [this 
query|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0%20ORDER%20BY%20priority%20DESC]

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.  The issue may slip to the 
> next release if development is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* 
> receive attention.  The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no 
> promises are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to promise 
> interest to proceed.
> h1. Instructions
> h2. For contributors
> Getting started
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time contributor, please always start with a small 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a larger feature.
> Coordinating on JIRA
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start work. This is to avoid duplicate work. For small patches, you do 
> not need to get the JIRA assigned to you to begin work.
> * For medium/large features or features with dependencies, please get 
> assigned first before coding and keep the ETA updated on the JIRA. If there 
> is no activity on the JIRA page for a certain amount of time, the JIRA should 
> be released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Do not set these fields: Target Version, Fix Version, or Shepherd.  Only 
> Committers should set those.
> Writing and reviewing PRs
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * *Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.*
> h2. For Committers
> Adding to this roadmap
> * You can update the

[jira] [Commented] (SPARK-18822) Support ML Pipeline in SparkR

2016-12-11 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740181#comment-15740181
 ] 

Felix Cheung commented on SPARK-18822:
--

I'll take a shot at this.

> Support ML Pipeline in SparkR
> -
>
> Key: SPARK-18822
> URL: https://issues.apache.org/jira/browse/SPARK-18822
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> From Joseph Bradley:
> "
> Supporting Pipelines and advanced use cases: There really needs to be more 
> design discussion around SparkR. Felix Cheung would you be interested in 
> leading some discussion? I'm envisioning something similar to what was done a 
> while back for Pipelines in Scala/Java/Python, where we consider several use 
> cases of MLlib: fitting a single model, creating and tuning a complex 
> Pipeline, and working with multiple languages. That should help inform what 
> APIs should look like in Spark R.
> "



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18822) Support ML Pipeline in SparkR

2016-12-11 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-18822:


 Summary: Support ML Pipeline in SparkR
 Key: SPARK-18822
 URL: https://issues.apache.org/jira/browse/SPARK-18822
 Project: Spark
  Issue Type: New Feature
  Components: ML, SparkR
Reporter: Felix Cheung


>From Joseph Bradley:

"
Supporting Pipelines and advanced use cases: There really needs to be more 
design discussion around SparkR. Felix Cheung would you be interested in 
leading some discussion? I'm envisioning something similar to what was done a 
while back for Pipelines in Scala/Java/Python, where we consider several use 
cases of MLlib: fitting a single model, creating and tuning a complex Pipeline, 
and working with multiple languages. That should help inform what APIs should 
look like in Spark R.
"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18821) Bisecting k-means wrapper in SparkR

2016-12-11 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-18821:


 Summary: Bisecting k-means wrapper in SparkR
 Key: SPARK-18821
 URL: https://issues.apache.org/jira/browse/SPARK-18821
 Project: Spark
  Issue Type: New Feature
  Components: ML, SparkR
Reporter: Felix Cheung


Implement a wrapper in SparkR to support bisecting k-means



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18332) SparkR 2.1 QA: Programming guide, migration guide, vignettes updates

2016-12-11 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740172#comment-15740172
 ] 

Felix Cheung commented on SPARK-18332:
--

[~josephkb] they are because of the {code}@aliases{code} tags. I think we are 
adding them because CRAN checks require them to match the specific format - 
[~shivaram] would you know?

I am pretty sure they are doubledly listed because in addition to aliases we 
also have {code}@rdname{code} which automatically generate the links as well.

I suspect if we change all the rdname to match the string in aliases then there 
will be one link. I can take a shot at this to test this out, but changes will 
be very extensive - is this something we could get into 2.1 still?


> SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-18332
> URL: https://issues.apache.org/jira/browse/SPARK-18332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18226) SparkR displaying vector columns in incorrect way

2016-12-11 Thread Krishna Kalyan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishna Kalyan updated SPARK-18226:
---
Component/s: SparkR

> SparkR displaying vector columns in incorrect way
> -
>
> Key: SPARK-18226
> URL: https://issues.apache.org/jira/browse/SPARK-18226
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Trivial
>
> I have encountered a problem with SparkR presenting Spark vectors from 
> org.apache.spark.mllib.linalg package
> * `head(df)` shows in vector column: ""
> * cast to string does not work as expected, it shows: 
> "[1,null,null,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@79f50a91]"
> * `showDF(df)` work correctly
> to reproduce, start SparkR and paste following code (example taken from 
> https://spark.apache.org/docs/latest/sparkr.html#naive-bayes-model)
> {code}
> # Fit a Bernoulli naive Bayes model with spark.naiveBayes
> titanic <- as.data.frame(Titanic)
> titanicDF <- createDataFrame(titanic[titanic$Freq > 0, -5])
> nbDF <- titanicDF
> nbTestDF <- titanicDF
> nbModel <- spark.naiveBayes(nbDF, Survived ~ Class + Sex + Age)
> # Model summary
> summary(nbModel)
> # Prediction
> nbPredictions <- predict(nbModel, nbTestDF)
> #
> # My modification to expose the problem #
> nbPredictions$rawPrediction_str <- cast(nbPredictions$rawPrediction, "string")
> head(nbPredictions)
> showDF(nbPredictions)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18226) SparkR displaying vector columns in incorrect way

2016-12-11 Thread Krishna Kalyan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishna Kalyan updated SPARK-18226:
---
Component/s: (was: SparkR)

> SparkR displaying vector columns in incorrect way
> -
>
> Key: SPARK-18226
> URL: https://issues.apache.org/jira/browse/SPARK-18226
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Trivial
>
> I have encountered a problem with SparkR presenting Spark vectors from 
> org.apache.spark.mllib.linalg package
> * `head(df)` shows in vector column: ""
> * cast to string does not work as expected, it shows: 
> "[1,null,null,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@79f50a91]"
> * `showDF(df)` work correctly
> to reproduce, start SparkR and paste following code (example taken from 
> https://spark.apache.org/docs/latest/sparkr.html#naive-bayes-model)
> {code}
> # Fit a Bernoulli naive Bayes model with spark.naiveBayes
> titanic <- as.data.frame(Titanic)
> titanicDF <- createDataFrame(titanic[titanic$Freq > 0, -5])
> nbDF <- titanicDF
> nbTestDF <- titanicDF
> nbModel <- spark.naiveBayes(nbDF, Survived ~ Class + Sex + Age)
> # Model summary
> summary(nbModel)
> # Prediction
> nbPredictions <- predict(nbModel, nbTestDF)
> #
> # My modification to expose the problem #
> nbPredictions$rawPrediction_str <- cast(nbPredictions$rawPrediction, "string")
> head(nbPredictions)
> showDF(nbPredictions)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18820) Driver may send "LaunchTask" before executor receive "RegisteredExecutor"

2016-12-11 Thread Shuai Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739970#comment-15739970
 ] 

Shuai Lin commented on SPARK-18820:
---

The driver first sends {{RegisteredExecutor}} message and then, if there is a 
task scheduled to run on this executor, sends the {{LaunchTask}} message, both 
through the same underlying netty channel. So I think the order is guaranteed, 
and the problem described would never happen.

> Driver may send "LaunchTask" before executor receive "RegisteredExecutor"
> -
>
> Key: SPARK-18820
> URL: https://issues.apache.org/jira/browse/SPARK-18820
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.3
> Environment: spark-1.6.3
>Reporter: jin xing
>
> CoarseGrainedSchedulerBackend will update executorDataMap after receiving 
> "RegisterExecutor", thus task scheduler may assign tasks on to this executor;
> If LaunchTask arrives at CoarseGrainedExecutorBackend before 
> RegisteredExecutor, it will result in NullPointerException and executor 
> backend will exit;
> Is it a bug? If so can I make a pr? I think driver should send "LaunchTask" 
> after "RegisteredExecutor" is already received.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18820) Driver may send "LaunchTask" before executor receive "RegisteredExecutor"

2016-12-11 Thread jin xing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jin xing updated SPARK-18820:
-
Description: 
CoarseGrainedSchedulerBackend will update executorDataMap after receiving 
"RegisterExecutor", thus task scheduler may assign tasks on to this executor;
If LaunchTask arrives at CoarseGrainedExecutorBackend before 
RegisteredExecutor, it will result in NullPointerException and executor backend 
will exit;
Is it a bug? If so can I make a pr? I think driver should send "LaunchTask" 
after "RegisteredExecutor" is already received.

  was:
CoarseGrainedSchedulerBackend will update executorDataMap after receiving 
"RegisterExecutor", thus task scheduler may assign tasks on to this executor;
If LaunchTask arrives at CoarseGrainedExecutorBackend before 
RegisteredExecutor, it will result in NullPointerException and executor backend 
will exit;
Is it a bug? I think driver should send "LaunchTask" after "RegisteredExecutor" 
is already received.


> Driver may send "LaunchTask" before executor receive "RegisteredExecutor"
> -
>
> Key: SPARK-18820
> URL: https://issues.apache.org/jira/browse/SPARK-18820
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.3
> Environment: spark-1.6.3
>Reporter: jin xing
>
> CoarseGrainedSchedulerBackend will update executorDataMap after receiving 
> "RegisterExecutor", thus task scheduler may assign tasks on to this executor;
> If LaunchTask arrives at CoarseGrainedExecutorBackend before 
> RegisteredExecutor, it will result in NullPointerException and executor 
> backend will exit;
> Is it a bug? If so can I make a pr? I think driver should send "LaunchTask" 
> after "RegisteredExecutor" is already received.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18820) Driver may send "LaunchTask" before executor receive "RegisteredExecutor"

2016-12-11 Thread jin xing (JIRA)

jin xing created SPARK-18820:


 Summary: Driver may send "LaunchTask" before executor receive 
"RegisteredExecutor"
 Key: SPARK-18820
 URL: https://issues.apache.org/jira/browse/SPARK-18820
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.6.3
 Environment: spark-1.6.3
Reporter: jin xing


CoarseGrainedSchedulerBackend will update executorDataMap after receiving 
"RegisterExecutor", thus task scheduler may assign tasks on to this executor;
If LaunchTask arrives at CoarseGrainedExecutorBackend before 
RegisteredExecutor, it will result in NullPointerException and executor backend 
will exit;
Is it a bug? I think driver should send "LaunchTask" after "RegisteredExecutor" 
is already received.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns

2016-12-11 Thread Mohit (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739526#comment-15739526
 ] 

Mohit edited comment on SPARK-18642 at 12/11/16 10:42 AM:
--

[~dongjoon] We will appreciate if you could share your findings in form of 
'touch-points' from the source-code. 


was (Author: mohitgargk):
[~dongjoon] Please share your findings in form of 'touch-points' from the 
source-code. 

> Spark SQL: Catalyst is scanning undesired columns
> -
>
> Key: SPARK-18642
> URL: https://issues.apache.org/jira/browse/SPARK-18642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
> Environment: Ubuntu 14.04
> Spark: Local Mode
>Reporter: Mohit
>  Labels: performance
> Fix For: 2.0.0
>
>
> When doing a left-join between two tables, say A and B,  Catalyst has 
> information about the projection required for table B. Only the required 
> columns should be scanned.
> Code snippet below explains the scenario:
> scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA")
> dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string]
> scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB")
> dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string]
> scala> dfA.registerTempTable("A")
> scala> dfB.registerTempTable("B")
> scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid 
> where B.bid<2").explain
> == Physical Plan ==
> Project [aid#15,bid#17]
> +- Filter (bid#17 < 2)
>+- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None
>   :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: 
> file:/home/mohit/ruleA
>   +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: 
> file:/home/mohit/ruleB
> This is a watered-down example from a production issue which has a huge 
> performance impact.
> External reference: 
> http://stackoverflow.com/questions/40783675/spark-sql-catalyst-is-scanning-undesired-columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns

2016-12-11 Thread Mohit (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739526#comment-15739526
 ] 

Mohit commented on SPARK-18642:
---

[~dongjoon] Please share your findings in form of 'touch-points' from the 
source-code. 

> Spark SQL: Catalyst is scanning undesired columns
> -
>
> Key: SPARK-18642
> URL: https://issues.apache.org/jira/browse/SPARK-18642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
> Environment: Ubuntu 14.04
> Spark: Local Mode
>Reporter: Mohit
>  Labels: performance
> Fix For: 2.0.0
>
>
> When doing a left-join between two tables, say A and B,  Catalyst has 
> information about the projection required for table B. Only the required 
> columns should be scanned.
> Code snippet below explains the scenario:
> scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA")
> dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string]
> scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB")
> dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string]
> scala> dfA.registerTempTable("A")
> scala> dfB.registerTempTable("B")
> scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid 
> where B.bid<2").explain
> == Physical Plan ==
> Project [aid#15,bid#17]
> +- Filter (bid#17 < 2)
>+- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None
>   :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: 
> file:/home/mohit/ruleA
>   +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: 
> file:/home/mohit/ruleB
> This is a watered-down example from a production issue which has a huge 
> performance impact.
> External reference: 
> http://stackoverflow.com/questions/40783675/spark-sql-catalyst-is-scanning-undesired-columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18196) Optimise CompactBuffer implementation

2016-12-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18196.
---
Resolution: Won't Fix

For now looks like a "wontfix" as it doesn't result in a speedup.

> Optimise CompactBuffer implementation
> -
>
> Key: SPARK-18196
> URL: https://issues.apache.org/jira/browse/SPARK-18196
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Adam Roberts
>Priority: Minor
>
> This change slightly increases the class footprint (8 bytes on IBM Java, 12 
> bytes on OpenJDK and Oracle's) but we've observed a 4% performance on 
> PageRank using HiBench large with this change, so a worthy trade-off IMO
> This results in a shorter path length for the JIT as a result of less if else 
> statements
> Config used on HiBench
> spark.executor.memory  25G
> spark.driver.memory4G
> spark.serializerorg.apache.spark.serializer.KryoSerializer  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-18819) Failure to read single-row Parquet files

2016-12-11 Thread Michael Kamprath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Kamprath updated SPARK-18819:
-
Comment: was deleted

(was: Possibly. I can dump the file created using 
[parquet-tools|https://github.com/Parquet/parquet-mr/tree/master/parquet-tools] 
on the ARM machines using the same java installation. I am assuming that this 
at least rules out the JVM, but not necessarily the parquet lib because I am 
using the latest snapshot of parquet to do the dump (which might not be the 
same as in spark 2.0.2). The fact that this problem arises with both HDFS and 
QFS as the file system rules out the file system itself, though not necessarily 
the spark interface to it.

If this is not enough, I'll see what I can do to isolate it more.)

> Failure to read single-row Parquet files
> 
>
> Key: SPARK-18819
> URL: https://issues.apache.org/jira/browse/SPARK-18819
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.0.2
> Environment: Ubuntu 14.04 LTS on ARM 7.1
>Reporter: Michael Kamprath
>Priority: Critical
>
> When I create a data frame in PySpark with a small row count (less than 
> number executors), then write it to a parquet file, then load that parquet 
> file into a new data frame, and finally do any sort of read against the 
> loaded new data frame, Spark fails with an {{ExecutorLostFailure}}.
> Example code to replicate this issue:
> {code}
> from pyspark.sql.types import *
> rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')])
> my_schema = StructType([
> StructField("id", StringType(), True),
> StructField("value1", IntegerType(), True),
> StructField("value2", DoubleType(), True),
> StructField("name",StringType(), True)
> ])
> df = spark.createDataFrame( rdd, schema=my_schema)
> df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite')
> newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> newdf.take(1)
> {code}
> The error I get when the {{take}} step runs is:
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   1 newdf = 
> spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> > 2 newdf.take(1)
> /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num)
> 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
> 347 """
> --> 348 return self.limit(num).collect()
> 349 
> 350 @since(1.3)
> /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self)
> 308 """
> 309 with SCCallSiteSync(self._sc) as css:
> --> 310 port = self._jdf.collectToPython()
> 311 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 312 
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134 
>1135 for temp_arg in temp_args:
> /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o54.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of 
> the running tasks) Reason: Remote RPC client disassociated. Likely due to 
> containers exceeding thresholds, or network issues. Check driver logs for 
> WARN messages.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
>

[jira] [Commented] (SPARK-18819) Failure to read single-row Parquet files

2016-12-11 Thread Michael Kamprath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739488#comment-15739488
 ] 

Michael Kamprath commented on SPARK-18819:
--

Possibly. I can dump the file created using 
[parquet-tools|https://github.com/Parquet/parquet-mr/tree/master/parquet-tools] 
on the ARM machines using the same java installation. I am assuming that this 
at least rules out the JVM, but not necessarily the parquet lib because I am 
using the latest snapshot of parquet to do the dump (which might not be the 
same as in spark 2.0.2). The fact that this problem arises with both HDFS and 
QFS as the file system rules out the file system itself, though not necessarily 
the spark interface to it.

If this is not enough, I'll see what I can do to isolate it more.

> Failure to read single-row Parquet files
> 
>
> Key: SPARK-18819
> URL: https://issues.apache.org/jira/browse/SPARK-18819
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.0.2
> Environment: Ubuntu 14.04 LTS on ARM 7.1
>Reporter: Michael Kamprath
>Priority: Critical
>
> When I create a data frame in PySpark with a small row count (less than 
> number executors), then write it to a parquet file, then load that parquet 
> file into a new data frame, and finally do any sort of read against the 
> loaded new data frame, Spark fails with an {{ExecutorLostFailure}}.
> Example code to replicate this issue:
> {code}
> from pyspark.sql.types import *
> rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')])
> my_schema = StructType([
> StructField("id", StringType(), True),
> StructField("value1", IntegerType(), True),
> StructField("value2", DoubleType(), True),
> StructField("name",StringType(), True)
> ])
> df = spark.createDataFrame( rdd, schema=my_schema)
> df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite')
> newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> newdf.take(1)
> {code}
> The error I get when the {{take}} step runs is:
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   1 newdf = 
> spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> > 2 newdf.take(1)
> /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num)
> 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
> 347 """
> --> 348 return self.limit(num).collect()
> 349 
> 350 @since(1.3)
> /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self)
> 308 """
> 309 with SCCallSiteSync(self._sc) as css:
> --> 310 port = self._jdf.collectToPython()
> 311 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 312 
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134 
>1135 for temp_arg in temp_args:
> /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o54.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of 
> the running tasks) Reason: Remote RPC client disassociated. Likely due to 
> containers exceeding thresholds, or network issues. Check driver logs for 
> WARN messages.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAG

[jira] [Commented] (SPARK-18819) Failure to read single-row Parquet files

2016-12-11 Thread Michael Kamprath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739487#comment-15739487
 ] 

Michael Kamprath commented on SPARK-18819:
--

Possibly. I can dump the file created using 
[parquet-tools|https://github.com/Parquet/parquet-mr/tree/master/parquet-tools] 
on the ARM machines using the same java installation. I am assuming that this 
at least rules out the JVM, but not necessarily the parquet lib because I am 
using the latest snapshot of parquet to do the dump (which might not be the 
same as in spark 2.0.2). The fact that this problem arises with both HDFS and 
QFS as the file system rules out the file system itself, though not necessarily 
the spark interface to it.

If this is not enough, I'll see what I can do to isolate it more.

> Failure to read single-row Parquet files
> 
>
> Key: SPARK-18819
> URL: https://issues.apache.org/jira/browse/SPARK-18819
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.0.2
> Environment: Ubuntu 14.04 LTS on ARM 7.1
>Reporter: Michael Kamprath
>Priority: Critical
>
> When I create a data frame in PySpark with a small row count (less than 
> number executors), then write it to a parquet file, then load that parquet 
> file into a new data frame, and finally do any sort of read against the 
> loaded new data frame, Spark fails with an {{ExecutorLostFailure}}.
> Example code to replicate this issue:
> {code}
> from pyspark.sql.types import *
> rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')])
> my_schema = StructType([
> StructField("id", StringType(), True),
> StructField("value1", IntegerType(), True),
> StructField("value2", DoubleType(), True),
> StructField("name",StringType(), True)
> ])
> df = spark.createDataFrame( rdd, schema=my_schema)
> df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite')
> newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> newdf.take(1)
> {code}
> The error I get when the {{take}} step runs is:
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   1 newdf = 
> spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> > 2 newdf.take(1)
> /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num)
> 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
> 347 """
> --> 348 return self.limit(num).collect()
> 349 
> 350 @since(1.3)
> /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self)
> 308 """
> 309 with SCCallSiteSync(self._sc) as css:
> --> 310 port = self._jdf.collectToPython()
> 311 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 312 
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134 
>1135 for temp_arg in temp_args:
> /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o54.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of 
> the running tasks) Reason: Remote RPC client disassociated. Likely due to 
> containers exceeding thresholds, or network issues. Check driver logs for 
> WARN messages.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAG

[jira] [Commented] (SPARK-18750) spark should be able to control the number of executor and should not throw stack overslow

2016-12-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739460#comment-15739460
 ] 

Sean Owen commented on SPARK-18750:
---

I'm going to close this as a duplicate of SPARK-18769 unless there's evidence 
that this is an error from Spark, and can be patched separately from the 
apparent underlying cause, which is in that JIRA.

> spark should be able to control the number of executor and should not throw 
> stack overslow
> --
>
> Key: SPARK-18750
> URL: https://issues.apache.org/jira/browse/SPARK-18750
> Project: Spark
>  Issue Type: Bug
>Reporter: Neerja Khattar
>
> When running Sql queries on large datasets. Job fails with stack overflow 
> warning and it shows it is requesting lots of executors.
> Looks like there is no limit to number of executors or not even having an 
> upperbound based on yarn available resources.
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s). 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead 
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s). 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s).
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s).
> 16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 
> time(s) in a row.
> java.lang.StackOverflowError
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:57)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:36)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$class.$plus$plus(TraversableLike.scala:156)
>   at 
> scala.collection.AbstractTraversable.$plus$plus(Traversable.scala:105)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:60)
>   at scala.collection.immutable.Map$Map4.updated(Map.scala:172)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:173)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:158)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>

[jira] [Commented] (SPARK-18819) Failure to read single-row Parquet files

2016-12-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739458#comment-15739458
 ] 

Sean Owen commented on SPARK-18819:
---

Surely, this is specific to ARM if it doesn't occur on x86? I doubt it has 
anything to do with Parquet per se.
I have no particular reason to believe ARM doesn't work, but also doubt it's 
been tested or is supported.
This still just contains the driver stack trace, which says "something went 
wrong over there". It's not even clear the failure is from Spark.

> Failure to read single-row Parquet files
> 
>
> Key: SPARK-18819
> URL: https://issues.apache.org/jira/browse/SPARK-18819
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.0.2
> Environment: Ubuntu 14.04 LTS on ARM 7.1
>Reporter: Michael Kamprath
>Priority: Critical
>
> When I create a data frame in PySpark with a small row count (less than 
> number executors), then write it to a parquet file, then load that parquet 
> file into a new data frame, and finally do any sort of read against the 
> loaded new data frame, Spark fails with an {{ExecutorLostFailure}}.
> Example code to replicate this issue:
> {code}
> from pyspark.sql.types import *
> rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')])
> my_schema = StructType([
> StructField("id", StringType(), True),
> StructField("value1", IntegerType(), True),
> StructField("value2", DoubleType(), True),
> StructField("name",StringType(), True)
> ])
> df = spark.createDataFrame( rdd, schema=my_schema)
> df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite')
> newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> newdf.take(1)
> {code}
> The error I get when the {{take}} step runs is:
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   1 newdf = 
> spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> > 2 newdf.take(1)
> /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num)
> 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
> 347 """
> --> 348 return self.limit(num).collect()
> 349 
> 350 @since(1.3)
> /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self)
> 308 """
> 309 with SCCallSiteSync(self._sc) as css:
> --> 310 port = self._jdf.collectToPython()
> 311 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 312 
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134 
>1135 for temp_arg in temp_args:
> /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o54.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of 
> the running tasks) Reason: Remote RPC client disassociated. Likely due to 
> containers exceeding thresholds, or network issues. Check driver logs for 
> WARN messages.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGSched

[jira] [Updated] (SPARK-18819) Failure to read single-row Parquet files

2016-12-11 Thread Michael Kamprath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Kamprath updated SPARK-18819:
-
Description: 
When I create a data frame in PySpark with a small row count (less than number 
executors), then write it to a parquet file, then load that parquet file into a 
new data frame, and finally do any sort of read against the loaded new data 
frame, Spark fails with an {{ExecutorLostFailure}}.

Example code to replicate this issue:

{code}
from pyspark.sql.types import *

rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')])
my_schema = StructType([
StructField("id", StringType(), True),
StructField("value1", IntegerType(), True),
StructField("value2", DoubleType(), True),
StructField("name",StringType(), True)
])
df = spark.createDataFrame( rdd, schema=my_schema)
df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite')

newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
newdf.take(1)
{code}

The error I get when the {{take}} step runs is:

{code}
---
Py4JJavaError Traceback (most recent call last)
 in ()
  1 newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> 2 newdf.take(1)

/usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num)
346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
347 """
--> 348 return self.limit(num).collect()
349 
350 @since(1.3)

/usr/local/spark/python/pyspark/sql/dataframe.py in collect(self)
308 """
309 with SCCallSiteSync(self._sc) as css:
--> 310 port = self._jdf.collectToPython()
311 return list(_load_from_socket(port, 
BatchedSerializer(PickleSerializer(
312 

/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o54.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of the 
running tasks) Reason: Remote RPC client disassociated. Likely due to 
containers exceeding thresholds, or network issues. Check driver logs for WARN 
messages.
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
at org.apache.spark.SparkContext.run

[jira] [Comment Edited] (SPARK-18819) Failure to read single-row Parquet files

2016-12-11 Thread Michael Kamprath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739452#comment-15739452
 ] 

Michael Kamprath edited comment on SPARK-18819 at 12/11/16 9:42 AM:


Sure. I updated the description above.


was (Author: kamprath):
Sure. The complete error message is:

{{code}}
---
Py4JJavaError Traceback (most recent call last)
 in ()
  1 newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> 2 newdf.take(1)

/usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num)
346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
347 """
--> 348 return self.limit(num).collect()
349 
350 @since(1.3)

/usr/local/spark/python/pyspark/sql/dataframe.py in collect(self)
308 """
309 with SCCallSiteSync(self._sc) as css:
--> 310 port = self._jdf.collectToPython()
311 return list(_load_from_socket(port, 
BatchedSerializer(PickleSerializer(
312 

/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o54.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of the 
running tasks) Reason: Remote RPC client disassociated. Likely due to 
containers exceeding thresholds, or network issues. Check driver logs for WARN 
messages.
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2526)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Data

[jira] [Commented] (SPARK-18819) Failure to read single-row Parquet files

2016-12-11 Thread Michael Kamprath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739452#comment-15739452
 ] 

Michael Kamprath commented on SPARK-18819:
--

Sure. The complete error message is:

{{code}}
---
Py4JJavaError Traceback (most recent call last)
 in ()
  1 newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> 2 newdf.take(1)

/usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num)
346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
347 """
--> 348 return self.limit(num).collect()
349 
350 @since(1.3)

/usr/local/spark/python/pyspark/sql/dataframe.py in collect(self)
308 """
309 with SCCallSiteSync(self._sc) as css:
--> 310 port = self._jdf.collectToPython()
311 return list(_load_from_socket(port, 
BatchedSerializer(PickleSerializer(
312 

/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o54.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of the 
running tasks) Reason: Remote RPC client disassociated. Likely due to 
containers exceeding thresholds, or network issues. Check driver logs for WARN 
messages.
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2526)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2523)
at sun.reflect.N

[jira] [Resolved] (SPARK-18653) Dataset.show() generates incorrect padding for Unicode Character

2016-12-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18653.
---
Resolution: Won't Fix

> Dataset.show() generates incorrect padding for Unicode Character
> 
>
> Key: SPARK-18653
> URL: https://issues.apache.org/jira/browse/SPARK-18653
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> The following program generates incorrect space padding for 
> {{Dataset.show()}} since column name or column value has Unicode Character
> Program
> {code:java}
> case class UnicodeCaseClass(整数: Int, 実数: Double, s: String)
> val ds = Seq(UnicodeCaseClass(1, 1.1, "文字列1")).toDS
> ds.show
> {code}
> Output
> {code}
> +---+---++
> | 整数| 実数|   s|
> +---+---++
> |  1|1.1|文字列1|
> +---+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18819) Failure to read single-row Parquet files

2016-12-11 Thread Michael Kamprath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Kamprath updated SPARK-18819:
-
Description: 
When I create a data frame in PySpark with a small row count (less than number 
executors), then write it to a parquet file, then load that parquet file into a 
new data frame, and finally do any sort of read against the loaded new data 
frame, Spark fails with an {{ExecutorLostFailure}}.

Example code to replicate this issue:

{code}
from pyspark.sql.types import *

rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')])
my_schema = StructType([
StructField("id", StringType(), True),
StructField("value1", IntegerType(), True),
StructField("value2", DoubleType(), True),
StructField("name",StringType(), True)
])
df = spark.createDataFrame( rdd, schema=my_schema)
df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite')

newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
newdf.take(1)
{code}

The error I get when the {{take}} step runs is:

{code}
Py4JJavaError: An error occurred while calling o54.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
8, 10.10.10.4): ExecutorLostFailure (executor 0 exited caused by one of the 
running tasks) Reason: Remote RPC client disassociated. Likely due to 
containers exceeding thresholds, or network issues. Check driver logs for WARN 
messages.
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2526)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2523)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
{code}

I have tested this against HDFS 2.7 and QFS 1.2 on an ARM v7.1 based cluster. 
Both have the same results. Note I have verified this issue doesn't express on 
x86 platforms. The

[jira] [Updated] (SPARK-18628) Update handle invalid documentation string

2016-12-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18628:
--
Assignee: Krishna Kalyan

> Update handle invalid documentation string
> --
>
> Key: SPARK-18628
> URL: https://issues.apache.org/jira/browse/SPARK-18628
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: Krishna Kalyan
>Priority: Trivial
>  Labels: starter
> Fix For: 2.1.1, 2.2.0
>
>
> The handleInvalid paramater documentation string currently doesn't have 
> quotes around the options, after SPARK-18366 is in, it would be good to 
> update both the Scala param and Python param to have quotes around the 
> options making it easier for users to read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18628) Update handle invalid documentation string

2016-12-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18628.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16242
[https://github.com/apache/spark/pull/16242]

> Update handle invalid documentation string
> --
>
> Key: SPARK-18628
> URL: https://issues.apache.org/jira/browse/SPARK-18628
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
> Fix For: 2.1.1, 2.2.0
>
>
> The handleInvalid paramater documentation string currently doesn't have 
> quotes around the options, after SPARK-18366 is in, it would be good to 
> update both the Scala param and Python param to have quotes around the 
> options making it easier for users to read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-12-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739432#comment-15739432
 ] 

Sean Owen commented on SPARK-9487:
--

I think this is going around in circles. You already have an open invitation to 
improve tests in any logical subset of the project in order to accomplish this 
change in number of worker threads. You're saying you are unable to get them to 
pass on Jenkins and unwilling to debug. I don't think there is more guidance to 
give here; either you can effect this change or not. If nobody can or seems 
willing to try, I think it should be closed, because this really isn't an error 
to start with, nor even that suboptimal (excepting that it has revealed a 
couple tests could be a little more robust)

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18809) Kinesis deaggregation issue on master

2016-12-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18809:
--
Assignee: Brian ONeill

> Kinesis deaggregation issue on master
> -
>
> Key: SPARK-18809
> URL: https://issues.apache.org/jira/browse/SPARK-18809
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Brian ONeill
>Assignee: Brian ONeill
>Priority: Minor
> Fix For: 2.2.0
>
>
> Fix for SPARK-14421 was never applied to master.
> https://github.com/apache/spark/pull/16236
> Upgrade KCL to 1.6.2 to support deaggregation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18809) Kinesis deaggregation issue on master

2016-12-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18809:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Kinesis deaggregation issue on master
> -
>
> Key: SPARK-18809
> URL: https://issues.apache.org/jira/browse/SPARK-18809
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Brian ONeill
>Priority: Minor
> Fix For: 2.2.0
>
>
> Fix for SPARK-14421 was never applied to master.
> https://github.com/apache/spark/pull/16236
> Upgrade KCL to 1.6.2 to support deaggregation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18809) Kinesis deaggregation issue on master

2016-12-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18809.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16236
[https://github.com/apache/spark/pull/16236]

> Kinesis deaggregation issue on master
> -
>
> Key: SPARK-18809
> URL: https://issues.apache.org/jira/browse/SPARK-18809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Brian ONeill
> Fix For: 2.2.0
>
>
> Fix for SPARK-14421 was never applied to master.
> https://github.com/apache/spark/pull/16236
> Upgrade KCL to 1.6.2 to support deaggregation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18819) Failure to read single-row Parquet files

2016-12-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739422#comment-15739422
 ] 

Sean Owen commented on SPARK-18819:
---

This doesn't say anything about the underlying error though. Without that I 
think this would have to be closed as unactionable. Any more detail?

> Failure to read single-row Parquet files
> 
>
> Key: SPARK-18819
> URL: https://issues.apache.org/jira/browse/SPARK-18819
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.0.2
> Environment: Ubuntu 14.04 LTS on ARM 7.1
>Reporter: Michael Kamprath
>Priority: Critical
>
> When I create a data frame in PySpark with a small row count (less than 
> number executors), then write it to a parquet file, then load that parquet 
> file into a new data frame, and finally do any sort of read against the 
> loaded new data frame, Spark fails with an {{ExecutorLostFailure}}.
> Example code to replicate this issue:
> {code}
> from pyspark.sql.types import *
> rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')])
> my_schema = StructType([
> StructField("id", StringType(), True),
> StructField("value1", IntegerType(), True),
> StructField("value2", DoubleType(), True),
> StructField("name",StringType(), True)
> ])
> df = spark.createDataFrame( rdd, schema=my_schema)
> df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite')
> newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
> newdf.take(1)
> {code}
> The error I get when the {{take}} step runs is:
> {code}
> Py4JJavaError: An error occurred while calling o54.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 8, 10.10.10.4): ExecutorLostFailure (executor 0 exited caused by one of 
> the running tasks) Reason: Remote RPC client disassociated. Likely due to 
> containers exceeding thresholds, or network issues. Check driver logs for 
> WARN messages.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2526)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546)
>   at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2523)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl

[jira] [Resolved] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension

2016-12-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18799.
---
Resolution: Duplicate

> Spark SQL expose interface for plug-gable parser extension 
> ---
>
> Key: SPARK-18799
> URL: https://issues.apache.org/jira/browse/SPARK-18799
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jihong MA
>
> There used to be an interface to plug a parser extension through 
> ParserDialect in HiveContext in all Spark 1.x version. Starting Spark 2.x 
> release, Apache Spark moved to the new parser (Antlr4), there is no longer a 
> way to extend the default SQL parser through SparkSession interface, however 
> this is really a pain and hard to work around it when integrating other data 
> source with Spark with extended support such as Insert, Update, Delete 
> statement or any other data management statement. 
> It would be very nice to continue to expose an interface for parser extension 
> to make data source integration easier and smoother. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18786) pySpark SQLContext.getOrCreate(sc) take stopped sparkContext

2016-12-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739415#comment-15739415
 ] 

Sean Owen commented on SPARK-18786:
---

I agree it's surprising and maybe fixable, but this may be in the category of 
things you just shouldn't do. You generally do not stop() a SparkContext except 
at the end of a program.

> pySpark SQLContext.getOrCreate(sc) take stopped sparkContext
> 
>
> Key: SPARK-18786
> URL: https://issues.apache.org/jira/browse/SPARK-18786
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Alex Liu
>
> The following steps to reproduce the issue
> {code}
> import sys
> sys.path.insert(1, 'spark/python/')
> sys.path.insert(1, 'spark/python/lib/py4j-0.9-src.zip')
> from pyspark import SparkContext, SQLContext
> sc = SparkContext.getOrCreate()
> sqlContext = SQLContext.getOrCreate(sc)
> sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show()
> sc.stop()
> sc = SparkContext.getOrCreate()
> sqlContext = SQLContext.getOrCreate(sc)
> sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show()
> {code}
> It has the following errors after the last command
> {code}
> >>> sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show()
> Traceback (most recent call last):
>   
>   File "", line 1, in 
>   File "spark/python/pyspark/sql/dataframe.py", line 257, in show
> print(self._jdf.showString(n, truncate))
>   File "spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in 
> __call__
>   File "spark/python/pyspark/sql/utils.py", line 45, in deco
> return f(*a, **kw)
>   File "spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in 
> get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o435.showString.
> : java.lang.IllegalStateException: Cannot call methods on a stopped 
> SparkContext.
> This stopped SparkContext was created at:
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
> py4j.Gateway.invoke(Gateway.java:214)
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
> py4j.GatewayConnection.run(GatewayConnection.java:209)
> java.lang.Thread.run(Thread.java:745)
> The currently active SparkContext was created at:
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
> py4j.Gateway.invoke(Gateway.java:214)
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
> py4j.GatewayConnection.run(GatewayConnection.java:209)
> java.lang.Thread.run(Thread.java:745)
>  
>   at 
> org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:106)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1325)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:126)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:349)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$an

[jira] [Updated] (SPARK-18786) pySpark SQLContext.getOrCreate(sc) take stopped sparkContext

2016-12-11 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-18786:
---
Component/s: PySpark

> pySpark SQLContext.getOrCreate(sc) take stopped sparkContext
> 
>
> Key: SPARK-18786
> URL: https://issues.apache.org/jira/browse/SPARK-18786
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Alex Liu
>
> The following steps to reproduce the issue
> {code}
> import sys
> sys.path.insert(1, 'spark/python/')
> sys.path.insert(1, 'spark/python/lib/py4j-0.9-src.zip')
> from pyspark import SparkContext, SQLContext
> sc = SparkContext.getOrCreate()
> sqlContext = SQLContext.getOrCreate(sc)
> sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show()
> sc.stop()
> sc = SparkContext.getOrCreate()
> sqlContext = SQLContext.getOrCreate(sc)
> sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show()
> {code}
> It has the following errors after the last command
> {code}
> >>> sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show()
> Traceback (most recent call last):
>   
>   File "", line 1, in 
>   File "spark/python/pyspark/sql/dataframe.py", line 257, in show
> print(self._jdf.showString(n, truncate))
>   File "spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in 
> __call__
>   File "spark/python/pyspark/sql/utils.py", line 45, in deco
> return f(*a, **kw)
>   File "spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in 
> get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o435.showString.
> : java.lang.IllegalStateException: Cannot call methods on a stopped 
> SparkContext.
> This stopped SparkContext was created at:
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
> py4j.Gateway.invoke(Gateway.java:214)
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
> py4j.GatewayConnection.run(GatewayConnection.java:209)
> java.lang.Thread.run(Thread.java:745)
> The currently active SparkContext was created at:
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
> py4j.Gateway.invoke(Gateway.java:214)
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
> py4j.GatewayConnection.run(GatewayConnection.java:209)
> java.lang.Thread.run(Thread.java:745)
>  
>   at 
> org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:106)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1325)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:126)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:349)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at

[jira] [Updated] (SPARK-18710) Add offset to GeneralizedLinearRegression models

2016-12-11 Thread Wayne Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang updated SPARK-18710:

  Shepherd: Yanbo Liang  (was: Sean Owen)
Remaining Estimate: 10h  (was: 336h)
 Original Estimate: 10h  (was: 336h)

> Add offset to GeneralizedLinearRegression models
> 
>
> Key: SPARK-18710
> URL: https://issues.apache.org/jira/browse/SPARK-18710
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>  Labels: features
> Fix For: 2.2.0
>
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> The current GeneralizedLinearRegression model does not support offset. The 
> offset can be useful to take into account exposure, or for testing 
> incremental effect of new variables. It is possible to use weights in current 
> environment to achieve the same effect of specifying offset for certain 
> models, e.g., Poisson & Binomial with log offset, it is desirable to have the 
> offset option to work with more general cases, e.g., negative offset or 
> offset that is hard to specify using weights (e.g., offset to the probability 
> rather than odds in logistic regression).
> Effort would involve:
> * update regression class to support offsetCol
> * update IWLS to take into account of offset
> * add test case for offset
> I can start working on this if the community approves this feature. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

85 matches

Mail list logo